Domain-Aware Plan Ranking Engine with Relative Comparison

Author: Larry (via OpenClaw)
Date: 2026-02-22
Status: Proposal
Audience: Technical reviewers, engineers, stakeholders

Problem Statement

Current plan evaluation assumes a universal rubric (concreteness, executability, success criteria) with fixed weights. This breaks down when comparing plans from different domains:

A road construction plan has different success signals than a software project
Domain-specific KPIs (e.g., budget contingency in construction vs. MVP launch timing in software) matter more than generic signals
Absolute scoring ("this plan is a 6/10") doesn't tell us whether it's in the top 10% of similar plans or mediocre compared to its peers

Solution: Domain-Aware Relative Ranking

Instead of absolute scores, rank plans within their domain context using:

Domain Classification — detect plan type (construction, software, marketing, operations, etc.)
Domain-Specific Signal Extraction — pull KPIs relevant to that domain
Corpus Bucketing — group plans by type for fair comparison
Relative ELO Ranking — score each plan against similar plans, not in a vacuum
Actionability — surface top-performing plans (>90th percentile) as refinement candidates, flag major-rewrite situations

Architecture

Plan Input
    ↓
[Domain Classifier] → Detect plan type (construction, software, marketing, ops, etc.)
    ↓
[Domain-Specific Extractor] → Pull KPIs: timeline clarity, resource estimates, risk mitigations, owner assignment, etc.
    ↓
[Corpus Bucketer] → Find all similar-type plans in database
    ↓
[ELO Ranker] → Compare new plan against sampled corpus neighbors
    ↓
[Actionability Scorer] → Is this top 10%? Fixable? Rewrite candidate?
    ↓
Output: Rank percentile, actionability flag, refinement recommendations

Implementation Details

1. Domain Classification

Input: Plan text + metadata (e.g., project title, goals, phases)

Output: Domain label (one of: construction, software, marketing, operations, research, business-development, other)

Method: - LLM-based classification (zero-shot with 1-2 examples per domain) - Fallback: keyword matching on phase names, deliverables, team roles - Confidence threshold: if <0.7, flag as "cross-domain" or "unclear"

Example prompts:

"Read this plan and classify it as one of: construction, software, marketing, operations, research, business-development, other. Explain your reasoning in 1 sentence."

Plan text here...

2. Domain-Specific Signal Extraction

Each domain extracts different KPIs:

Construction: - Budget vs. estimate variance tolerance - Schedule float/slack (days available for delays) - Risk contingency % of budget - Owner accountability (named PM, not "TBD") - Inspection/approval checkpoints

Software: - MVP vs. full feature clarity (what's launch, what's post-launch) - Tech debt acknowledgment (testing, documentation standards) - Team skill-market fit (do we have the right people?) - Dependency clarity (external APIs, third-party risks) - Launch/staging milestones

Marketing/Growth: - Channel diversification (not all eggs in one basket) - CAC payback period or LTV:CAC ratio (are we thinking about unit economics?) - Audience targeting specificity (who exactly, not "millennials") - Content calendar or cadence clarity - Success metric definition (viral coeff, NPS, growth rate?)

Operations: - Process automation KPI (% manual vs. automated workflows) - SLA definition (response time, uptime targets) - Escalation clarity (who handles edge cases?) - Monitoring/alerting (do we know if something breaks?)

All domains: - Concreteness: timeline specificity, named owner, measurable KPIs (0–10) - Executability: phase sequencing, dependencies clear (0–10) - Success criteria: explicit win conditions, not vibes (0–10)

Output: JSON with domain + extracted signals (each 0–10)

{
  "domain": "software",
  "concreteness": 8,
  "executability": 7,
  "success_criteria": 6,
  "domain_specific": {
    "mvp_clarity": 9,
    "tech_debt_acknowledged": 7,
    "team_fit": 6,
    "dependency_clarity": 8,
    "launch_milestones": 9
  },
  "confidence": 0.92
}

3. Corpus Bucketing

Storage: - plan_corpus table extended with domain column - Index on domain for fast filtering - pgvector embeddings per domain (optional, for semantic search within domain)

Query:

SELECT * FROM plan_corpus 
WHERE domain = 'software' 
ORDER BY created_at DESC 
LIMIT 1000;

Bucketing strategy: - Exact domain match (software vs. software) - Fuzzy fallback: if bucket size <20, blend adjacent domains (e.g., "software" + "research" for AI projects)

4. Relative ELO Ranking

Algorithm:

Extract new plan's signals → new_plan_vector
Sample 5–10 existing plans from same domain (random + stratified by existing Elo)

For each sampled plan, LLM pairwise comparison:

"Plan A: [concreteness, executability, success clarity]
 Plan B: [concreteness, executability, success clarity]

 Which plan is more likely to succeed? Why?
 (Likert: strongly A, slightly A, neutral, slightly B, strongly B)"

Use Likert output to compute win/loss, then update Elo:

new_elo = old_elo + K * (expected_win - actual_win)
where K = 32 (standard), expected_win based on current Elos

Result: Each plan has an Elo score within its domain, comparable across similar plans.

5. Actionability Scoring

Output:

{
  "plan_id": "...",
  "domain": "software",
  "elo_score": 1650,
  "percentile": 0.87,
  "actionability": {
    "is_candidate_for_refinement": true,
    "reason": "87th percentile; fixable with clarity on tech debt and team roles",
    "needs_major_rewrite": false,
    "top_gaps": ["tech_debt_acknowledged", "team_fit"],
    "confidence": 0.92
  }
}

Decision rules: - >90th percentile: "High-performing; consider as template for other plans" - 70–90th percentile: "Good candidate for refinement; address top gaps" - 50–70th percentile: "Mid-tier; incremental improvements or focused refinement" - <50th percentile & low concreteness: "Major rewrite recommended; start over with domain-specific template" - <50th percentile & high concreteness: "Execution challenges; may be doable despite lower score"

6. API Endpoints

POST /api/rank/domain-aware
  Input: { plan_text, plan_metadata }
  Output: { domain, signals, elo, percentile, actionability }

GET /api/leaderboard/by-domain?domain=software&limit=20
  Output: [ { rank, plan_id, elo, percentile }, ... ]

GET /api/corpus-stats?domain=software
  Output: { domain, count, avg_elo, elo_stdev, domain_signals_info }

Data Model

New columns in plan_corpus:

ALTER TABLE plan_corpus ADD COLUMN domain VARCHAR(50);
ALTER TABLE plan_corpus ADD COLUMN signals JSONB; -- domain-agnostic + domain-specific
ALTER TABLE plan_corpus ADD COLUMN elo_score FLOAT DEFAULT 1600;
ALTER TABLE plan_corpus ADD COLUMN percentile FLOAT; -- recomputed periodically
ALTER TABLE plan_corpus ADD COLUMN actionability_notes JSONB;

CREATE INDEX idx_plan_corpus_domain ON plan_corpus(domain);
CREATE INDEX idx_plan_corpus_elo ON plan_corpus(elo_score DESC);

Implementation Phases

Phase 1: Minimal Domain Classifier (2 days)

LLM-based domain detection (zero-shot)
Fallback to keyword matching
No ELO yet; just label & store domain

Phase 2: Domain-Specific Extractors (3 days)

Build 3–4 domain-specific signal extractors (software, construction, ops, marketing)
Normalize all to 0–10 scale
Store signals in plan_corpus

Phase 3: ELO Ranking Engine (4 days)

Implement pairwise LLM comparison
Elo update logic
Corpus bucketing & sampling
Percentile calculation

Phase 4: Actionability & API (2 days)

Actionability scoring rules
/api/rank/domain-aware endpoint
/api/leaderboard/by-domain endpoint
Test against real PlanExe corpus

Total estimate: 10 days

Testing Strategy

Unit tests: Domain classifier (accuracy on known plans)
Integration tests: Full ranking pipeline (new plan → elo score → percentile)
Corpus validation: Run against existing 100+ plans, verify percentile distribution is sensible
Domain coverage: Ensure top domains (software, construction, marketing) have >50 plans in corpus for ranking

Success Criteria

✅ Domain classifier achieves >85% accuracy on test set
✅ Elo scores converge within 10 iterations (stability)
✅ Top 10% plans are consistently high-clarity and actionable
✅ Cross-domain comparison is avoided (software vs. construction ranked separately)
✅ API latency <2 seconds for new plan ranking

Open Questions

Domain list finality: Should we start with 5 domains or leave it extensible? (Proposal: 5 initial, extensible)
Sampling strategy: Random 5–10 neighbors or stratified by existing Elo? (Proposal: stratified)
Elo K-factor: 32 (soft), 16 (hard), 64 (volatile)? (Proposal: 32, adaptive if needed)
Corpus size threshold: When do we stop ranking due to insufficient peers? (Proposal: <20 → merge adjacent domains)
Actionability UI: Does PlanExe web show percentile badges, heat maps, or refinement prompts? (Proposal: all three)

Dependencies

LLM: Gemini 2.0 Flash (for domain classification and pairwise comparison)
Embeddings: OpenAI embeddings (optional, for semantic bucketing within domain)
Database: PostgreSQL with pgvector (for corpus storage and fast domain filtering)
Rate limiting: Respect API quotas (5 req/min per key)

Risks & Mitigation

Risk	Mitigation
Domain classifier misclassifies plan	Confidence threshold; manual override option; log misclassifications
Elo ranking is slow with large corpus	Cache pairwise comparisons; use stratified sampling (not random)
Cross-domain contamination	Strict bucketing; log when fallback to adjacent domain
Signal extraction is too generic	Domain-specific extractors with explicit KPI lists; tune per domain

References

Changelog

2026-02-22: Initial proposal by Larry