Domain-Aware Plan Ranking Engine with Relative Comparison
Author: Larry (via OpenClaw)
Date: 2026-02-22
Status: Proposal
Audience: Technical reviewers, engineers, stakeholders
Problem Statement
Current plan evaluation assumes a universal rubric (concreteness, executability, success criteria) with fixed weights. This breaks down when comparing plans from different domains:
- A road construction plan has different success signals than a software project
- Domain-specific KPIs (e.g., budget contingency in construction vs. MVP launch timing in software) matter more than generic signals
- Absolute scoring ("this plan is a 6/10") doesn't tell us whether it's in the top 10% of similar plans or mediocre compared to its peers
Solution: Domain-Aware Relative Ranking
Instead of absolute scores, rank plans within their domain context using:
- Domain Classification — detect plan type (construction, software, marketing, operations, etc.)
- Domain-Specific Signal Extraction — pull KPIs relevant to that domain
- Corpus Bucketing — group plans by type for fair comparison
- Relative ELO Ranking — score each plan against similar plans, not in a vacuum
- Actionability — surface top-performing plans (>90th percentile) as refinement candidates, flag major-rewrite situations
Architecture
Plan Input
↓
[Domain Classifier] → Detect plan type (construction, software, marketing, ops, etc.)
↓
[Domain-Specific Extractor] → Pull KPIs: timeline clarity, resource estimates, risk mitigations, owner assignment, etc.
↓
[Corpus Bucketer] → Find all similar-type plans in database
↓
[ELO Ranker] → Compare new plan against sampled corpus neighbors
↓
[Actionability Scorer] → Is this top 10%? Fixable? Rewrite candidate?
↓
Output: Rank percentile, actionability flag, refinement recommendations
Implementation Details
1. Domain Classification
Input: Plan text + metadata (e.g., project title, goals, phases)
Output: Domain label (one of: construction, software, marketing, operations, research, business-development, other)
Method: - LLM-based classification (zero-shot with 1-2 examples per domain) - Fallback: keyword matching on phase names, deliverables, team roles - Confidence threshold: if <0.7, flag as "cross-domain" or "unclear"
Example prompts:
"Read this plan and classify it as one of: construction, software, marketing, operations, research, business-development, other. Explain your reasoning in 1 sentence."
Plan text here...
2. Domain-Specific Signal Extraction
Each domain extracts different KPIs:
Construction: - Budget vs. estimate variance tolerance - Schedule float/slack (days available for delays) - Risk contingency % of budget - Owner accountability (named PM, not "TBD") - Inspection/approval checkpoints
Software: - MVP vs. full feature clarity (what's launch, what's post-launch) - Tech debt acknowledgment (testing, documentation standards) - Team skill-market fit (do we have the right people?) - Dependency clarity (external APIs, third-party risks) - Launch/staging milestones
Marketing/Growth: - Channel diversification (not all eggs in one basket) - CAC payback period or LTV:CAC ratio (are we thinking about unit economics?) - Audience targeting specificity (who exactly, not "millennials") - Content calendar or cadence clarity - Success metric definition (viral coeff, NPS, growth rate?)
Operations: - Process automation KPI (% manual vs. automated workflows) - SLA definition (response time, uptime targets) - Escalation clarity (who handles edge cases?) - Monitoring/alerting (do we know if something breaks?)
All domains: - Concreteness: timeline specificity, named owner, measurable KPIs (0–10) - Executability: phase sequencing, dependencies clear (0–10) - Success criteria: explicit win conditions, not vibes (0–10)
Output: JSON with domain + extracted signals (each 0–10)
{
"domain": "software",
"concreteness": 8,
"executability": 7,
"success_criteria": 6,
"domain_specific": {
"mvp_clarity": 9,
"tech_debt_acknowledged": 7,
"team_fit": 6,
"dependency_clarity": 8,
"launch_milestones": 9
},
"confidence": 0.92
}
3. Corpus Bucketing
Storage:
- plan_corpus table extended with domain column
- Index on domain for fast filtering
- pgvector embeddings per domain (optional, for semantic search within domain)
Query:
Bucketing strategy: - Exact domain match (software vs. software) - Fuzzy fallback: if bucket size <20, blend adjacent domains (e.g., "software" + "research" for AI projects)
4. Relative ELO Ranking
Algorithm:
- Extract new plan's signals →
new_plan_vector - Sample 5–10 existing plans from same domain (random + stratified by existing Elo)
- For each sampled plan, LLM pairwise comparison:
- Use Likert output to compute win/loss, then update Elo:
Result: Each plan has an Elo score within its domain, comparable across similar plans.
5. Actionability Scoring
Output:
{
"plan_id": "...",
"domain": "software",
"elo_score": 1650,
"percentile": 0.87,
"actionability": {
"is_candidate_for_refinement": true,
"reason": "87th percentile; fixable with clarity on tech debt and team roles",
"needs_major_rewrite": false,
"top_gaps": ["tech_debt_acknowledged", "team_fit"],
"confidence": 0.92
}
}
Decision rules: - >90th percentile: "High-performing; consider as template for other plans" - 70–90th percentile: "Good candidate for refinement; address top gaps" - 50–70th percentile: "Mid-tier; incremental improvements or focused refinement" - <50th percentile & low concreteness: "Major rewrite recommended; start over with domain-specific template" - <50th percentile & high concreteness: "Execution challenges; may be doable despite lower score"
6. API Endpoints
POST /api/rank/domain-aware
Input: { plan_text, plan_metadata }
Output: { domain, signals, elo, percentile, actionability }
GET /api/leaderboard/by-domain?domain=software&limit=20
Output: [ { rank, plan_id, elo, percentile }, ... ]
GET /api/corpus-stats?domain=software
Output: { domain, count, avg_elo, elo_stdev, domain_signals_info }
Data Model
New columns in plan_corpus:
ALTER TABLE plan_corpus ADD COLUMN domain VARCHAR(50);
ALTER TABLE plan_corpus ADD COLUMN signals JSONB; -- domain-agnostic + domain-specific
ALTER TABLE plan_corpus ADD COLUMN elo_score FLOAT DEFAULT 1600;
ALTER TABLE plan_corpus ADD COLUMN percentile FLOAT; -- recomputed periodically
ALTER TABLE plan_corpus ADD COLUMN actionability_notes JSONB;
CREATE INDEX idx_plan_corpus_domain ON plan_corpus(domain);
CREATE INDEX idx_plan_corpus_elo ON plan_corpus(elo_score DESC);
Implementation Phases
Phase 1: Minimal Domain Classifier (2 days)
- LLM-based domain detection (zero-shot)
- Fallback to keyword matching
- No ELO yet; just label & store domain
Phase 2: Domain-Specific Extractors (3 days)
- Build 3–4 domain-specific signal extractors (software, construction, ops, marketing)
- Normalize all to 0–10 scale
- Store signals in
plan_corpus
Phase 3: ELO Ranking Engine (4 days)
- Implement pairwise LLM comparison
- Elo update logic
- Corpus bucketing & sampling
- Percentile calculation
Phase 4: Actionability & API (2 days)
- Actionability scoring rules
/api/rank/domain-awareendpoint/api/leaderboard/by-domainendpoint- Test against real PlanExe corpus
Total estimate: 10 days
Testing Strategy
- Unit tests: Domain classifier (accuracy on known plans)
- Integration tests: Full ranking pipeline (new plan → elo score → percentile)
- Corpus validation: Run against existing 100+ plans, verify percentile distribution is sensible
- Domain coverage: Ensure top domains (software, construction, marketing) have >50 plans in corpus for ranking
Success Criteria
- ✅ Domain classifier achieves >85% accuracy on test set
- ✅ Elo scores converge within 10 iterations (stability)
- ✅ Top 10% plans are consistently high-clarity and actionable
- ✅ Cross-domain comparison is avoided (software vs. construction ranked separately)
- ✅ API latency <2 seconds for new plan ranking
Open Questions
- Domain list finality: Should we start with 5 domains or leave it extensible? (Proposal: 5 initial, extensible)
- Sampling strategy: Random 5–10 neighbors or stratified by existing Elo? (Proposal: stratified)
- Elo K-factor: 32 (soft), 16 (hard), 64 (volatile)? (Proposal: 32, adaptive if needed)
- Corpus size threshold: When do we stop ranking due to insufficient peers? (Proposal: <20 → merge adjacent domains)
- Actionability UI: Does PlanExe web show percentile badges, heat maps, or refinement prompts? (Proposal: all three)
Dependencies
- LLM: Gemini 2.0 Flash (for domain classification and pairwise comparison)
- Embeddings: OpenAI embeddings (optional, for semantic bucketing within domain)
- Database: PostgreSQL with pgvector (for corpus storage and fast domain filtering)
- Rate limiting: Respect API quotas (5 req/min per key)
Risks & Mitigation
| Risk | Mitigation |
|---|---|
| Domain classifier misclassifies plan | Confidence threshold; manual override option; log misclassifications |
| Elo ranking is slow with large corpus | Cache pairwise comparisons; use stratified sampling (not random) |
| Cross-domain contamination | Strict bucketing; log when fallback to adjacent domain |
| Signal extraction is too generic | Domain-specific extractors with explicit KPI lists; tune per domain |
References
Changelog
- 2026-02-22: Initial proposal by Larry