ELO Scoring for Plan Quality
The Problem
PlanExe's self-improvement loop currently lacks a hard metric for plan quality. Structural compliance (valid JSON, schema adherence, field counts) is measurable but insufficient — iteration 17 proved that optimizing structural compliance alone degrades content quality (6.5/10 → 5.8/10 on external review). The assessment phase catches regressions, but requires human babysitting and has no aggregate quality signal across prompts.
What's missing is a scalar quality score that:
- Works across diverse plan types (vegan butcher, rubber biosecurity, census logistics)
- Detects content quality changes, not just structural compliance
- Can run unattended as part of the optimization loop
- Is resistant to gaming by verbose, confident-sounding but hollow output
Why ELO
ELO solves the "no absolute metric" problem by converting quality assessment into pairwise comparisons. Asking "is this plan good?" is vague. Asking "is Plan A better than Plan B for this prompt?" is concrete and produces a binary signal that aggregates into a meaningful ranking.
Each system prompt variant is a player. Each plan prompt is a match. The generated plans compete head-to-head. A variant that consistently produces better plans across diverse prompts climbs the ranking. One that excels at rubber biosecurity but fails at the clay workshop gets a mediocre rating because it's inconsistent.
ELO properties that matter here:
- Self-calibrating. New variants enter with a default rating and find their level through matches. No need to define "what is a good plan" in absolute terms.
- Transitive. If variant A beats B and B beats C, A's rating reflects that without needing a direct A-vs-C comparison.
- Incremental. Each match updates ratings. You don't need a full tournament to get signal — even 5-10 comparisons per variant produce useful rankings.
- Familiar. The math is well-understood, widely implemented, and easy to explain to collaborators.
Rating System Design
Players
A "player" is a frozen configuration that produces plans:
player = {
system_prompt_hash: str, # hash of the prompt text
code_version: str, # git commit of pipeline code
model_config: str, # model name from baseline/frontier.json
pydantic_schema_hash: str, # hash of Lever/DocumentDetails schemas
validator_hash: str # hash of field_validator code
}
Any change to any of these components creates a new player. This prevents conflating prompt improvements with model upgrades or code fixes.
Matches
A match compares two players on the same plan prompt:
match = {
prompt_id: str, # from simple_plan_prompts.jsonl
player_a: player,
player_b: player,
plan_a: path, # generated plan output
plan_b: path, # generated plan output
judge_model: str, # stronger model used for judging
verdict: "A" | "B" | "DRAW",
judge_reasoning: str, # stored for auditing
presentation_order: str, # "AB" or "BA" (randomized)
timestamp: str
}
Rating Update
Standard ELO with K-factor 32 (adjustable):
def update_elo(rating_a, rating_b, outcome):
"""outcome: 1.0 = A wins, 0.0 = B wins, 0.5 = draw"""
expected_a = 1 / (1 + 10 ** ((rating_b - rating_a) / 400))
expected_b = 1 - expected_a
new_a = rating_a + K * (outcome - expected_a)
new_b = rating_b + K * ((1 - outcome) - expected_b)
return new_a, new_b
New players start at 1200. After ~20 matches the rating stabilizes enough to be directionally useful.
K-Factor Considerations
- K=32 (default): responsive to new data, ratings shift meaningfully after each match. Good for early exploration when variants differ a lot.
- K=16: more stable ratings, better when variants are close in quality and you want to avoid noise from individual judge calls.
- Consider starting at K=32 and reducing to K=16 after a player has 30+ matches (provisional → established, same pattern as FIDE chess).
Judge Design
The judge is the most critical component. A bad judge makes the entire rating system meaningless.
Model Selection
The judge model should be stronger than the generator model. If plans are generated with Gemini 2.0 Flash, judge with something heavier — Claude Sonnet, GPT-4o, or Qwen 3.5-397B. The judge only runs once per comparison (not 194 times), so the cost premium is manageable.
Cost estimate: one judge call per match, ~2k-4k tokens input (two plan excerpts + rubric), ~200 tokens output. At $3/M input tokens, that's roughly $0.01 per match.
Position Bias Mitigation
LLMs prefer whichever plan they see first. For every match, run the comparison twice with swapped presentation order:
Round 1: "Here is Plan A: ... Here is Plan B: ..."
Round 2: "Here is Plan A: ... Here is Plan B: ..." (but A and B are swapped)
Scoring:
- Same winner in both rounds → decisive verdict (1.0 or 0.0)
- Different winners → draw (0.5)
- Both rounds say draw → draw (0.5)
This doubles judge cost (~$0.02 per match) but eliminates position bias.
Judge Prompt
The judge prompt must evaluate what you actually care about — content quality, not structural compliance. The rubric should reflect lessons from OPTIMIZE_INSTRUCTIONS:
Compare these two plans generated for the same prompt. Evaluate on:
1. **Groundedness**: Are claims supported by the project context, or
fabricated? Penalize invented percentages, cost estimates, and
performance figures that have no basis in the prompt.
2. **Actionability**: Could a project manager actually schedule and
resource the proposed actions? Penalize vague aspirations posing
as options.
3. **Specificity**: Do options name domain-specific mechanisms and
constraints, or do they use generic strategy language that could
apply to any project?
4. **Conciseness**: Is the content earning its length? Penalize
verbose restatements, marketing language, and padding that adds
words without adding information.
5. **Risk honesty**: Does the plan acknowledge real constraints and
trade-offs, or does it present an unrealistically optimistic path?
Which plan is better overall? Respond with exactly one of:
A_BETTER | B_BETTER | DRAW
Then explain your reasoning in 2-3 sentences.
What NOT to Judge
Do not ask the judge about structural compliance (valid JSON, field counts, schema adherence). That's already covered by validators. The ELO system should measure exclusively what validators cannot: content quality, groundedness, and usefulness.
Judge Calibration
Run the judge on known-good vs known-bad pairs to verify it produces sensible rankings. Use the baseline training data as a reference:
- Compare a baseline plan against one with fabricated percentages injected → judge should prefer baseline.
- Compare a concise plan against a 3× verbose version with the same content → judge should prefer concise or call it a draw.
- Compare a plan with domain-specific options against one with generic "optimize X" options → judge should prefer domain-specific.
If the judge fails these sanity checks, fix the rubric before trusting match results.
A/B Testing Framework
Datasets
Maintain two prompt datasets:
- Train set: prompts used during optimization. The system can overfit to
these. Currently the 5 plans in
baseline/train/. - Verify set: held-out prompts never seen during optimization. Used only
for final validation of a candidate change. Currently in
baseline/verify/.
For ELO scoring, use a third set drawn from simple_plan_prompts.jsonl:
- Tournament set: 15-20 diverse prompts spanning business/personal/other, different scales (coffee → $30B programs), different domains, and including 2-3 red-team prompts. This set tests generalization.
The tournament set should be fixed for a scoring season (e.g., one month) so ratings are comparable. Rotate prompts between seasons to prevent overfitting.
Running an A/B Test
When a PR proposes a system prompt change:
1. Generate plans for the tournament set using the current best prompt (A).
- Skip if A's plans are already cached from a previous run.
2. Generate plans for the tournament set using the candidate prompt (B).
3. Run pairwise comparisons: A vs B on each prompt.
4. Update ELO ratings for both A and B.
5. If B's rating exceeds A's by a significance threshold, promote B.
Significance Threshold
Don't promote on a single match win. Require:
- B's ELO exceeds A's by ≥50 points after all tournament matches, OR
- B wins ≥60% of decisive (non-draw) matches across the tournament set
Both thresholds should be tunable. Start conservative and loosen if the system is too slow to adopt improvements.
Caching and Cost Control
Plan generation is the expensive step (~$0.36 per plan, 15 minutes). Judging is cheap (~$0.02 per match). Optimize accordingly:
- Cache generated plans by player hash + prompt ID. A plan only needs to be generated once; it can be judged against multiple opponents.
- When testing a new candidate (B), generate B's plans fresh but reuse A's cached plans from the previous round.
- For a 15-prompt tournament: generation cost = 15 × $0.36 = $5.40 for the new candidate (A is cached). Judging cost = 15 × $0.02 = $0.30. Total per A/B test: ~$5.70.
- Compare to current approach: ~$1/iteration with manual babysitting. The ELO approach costs more per test but eliminates babysitting and produces a reusable rating.
Budget-Conscious Mode
For $1/iteration budget constraints:
- Reduce tournament set to 5 prompts (the current train set). Generation cost: $1.80 (only candidate, baseline cached). Judging cost: $0.10. Total: ~$1.90.
- Or run partial tournaments: 3 prompts per iteration, rotating through the full set across iterations. Ratings update incrementally.
- Or judge only on cached plans: if both A and B already have generated plans from previous runs, the entire cost is judging only ($0.10-0.30).
Integration with the Existing Loop
Where ELO Fits in the Pipeline
Current loop:
PR → runner → insight → code review → synthesis → assessment → verdict
With ELO:
PR → runner → insight → code review → synthesis → ELO matches → verdict
The ELO matches replace or supplement the assessment phase.
The synthesis phase still identifies *why* something changed.
ELO provides the *did it actually improve* signal.
Coexistence with Assessment
Don't remove the assessment phase immediately. Run both in parallel:
- Assessment provides causal analysis (what changed and why).
- ELO provides aggregate quality signal (did it get better).
- If they disagree, investigate. A PR that assessment calls YES but ELO rates as worse likely has a quality regression masked by a structural improvement.
Over time, as trust in ELO builds, the assessment phase can be simplified to focus on causal analysis only, with the quality verdict delegated to ELO.
Tracking Model-Specific Effects
Create separate ELO pools per model if needed. A prompt change that helps haiku but hurts llama3.1 should be visible:
ratings = {
"prompt_v47": {
"overall": 1285,
"by_model": {
"haiku": 1310,
"llama3.1": 1190,
"gemini-flash": 1300,
...
}
}
}
This catches the haiku-specific problems that dominate current optimization work (template lock, verbosity, fabricated percentages) without hiding them in an aggregate score.
Data Model
Storage
All ELO data in a single JSON file, human-readable and diffable:
{
"schema_version": 1,
"k_factor": 32,
"players": {
"prompt_v47_gemini-flash_abc123": {
"rating": 1285,
"matches": 23,
"wins": 14,
"losses": 6,
"draws": 3,
"created_at": "2026-04-01T12:00:00Z",
"config": {
"system_prompt_hash": "abc123",
"code_version": "def456",
"model_config": "openrouter-gemini-2.0-flash-001"
}
}
},
"matches": [
{
"id": "match_001",
"prompt_id": "a6158408-...",
"player_a": "prompt_v46_gemini-flash_xyz789",
"player_b": "prompt_v47_gemini-flash_abc123",
"verdict": "B",
"judge_model": "claude-sonnet-4-20250514",
"presentation_order": "BA",
"judge_reasoning": "Plan B provides more specific...",
"timestamp": "2026-04-01T12:05:00Z"
}
]
}
Leaderboard
Generate a leaderboard from the ratings file:
Rank Player ELO W-L-D Last Match
1 prompt_v47 + gemini-flash 1285 14-6-3 2026-04-01
2 prompt_v47 + gpt-4o-mini 1270 12-5-3 2026-04-01
3 prompt_v46 + gemini-flash 1230 10-8-2 2026-03-28
4 prompt_v47 + haiku 1195 8-9-3 2026-04-01
5 prompt_v45 + gemini-flash 1150 6-10-4 2026-03-25
This makes model-specific regressions immediately visible. If haiku consistently ranks below other models on the same prompt version, that's a signal to investigate haiku-specific issues — exactly the kind of problem the current OPTIMIZE_INSTRUCTIONS documents.
Red-Team ELO (Optional Extension)
For the red-team prompts in simple_plan_prompts.jsonl, define "better"
differently. Instead of "which plan is more operationally useful," ask:
This creates a parallel safety ELO alongside the quality ELO. A system prompt change that improves quality but reduces safety caution would show up as a quality ELO gain paired with a safety ELO loss — a clear signal that the change needs review.
This is exploratory and may not produce useful signal if all models comply equally (which current evidence suggests). But it costs nothing extra to track if matches are already being run on red-team prompts.
Step-Specific ELO (Scaling Beyond Levers)
The current optimization loop covers identify_potential_levers and
identify_documents. ELO can be applied per pipeline step:
- Levers ELO: judges compare lever quality (groundedness, option diversity, review depth).
- Documents ELO: judges compare document identification quality (completeness, specificity, relevance).
- Full-plan ELO: judges compare the final HTML report end-to-end.
Start with levers (most iteration history, best-understood failure modes). Add other steps as the system matures.
Risks and Mitigations
Judge unreliability. LLM judges are noisy. Mitigations: position bias swap, sanity check calibration, requiring consistent verdicts across presentation orders. If judge noise is too high, increase matches per comparison (best-of-3 instead of single match).
Metric gaming. A prompt variant could be optimized to win judge comparisons without producing genuinely better plans. Mitigation: rotate judge models, update rubric based on observed gaming patterns, and maintain the qualitative assessment phase as a cross-check.
Overfitting to tournament set. A prompt variant could improve on the 15 tournament prompts but regress on novel prompts. Mitigation: rotate tournament prompts seasonally, validate finalists against the held-out verify set before promotion.
Cost creep. Each additional tournament prompt adds ~$0.38 per A/B test. Mitigation: cache aggressively, run partial tournaments when budget is tight, use budget-conscious mode.
ELO inflation/deflation. If new variants always enter at 1200 but the overall quality is rising, 1200 becomes relatively weaker over time. Mitigation: anchor ratings to a reference player (the original baseline prompt) that is never modified and always participates in tournaments.
Implementation Priority
-
Match runner. Script that takes two plan directories and a judge model, runs the pairwise comparison with position swap, and outputs a match record. This is useful immediately — even without ELO ratings, pairwise comparisons are a better signal than the current assessment.
-
Rating tracker. JSON file + update script. Read match records, compute ELO updates, write leaderboard. Simple, no infrastructure.
-
Tournament runner. Script that takes a candidate player, generates plans for the tournament set, runs matches against the current best, and reports the result. This replaces manual A/B testing.
-
Integration with run_optimization_iteration.py. Auto-run a mini-tournament after synthesis, use ELO delta as the verdict signal. This is the point where babysitting becomes optional.
Step 1 is a single Python script. It produces value on its own. Start there.