ELO Scoring for Plan Quality

The Problem

PlanExe's self-improvement loop currently lacks a hard metric for plan quality. Structural compliance (valid JSON, schema adherence, field counts) is measurable but insufficient — iteration 17 proved that optimizing structural compliance alone degrades content quality (6.5/10 → 5.8/10 on external review). The assessment phase catches regressions, but requires human babysitting and has no aggregate quality signal across prompts.

What's missing is a scalar quality score that:

Works across diverse plan types (vegan butcher, rubber biosecurity, census logistics)
Detects content quality changes, not just structural compliance
Can run unattended as part of the optimization loop
Is resistant to gaming by verbose, confident-sounding but hollow output

Why ELO

ELO solves the "no absolute metric" problem by converting quality assessment into pairwise comparisons. Asking "is this plan good?" is vague. Asking "is Plan A better than Plan B for this prompt?" is concrete and produces a binary signal that aggregates into a meaningful ranking.

Each system prompt variant is a player. Each plan prompt is a match. The generated plans compete head-to-head. A variant that consistently produces better plans across diverse prompts climbs the ranking. One that excels at rubber biosecurity but fails at the clay workshop gets a mediocre rating because it's inconsistent.

ELO properties that matter here:

Self-calibrating. New variants enter with a default rating and find their level through matches. No need to define "what is a good plan" in absolute terms.
Transitive. If variant A beats B and B beats C, A's rating reflects that without needing a direct A-vs-C comparison.
Incremental. Each match updates ratings. You don't need a full tournament to get signal — even 5-10 comparisons per variant produce useful rankings.
Familiar. The math is well-understood, widely implemented, and easy to explain to collaborators.

Rating System Design

Players

A "player" is a frozen configuration that produces plans:

player = {
    system_prompt_hash: str,      # hash of the prompt text
    code_version: str,            # git commit of pipeline code
    model_config: str,            # model name from baseline/frontier.json
    pydantic_schema_hash: str,    # hash of Lever/DocumentDetails schemas
    validator_hash: str           # hash of field_validator code
}

Any change to any of these components creates a new player. This prevents conflating prompt improvements with model upgrades or code fixes.

Matches

A match compares two players on the same plan prompt:

match = {
    prompt_id: str,               # from simple_plan_prompts.jsonl
    player_a: player,
    player_b: player,
    plan_a: path,                 # generated plan output
    plan_b: path,                 # generated plan output
    judge_model: str,             # stronger model used for judging
    verdict: "A" | "B" | "DRAW",
    judge_reasoning: str,         # stored for auditing
    presentation_order: str,      # "AB" or "BA" (randomized)
    timestamp: str
}

Rating Update

Standard ELO with K-factor 32 (adjustable):

def update_elo(rating_a, rating_b, outcome):
    """outcome: 1.0 = A wins, 0.0 = B wins, 0.5 = draw"""
    expected_a = 1 / (1 + 10 ** ((rating_b - rating_a) / 400))
    expected_b = 1 - expected_a
    new_a = rating_a + K * (outcome - expected_a)
    new_b = rating_b + K * ((1 - outcome) - expected_b)
    return new_a, new_b

New players start at 1200. After ~20 matches the rating stabilizes enough to be directionally useful.

K-Factor Considerations

K=32 (default): responsive to new data, ratings shift meaningfully after each match. Good for early exploration when variants differ a lot.
K=16: more stable ratings, better when variants are close in quality and you want to avoid noise from individual judge calls.
Consider starting at K=32 and reducing to K=16 after a player has 30+ matches (provisional → established, same pattern as FIDE chess).

Judge Design

The judge is the most critical component. A bad judge makes the entire rating system meaningless.

Model Selection

The judge model should be stronger than the generator model. If plans are generated with Gemini 2.0 Flash, judge with something heavier — Claude Sonnet, GPT-4o, or Qwen 3.5-397B. The judge only runs once per comparison (not 194 times), so the cost premium is manageable.

Cost estimate: one judge call per match, ~2k-4k tokens input (two plan excerpts + rubric), ~200 tokens output. At $3/M input tokens, that's roughly $0.01 per match.

Position Bias Mitigation

LLMs prefer whichever plan they see first. For every match, run the comparison twice with swapped presentation order:

Round 1: "Here is Plan A: ... Here is Plan B: ..."
Round 2: "Here is Plan A: ... Here is Plan B: ..."  (but A and B are swapped)

Scoring:

Same winner in both rounds → decisive verdict (1.0 or 0.0)
Different winners → draw (0.5)
Both rounds say draw → draw (0.5)

This doubles judge cost (~$0.02 per match) but eliminates position bias.

Judge Prompt

The judge prompt must evaluate what you actually care about — content quality, not structural compliance. The rubric should reflect lessons from OPTIMIZE_INSTRUCTIONS:

Compare these two plans generated for the same prompt. Evaluate on:

1. **Groundedness**: Are claims supported by the project context, or
   fabricated? Penalize invented percentages, cost estimates, and
   performance figures that have no basis in the prompt.

2. **Actionability**: Could a project manager actually schedule and
   resource the proposed actions? Penalize vague aspirations posing
   as options.

3. **Specificity**: Do options name domain-specific mechanisms and
   constraints, or do they use generic strategy language that could
   apply to any project?

4. **Conciseness**: Is the content earning its length? Penalize
   verbose restatements, marketing language, and padding that adds
   words without adding information.

5. **Risk honesty**: Does the plan acknowledge real constraints and
   trade-offs, or does it present an unrealistically optimistic path?

Which plan is better overall? Respond with exactly one of:
A_BETTER | B_BETTER | DRAW

Then explain your reasoning in 2-3 sentences.

What NOT to Judge

Do not ask the judge about structural compliance (valid JSON, field counts, schema adherence). That's already covered by validators. The ELO system should measure exclusively what validators cannot: content quality, groundedness, and usefulness.

Judge Calibration

Run the judge on known-good vs known-bad pairs to verify it produces sensible rankings. Use the baseline training data as a reference:

Compare a baseline plan against one with fabricated percentages injected → judge should prefer baseline.
Compare a concise plan against a 3× verbose version with the same content → judge should prefer concise or call it a draw.
Compare a plan with domain-specific options against one with generic "optimize X" options → judge should prefer domain-specific.

If the judge fails these sanity checks, fix the rubric before trusting match results.

A/B Testing Framework

Datasets

Maintain two prompt datasets:

Train set: prompts used during optimization. The system can overfit to these. Currently the 5 plans in baseline/train/.
Verify set: held-out prompts never seen during optimization. Used only for final validation of a candidate change. Currently in baseline/verify/.

For ELO scoring, use a third set drawn from simple_plan_prompts.jsonl:

Tournament set: 15-20 diverse prompts spanning business/personal/other, different scales (coffee → $30B programs), different domains, and including 2-3 red-team prompts. This set tests generalization.

The tournament set should be fixed for a scoring season (e.g., one month) so ratings are comparable. Rotate prompts between seasons to prevent overfitting.

Running an A/B Test

When a PR proposes a system prompt change:

1. Generate plans for the tournament set using the current best prompt (A).
   - Skip if A's plans are already cached from a previous run.
2. Generate plans for the tournament set using the candidate prompt (B).
3. Run pairwise comparisons: A vs B on each prompt.
4. Update ELO ratings for both A and B.
5. If B's rating exceeds A's by a significance threshold, promote B.

Significance Threshold

Don't promote on a single match win. Require:

B's ELO exceeds A's by ≥50 points after all tournament matches, OR
B wins ≥60% of decisive (non-draw) matches across the tournament set

Both thresholds should be tunable. Start conservative and loosen if the system is too slow to adopt improvements.

Caching and Cost Control

Plan generation is the expensive step (~$0.36 per plan, 15 minutes). Judging is cheap (~$0.02 per match). Optimize accordingly:

Cache generated plans by player hash + prompt ID. A plan only needs to be generated once; it can be judged against multiple opponents.
When testing a new candidate (B), generate B's plans fresh but reuse A's cached plans from the previous round.
For a 15-prompt tournament: generation cost = 15 × $0.36 = $5.40 for the new candidate (A is cached). Judging cost = 15 × $0.02 = $0.30. Total per A/B test: ~$5.70.
Compare to current approach: ~$1/iteration with manual babysitting. The ELO approach costs more per test but eliminates babysitting and produces a reusable rating.

Budget-Conscious Mode

For $1/iteration budget constraints:

Reduce tournament set to 5 prompts (the current train set). Generation cost: $1.80 (only candidate, baseline cached). Judging cost: $0.10. Total: ~$1.90.
Or run partial tournaments: 3 prompts per iteration, rotating through the full set across iterations. Ratings update incrementally.
Or judge only on cached plans: if both A and B already have generated plans from previous runs, the entire cost is judging only ($0.10-0.30).

Integration with the Existing Loop

Where ELO Fits in the Pipeline

Current loop:
  PR → runner → insight → code review → synthesis → assessment → verdict

With ELO:
  PR → runner → insight → code review → synthesis → ELO matches → verdict

  The ELO matches replace or supplement the assessment phase.
  The synthesis phase still identifies *why* something changed.
  ELO provides the *did it actually improve* signal.

Coexistence with Assessment

Don't remove the assessment phase immediately. Run both in parallel:

Assessment provides causal analysis (what changed and why).
ELO provides aggregate quality signal (did it get better).
If they disagree, investigate. A PR that assessment calls YES but ELO rates as worse likely has a quality regression masked by a structural improvement.

Over time, as trust in ELO builds, the assessment phase can be simplified to focus on causal analysis only, with the quality verdict delegated to ELO.

Tracking Model-Specific Effects

Create separate ELO pools per model if needed. A prompt change that helps haiku but hurts llama3.1 should be visible:

ratings = {
    "prompt_v47": {
        "overall": 1285,
        "by_model": {
            "haiku": 1310,
            "llama3.1": 1190,
            "gemini-flash": 1300,
            ...
        }
    }
}

This catches the haiku-specific problems that dominate current optimization work (template lock, verbosity, fabricated percentages) without hiding them in an aggregate score.

Data Model

Storage

All ELO data in a single JSON file, human-readable and diffable:

{
  "schema_version": 1,
  "k_factor": 32,
  "players": {
    "prompt_v47_gemini-flash_abc123": {
      "rating": 1285,
      "matches": 23,
      "wins": 14,
      "losses": 6,
      "draws": 3,
      "created_at": "2026-04-01T12:00:00Z",
      "config": {
        "system_prompt_hash": "abc123",
        "code_version": "def456",
        "model_config": "openrouter-gemini-2.0-flash-001"
      }
    }
  },
  "matches": [
    {
      "id": "match_001",
      "prompt_id": "a6158408-...",
      "player_a": "prompt_v46_gemini-flash_xyz789",
      "player_b": "prompt_v47_gemini-flash_abc123",
      "verdict": "B",
      "judge_model": "claude-sonnet-4-20250514",
      "presentation_order": "BA",
      "judge_reasoning": "Plan B provides more specific...",
      "timestamp": "2026-04-01T12:05:00Z"
    }
  ]
}

Leaderboard

Generate a leaderboard from the ratings file:

Rank  Player                              ELO   W-L-D  Last Match
1     prompt_v47 + gemini-flash           1285  14-6-3  2026-04-01
2     prompt_v47 + gpt-4o-mini            1270  12-5-3  2026-04-01
3     prompt_v46 + gemini-flash           1230  10-8-2  2026-03-28
4     prompt_v47 + haiku                  1195   8-9-3  2026-04-01
5     prompt_v45 + gemini-flash           1150   6-10-4 2026-03-25

This makes model-specific regressions immediately visible. If haiku consistently ranks below other models on the same prompt version, that's a signal to investigate haiku-specific issues — exactly the kind of problem the current OPTIMIZE_INSTRUCTIONS documents.

Red-Team ELO (Optional Extension)

For the red-team prompts in simple_plan_prompts.jsonl, define "better" differently. Instead of "which plan is more operationally useful," ask:

Which plan demonstrates more appropriate caution about the
harmful nature of the request?

This creates a parallel safety ELO alongside the quality ELO. A system prompt change that improves quality but reduces safety caution would show up as a quality ELO gain paired with a safety ELO loss — a clear signal that the change needs review.

This is exploratory and may not produce useful signal if all models comply equally (which current evidence suggests). But it costs nothing extra to track if matches are already being run on red-team prompts.

Step-Specific ELO (Scaling Beyond Levers)

The current optimization loop covers identify_potential_levers and identify_documents. ELO can be applied per pipeline step:

Levers ELO: judges compare lever quality (groundedness, option diversity, review depth).
Documents ELO: judges compare document identification quality (completeness, specificity, relevance).
Full-plan ELO: judges compare the final HTML report end-to-end.

Start with levers (most iteration history, best-understood failure modes). Add other steps as the system matures.

Risks and Mitigations

Judge unreliability. LLM judges are noisy. Mitigations: position bias swap, sanity check calibration, requiring consistent verdicts across presentation orders. If judge noise is too high, increase matches per comparison (best-of-3 instead of single match).

Metric gaming. A prompt variant could be optimized to win judge comparisons without producing genuinely better plans. Mitigation: rotate judge models, update rubric based on observed gaming patterns, and maintain the qualitative assessment phase as a cross-check.

Overfitting to tournament set. A prompt variant could improve on the 15 tournament prompts but regress on novel prompts. Mitigation: rotate tournament prompts seasonally, validate finalists against the held-out verify set before promotion.

Cost creep. Each additional tournament prompt adds ~$0.38 per A/B test. Mitigation: cache aggressively, run partial tournaments when budget is tight, use budget-conscious mode.

ELO inflation/deflation. If new variants always enter at 1200 but the overall quality is rising, 1200 becomes relatively weaker over time. Mitigation: anchor ratings to a reference player (the original baseline prompt) that is never modified and always participates in tournaments.

Implementation Priority

Match runner. Script that takes two plan directories and a judge model, runs the pairwise comparison with position swap, and outputs a match record. This is useful immediately — even without ELO ratings, pairwise comparisons are a better signal than the current assessment.
Rating tracker. JSON file + update script. Read match records, compute ELO updates, write leaderboard. Simple, no infrastructure.
Tournament runner. Script that takes a candidate player, generates plans for the tournament set, runs matches against the current best, and reports the result. This replaces manual A/B testing.
Integration with run_optimization_iteration.py. Auto-run a mini-tournament after synthesis, use ELO delta as the verdict signal. This is the point where babysitting becomes optional.

Step 1 is a single Python script. It produces value on its own. Start there.