Prompt Optimizing with A/B Testing
Author: Simon Strandgaard
Date: 2026-02-18
Status: Proposal
Tags: prompting, ab-testing, evaluation, quality, automation
Pitch
Build a repeatable A/B testing system for system-prompt optimization so PlanExe can improve prompts with evidence, not manual taste.
The target is small, controlled prompt edits, measured against baseline across many plan prompts and multiple LLMs, then automatic promotion only when improvements are consistent.
Problem
Prompt tuning is currently manual and expensive:
- A human creates prompt variants (often via ChatGPT, Gemini, Grok), runs attempts, and inspects outputs.
worker_plan_internal/diagnostics/redline_gate.pyalready shows this pattern: multiple candidate system prompts, pairwise comparisons, and manual selection.- Humans cannot reliably detect which prompt variant consistently wins across many tasks and model providers.
- A variant that looks better on one run can regress quality on other domains.
Result: prompt changes are hard to trust, hard to reproduce, and slow to iterate.
Feasibility
This is feasible with existing building blocks:
- Prompt corpora already exist (
simple_plan_prompts.jsonland MCP prompt examples). - PlanExe pipeline already emits many intermediate artifacts that can be scored.
- Redline-gate code already demonstrates multi-prompt evaluation mechanics.
- Existing diagnostics/self-audit modules can provide objective checks.
Constraints:
- Full-plan runs are costly; experiments must support stratified sampling.
- Model variance is high; evaluation must run across multiple providers/models.
- Promotion must be conservative to avoid silent quality regressions.
Proposal
Add a Prompt Experiment Harness for controlled A/B tests.
Core rules:
- Change one system prompt at a time (single-variable experiments).
- Keep baseline prompt fixed and versioned in git.
- Run paired A/B on identical input sets.
- Evaluate with objective checks first, then judge-model scoring.
- Promote only if win criteria are met across domains and models.
Experiment Workflow
- Select target prompt in pipeline (earliest stage linked to observed weakness).
- Define baseline prompt version and one candidate variant.
- Build test set:
- curated prompt suite (easy/medium/hard, multiple domains)
- stress tests and edge cases
- Run A/B matrix:
- same inputs
- same config
- multiple LLMs/reasoning models
- Score outcomes with fixed rubric.
- Compute win/loss/draw and confidence.
- Promote candidate only when thresholds pass.
- Store artifacts and decision log in repo.
Scoring Framework
Use a weighted composite score per run:
- realism/feasibility
- internal consistency (no contradictions between summary, schedule, risks)
- constraint adherence (budget, timeline, location)
- safety/compliance behavior
- actionability (clear next steps and ownership)
Recommended approach:
- Objective checks: schema validity, contradiction flags, missing-core-sections.
- Judge scoring: LLM-as-judge with fixed rubric and pairwise comparison.
- Tie-breakers: lower hallucination risk and better constraint fidelity.
Also include an Elo-based relative quality signal (see docs/proposals/07-elo-ranking.md) so we measure whether a candidate prompt shifts plan quality up or down against corpus peers, not only against one baseline pair.
Elo Integration
Use Elo as a secondary decision layer for prompt promotion.
How:
- For each A/B run, rank produced plans with the existing pairwise KPI-to-Elo method.
- Compute
elo_deltafor candidate-generated plans versus baseline-generated plans. - Track short-term experiment win rate and medium-term Elo trend.
Promotion guard:
- Candidate must win the local A/B matrix.
- Candidate must also show non-negative (preferably positive) Elo movement on holdout slices.
- Any persistent negative Elo drift blocks promotion even if small-sample A/B looks positive.
Redline-Gate-Inspired Pattern
Adopt the proven pattern from worker_plan_internal/diagnostics/redline_gate.py:
- Keep several system-prompt candidates.
- Run the same user prompts against each candidate.
- Compare outcomes in bulk, not one-by-one.
Generalize this pattern to the full planning pipeline:
- not only safety classification prompts
- also planning, assumptions, lever selection, and review prompts
Candidate Generation
Support assisted prompt drafting while keeping evaluation strict.
Allowed sources for new variants:
- human-written edits
- AI-assisted edits (ChatGPT, Gemini, Grok, etc.)
- templated rewrites from internal rules
Requirement:
- each candidate must include a short changelog (what changed and intended effect)
- no promotion without A/B evidence
Integration Points
- Prompt definitions in
worker_plan_internal/*modules. - Diagnostics harness under
worker_plan_internal/diagnostics/. - Prompt datasets in
worker_plan/worker_plan_api/prompt/data/. - Report/log layer for experiment metadata and decisions.
- Elo ranking components and data flow described in
docs/proposals/07-elo-ranking.md.
Suggested artifacts:
experiments/prompt_ab/<experiment_id>/config.jsonexperiments/prompt_ab/<experiment_id>/runs.jsonlexperiments/prompt_ab/<experiment_id>/scores.jsonlexperiments/prompt_ab/<experiment_id>/decision.mdexperiments/prompt_ab/<experiment_id>/elo_summary.json
Promotion Policy
Promote candidate prompt only when all pass:
- net positive win rate over baseline on full test matrix
- no critical regressions on safety/compliance checks
- consistent gains on at least two model families
- statistically credible margin (configured confidence threshold)
- non-negative Elo trend on holdout and benchmark slices
If not promoted:
- keep baseline
- log failure mode
- schedule next micro-iteration for the same weakness
Phased Implementation
Phase A: Minimal Harness
- A/B runner for one prompt target.
- Paired inputs and simple win/loss metrics.
- Manual review report output.
Phase B: Multi-Model Robustness
- Add cross-model matrix and stratified prompt sets.
- Add confidence intervals and regression guards.
Phase C: Promotion Automation
- Auto-create candidate branches/PR notes with experiment summary.
- Require explicit promotion gate before baseline replacement.
Phase D: Continuous Improvement Loop
- Weekly or per-release prompt experiment batch.
- Track long-term drift and rollback when needed.
Success Metrics
- Higher average plan quality score versus baseline across benchmark suite.
- Lower variance in quality across models/providers.
- Reduced manual time spent on prompt trial-and-error.
- Fewer regressions after prompt updates.
- Faster cycle time from weakness discovery to tested prompt improvement.
- Positive median
elo_deltaafter prompt promotion windows.
Risks
- Overfitting to benchmark prompts instead of real traffic.
- Judge-model bias or instability.
- High experiment cost (tokens, runtime).
- Metric gaming (improving score without real output quality gains).
Mitigations:
- Keep hidden holdout sets and periodically refresh benchmarks.
- Use multiple judges and objective checks together.
- Start with low-cost slices before full-matrix runs.
- Include human spot-check audits on promoted variants.
Open Questions
- Which score dimensions should be hard gates vs soft preferences?
- How much statistical confidence is required for promotion?
- Should promotions be fully automatic or always human-approved?
- What Elo horizon should be used for rollback (for example 7-day vs 30-day drift)?