System prompt optimizer
My overall plan is to optimize all system prompts across PlanExe. By first starting optimizing the earliest system prompt, when that consistently performs better than older system prompts, then it's a keeper. Then move on to the next system prompt in the luigi pipeline. By the end of this prompt optimization, the overall plan quality should have been improved.
I want to track metrics for how much improvement have happened.
Status (2026-03-17)
Infrastructure
- Data repo: PlanExe-prompt-lab — holds baseline data, run history, and analysis artifacts.
- Baseline data: 5 train plans × 9 verify plans, extracted from zips into
baseline/train/andbaseline/verify/. - Runner (
self_improve/runner.py): re-executes a single pipeline step against baseline plans using the system prompt constant from PlanExe's code. Auto-incrementing history inhistory/{bucket}/{counter}_{step}/. Supports--model,--prompt-lab-dir, resume, and parallel workers. - Analysis pipeline (5 phases in
analysis/): - Phase 0:
prepare_iteration.py— verifies PR state, pre-creates history dirs, and writes analysismeta.jsonwith PR info. Replaces the formercreate_analysis_dir.py+update_meta_pr.py. - Phase 1:
run_insight.py— Claude Code + Codex in parallel produce independentinsight_*.mdfiles with quantitative metrics and PR impact verdicts. - Phase 2:
run_code_review.py— both agents review PlanExe source code, producingcode_*.mdwith file:line references. - Phase 3:
run_synthesis.py— single agent reconciles all findings intosynthesis.mdwith top 5 ranked directions. - Phase 4:
run_assessment.py— before/after comparison, metric table, keeper verdict (YES/NO/CONDITIONAL), and evaluation of the synthesis recommendation for the next iteration. run_analysis.py— orchestrates phases 1–4 sequentially; stops hard on first phase failure.run_optimization_iteration.py— orchestrates a full loop: implement recommendation → create PR → run experiments → run analysis pipeline. Supports--skip-implement,--skip-runner,--skip-analysis,--models.- Conventions:
analysis/AGENTS.md.
Models
7 models tested per iteration (5 plans each = 35 runs):
| Model | Alias | Status |
|---|---|---|
| ollama-llama3.1 | llama | Working (3-5/5) |
| openrouter-openai-gpt-oss-20b | gpt-oss | Working (4-5/5) |
| openai-gpt-5-nano | gpt5-nano | Working (5/5) |
| openrouter-qwen3-30b-a3b | qwen | Working (5/5) |
| openrouter-openai-gpt-4o-mini | gpt4o-mini | Working (5/5) |
| openrouter-gemini-2.0-flash-001 | gemini-flash | Working (5/5) — added iter 13 |
| anthropic-claude-haiku-4-5-pinned | haiku | Working (4-5/5) |
Removed: GLM (PR #266, schema-echoing), StepFun (removed from config), nemotron (0/5 all iterations, structural incompatibility).
Iteration History
Currently optimizing: identify_potential_levers (the first step after plan intake).
24 analysis rounds (0–23).
| Iter | PR | Change | Verdict | Runs | Analysis |
|---|---|---|---|---|---|
| 0 | — | Baseline | — | 00–08 | analysis/0 |
| 1 | #268 | Fix doubled user prompt | YES | 09–16 | analysis/1 |
| 2 | #270 | Fix assistant turn serialization | CONDITIONAL | 17–23 | analysis/2 |
| 3 | #272 | Novelty-aware follow-up prompts | YES | 24–31 | analysis/3 |
| 4 | #273 | Remove exemplar strings + make wrapper fields optional | CONDITIONAL | 32–38 | analysis/4 |
| 5 | #274 | Align Pydantic field descriptions with system prompt | YES | 39–45 | analysis/5 |
| 6 | #275 | Fix consequences length + trade-off requirement + review format | YES | 46–52 | analysis/6 |
| 7 | #276 | Enforce schema contract: levers min/max 5, summary required | CONDITIONAL | 53–59 | analysis/7 |
| 8 | #278 | Fresh context per call + relax lever count to 5–7 | CONDITIONAL | 60–66 | analysis/8 |
| 9 | #279 | Remove naming template | YES | 67–73 | analysis/9 |
| 10 | #281 | Keyword quality gate (reverted in #282) | NO | 74–80 | analysis/10 |
| 11 | #283 | RetryConfig in runner (reverted in #284) | NO | 81–87 | analysis/11 |
| 12 | #286 | Remove max_length=7 hard constraint on levers | YES | 88–94 | analysis/12 |
| 13 | #289 | Add options count and review format validators | CONDITIONAL | 95–101 | analysis/13 |
| 14 | #292 | Recover partial results when a call fails | YES | 102–108 | analysis/14 |
| 15 | #294 | Consolidate review_lever prompt to prevent format alternation | YES | 09–16 | analysis/15 |
| 16 | #295 | Continue loop after call failure instead of breaking | YES | 17–23 | analysis/16 |
| 17 | #296 | Auto-correct review_lever before hard-rejecting | CONDITIONAL | 24–30 | analysis/17 |
| 18 | #297 | Simplify lever prompt to restore content quality | YES | 31–37 | analysis/18 |
| 19 | #299 | Remove bracket placeholders and fragile English-only validator | CONDITIONAL | 38–45 | analysis/19 |
| 20 | #309 | Add option-quality reminder to call-2/3 prompts | YES | 46–52 | analysis/20 |
| 21 | #313 | Add anti-fabrication reminder to call-2/3 prompts | CONDITIONAL | 53–59 | analysis/21 |
| 22 | #316 | Replace two-bullet review_lever prompt with single flowing example | CONDITIONAL | 60–66 | analysis/22 |
| 23 | #326 | Add second review_lever example to break template lock | KEEP | 67–73 | analysis/23 |
Key Improvements So Far
Comparing iteration 0 (baseline) to iteration 14 (current state):
- Review format violations: 67 → ~4 (enforced by validator since iter 13)
- Consequence chain violations: 35 → 7
- Bracket placeholder leakage: ~17 → 0
- Template leakage (naming suffix): 83–100% → 2–10%
- Cross-call duplication: eliminated via novelty-aware follow-ups
- Chat-history contamination: eliminated via fresh context per call
- Validation failures discarding valid levers: eliminated —
max_length=7removed (iter 12), partial result recovery added (iter 14) - Option count violations: caught by
check_option_countvalidator (iter 13) - Overall success rate: 88.6% → 91.4% (iter 14, partial recovery)
- Gemini-flash baseline comparison: name uniqueness 71%→100%, cross-call duplication 15→0, consequence richness +40%, option prefix leakage 16→0 (comparison)
Key Lessons Learned
- No hardcoded English keywords in validators. PlanExe users create plans in many languages. Keyword-based quality gates (checking for "Controls", "Weakness:") break for non-English plans. All validation must be language-agnostic — structural checks, field length ratios, duplicate detection. Iteration 10 was a disaster because of this.
- Don't merge PRs before the verdict. The correct order is: create PR → run experiments on the branch → run analysis → read verdict → merge only if verdict confirms improvement. Iteration 11 was merged prematurely.
- The runner always uses the code constant. There is no external prompt file or CLI override — the prompt is whatever is committed in PlanExe's code at run time.
- Auto-implementing synthesis recommendations can conflict with user intent. The
run_optimization_iteration.pyscript auto-applies the top recommendation, but this can conflict with explicit user preferences (e.g., reverting to "exactly 5 levers" when the user wanted 5–7). Use--skip-implementwhen needed. - Dual-agent analysis catches real errors. Codex corrected Claude's success rate miscount in iteration 0. Independent analysis is worth the extra cost.
- Prefer soft prompt guidance over hard Pydantic constraints.
max_length=7on the levers field discarded entire LLM responses when a model returned 8 levers. The downstream dedup step already handles extras. Hard caps waste tokens on retries; soft guidance in the prompt is cheaper and more fault-tolerant. - Recover partial results rather than failing completely. When a 3-call loop has call 2 or 3 fail, keeping levers from prior successful calls is better than discarding everything. A single validator rejection should not wipe out 10+ valid levers.
Current State of identify_potential_levers.py
After 23 iterations, the step has these characteristics:
- System prompt says "5 to 7 levers per response", schema has
min_length=5(nomax_length— downstream dedup handles extras) - Follow-up calls use novelty-aware prompts (exclude already-generated lever names)
- Each LLM call gets a fresh context (no chat history accumulation)
- Naming guidance says "avoid formulaic patterns or repeated prefixes" (no template)
- Pydantic validators:
check_option_count(exactly 3 options),check_review_format(structural only — min length + no bracket placeholders, language-agnostic) - Partial result recovery: if call 2 or 3 fails, keep levers from prior successful calls instead of discarding everything
- Quality gate: duplicate name filter (language-agnostic)
review_leverprompt uses two structurally distinct examples to prevent template lockOPTIMIZE_INSTRUCTIONSdocuments 6 known problems (including template lock)- System prompt:
IDENTIFY_POTENTIAL_LEVERS_SYSTEM_PROMPTconstant in code
Not Started
- Numeric evaluator — no deterministic scoring script. Metrics are computed by LLM agents, not reproducible code.
- Candidate generator — prompt variants are proposed by synthesis agents and implemented manually, not auto-generated.
- Train/verify scoring loop — no automated optimizer that tests candidates against
baseline/verify/. - Other pipeline steps — only
identify_potential_levershas been optimized so far.
Next Steps
- Reduce remaining llama3.1 template lock — after iteration 23, llama3.1 still shows ~71% template lock on "This lever governs the tension". Consider adding an explicit anti-pattern prohibition.
- Address gpt-5-nano second-example lock — gpt-5-nano swapped one template lock for another (now copies the second example's format). May need a third structurally distinct example or different approach.
- Build a deterministic evaluator — compute metrics (uniqueness, option count, template leakage, format compliance) without LLM calls. Makes assessment reproducible and fast.
- Move to the next pipeline step once
identify_potential_leversis stable.
Stage 1 - one improvement iteration
Here is how one iteration is going to be: Optimize one system prompt for a single step in the luigi pipeline in run_plan_pipeline.py Take the "dataset train" and unzip the files to a work dir. Alter the system prompt of a named task in the pipeline. For MODEL_LIST, loop through the models, (these are LLMs without reasoning): - It's very expensive to run the full pipeline. So I only want to rerun the luigi pipeline for that just step, see if the new system prompt does a better job than the reference plan. - Capture the output.
Have a reasoning model do a comparison of the captured outputs. If it does better than the original prompt then it's a candidate for keeping. If it does does worse than the original prompt then write it to a failed attempt log. Find prompts that consistently yield better outputs than the original prompt. Store the feedback.
At this point there should be a ranked list of candidate system prompts that are better than the original. Go through the "dataset verify" and compute score of the best candidate prompts. Pick the best one and commit it to git.
For all plans both train and verify, commit the new generated files for each plan. Commit improvement stats.
Stage 2 - multiple improvement iteration
Run "one improvement iteration" multiple times, until there score isn't improving significantly.
Stage 3 - next item in the luigi pipeline
As more upstream tasks gets optimized, then move on to further downstream tasks.
From time to time regenerate full plan, and have a reasoning model compare how much better or worse is the new full plan is over various KPIs.
Dataset
Config json file that contains the "dataset train" and "dataset verify".
The dataset lives in the PlanExe-prompt-lab repository.
These zip files are what PlanExe currently outputs, so these are the "baseline" to compare with.
Dataset train
20260310_hong_kong_game.zip 20250329_gta_game.zip 20250321_silo.zip 20260311_parasomnia_research_unit.zip 20260308_sovereign_identity.zip
Dataset verify
20260303_crate_recovery_campaign.zip 20260215_nuuk_clay_workshop.zip 20260202_heatwave_resilience.zip 20260201_media_rescue.zip 20260201_yellowstone_evacuation.zip 20260131_clear_english.zip 20260129_euro_adoption.zip 20260114_cbc_validation.zip 20251101_e_bus_security.zip
Feedback
Strengths
- Incremental approach is smart — optimizing one pipeline step at a time and proving it before moving on avoids compounding errors.
- Train/verify split — using separate datasets to prevent overfitting to the evaluation set is the right call.
- Failed attempt log — tracking what didn't work is often overlooked but very valuable.
Questions & Gaps
-
What are the KPIs? The document mentions "score," "better," "worse," and "various KPIs" but never defines them. This is the most critical gap. Without concrete metrics, the reasoning model's comparison will be subjective and inconsistent. What makes one output "better"? Completeness? Specificity? Actionability? Structure?
-
How is the reasoning model prompted for comparison? "Have a reasoning model do a comparison" is underspecified. Will it use a rubric? Pairwise ranking? A numeric score? The evaluation prompt itself will heavily influence which candidate "wins." This prompt may need its own optimization.
-
What does "consistently" mean? "Find prompts that consistently yield better outputs" — across all train plans? A majority? With what confidence threshold? A candidate that's great for 3 plans but terrible for 1 might still be worse overall.
-
MODEL_LIST is undefined. Which models are being tested? Are you running each candidate prompt across multiple LLMs to find prompts that are model-robust, or optimizing per-model?
-
How are candidate prompts generated? The doc describes the evaluation loop but not how new prompt variants are created. Manual authoring? LLM-generated rewrites? A mix?
-
"Rerun just that step" — dependency handling. If you rerun step N with a new prompt, its output feeds into step N+1 in the real pipeline. Are you only evaluating step N's output in isolation, or propagating through downstream steps? Evaluating in isolation is faster but may miss regressions that only show up downstream.
-
Budget/cost controls. You mention "it's very expensive to run the full pipeline." What about the cost of this optimization loop itself? With multiple models x multiple candidate prompts x multiple train plans x multiple iterations, this can get expensive fast. Consider setting a budget cap or max iterations.
-
The envvar for the dataset path — good instinct, but also specify a fallback or error message if it's unset.
Suggestions
- Define a scoring rubric (even a simple 1-5 scale across 3-4 dimensions) before starting.
- Version-control the evaluation prompts alongside the system prompts.
- Consider running a baseline score for the current prompts first so you have a numeric starting point, not just pairwise comparisons.
- Add a "no regression" gate: the new prompt must not score worse on any plan by more than X%, even if average improves.
Repo Structure
Two repos: the optimization engine lives in PlanExe, the data/artifacts live in a separate repo.
Repo 1: PlanExe (existing — add optimization tooling)
worker_plan/
worker_plan_internal/
assume/
identify_purpose.py # contains SYSTEM_PROMPT (current)
...
...
self_improve/ # NEW — the optimization engine
__init__.py
config.py # loads dataset config, env vars
runner.py # reruns a single pipeline step with a candidate prompt
evaluator.py # calls reasoning model to score/compare outputs
scorer.py # rubric definition, numeric scoring
candidate_generator.py # generates prompt variants
optimizer.py # orchestrates the train/verify loop (Stage 1-2)
cli.py # CLI entry point
rubrics/ # evaluation rubrics per pipeline step
identify_purpose.yaml
make_assumptions.yaml
...
tests/
...
This keeps the optimizer close to the code it's modifying — it needs to import pipeline tasks, swap prompts, and run individual steps.
Repo 2: PlanExe-prompt-lab (data repo)
README.md
dataset.json # train/verify split definition
populate_baseline.py # script to populate baseline from zip files
baseline/ # current outputs (extracted from dataset zips)
train/
20260310_hong_kong_game/
001-1-start_time.json
001-2-plan.txt
...
030-report.html
20250329_gta_game/
...
20250321_silo/
...
20260311_parasomnia_research_unit/
...
20260308_sovereign_identity/
...
verify/
20260303_crate_recovery_campaign/
...
20260215_nuuk_clay_workshop/
...
(7 more plans)
history/ # captured output, global run counter
# Path: history/{counter // 100}/{counter % 100:02d}_{step_name}/
# Counter is auto-incremented: scan history/ for the highest existing
# run number and add 1. No counter file needed.
# Runs for different steps are interleaved chronologically.
0/ # runs 0-99
1/ # runs 100-199
2/ # runs 200-299
00_identify_purpose/ # run 200
01_identify_potential_levers/ # run 201
02_identify_potential_levers/ # run 202
meta.json # which step, which system prompt, what model used
events.jsonl
outputs.jsonl
outputs/
20250321_silo/
002-9-potential_levers_raw.json
002-10-potential_levers.json
activity_overview.json
usage_metrics.jsonl
20260310_hong_kong_game/
20260201_media_rescue/
03_identify_potential_levers/
...
98_identify_potential_levers/
99_identify_potential_levers/
3/
analysis/
AGENTS.md # conventions for analysis artifacts
prepare_iteration.py # phase 0: verify PR, pre-create history dirs
run_analysis.py # orchestrates phases 1-4 sequentially
run_insight.py # phase 1: Claude + Codex insight in parallel
run_code_review.py # phase 2: Claude + Codex code review in parallel
run_synthesis.py # phase 3: Claude synthesis (single agent)
run_assessment.py # phase 4: before/after comparison + verdict
0_identify_potential_levers/ # pre-fix analysis (runs 00-08)
meta.json
insight_claude.md
insight_codex.md
code_claude.md
code_codex.md
synthesis.md
1_identify_potential_levers/ # post-fix analysis (runs 09-16, PR #268)
meta.json # includes pr_url, pr_title, pr_description
insight_claude.md
insight_codex.md
code_claude.md
code_codex.md
synthesis.md
assessment.md # verdict: YES/NO/CONDITIONAL
run_optimization_iteration.py # orchestrates full iteration
scores/ # longitudinal tracking
scoreboard.csv # step, date, baseline_score, best_score, delta
history.json # full history for charting
full_plan_comparisons/ # Stage 3 periodic full-plan regenerations
2026-03-20/
hong_kong_game/
030-report.html
kpi_comparison.json
Connecting the two repos
dataset.json defines the train/verify split. The populate_baseline.py script reads this file and extracts the zips into baseline/.
{
"comment": "Replace the zip files with your own. Run the populate_baseline.py script to populate the baseline/ directory.",
"train": [
"20260310_hong_kong_game.zip",
"20250329_gta_game.zip",
"20250321_silo.zip",
"20260311_parasomnia_research_unit.zip",
"20260308_sovereign_identity.zip"
],
"verify": [
"20260303_crate_recovery_campaign.zip",
"20260215_nuuk_clay_workshop.zip",
"20260202_heatwave_resilience.zip",
"20260201_media_rescue.zip",
"20260201_yellowstone_evacuation.zip",
"20260131_clear_english.zip",
"20260129_euro_adoption.zip",
"20260114_cbc_validation.zip",
"20251101_e_bus_security.zip"
]
}
Why this split
PlanExestays clean — no multi-GB plan artifacts in the code repo.PlanExe-prompt-labcan use Git LFS for the zip files and large outputs.- The data repo is a full audit trail — every candidate, evaluation, and score is committed.
scoreboard.csvgives the metrics tracking at a glance.
Feedback from Bubba (2026-03-13)
Additional Questions & Gaps
Evaluation prompt is itself a prompt — it needs the same rigor. The reasoning model used for comparison will be guided by an evaluation prompt. That prompt encodes the scoring rubric. If the evaluation prompt is poorly specified, the optimizer will find prompts that score well on the evaluator but don't actually produce better plans. The evaluation prompt should be versioned, reviewed, and held constant across a run (or explicitly varied as a separate experiment).
Step isolation vs. cascade quality. Optimizing step N in isolation assumes that step N's output quality is a meaningful proxy for end-to-end plan quality. This may not hold: a step N output that scores 5/5 in isolation might feed step N+1 in a way that degrades the downstream result. Recommendation: after each isolated step optimization, run at least one full-pipeline spot check on the best candidate to confirm no downstream regressions before committing.
FAST_BUT_SKIP_DETAILS vs. ALL_DETAILS_BUT_SLOW — which mode for the optimizer?
The optimizer reruns individual pipeline steps. Those steps behave differently depending on SPEED_VS_DETAIL. Which mode should the optimizer use? Using FAST_BUT_SKIP_DETAILS is cheaper but may not reflect production output quality. This should be explicit in meta.json and the scoring rubric should account for it.
Candidate prompt generation is the missing piece. The doc describes the evaluation loop thoroughly but says nothing about how candidate prompts are created. This is the creative/generative step that drives the whole system. Options to consider: - LLM-generated rewrites (give the current prompt + failure examples to a reasoning model, ask for 3-5 variants) - Manual authoring with structured variation (change one constraint at a time) - Mutation from failed-attempts log (learn what patterns correlate with failure) The choice here will determine how fast the optimizer converges.
Implementation Suggestion: Prompt diff in failed_attempts.log
The failed_attempts.log currently captures that a candidate failed. It should also capture:
- The diff between the candidate prompt and the base prompt
- Which specific train plans it failed on (not just aggregate)
- The evaluator's reasoning (not just the score)
This turns the failure log into a learning dataset that can feed the next round of candidate generation.
Implementation Suggestion: Freeze the evaluation model
The reasoning model used for comparison should be pinned to a specific model version (not a rolling alias) for the duration of an optimization run. Model updates can shift evaluation scores without any change to the system prompts being tested. meta.json should record the exact evaluator model used.
Feedback from Claude Code (2026-03-14)
After building and running the first complete optimization iteration (B1 fix, PR #268), here are observations on the loop's strengths and weaknesses.
What works well
The audit trail is the real win. Every change has a PR, every PR has an assessment with a comparison table and verdict. You can trace any decision back to evidence. Most optimization systems are black boxes; this one is readable.
Dual-agent analysis reduces blind spots. Claude and Codex independently analyze the same data, then synthesis resolves disagreements by checking source code. This caught real issues — codex corrected claude's success rate miscount in iteration 0 (claude counted raw outputs.jsonl rows including pre-retry errors; codex counted final artifacts on disk).
It found a real bug on the first iteration. B1 was a genuine code defect, not a prompt tweak. The loop surfaced it, fixed it, verified the fix, and moved on to deeper issues. That's the loop working as intended.
Concerns
It's slow and expensive. Each iteration runs 7 models × 5 plans, then 4 phases of LLM analysis (2 insight + 2 code review + 1 synthesis + 1 assessment = 6 LLM sessions). That's a lot of API calls for one recommendation. The low-hanging fruit (B1) justified it, but as fixes get more marginal the cost-per-insight ratio will worsen.
There's no numeric scoring yet. The assessment says "review violations 67→4" but those numbers come from agents counting by hand, not a deterministic evaluator. Two runs of the same analysis could produce different counts. Until evaluator.py exists with a pinned rubric, the loop is making keep/reject decisions on qualitative agent impressions, not reproducible scores.
The synthesis bottleneck. Everything funnels through one synthesis agent that picks THE recommendation. If it picks wrong (e.g., a prompt tweak when a code fix would help more), you burn a full iteration discovering that. The ranking is only as good as that single agent's judgment.
It optimizes in series, not parallel. One recommendation per iteration. You could test multiple fixes simultaneously on separate branches and compare, but the current flow is strictly sequential.
Suggested priorities
- Build the numeric evaluator — a deterministic script that computes the same metrics the insight agents compute (uniqueness, option count violations, template leakage, format compliance) without LLM calls. This makes assessment reproducible and fast.
- Use the evaluator to gate the assessment — run it before the expensive 4-phase analysis. If the numbers clearly improve, fast-track. If they clearly regress, skip analysis entirely.
- Consider parallelizing fixes — the synthesis already produces a top-5 list. You could implement 2-3 in parallel on separate branches and let the evaluator pick the winner.
The loop is a solid foundation. The main risk is that it stays in "expensive qualitative analysis" mode when it should graduate to "cheap quantitative scoring with qualitative analysis reserved for ambiguous cases."