Proposal 128: The Compiler Model — Plan Quality Metrics, Dogfood Execution, and the Feedback Loop PlanExe Is Missing
Author: Claude (Opus 4.6), prompted by Simon (neoneye) Status: Draft Date: March 29, 2026 Context: Comparative analysis of two PlanExe reports (reverse aging lab, cryosleep program) and discussion about the execution gap, cost constraints, and what PlanExe actually is
What PlanExe actually is
PlanExe is a prompt-to-plan compiler. A human writes a paragraph. PlanExe compiles it into a structured execution plan with ~285 tasks, a dependency graph with ~245 links, risk analysis, governance structures, and role assignments. The HTML report is the .map file — useful for debugging, not the artifact itself.
The intended consumer is not a human skimming slides. It is an AI agent that needs unambiguous, structured text to reason over. The 50,000 words of output are not a flaw — they are the instruction set. The fixed 21-section template is not rigidity — it is a contract, an API shape that downstream agents can parse reliably.
This reframing matters because it changes what "quality" means. A plan is not good because it reads well. A plan is good because an agent can execute it without ambiguity, without hallucinating missing context, and without deadlocking on decisions nobody owns.
Part 1: What the report comparison revealed
The data
Two plans were compared side-by-side:
| Metric | Reverse aging (~100-word prompt) | Cryosleep (~900-word prompt) |
|---|---|---|
| File size | 790 KB | 877 KB |
| Word count | 42,263 | 51,428 |
| Unique vocabulary | 5,815 | 6,567 |
| Gantt tasks / links | 285 / 245 | 293 / 250 |
| Documents listed | 14 | 23 |
| Pipeline files | 188 | 212 |
| Generation time | ~12 min | unknown |
A 9x increase in prompt investment yielded a 1.2x increase in output volume. The structural scaffolding — section count, Gantt task count, expert count (8 each), Q&A pairs (15 each) — was nearly identical.
Where prompt investment paid off
Three sections showed meaningful divergence:
- Data Collection: 432 → 1,735 words (4x). The cryosleep prompt's explicit mention of regulatory pathways, dual research tracks, and fallback-success criteria gave the LLM concrete data needs to enumerate.
- Documents to Create and Find: 6,739 → 11,426 words (1.7x). More prompt specificity → more concrete deliverables. The cryosleep plan knew it needed an NMPA regulatory filing, not a generic "regulatory document."
- Initial Prompt Vetted: 1,957 → 3,059 words (1.6x). A longer prompt gives the vetting pipeline more surface area to challenge.
Where it didn't matter
Executive Summary, Pitch, Scenarios, SWOT, Q&A, Self Audit, and the Execute Plan section (~32K words) were all within ~5% of each other. The pipeline fills these with a fixed volume regardless of input depth.
The insight
Prompt investment doesn't scale linearly with output volume. It scales with output specificity. The extra 800 words bought sharper domain grounding in the sections that depend on understanding the project's actual structure (documents, data collection, expert selection), while template-driven sections stayed flat.
This is healthy. The variability is in the right places. But it also means the pipeline currently has no way to measure this difference, nor to tell the user "your prompt was too vague for sections X and Y."
Part 2: PlanExe needs a quality signal
The problem
Right now, every plan looks equally confident. A 100-word prompt and a 900-word prompt both produce a plan with 21 sections, ~285 Gantt tasks, and a professional-looking HTML report. There is no internal metric that says "this plan is well-grounded" versus "this plan is plausible-sounding but vague."
Without a quality signal, you cannot: - Tell the user their prompt needs more detail - Compare plans generated by different models - Detect when the pipeline is producing generic filler versus domain-specific content - Validate that execution agents have enough specificity to act on
Proposed: plan quality score
A lightweight post-pipeline pass that scores each section on specificity rather than length. Metrics could include:
Grounding density: ratio of domain-specific terms to generic project-management vocabulary. "Conduct stakeholder analysis" scores low. "File NMPA regulatory pre-submission for cryoprotectant formulation" scores high. This can be computed cheaply — a simple TF-IDF variant where the "corpus" is a baseline of generic PMO language and the "document" is the plan section. Terms that deviate from the generic baseline are grounding terms.
Numeric concreteness: count of specific quantities (dates, dollar amounts, percentages, durations) versus hedging language ("approximately," "as needed," "TBD"). The aging plan's Assumptions section had 107 numbers but many were generic thresholds (e.g., "15% contingency"). A score should distinguish between plan-specific numbers derived from the prompt and boilerplate numbers the LLM always generates.
Cross-reference density: how often do sections reference each other? A well-integrated plan has the Expert Criticism section citing specific assumptions from the Assumptions section, and the Premortem referencing specific Gantt tasks. A generic plan has sections that could be shuffled without anyone noticing.
Prompt echo rate: what percentage of the user's prompt concepts appear in the plan versus being ignored? If the user mentioned "dual research tracks" and the plan only discusses a single track, that's a coverage gap. Measurable via semantic similarity between prompt segments and plan sections.
The score doesn't need to be a single number. A per-section heatmap ("Data Collection: high specificity, SWOT: generic, Expert Criticism: high specificity") would be more actionable. It could be rendered as an additional collapsible section in the HTML report, or as metadata in the JSON output for agent consumption.
Part 3: The dogfood execution strategy
The cost constraint
Validating plans for real-world projects (construction, biotech, logistics) is prohibitively expensive. You cannot run a $500M reverse aging lab to see if the Gantt chart was accurate.
The solution: PlanExe plans its own development
PlanExe should generate plans for its own next features, then have agents execute those plans. This creates a closed validation loop at near-zero marginal cost (only API tokens):
If the plan said "build SpawnAgentTask in 5 days with 3 agents" and it took 12 days and the output was broken, you have learned something concrete about plan quality without spending money on real-world resources.
What this produces
Every self-referential run generates a plan-vs-reality diff: the delta between what the plan predicted and what actually happened. This diff is training data for:
- Drift measurement (proposals 82/83): how far did execution diverge from plan?
- Fermi sanity check (proposal 88): were the time/effort estimates in the right ballpark?
- Evidence calibration (proposal 123): which assumptions held and which were wrong?
- Quality scoring (this proposal): which plan sections correlated with execution success?
Over time, a corpus of dogfood runs builds an empirical answer to "what makes a good plan." You don't need external validation — you need the feedback loop.
Phasing the dogfood
Phase 0 (now, free): Pick a small PlanExe feature. Generate a plan for it using PlanExe itself. Have a human (Simon) execute the plan manually. Record the diff: what matched, what was wrong, what was missing.
Phase 1 (cheap): Same as above, but the OpenClaw agents (Larry, Egon, Bubba) execute the plan. Record agent decisions, blockers, and deviations. Compare to the plan's predictions.
Phase 2 (the real test): Close the loop automatically. PlanExe generates a plan, spawns agents, agents execute, the system records the diff, and the diff feeds back into plan quality scoring. Now you have a self-improving compiler.
Part 4: The template ceiling and how to raise it
The problem
The Execute Plan section is ~32K words in both analyzed reports — nearly identical regardless of prompt investment. This is the bulk of the output and it's where agents will spend most of their time. Yet it appears to run on autopilot, filled by the template regardless of input depth.
If Execute Plan is the agent's instruction set, its quality ceiling is set by the template, not the prompt. This is the single biggest bottleneck for execution quality.
Proposed: two-tier Execute Plan
The current Execute Plan appears to be a flat expansion of the WBS. Consider splitting it into two layers:
Skeleton layer (template-driven, stable): the task sequence, dependencies, role assignments, and acceptance criteria. This is the scheduler's input. It should be consistent and predictable — the template is an asset here.
Context layer (prompt-driven, variable): per-task briefing notes that incorporate domain-specific knowledge from the prompt. For the cryosleep plan, task "T3.2: Develop cryoprotectant delivery protocol" should carry context about Track A vs Track B, the specific temperature targets (10–15°C core body temperature), and the fallback-success framing from the prompt. This context is what differentiates a plan an agent can execute from a plan an agent can only follow mechanically.
The skeleton layer is cheap to generate and should be highly deterministic. The context layer is where LLM effort should concentrate, and where the quality score would focus its grounding-density metric.
Part 5: The lever system as execution infrastructure
Proposals 120 and 121 both mention the lever system but undervalue it. The lever system — mapping which decisions belong to which roles — is arguably more important for execution than the task graph itself.
Why levers matter more than tasks
Tasks are sequential. An agent picks them up, does the work, moves on. That part is easy — it's a queue.
Decisions are where execution breaks down. Two agents with overlapping authority deadlock. An agent without authority on a decision either blocks waiting for escalation or makes an unauthorized call that cascades. A decision that nobody owns falls through the cracks.
The lever system already solves this by mapping every decision to an owner. For agent execution, this becomes a routing table:
If the Environmental Engineer agent hits a regulatory ambiguity, the lever map tells it: "this is a Regulatory Affairs decision, escalate to the Regulatory Affairs agent, timeout after 4 hours, default action is to flag for human review."
Proposed: lever-first execution
Instead of dispatching agents by task queue (the approach in proposals 120/121), dispatch by decision domain. Each agent owns a set of levers (decision types), not just a set of tasks. Tasks flow to whoever owns the relevant decision authority.
This is closer to how real organizations work. A construction foreman doesn't just execute a task list — they own decisions about crew scheduling, material sequencing, and safety compliance. Tasks that touch those domains route through them regardless of the WBS assignment.
For PlanExe, this means SpawnAgentTask should emit agent manifests organized around lever ownership, not just task assignment. An agent's SOUL.md should say "you own decisions about regulatory compliance, environmental impact, and permit applications" rather than "you own tasks T1.2, T1.3, and T2.1."
Part 6: The time-sharing agent model
The cost problem with N parallel agents
Proposals 120/121 envision one agent per team member. For a plan with 8 team members, that is 8 concurrent LLM sessions consuming tokens continuously. For a hobby project, this adds up fast — especially if agents are idle waiting on dependencies.
Proposed: single-process role rotation
Instead of N parallel agent processes, run a single executor process that cycles through roles. The executor:
- Reads the dependency graph
- Identifies the next unblocked task
- Loads the appropriate
SOUL.mdfor that task's assigned role - Executes the task in that role's persona
- Writes results back to shared state
- Moves to the next unblocked task
This is cheaper (one LLM session, not eight), eliminates coordination overhead (no inter-agent messaging needed — the executor IS the coordinator), and avoids race conditions on shared state (serial execution by design).
The tradeoff is speed — parallel agents can work simultaneously on independent tasks. But for a hobby project running on API credits, the cost savings outweigh the speed penalty. And the serial model is simpler to debug, since you can replay the execution log linearly.
A Postgres-backed state machine fits this naturally: a tasks table with status, assigned_role, depends_on, and output columns. The executor polls for status = 'unblocked', picks the highest-priority task, and runs it. This is 200 lines of Python against your existing Flask + Postgres stack.
Scaling later
The serial executor is Phase 1. When cost isn't a constraint, upgrading to parallel agents is a configuration change — N workers polling the same task table instead of one. The task table schema doesn't change. The SOUL.md files don't change. The lever routing doesn't change. You've just increased the worker count.
Part 7: Broader observations
The proposal sprawl
PlanExe has 127+ proposals. Many are ambitious, some are speculative, and the numbering suggests they were generated faster than they can be implemented. This is not inherently bad — a project with too many ideas is healthier than one with too few. But it creates a prioritization problem, especially for a solo developer.
The proposals themselves could benefit from PlanExe's own template discipline. Many are vision documents without acceptance criteria. "Build X" is a direction, not a task. "Build X, verify it by running plan Y through it and checking that property Z holds" is a task with a testable outcome.
Suggestion: run PlanExe on "implement proposal 120" and see what it produces. If the generated plan is actionable enough for the lobster swarm to execute, the pipeline works. If the plan is too vague, you've found a quality gap.
The naming is an asset
"PlanExe" communicates the vision in five characters: plan + execute. The fact that execution isn't built yet isn't a naming failure — it's a roadmap. The name sets the expectation correctly. Ship the executor and the name goes from aspirational to descriptive.
The 21-section template is a standard in disguise
If PlanExe's section structure stabilizes, it becomes a de facto interchange format for AI-generated project plans. Other tools could consume PlanExe output if the section names, JSON schema, and Gantt data format were documented as a spec. This is how standards happen — not by committee, but by one tool doing it consistently enough that others adapt.
Worth considering: publishing the plan schema (section names, expected fields per section, Gantt JSON format, lever schema) as a standalone specification, separate from PlanExe itself. Even if nobody adopts it immediately, it clarifies what the pipeline promises and what execution agents can rely on.
Summary of proposals
| ID | Name | Effort | Depends on |
|---|---|---|---|
| 128a | Plan quality score (per-section specificity heatmap) | Medium | Nothing — can run as a post-pipeline pass on existing plans |
| 128b | Dogfood execution Phase 0 (manual plan-vs-reality diff) | Low | A small PlanExe feature to plan for |
| 128c | Two-tier Execute Plan (skeleton + context layers) | Medium | 128a (to measure improvement) |
| 128d | Lever-first agent dispatch | Medium | Lever pipeline (proposal 119) |
| 128e | Serial executor (single-process role rotation) | Low-Medium | 128d for lever routing, but a basic version works with task-queue dispatch |
| 128f | Plan schema specification | Low | Nothing — documenting what already exists |
Recommended order: 128b → 128a → 128f → 128e → 128c → 128d
Start with the dogfood loop because it produces the feedback signal everything else depends on. Measure before you optimize.
Appendix: Raw comparison data
Section-by-section word counts from the analyzed reports:
Section Aging Cryo Delta Ratio
Executive Summary 340 347 +7 1.0x
Pitch 648 618 -30 1.0x
Project Plan 589 674 +85 1.1x
Strategic Decisions 5,989 5,111 -878 0.9x
Scenarios 984 988 +4 1.0x
Assumptions 3,610 4,467 +857 1.2x
Governance 3,828 4,483 +655 1.2x
Related Resources 1,253 1,054 -199 0.8x
Data Collection 434 1,735 +1,301 4.0x
Documents to Create/Find 6,739 11,426 +4,687 1.7x
SWOT Analysis 1,094 986 -108 0.9x
Team 2,255 2,324 +69 1.0x
Expert Criticism 2,052 2,843 +791 1.4x
Work Breakdown Structure 1,498 1,598 +100 1.1x
Review Plan 2,877 3,864 +987 1.3x
Questions & Answers 990 1,005 +15 1.0x
Premortem 3,287 3,014 -273 0.9x
Self Audit 1,796 1,791 -5 1.0x
Initial Prompt Vetted 1,957 3,059 +1,102 1.6x
Execute Plan 32,256 32,696 +440 1.0x
Gantt chart comparison: