Skip to content

Proposal 128: The Compiler Model — Plan Quality Metrics, Dogfood Execution, and the Feedback Loop PlanExe Is Missing

Author: Claude (Opus 4.6), prompted by Simon (neoneye) Status: Draft Date: March 29, 2026 Context: Comparative analysis of two PlanExe reports (reverse aging lab, cryosleep program) and discussion about the execution gap, cost constraints, and what PlanExe actually is


What PlanExe actually is

PlanExe is a prompt-to-plan compiler. A human writes a paragraph. PlanExe compiles it into a structured execution plan with ~285 tasks, a dependency graph with ~245 links, risk analysis, governance structures, and role assignments. The HTML report is the .map file — useful for debugging, not the artifact itself.

The intended consumer is not a human skimming slides. It is an AI agent that needs unambiguous, structured text to reason over. The 50,000 words of output are not a flaw — they are the instruction set. The fixed 21-section template is not rigidity — it is a contract, an API shape that downstream agents can parse reliably.

This reframing matters because it changes what "quality" means. A plan is not good because it reads well. A plan is good because an agent can execute it without ambiguity, without hallucinating missing context, and without deadlocking on decisions nobody owns.


Part 1: What the report comparison revealed

The data

Two plans were compared side-by-side:

Metric Reverse aging (~100-word prompt) Cryosleep (~900-word prompt)
File size 790 KB 877 KB
Word count 42,263 51,428
Unique vocabulary 5,815 6,567
Gantt tasks / links 285 / 245 293 / 250
Documents listed 14 23
Pipeline files 188 212
Generation time ~12 min unknown

A 9x increase in prompt investment yielded a 1.2x increase in output volume. The structural scaffolding — section count, Gantt task count, expert count (8 each), Q&A pairs (15 each) — was nearly identical.

Where prompt investment paid off

Three sections showed meaningful divergence:

  • Data Collection: 432 → 1,735 words (4x). The cryosleep prompt's explicit mention of regulatory pathways, dual research tracks, and fallback-success criteria gave the LLM concrete data needs to enumerate.
  • Documents to Create and Find: 6,739 → 11,426 words (1.7x). More prompt specificity → more concrete deliverables. The cryosleep plan knew it needed an NMPA regulatory filing, not a generic "regulatory document."
  • Initial Prompt Vetted: 1,957 → 3,059 words (1.6x). A longer prompt gives the vetting pipeline more surface area to challenge.

Where it didn't matter

Executive Summary, Pitch, Scenarios, SWOT, Q&A, Self Audit, and the Execute Plan section (~32K words) were all within ~5% of each other. The pipeline fills these with a fixed volume regardless of input depth.

The insight

Prompt investment doesn't scale linearly with output volume. It scales with output specificity. The extra 800 words bought sharper domain grounding in the sections that depend on understanding the project's actual structure (documents, data collection, expert selection), while template-driven sections stayed flat.

This is healthy. The variability is in the right places. But it also means the pipeline currently has no way to measure this difference, nor to tell the user "your prompt was too vague for sections X and Y."


Part 2: PlanExe needs a quality signal

The problem

Right now, every plan looks equally confident. A 100-word prompt and a 900-word prompt both produce a plan with 21 sections, ~285 Gantt tasks, and a professional-looking HTML report. There is no internal metric that says "this plan is well-grounded" versus "this plan is plausible-sounding but vague."

Without a quality signal, you cannot: - Tell the user their prompt needs more detail - Compare plans generated by different models - Detect when the pipeline is producing generic filler versus domain-specific content - Validate that execution agents have enough specificity to act on

Proposed: plan quality score

A lightweight post-pipeline pass that scores each section on specificity rather than length. Metrics could include:

Grounding density: ratio of domain-specific terms to generic project-management vocabulary. "Conduct stakeholder analysis" scores low. "File NMPA regulatory pre-submission for cryoprotectant formulation" scores high. This can be computed cheaply — a simple TF-IDF variant where the "corpus" is a baseline of generic PMO language and the "document" is the plan section. Terms that deviate from the generic baseline are grounding terms.

Numeric concreteness: count of specific quantities (dates, dollar amounts, percentages, durations) versus hedging language ("approximately," "as needed," "TBD"). The aging plan's Assumptions section had 107 numbers but many were generic thresholds (e.g., "15% contingency"). A score should distinguish between plan-specific numbers derived from the prompt and boilerplate numbers the LLM always generates.

Cross-reference density: how often do sections reference each other? A well-integrated plan has the Expert Criticism section citing specific assumptions from the Assumptions section, and the Premortem referencing specific Gantt tasks. A generic plan has sections that could be shuffled without anyone noticing.

Prompt echo rate: what percentage of the user's prompt concepts appear in the plan versus being ignored? If the user mentioned "dual research tracks" and the plan only discusses a single track, that's a coverage gap. Measurable via semantic similarity between prompt segments and plan sections.

The score doesn't need to be a single number. A per-section heatmap ("Data Collection: high specificity, SWOT: generic, Expert Criticism: high specificity") would be more actionable. It could be rendered as an additional collapsible section in the HTML report, or as metadata in the JSON output for agent consumption.


Part 3: The dogfood execution strategy

The cost constraint

Validating plans for real-world projects (construction, biotech, logistics) is prohibitively expensive. You cannot run a $500M reverse aging lab to see if the Gantt chart was accurate.

The solution: PlanExe plans its own development

PlanExe should generate plans for its own next features, then have agents execute those plans. This creates a closed validation loop at near-zero marginal cost (only API tokens):

prompt → PlanExe plan → agent execution → PRs → merged code → did the feature work?

If the plan said "build SpawnAgentTask in 5 days with 3 agents" and it took 12 days and the output was broken, you have learned something concrete about plan quality without spending money on real-world resources.

What this produces

Every self-referential run generates a plan-vs-reality diff: the delta between what the plan predicted and what actually happened. This diff is training data for:

  • Drift measurement (proposals 82/83): how far did execution diverge from plan?
  • Fermi sanity check (proposal 88): were the time/effort estimates in the right ballpark?
  • Evidence calibration (proposal 123): which assumptions held and which were wrong?
  • Quality scoring (this proposal): which plan sections correlated with execution success?

Over time, a corpus of dogfood runs builds an empirical answer to "what makes a good plan." You don't need external validation — you need the feedback loop.

Phasing the dogfood

Phase 0 (now, free): Pick a small PlanExe feature. Generate a plan for it using PlanExe itself. Have a human (Simon) execute the plan manually. Record the diff: what matched, what was wrong, what was missing.

Phase 1 (cheap): Same as above, but the OpenClaw agents (Larry, Egon, Bubba) execute the plan. Record agent decisions, blockers, and deviations. Compare to the plan's predictions.

Phase 2 (the real test): Close the loop automatically. PlanExe generates a plan, spawns agents, agents execute, the system records the diff, and the diff feeds back into plan quality scoring. Now you have a self-improving compiler.


Part 4: The template ceiling and how to raise it

The problem

The Execute Plan section is ~32K words in both analyzed reports — nearly identical regardless of prompt investment. This is the bulk of the output and it's where agents will spend most of their time. Yet it appears to run on autopilot, filled by the template regardless of input depth.

If Execute Plan is the agent's instruction set, its quality ceiling is set by the template, not the prompt. This is the single biggest bottleneck for execution quality.

Proposed: two-tier Execute Plan

The current Execute Plan appears to be a flat expansion of the WBS. Consider splitting it into two layers:

Skeleton layer (template-driven, stable): the task sequence, dependencies, role assignments, and acceptance criteria. This is the scheduler's input. It should be consistent and predictable — the template is an asset here.

Context layer (prompt-driven, variable): per-task briefing notes that incorporate domain-specific knowledge from the prompt. For the cryosleep plan, task "T3.2: Develop cryoprotectant delivery protocol" should carry context about Track A vs Track B, the specific temperature targets (10–15°C core body temperature), and the fallback-success framing from the prompt. This context is what differentiates a plan an agent can execute from a plan an agent can only follow mechanically.

The skeleton layer is cheap to generate and should be highly deterministic. The context layer is where LLM effort should concentrate, and where the quality score would focus its grounding-density metric.


Part 5: The lever system as execution infrastructure

Proposals 120 and 121 both mention the lever system but undervalue it. The lever system — mapping which decisions belong to which roles — is arguably more important for execution than the task graph itself.

Why levers matter more than tasks

Tasks are sequential. An agent picks them up, does the work, moves on. That part is easy — it's a queue.

Decisions are where execution breaks down. Two agents with overlapping authority deadlock. An agent without authority on a decision either blocks waiting for escalation or makes an unauthorized call that cascades. A decision that nobody owns falls through the cracks.

The lever system already solves this by mapping every decision to an owner. For agent execution, this becomes a routing table:

decision_type → owner_agent → escalation_path → timeout → default_action

If the Environmental Engineer agent hits a regulatory ambiguity, the lever map tells it: "this is a Regulatory Affairs decision, escalate to the Regulatory Affairs agent, timeout after 4 hours, default action is to flag for human review."

Proposed: lever-first execution

Instead of dispatching agents by task queue (the approach in proposals 120/121), dispatch by decision domain. Each agent owns a set of levers (decision types), not just a set of tasks. Tasks flow to whoever owns the relevant decision authority.

This is closer to how real organizations work. A construction foreman doesn't just execute a task list — they own decisions about crew scheduling, material sequencing, and safety compliance. Tasks that touch those domains route through them regardless of the WBS assignment.

For PlanExe, this means SpawnAgentTask should emit agent manifests organized around lever ownership, not just task assignment. An agent's SOUL.md should say "you own decisions about regulatory compliance, environmental impact, and permit applications" rather than "you own tasks T1.2, T1.3, and T2.1."


Part 6: The time-sharing agent model

The cost problem with N parallel agents

Proposals 120/121 envision one agent per team member. For a plan with 8 team members, that is 8 concurrent LLM sessions consuming tokens continuously. For a hobby project, this adds up fast — especially if agents are idle waiting on dependencies.

Proposed: single-process role rotation

Instead of N parallel agent processes, run a single executor process that cycles through roles. The executor:

  1. Reads the dependency graph
  2. Identifies the next unblocked task
  3. Loads the appropriate SOUL.md for that task's assigned role
  4. Executes the task in that role's persona
  5. Writes results back to shared state
  6. Moves to the next unblocked task

This is cheaper (one LLM session, not eight), eliminates coordination overhead (no inter-agent messaging needed — the executor IS the coordinator), and avoids race conditions on shared state (serial execution by design).

The tradeoff is speed — parallel agents can work simultaneously on independent tasks. But for a hobby project running on API credits, the cost savings outweigh the speed penalty. And the serial model is simpler to debug, since you can replay the execution log linearly.

A Postgres-backed state machine fits this naturally: a tasks table with status, assigned_role, depends_on, and output columns. The executor polls for status = 'unblocked', picks the highest-priority task, and runs it. This is 200 lines of Python against your existing Flask + Postgres stack.

Scaling later

The serial executor is Phase 1. When cost isn't a constraint, upgrading to parallel agents is a configuration change — N workers polling the same task table instead of one. The task table schema doesn't change. The SOUL.md files don't change. The lever routing doesn't change. You've just increased the worker count.


Part 7: Broader observations

The proposal sprawl

PlanExe has 127+ proposals. Many are ambitious, some are speculative, and the numbering suggests they were generated faster than they can be implemented. This is not inherently bad — a project with too many ideas is healthier than one with too few. But it creates a prioritization problem, especially for a solo developer.

The proposals themselves could benefit from PlanExe's own template discipline. Many are vision documents without acceptance criteria. "Build X" is a direction, not a task. "Build X, verify it by running plan Y through it and checking that property Z holds" is a task with a testable outcome.

Suggestion: run PlanExe on "implement proposal 120" and see what it produces. If the generated plan is actionable enough for the lobster swarm to execute, the pipeline works. If the plan is too vague, you've found a quality gap.

The naming is an asset

"PlanExe" communicates the vision in five characters: plan + execute. The fact that execution isn't built yet isn't a naming failure — it's a roadmap. The name sets the expectation correctly. Ship the executor and the name goes from aspirational to descriptive.

The 21-section template is a standard in disguise

If PlanExe's section structure stabilizes, it becomes a de facto interchange format for AI-generated project plans. Other tools could consume PlanExe output if the section names, JSON schema, and Gantt data format were documented as a spec. This is how standards happen — not by committee, but by one tool doing it consistently enough that others adapt.

Worth considering: publishing the plan schema (section names, expected fields per section, Gantt JSON format, lever schema) as a standalone specification, separate from PlanExe itself. Even if nobody adopts it immediately, it clarifies what the pipeline promises and what execution agents can rely on.


Summary of proposals

ID Name Effort Depends on
128a Plan quality score (per-section specificity heatmap) Medium Nothing — can run as a post-pipeline pass on existing plans
128b Dogfood execution Phase 0 (manual plan-vs-reality diff) Low A small PlanExe feature to plan for
128c Two-tier Execute Plan (skeleton + context layers) Medium 128a (to measure improvement)
128d Lever-first agent dispatch Medium Lever pipeline (proposal 119)
128e Serial executor (single-process role rotation) Low-Medium 128d for lever routing, but a basic version works with task-queue dispatch
128f Plan schema specification Low Nothing — documenting what already exists

Recommended order: 128b → 128a → 128f → 128e → 128c → 128d

Start with the dogfood loop because it produces the feedback signal everything else depends on. Measure before you optimize.


Appendix: Raw comparison data

Section-by-section word counts from the analyzed reports:

Section                         Aging    Cryo    Delta   Ratio
Executive Summary                 340     347       +7   1.0x
Pitch                             648     618      -30   1.0x
Project Plan                      589     674      +85   1.1x
Strategic Decisions             5,989   5,111     -878   0.9x
Scenarios                         984     988       +4   1.0x
Assumptions                    3,610   4,467     +857   1.2x
Governance                     3,828   4,483     +655   1.2x
Related Resources              1,253   1,054     -199   0.8x
Data Collection                  434   1,735   +1,301   4.0x
Documents to Create/Find       6,739  11,426   +4,687   1.7x
SWOT Analysis                  1,094     986     -108   0.9x
Team                           2,255   2,324      +69   1.0x
Expert Criticism               2,052   2,843     +791   1.4x
Work Breakdown Structure       1,498   1,598     +100   1.1x
Review Plan                    2,877   3,864     +987   1.3x
Questions & Answers              990   1,005      +15   1.0x
Premortem                      3,287   3,014     -273   0.9x
Self Audit                     1,796   1,791       -5   1.0x
Initial Prompt Vetted          1,957   3,059   +1,102   1.6x
Execute Plan                  32,256  32,696     +440   1.0x

Gantt chart comparison:

                    Aging       Cryo
Tasks               285         293
Dependency links    245         250
Project groups       61          60
Regular tasks       224         233
Date span           2026-2046   2026-2053