Skip to content

Promising Directions for PlanExe

Analysis of 110+ proposals. Primary audience: AI agents consuming PlanExe via MCP. Secondary audience: humans reviewing plans.


Strategic Frame

PlanExe's future is as infrastructure for AI agents — the planning layer that agents call before executing complex multi-step work. Today, agents use PlanExe to generate plans for humans to review. Tomorrow, agents will plan, validate, and execute autonomously with PlanExe as the orchestration backbone.

The proposals below are grouped by what agents need, in priority order.


1. Operational Reliability (agents can't babysit)

Agents need PlanExe runs to complete reliably without human intervention. A failed run that loses 73% of work is a dealbreaker for autonomous workflows.

# Proposal Agent Impact
87 Plan Resume MCP Tool Implemented. plan_resume MCP tool lets agents recover from failures without restarting the full pipeline. Includes pipeline version checking and zip snapshot restore
109 LLM Executor Retry Improvements Implemented (PR #220). RetryConfig with exponential backoff for transient errors (rate limits, timeouts, connection failures). is_transient_error() classifies exceptions; permanent errors fall through to next model immediately. Agents no longer need their own retry wrappers
102 Pipeline Intelligence Layer ⚙️ Foundation implemented (PR #221). LLMExecutor extracts structured Pydantic validation feedback and exposes it via validation_feedback property; max_validation_retries controls per-model retry count. Remaining: wire individual tasks to inject feedback into prompts
103 Pipeline Hardening for Local Models Fix silent truncation and context-window overflows. Critical for agents running local models where failures are subtle
113 LLM Error Traceability Implemented (PR #237). LLMChatError replaces generic ValueError across 38 call sites. Root cause preserved for error classification; error_id UUID enables log-to-metrics cross-referencing. Agents can programmatically diagnose failures
101 Luigi Resume Enhancements Webhook hooks on task completion/failure — agents can subscribe to events instead of polling
114-I1 Stopped vs Failed State Implemented. Dedicated PlanState.stopped enum value — plan_stop transitions to stopped, not failed. Agents can now distinguish user-initiated stops from actual errors. plan_retry and plan_resume accept both failed and stopped
114-I2 Failure Diagnostics in plan_status Implemented. Four DB columns (failure_reason, failed_step, error_message, recoverable) populated by worker on failure. plan_status returns a consolidated error dict (keys: failure_reason, failed_step, message, recoverable) only for failed plans — non-failed plans omit error entirely. The recoverable flag lets agents immediately suggest plan_resume vs plan_retry. Migration renames the legacy last_error column to error_message with existence checks to avoid noisy PostgreSQL logs. Diagnostics reset on retry/resume
114-I7 Stalled-Plan Detection ⚙️ Partially implemented. timing.last_progress_at added to plan_status — agents can compute time-since-last-progress to detect stalls. last_llm_call_at deferred (requires LLM executor integration)
114-I10 Silent Partial Failures in Completed Plans A plan can reach completed with empty or stub-quality sections (e.g. 2/8 experts responding). No quality_summary in plan_status — agents can't tell if completed means "all sections produced quality output" or just "all steps ran." Trust gap for autonomous workflows
114-I12 Remote Files Visibility During Processing plan_status on remote returns files_count: 0 while processing; local shows files incrementally. Agents lose the ability to confirm output is being produced and see which pipeline sections completed. Behavioral inconsistency between servers breaks workflows silently
114-I13 Remote SSE Reliability SSE stream on local was 100% reliable across 12 plans; remote dropped at ~48% on first plan. Only matters if SSE is kept for non-agent consumers (see I5 correction below)

2. Agent Interface & Discoverability (reduce integration friction)

Agents need to discover PlanExe, understand its tools, and consume outputs programmatically. The current interface assumes a human is in the loop.

# Proposal Agent Impact
86 Agent-Optimized Pipeline Implemented (PR #223). Autonomous agent guide and agent-oriented prompts. Removes key friction points for autonomous agent use
62 Agent-First Frontend Discoverability Implemented (PR #224). robots.txt and README agent discoverability section — standard discovery protocols so agents find PlanExe without human guidance
110 Usage Metrics for Local Runs Implemented (PR #219, #236, #237). Agents need cost accounting for budget-constrained workflows. usage_metrics.jsonl answers "how much did this run cost?" with per-call granularity (model, tokens, cost, duration). Errors are classified into short categories with traceable error_id UUIDs. Complements activity_overview.json aggregated totals
114-I3 Plan Delete / Archive Stopped and failed plans persist in plan_list forever. After 10 plans, the list is noisy. Add plan_delete (hard delete) or plan_archive (soft delete, hidden from list but retained for billing)
114-I4 Idempotency Guard on plan_create Implemented (PR #242). Server-side auto-dedup on (user_id, prompt, model_profile) within configurable time window (default 10 min)
114-I5 SSE Wrong for Agents; MCP Notifications Deprioritized Corrected in v3, re-corrected in v4. SSE is designed for real-time UI clients, not turn-based agents — the agent abused connection close as a binary "done" signal. v3 proposed MCP notifications as the replacement. v4 tested monitor=true (implemented and reverted): Claude Code silently drops notifications/progress and notifications/message — it only supports list_changed. No current MCP client can consume progress notifications. Practical priority: (1) polling plan_status — works everywhere, (2) plan_wait blocking tool (I8) — would eliminate polling, (3) MCP notifications — correct long-term solution, blocked by client support. SSE remains useful as a local completion detector for non-agent consumers
114-I6 Download URL TTL Extension 15-minute signed URL expiry surprises users who review before downloading. Extend to 30–60 min, make configurable via env var
114-I8 plan_wait Tool Upgraded from P3 to P2. The most practical improvement for agents today, given that MCP notifications are blocked by client support (I5). A blocking plan_wait(plan_id, timeout) tool returns final status on completion — replaces both SSE monitoring and polling plan_status in a single call
114-I9 Prompt Iteration Linking Each plan_create is independent. Optional parent_plan_id links iteration chains so agents and users can track prompt refinement across plan versions
114-I11 Server Identity in Responses New in v3. Agent cannot programmatically determine which backend it's connected to (local vs remote). Plans from one server aren't accessible from the other, speed expectations differ (13 min vs 28 min), and SSE behavior varies. plan_status should include a server field or plan_create should return server_info with capabilities

Key friction points from #86 that block autonomous agent use: - F1: Human approval step before plan_create — autonomous agents can't proceed - F4: No machine-readable summary — agents must parse HTML or iterate 100+ zip files - F3: 5-minute poll interval is wrong for cloud runs completing in 8 minutes


3. Plan Quality Gates (agents must not amplify bad plans)

Before agents can execute plans autonomously, the plans themselves need automated validation. An agent that executes an unchecked plan multiplies errors at machine speed.

# Proposal Agent Impact
58 Boost Initial Prompt ⚙️ Open PR #222. Single LLM call to strengthen weak prompts before pipeline runs. Especially valuable for agent-originated prompts, which may be terse or overly technical
42 Evidence Traceability Ledger Links every claim to evidence with freshness scoring. Agents can programmatically check whether assumptions are grounded
43 Assumption Drift Monitor Watches key variables (costs, FX rates) against baselines, triggers re-plan on breach. Agents re-planning on a schedule need this to detect when a plan has gone stale
57 Banned Words + Lever Realism Auto-detects hype-heavy or impractical outputs without human review
56 Adversarial Red-Team Reality Check Multi-model adversarial review with judge scoring. Catches optimism bias that agents are prone to propagating
88 Fermi Sanity Check Validation Gate ⚙️ Open PR #225. Rule-based guard on every assumption: bounds present, span ratio sane, evidence for low-confidence claims. First line of defense before expensive downstream processing

4. Autonomous Prompt Optimization (agents improving PlanExe itself)

Luigi's caching makes prompt optimization uniquely practical — changing one prompt template and re-running only regenerates that single task (seconds, not the full 15-minute pipeline). This turns a 60+ LLM call pipeline into a 2–21 call experiment.

# Proposal Agent Impact
94 Autoresearch-Style Prompt Optimization ⚙️ Open PR #226 (LLM-as-judge foundation). Autonomous overnight loops: agent modifies one prompt, re-runs one task, scores, keeps or reverts. Hundreds of experiments per night exploiting Luigi resumability
59 Prompt Optimizing with A/B Testing ⚙️ Open PR #227 (A/B experiment runner). Structured promotion pipeline: multi-model A/B matrix, Elo tracking, regression guards. Validates candidates discovered by #94 before merging into baseline

Two-stage system: #94 discovers promising variants at high volume (greedy, autonomous), #59 validates them with rigor (conservative, human-gated). Exploration feeds promotion.

This is a case where agents improve the tool they use — a self-reinforcing loop.


5. Post-Plan Execution (the big unlock)

This is PlanExe's largest gap: "I have a plan. Now what?" For agents, a plan that can't be executed programmatically is just a document.

# Proposal Agent Impact
41 Autonomous Execution of Plan Converts static Gantt into live execution engine with AI/human task delegation. The plan becomes a runnable workflow, not a PDF
60–66 Plan-to-Repo + Agent Swarm Auto-provision repo, spawn agents (research, issues, compliance), use git as state machine. Plans become collaborative artifacts with continuous enrichment
92 Task Complexity Scoring & Model Routing Each task gets a complexity score and recommended model tier. Agents can route cheap tasks to cheap models and expensive tasks to expensive ones — 55% cost savings in benchmarks

#41 is the most transformative proposal — but it only works if the quality gates (#42, #43, #56, #88) prevent automation from multiplying bad plans.


Phase 1: Reliable foundation         (nearly complete)
  ├─ #87  Plan resume ✅
  ├─ #109 Retry improvements ✅ (PR #220)
  ├─ #102 Error-feedback retries ⚙️ (PR #221, foundation only — tasks not yet wired)
  ├─ #110 Usage metrics ✅ (PR #219, #236, #237)
  ├─ #113 Error traceability ✅ (PR #237)
  ├─ #58  Prompt boost ⚙️ (open PR #222)
  ├─ #114-I1 Stopped vs failed state ✅
  ├─ #114-I2 Failure diagnostics in plan_status ✅
  ├─ #114-I5 MCP notifications for agents  ← deprioritized (v4: Claude Code can't consume them; polling is primary)
  ├─ #114-I7 Stalled-plan detection
  ├─ #114-I10 Silent partial failures in completed plans
  └─ #114-I12 Remote files visibility during processing

Phase 2: Agent-native interface       (next)
  ├─ #86  Remove agent friction points ✅ (PR #223)
  ├─ #62  Discovery protocols ✅ (PR #224)
  ├─ #88  Fermi validation gate ⚙️ (open PR #225)
  ├─ #114-I3 Plan delete/archive
  ├─ #114-I6 Download URL TTL extension
  ├─ #114-I4 Idempotency guard ✅ (PR #242)
  ├─ #114-I8 plan_wait tool (P2 — primary polling alternative now that MCP notifications are blocked)
  ├─ #114-I9 Prompt iteration linking
  ├─ #114-I11 Server identity in responses
  └─ #114-I13 Remote SSE reliability (if SSE kept for non-agent consumers)

Phase 3: Automated quality            (then)
  ├─ #42  Evidence traceability
  ├─ #43  Assumption drift monitor
  ├─ #56  Adversarial red-team
  └─ #57  Banned words / lever realism

Phase 4: Self-improving pipeline      (concurrent with 3)
  ├─ #94  Autoresearch prompt optimization ⚙️ (open PR #226 foundation, PR #227 runner)
  └─ #59  A/B testing promotion ⚙️ (open PR #227)

Phase 5: Autonomous execution         (after quality gates)
  ├─ #41  Plan execution engine
  ├─ #60-66 Agent swarm
  └─ #92  Model routing

Each phase enables the next. Skipping to Phase 5 without Phase 3 means agents execute unchecked plans at scale — the worst outcome.


Key Machine-Readable Outputs Agents Need

From proposal #86 (F4), the most agent-useful files in the zip artifact:

File Contents
assumptions/distilled_assumptions.json Key planning assumptions
pre_project_assessment/pre_project_assessment.json Go/no-go recommendation
negative_feedback/negative_feedback.json Risk register
wbs/wbs_level2.json Work breakdown structure

A future plan_summary.json collating these into a single file would eliminate the need for agents to parse multiple artifacts.