Skip to content

Proposal 85: Pipeline Benchmark Prompt Suite (10 Prompts)

Author: EgonBot
Date: 2026-03-07
Status: Proposal
Source: worker_plan/worker_plan_api/prompt/data/simple_plan_prompts.jsonl
Purpose: Define a canonical 10-prompt benchmark suite for measuring pipeline reliability and plan fidelity across diverse domains and global locations.


1. Why a Benchmark Suite

A single run reveals one failure. Ten diverse runs reveal a pattern.

This suite provides a standardised, repeatable set of inputs that can be run against any PlanExe installation — local Qwen, planexe.org, Docker, Railway — and the results compared directly. Each run produces: - A completion status (passed / failed at task X) - A failure classification (which failure mode, if any) - A drift score (when proposal 84 DriftEvaluationTask is implemented)

Because all prompts come from the existing simple_plan_prompts.jsonl, they are maintained in the repo and do not duplicate effort.


2. Selection Criteria

Each selected prompt: - Is already present in simple_plan_prompts.jsonl (no new content created) - Emphasizes geographic diversity across regions while allowing occasional country repeats when the domain differs materially - Has an explicit budget, constraints, and success criteria (strong prompt quality) - Tests a different domain (infrastructure, healthcare, entertainment, defence, environment, etc.) - Is suitable for comparison across model profiles without ethical blockers

Geographic coverage: Denmark, Global/Space, India (policy + infrastructure), Ghana, Uruguay, USA, Global (SE Asia/Brazil/Africa), Spain+Morocco, Estonia


3. The 10 Selected Prompts

# UUID Location Domain Tags
01 ce2fbf38-9700-4ed1-814e-78772f7b7700 Denmark CSR / logistics denmark, plastic, waste, business
02 e6ddd953-939f-4d15-89ec-fd3988f79123 Global / Space Defence / research laser, space, defense, research
03 eaed8d7d-461c-48a5-b16c-76dbdba044c4 India Labor policy / productivity / public governance india, work, life, health, family
04 22f35414-c01b-4b52-a229-7dc5a78e2b96 Accra, Ghana Healthcare / Africa healthcare, malaria, accra, ghana
05 a6bef08b-c768-4616-bc28-7503244eff02 Delhi, India Infrastructure / water water, pollution, india, delhi
06 62f48a04-6f2c-4e60-9e65-34686a13c95a Uruguay AI / research / biotech uruguay, ai, brain, research
07 50c0f31f-d9a3-442a-81b8-1d885db05623 Yellowstone, USA Emergency / government yellowstone, volcano, evacuation
08 e9a73d5b-f274-4286-a619-4f0e1303cdc2 Global (SE Asia / Brazil / Africa) Food security / supply chain rubber, disease, supply, global
09 b9afce6c-f98d-4e9d-8525-267a9d153b51 Spain + Morocco Infrastructure / cross-border bridge, tunnel, europe, morocco
10 ab700769-c3ba-4f8a-913d-8589fea4624e Tallinn, Estonia Resilience / hardware prepping, tallinn, estonia

4. Prompt Rationale

01 — Arla Foods Milk Crate Return (Denmark)

  • Why: Strong CSR logistics plan with explicit KPIs, timeline, multi-stakeholder coordination, banned words list, and charitable mechanic. Tests whether the pipeline can handle a real-world corporate campaign with measurable success criteria.
  • Drift risk: Scope inflation (pilot → national programme), confidence inflation on recovery rates.
  • Pipeline stress: SelectScenarioTask, AssumptionsTask, ExpertCriticismTask.

02 — Space-Based Coherent Beam Combining (Global/Space)

  • Why: Highly technical prompt with precise engineering specs, performance thresholds, and explicit definitions. Tests whether the pipeline can handle deep-domain content without hallucinating or generalising away the technical constraints. [mcp_example]
  • Drift risk: Unsupported invention (fabricated specs), confidence inflation, mechanism drift.
  • Pipeline stress: PremiseAttackTask, ReviewPlanTask, structured output under high token load.

03 — 4-Day Work Week National Program (India)

  • Why: National policy programme with explicit governance design (single PMO under NITI Aayog), phased rollout, and measurable productivity/equity outcomes. Real-world labour policy problem with political and implementation constraints.
  • Drift risk: Scope expansion (pilot policy → nationwide mandate too quickly), unsupported adoption claims, confidence inflation on productivity gains.
  • Pipeline stress: GovernanceTask, StakeholderTask, ReviewPlanTask, NegativeFeedbackTask.

04 — Malaria Response Post-USAID (Accra, Ghana)

  • Why: Crisis-driven healthcare plan in sub-Saharan Africa with no specified budget. Tests how the pipeline handles resource-constrained plans and whether it fabricates Western-centric solutions.
  • Drift risk: Unsupported invention (invented NGO partners), customer drift (community → international org).
  • Pipeline stress: PreProjectAssessmentTask, ExpertDetails, assumption handling.

05 — Advanced Water Purification Hub (Delhi, India)

  • Why: Large-scale ($250M) infrastructure programme in South Asia. Tests cost modelling, regulatory posture (Indian law), and supply chain assumptions.
  • Drift risk: Confidence inflation on adoption rates, unsupported technology claims.
  • Pipeline stress: CostBreakdownTask, WBSTask, GanttTask.

06 — Upload Intelligence Neural Connectome (Uruguay)

  • Why: Speculative biotech/AI plan with a massive budget ($10B) and genuine ethical and scientific uncertainty. Tests whether the pipeline preserves epistemic caution on unproven science.
  • Drift risk: Confidence inflation (treats unproven science as settled), scope expansion.
  • Pipeline stress: DistillAssumptionsTask, RedlineGateTask, PremiseAttackTask.

07 — Yellowstone Caldera Emergency Response (USA)

  • Why: Crisis management plan for a low-probability, extreme-consequence event. Tests whether the pipeline can reason about multi-stakeholder emergency coordination without scope-expanding into long-term recovery.
  • Drift risk: Scope expansion (72-hour response → national recovery plan), confidence inflation on coordination outcomes.
  • Pipeline stress: GovernanceTask, StakeholderTask, NegativeFeedbackTask.

08 — Global Rubber Supply De-Risking from SALB (Global / SE Asia / Brazil / Africa)

  • Why: $30B, 25-year public-private programme to end global rubber supply dependence on a single crop vulnerable to South American Leaf Blight. Explicit Phase 1 deliverable (SALB Containment Protocol), multi-jurisdiction phytosanitary coordination. Real-world food security and supply chain problem.
  • Drift risk: Scope expansion, confidence inflation on containment timelines, unsupported invention of containment mechanisms.
  • Pipeline stress: GovernanceTask, StakeholderTask, WBSTask, GanttTask.

09 — Spain–Morocco Transoceanic Tunnel (Europe + Africa)

  • Why: Cross-border megaproject (€40B, 20 years, two continents, two regulatory systems). Tests whether the pipeline can handle political, geotechnical, and financial complexity at scale.
  • Drift risk: Scope expansion, confidence inflation on political feasibility, unsupported engineering claims.
  • Pipeline stress: PremiseAttackTask, WBSTask, GanttTask, CostBreakdownTask.

10 — Carrington Event Prep / Faraday Enclosure (Tallinn, Estonia)

  • Why: Small-budget hardware product (€750K) with specific certification path, cash-flow milestones, and low-risk pilot framing. Tests the pipeline on a product-hardware plan in a small Eastern European market.
  • Drift risk: Scope inflation (single SKU → platform), confidence inflation on regulatory approval.
  • Pipeline stress: MakeAssumptionsTask, ExpertDetails, financial structured output.

5. How to Run the Suite

Baseline pass (single model)

for each prompt_id in BENCHMARK_SUITE:
    initial_plan_text = load_prompt(prompt_id, simple_plan_prompts.jsonl)
    run_dir = create_run_dir(prompt_id, model_profile)
    seed_run_dir(run_dir, initial_plan_text)
    result = run_pipeline(run_dir, model_profile)
    record(prompt_id, model_profile, result.status, result.failed_task, result.error_type)

Comparison pass (multiple models)

for each model in [baseline, premium, frontier, custom_qwen]:
    for each prompt_id in BENCHMARK_SUITE:
        result = run_pipeline(prompt_id, model)
        drift_score = drift_evaluate(initial_plan_text, result.final_report)  # proposal 84
        record(prompt_id, model, result.status, drift_score)

What to look for

  • Which tasks fail most often across prompts? → structural pipeline weakness
  • Which drift types appear most often per model? → model-specific tendency
  • Do local models fail at different task gates than cloud models? → model capability floor
  • Do any prompts cause consistent failure across all models? → pipeline design issue (not model issue)

6. Results Schema

Each run should produce a record in a benchmark log:

{
  "run_id": "...",
  "prompt_id": "ce2fbf38-9700-4ed1-814e-78772f7b7700",
  "model_profile": "custom",
  "model_name": "lmstudio-qwen3.5-35b-a3b",
  "timestamp": "2026-03-07T15:00:00Z",
  "status": "completed",
  "failed_task": null,
  "error_type": null,
  "tasks_completed": 61,
  "tasks_total": 61,
  "duration_seconds": 3240,
  "drift_score": 3.8,
  "drift_risk": "medium",
  "notes": ""
}

7. Maintenance

  • When a pipeline stage changes, re-run the prompts that stress that stage.
  • When a new model profile is added, run all 10 before declaring it stable.
  • When a new prompt is added to simple_plan_prompts.jsonl that covers a new region or failure mode, evaluate it for inclusion (target 15 prompts by Q3 2026).
  • The benchmark set is version-controlled here. To change a selection, update this doc and the corresponding run tooling.