Autonomous Execution of a Plan (AI + Human Delegation)
Pitch
Turn a PlanExe plan into a living execution program that runs autonomously where possible, delegates to humans where necessary, and continuously re-plans as reality changes.
Why
A static plan is not enough. Real execution requires:
- resource-aware scheduling
- continuous evidence checks
- delegation to humans when AI capability is insufficient
- iterative re-planning based on new data and outcomes
This proposal extends plan generation into plan execution, with orchestration, governance, and adaptive control.
Problem
- Plans often assume ideal conditions and static timelines.
- AI agents can execute some tasks, but not all (legal, physical, negotiation).
- Human resources are limited and unevenly available.
- Without re-planning, early deviations compound into failure.
Proposed Solution
Build an Execution Engine that:
- Converts plan JSON into an executable task graph.
- Assigns tasks to AI agents or humans based on capability and risk.
- Tracks progress, evidence, and constraints in real time.
- Re-plans dynamically as facts change.
- Produces audit-ready reports and governance checkpoints.
Execution Architecture
PlanExe JSON
-> Task Graph Builder
-> Orchestrator
-> Agent Pool (AI)
-> Human Queue
-> Evidence Validator
-> Risk Gates
-> Progress + Metrics
-> Adaptive Re-Plan
Core Components
1) Task Graph Builder
- Parse PlanExe outputs into a DAG of tasks with:
- dependencies
- required inputs/outputs
- estimated duration and cost
- risk class and verification level
2) Capability Registry
- Agents and humans register their capabilities as schemas.
- Each task specifies the capability required.
- Orchestrator performs matching based on skills, availability, and cost.
3) Orchestrator
- Dispatches tasks to agents or humans.
- Manages retries, fallbacks, and escalation.
- Tracks state transitions (queued, running, blocked, done).
4) Human Task Queue
- Dedicated UI showing:
- task description
- required evidence
- urgency and deadline
- instructions and context
- Human actions logged with rationale.
5) Evidence Validator
- Every task output must pass a schema and evidence check.
- Outputs are tagged with confidence.
- Low-confidence or high-impact results trigger human review.
6) Risk Gates
- High-impact tasks require approval before execution.
- Risk gates enforce compliance, safety, and budget limits.
Delegation Model
Capability Matching
- Match task requirements to agent or human skill profiles.
- Consider risk class, data sensitivity, and cost.
Decision Rules
- AI-first for low-risk, repeatable tasks.
- Human-first for legal, regulatory, and negotiation tasks.
- Hybrid for tasks requiring AI prep plus human judgment.
Escalation
- If AI confidence falls below threshold, reroute to human.
- If humans reject, re-plan or adjust scope.
Adaptive Re-Planning
Execution is not linear. The system must re-plan when:
- task outputs contradict assumptions
- dependencies are delayed
- budgets exceed thresholds
- external factors change (regulation, market, supplier)
Re-Planning Loop
- Detect deviation vs baseline.
- Update plan assumptions and constraints.
- Recompute schedule and resource allocation.
- Issue new tasks or adjust milestones.
Re-Plan Outputs
- Updated Gantt (baseline vs current)
- Revised risk register
- Decision log with rationale
Resource and Capacity Management
- Track available AI resources and human availability.
- Enforce concurrency limits.
- Prioritize high-impact work when capacity is constrained.
Capacity Profile Example
{
"ai_agents": 12,
"human_fte": {
"legal": 1,
"engineering": 3,
"procurement": 0.5
},
"max_parallel_tasks": 20
}
Data Contracts
Task Schema
{
"task_id": "t_001",
"title": "Obtain environmental permit",
"capability": "regulatory_filing",
"priority": "high",
"risk_class": "critical",
"inputs": ["site_plan", "impact_assessment"],
"outputs": ["permit_document"],
"status": "queued"
}
Execution Event Log
{
"event_id": "e_445",
"task_id": "t_001",
"actor": "human:legal_01",
"action": "approved",
"timestamp": "2026-02-11T10:22:00Z",
"notes": "Permit submitted"
}
Reporting Outputs
- Progress dashboard: status by milestone.
- Risk dashboard: unresolved risks and blockers.
- Evidence coverage: which claims are verified.
- Final execution report:
- timeline vs baseline
- deviations and corrective actions
- decision log
- confidence profile
Example Execution Timeline (With Change Events)
Baseline (Day 0):
- Week 1-2: feasibility + regulatory scoping
- Week 3-4: procurement + vendor selection
- Week 5-8: build + integration
- Week 9: compliance review
- Week 10: launch
Execution Events:
- Day 9: Regulator requests additional environmental study.
- Day 10: Task graph updated, compliance lane becomes critical.
- Day 11: Human legal capacity bottleneck detected.
- Day 12: Orchestrator proposes contractor augmentation and re-baselines timeline.
- Day 14: Updated plan published with new dependencies and budget deltas.
Output: Updated Gantt, risk register, and decision log entry that explains the change and its impact.
Failure Handling and Rollback
Autonomous execution must assume failure cases and encode recovery:
Failure Types
- Data failure: missing or invalid inputs.
- Task failure: agent returns low-confidence or incorrect output.
- External failure: third-party dependency delays or rejects.
- Policy failure: execution violates budget, compliance, or ethics constraints.
Recovery Actions
- Retry with alternative agent (for data and task failures).
- Escalate to human for high-risk or repeated failures.
- Re-plan to adjust dependencies and milestones.
- Rollback to last stable snapshot if downstream impacts are unsafe.
Rollback Mechanism
- Every milestone creates a snapshot of plan state and evidence.
- Rollback reverts to the last validated snapshot.
- All downstream tasks are invalidated and re-queued with updated inputs.
AI-to-Human Handoff Contract (SLA + Requirements)
For tasks that require human intervention, define a strict handoff contract:
Required Fields
- Task purpose and required outcome.
- Evidence needed and acceptance criteria.
- Deadline and urgency.
- Context pack (inputs, dependencies, prior decisions).
- Suggested next steps and contact references.
SLA Targets
- Acknowledgement: within 24 hours.
- Completion: within task-defined window (default 5 business days).
- Escalation: if no response within SLA, route to alternate human or reduce scope.
Audit Requirement
Every handoff must record:
- human assignee
- response timestamp
- decision rationale
- evidence provided
Safety and Governance
- Immutable audit log for all actions and outputs.
- Explicit sign-offs on high-impact tasks.
- Budget and compliance thresholds enforced by policy.
- Kill-switch to halt execution in emergencies.
Success Metrics
- % tasks executed without manual intervention
- Median deviation vs baseline schedule
- % high-risk tasks with evidence verified
- Time saved vs manual execution coordination
Risks
- Over-automation may hide human context.
- Insufficient human capacity causes bottlenecks.
- False confidence from AI outputs.
Feasibility Tiers
Autonomous execution is feasible only with clear boundaries. Define tiers by risk and controllability:
- Tier 1 (Feasible Now): low-risk, repeatable tasks (data gathering, summarization, formatting, internal reporting).
- Tier 2 (Partially Feasible): tasks that can be AI-driven but require human approval (procurement drafts, compliance checklists, vendor shortlists).
- Tier 3 (Not Feasible Without Human Lead): legal filings, negotiations, physical execution, regulatory approvals, and high-stakes financial commitments.
Feasibility Adjustments
To make autonomous execution practical, enforce the following adjustments:
- Scope limits: constrain execution to well-defined domains and task types.
- Hard policy gates: require explicit human approval for high-impact steps.
- Evidence sufficiency checks: block execution when inputs are unverified.
- Capacity-aware scheduling: align execution with real human availability.
- Rollback readiness: snapshot after each milestone and auto-revert on critical deviation.
Staged Rollout Plan
Phase 1: Autonomous Coordination
- Task graphing, scheduling, and progress tracking.
- Human execution for all tasks.
- Evidence collection and audit logging only.
Phase 2: Low-Risk Autonomous Execution
- AI executes low-risk tasks under strict validation.
- Human approval for all medium and high risk tasks.
- Adaptive re-planning enabled.
Phase 3: Selective Autonomous Execution
- AI executes a broader set of tasks with confidence thresholds.
- High-risk steps remain human-led.
- Continuous monitoring and rollback enforced.
Scope Checklist (Tier Classification)
Use this checklist to classify each task into Tier 1/2/3. If any Tier 3 condition is true, the task is human-led.
Tier 1 (AI-Executable)
- Internal analysis or data aggregation only.
- No external commitments or legal exposure.
- Outputs are reversible and low-cost to redo.
- Evidence inputs are verified and structured.
Tier 2 (AI-With-Approval)
- External communication or procurement drafts.
- Moderate budget impact but reversible.
- Requires validation by a human reviewer.
- Evidence is strong but not fully audited.
Tier 3 (Human-Led)
- Legal filings, regulatory submissions, or compliance sign-offs.
- Negotiations, contracts, or financial commitments.
- Physical execution or safety-critical actions.
- High budget impact or irreversible decisions.
Roadmap
- MVP orchestrator with basic DAG execution.
- Capability registry and task schema.
- Human task queue integration.
- Evidence validator and risk gate layer.
- Adaptive re-planning and reporting.
- Pilot on a real plan with human oversight.
Detailed Implementation Plan
Phase A — Execution Core (3–4 weeks)
- Build canonical execution graph model in code:
- task node
- dependency edge
- execution state machine
-
risk class and verification requirements
-
Implement orchestrator service with deterministic scheduling:
- queueing
- dependency resolution
- retries
-
timeout + cancellation semantics
-
Add durable event stream:
- all state transitions append-only
- idempotent event handlers
Phase B — Capability and Assignment Layer (2–3 weeks)
- Define capability contract schema for agents/humans:
- capability id
- confidence profile
- availability window
-
trust tier
-
Implement assignment policy engine:
- AI-first for low-risk tasks
- human-first for legal/regulatory/irreversible actions
-
hybrid path for AI-prep + human decision
-
Add capacity-aware scoring:
- minimize queue aging
- avoid overloading constrained human roles
Phase C — Validation + Governance (2 weeks)
- Add evidence validator gate before task completion.
- Add policy gate before high-impact task execution.
- Implement emergency kill-switch + rollback to last stable milestone snapshot.
Phase D — Adaptive Re-Planning (2–3 weeks)
- Detect deviation triggers:
- cost/schedule threshold breach
- failed dependencies
-
assumption drift from external signals
-
Re-plan workflow:
- clone active graph state
- regenerate local schedule around affected subgraph
-
preserve completed task evidence and audit lineage
-
Publish delta artifacts:
- baseline vs current timeline
- budget delta
- decision rationale log
Data model additions
execution_tasksexecution_eventsassignment_decisionspolicy_gate_resultsreplan_snapshots
All tables should be indexed by plan_id, run_id, and task_id for audit and replay.
API additions (suggested)
POST /api/execution/start/{plan_id}GET /api/execution/status/{run_id}POST /api/execution/replan/{run_id}POST /api/execution/stop/{run_id}
Rollout safety controls
- Start with “coordination-only mode” (humans execute all tasks)
- Enable autonomous execution for Tier 1 tasks only
- Require opt-in per workspace for Tier 2
- Keep Tier 3 permanently human-led unless explicit governance override
Validation checklist
- Event consistency under retries/restarts
- Correct dependency execution ordering
- No policy-gated task executes without approval
- Rollback replay determinism
- Human handoff SLA conformance