Autonomous Execution of a Plan (AI + Human Delegation)

Pitch

Turn a PlanExe plan into a living execution program that runs autonomously where possible, delegates to humans where necessary, and continuously re-plans as reality changes.

Why

A static plan is not enough. Real execution requires:

resource-aware scheduling
continuous evidence checks
delegation to humans when AI capability is insufficient
iterative re-planning based on new data and outcomes

This proposal extends plan generation into plan execution, with orchestration, governance, and adaptive control.

Problem

Plans often assume ideal conditions and static timelines.
AI agents can execute some tasks, but not all (legal, physical, negotiation).
Human resources are limited and unevenly available.
Without re-planning, early deviations compound into failure.

Proposed Solution

Build an Execution Engine that:

Converts plan JSON into an executable task graph.
Assigns tasks to AI agents or humans based on capability and risk.
Tracks progress, evidence, and constraints in real time.
Re-plans dynamically as facts change.
Produces audit-ready reports and governance checkpoints.

Execution Architecture

PlanExe JSON
  -> Task Graph Builder
  -> Orchestrator
     -> Agent Pool (AI)
     -> Human Queue
     -> Evidence Validator
     -> Risk Gates
  -> Progress + Metrics
  -> Adaptive Re-Plan

Core Components

1) Task Graph Builder

Parse PlanExe outputs into a DAG of tasks with:
dependencies
required inputs/outputs
estimated duration and cost
risk class and verification level

2) Capability Registry

Agents and humans register their capabilities as schemas.
Each task specifies the capability required.
Orchestrator performs matching based on skills, availability, and cost.

3) Orchestrator

Dispatches tasks to agents or humans.
Manages retries, fallbacks, and escalation.
Tracks state transitions (queued, running, blocked, done).

4) Human Task Queue

Dedicated UI showing:
task description
required evidence
urgency and deadline
instructions and context
Human actions logged with rationale.

5) Evidence Validator

Every task output must pass a schema and evidence check.
Outputs are tagged with confidence.
Low-confidence or high-impact results trigger human review.

6) Risk Gates

High-impact tasks require approval before execution.
Risk gates enforce compliance, safety, and budget limits.

Delegation Model

Capability Matching

Match task requirements to agent or human skill profiles.
Consider risk class, data sensitivity, and cost.

Decision Rules

AI-first for low-risk, repeatable tasks.
Human-first for legal, regulatory, and negotiation tasks.
Hybrid for tasks requiring AI prep plus human judgment.

Escalation

If AI confidence falls below threshold, reroute to human.
If humans reject, re-plan or adjust scope.

Adaptive Re-Planning

Execution is not linear. The system must re-plan when:

task outputs contradict assumptions
dependencies are delayed
budgets exceed thresholds
external factors change (regulation, market, supplier)

Re-Planning Loop

Detect deviation vs baseline.
Update plan assumptions and constraints.
Recompute schedule and resource allocation.
Issue new tasks or adjust milestones.

Re-Plan Outputs

Updated Gantt (baseline vs current)
Revised risk register
Decision log with rationale

Resource and Capacity Management

Track available AI resources and human availability.
Enforce concurrency limits.
Prioritize high-impact work when capacity is constrained.

Capacity Profile Example

{
  "ai_agents": 12,
  "human_fte": {
    "legal": 1,
    "engineering": 3,
    "procurement": 0.5
  },
  "max_parallel_tasks": 20
}

Data Contracts

Task Schema

{
  "task_id": "t_001",
  "title": "Obtain environmental permit",
  "capability": "regulatory_filing",
  "priority": "high",
  "risk_class": "critical",
  "inputs": ["site_plan", "impact_assessment"],
  "outputs": ["permit_document"],
  "status": "queued"
}

Execution Event Log

{
  "event_id": "e_445",
  "task_id": "t_001",
  "actor": "human:legal_01",
  "action": "approved",
  "timestamp": "2026-02-11T10:22:00Z",
  "notes": "Permit submitted"
}

Reporting Outputs

Progress dashboard: status by milestone.
Risk dashboard: unresolved risks and blockers.
Evidence coverage: which claims are verified.
Final execution report:
timeline vs baseline
deviations and corrective actions
decision log
confidence profile

Example Execution Timeline (With Change Events)

Baseline (Day 0):

Week 1-2: feasibility + regulatory scoping
Week 3-4: procurement + vendor selection
Week 5-8: build + integration
Week 9: compliance review
Week 10: launch

Execution Events:

Day 9: Regulator requests additional environmental study.
Day 10: Task graph updated, compliance lane becomes critical.
Day 11: Human legal capacity bottleneck detected.
Day 12: Orchestrator proposes contractor augmentation and re-baselines timeline.
Day 14: Updated plan published with new dependencies and budget deltas.

Output: Updated Gantt, risk register, and decision log entry that explains the change and its impact.

Failure Handling and Rollback

Autonomous execution must assume failure cases and encode recovery:

Failure Types

Data failure: missing or invalid inputs.
Task failure: agent returns low-confidence or incorrect output.
External failure: third-party dependency delays or rejects.
Policy failure: execution violates budget, compliance, or ethics constraints.

Recovery Actions

Retry with alternative agent (for data and task failures).
Escalate to human for high-risk or repeated failures.
Re-plan to adjust dependencies and milestones.
Rollback to last stable snapshot if downstream impacts are unsafe.

Rollback Mechanism

Every milestone creates a snapshot of plan state and evidence.
Rollback reverts to the last validated snapshot.
All downstream tasks are invalidated and re-queued with updated inputs.

AI-to-Human Handoff Contract (SLA + Requirements)

For tasks that require human intervention, define a strict handoff contract:

Required Fields

Task purpose and required outcome.
Evidence needed and acceptance criteria.
Deadline and urgency.
Context pack (inputs, dependencies, prior decisions).
Suggested next steps and contact references.

SLA Targets

Acknowledgement: within 24 hours.
Completion: within task-defined window (default 5 business days).
Escalation: if no response within SLA, route to alternate human or reduce scope.

Audit Requirement

Every handoff must record:

human assignee
response timestamp
decision rationale
evidence provided

Safety and Governance

Immutable audit log for all actions and outputs.
Explicit sign-offs on high-impact tasks.
Budget and compliance thresholds enforced by policy.
Kill-switch to halt execution in emergencies.

Success Metrics

% tasks executed without manual intervention
Median deviation vs baseline schedule
% high-risk tasks with evidence verified
Time saved vs manual execution coordination

Risks

Over-automation may hide human context.
Insufficient human capacity causes bottlenecks.
False confidence from AI outputs.

Feasibility Tiers

Autonomous execution is feasible only with clear boundaries. Define tiers by risk and controllability:

Tier 1 (Feasible Now): low-risk, repeatable tasks (data gathering, summarization, formatting, internal reporting).
Tier 2 (Partially Feasible): tasks that can be AI-driven but require human approval (procurement drafts, compliance checklists, vendor shortlists).
Tier 3 (Not Feasible Without Human Lead): legal filings, negotiations, physical execution, regulatory approvals, and high-stakes financial commitments.

Feasibility Adjustments

To make autonomous execution practical, enforce the following adjustments:

Scope limits: constrain execution to well-defined domains and task types.
Hard policy gates: require explicit human approval for high-impact steps.
Evidence sufficiency checks: block execution when inputs are unverified.
Capacity-aware scheduling: align execution with real human availability.
Rollback readiness: snapshot after each milestone and auto-revert on critical deviation.

Staged Rollout Plan

Phase 1: Autonomous Coordination

Task graphing, scheduling, and progress tracking.
Human execution for all tasks.
Evidence collection and audit logging only.

Phase 2: Low-Risk Autonomous Execution

AI executes low-risk tasks under strict validation.
Human approval for all medium and high risk tasks.
Adaptive re-planning enabled.

Phase 3: Selective Autonomous Execution

AI executes a broader set of tasks with confidence thresholds.
High-risk steps remain human-led.
Continuous monitoring and rollback enforced.

Scope Checklist (Tier Classification)

Use this checklist to classify each task into Tier 1/2/3. If any Tier 3 condition is true, the task is human-led.

Tier 1 (AI-Executable)

Internal analysis or data aggregation only.
No external commitments or legal exposure.
Outputs are reversible and low-cost to redo.
Evidence inputs are verified and structured.

Tier 2 (AI-With-Approval)

External communication or procurement drafts.
Moderate budget impact but reversible.
Requires validation by a human reviewer.
Evidence is strong but not fully audited.

Tier 3 (Human-Led)

Legal filings, regulatory submissions, or compliance sign-offs.
Negotiations, contracts, or financial commitments.
Physical execution or safety-critical actions.
High budget impact or irreversible decisions.

Roadmap

MVP orchestrator with basic DAG execution.
Capability registry and task schema.
Human task queue integration.
Evidence validator and risk gate layer.
Adaptive re-planning and reporting.
Pilot on a real plan with human oversight.

Detailed Implementation Plan

Phase A — Execution Core (3–4 weeks)

Build canonical execution graph model in code:
task node
dependency edge
execution state machine
risk class and verification requirements
Implement orchestrator service with deterministic scheduling:
queueing
dependency resolution
retries
timeout + cancellation semantics
Add durable event stream:
all state transitions append-only
idempotent event handlers

Phase B — Capability and Assignment Layer (2–3 weeks)

Define capability contract schema for agents/humans:
capability id
confidence profile
availability window
trust tier
Implement assignment policy engine:
AI-first for low-risk tasks
human-first for legal/regulatory/irreversible actions
hybrid path for AI-prep + human decision
Add capacity-aware scoring:
minimize queue aging
avoid overloading constrained human roles

Phase C — Validation + Governance (2 weeks)

Add evidence validator gate before task completion.
Add policy gate before high-impact task execution.
Implement emergency kill-switch + rollback to last stable milestone snapshot.

Phase D — Adaptive Re-Planning (2–3 weeks)

Detect deviation triggers:
cost/schedule threshold breach
failed dependencies
assumption drift from external signals
Re-plan workflow:
clone active graph state
regenerate local schedule around affected subgraph
preserve completed task evidence and audit lineage
Publish delta artifacts:
baseline vs current timeline
budget delta
decision rationale log

Data model additions

execution_tasks
execution_events
assignment_decisions
policy_gate_results
replan_snapshots

All tables should be indexed by plan_id, run_id, and task_id for audit and replay.

API additions (suggested)

POST /api/execution/start/{plan_id}
GET /api/execution/status/{run_id}
POST /api/execution/replan/{run_id}
POST /api/execution/stop/{run_id}

Rollout safety controls

Start with “coordination-only mode” (humans execute all tasks)
Enable autonomous execution for Tier 1 tasks only
Require opt-in per workspace for Tier 2
Keep Tier 3 permanently human-led unless explicit governance override

Validation checklist

Event consistency under retries/restarts
Correct dependency execution ordering
No policy-gated task executes without approval
Rollback replay determinism
Human handoff SLA conformance