Distributed Plan Execution - Worker Pool Parallelism
Overview
PlanExe's plan generation pipeline currently runs sequentially on a single worker. For complex, multi-stage plans (research → outline → expand → review), this creates bottlenecks and wastes compute when stages could run in parallel.
This proposal introduces a distributed execution model with worker pool parallelism and DAG-based scheduling for compute-heavy plan stages.
Problem
-
Single-threaded execution = slow generation for complex plans
-
Wasted compute: Outline stage could start while research continues
-
No horizontal scaling: Can't throw more workers at the problem
-
Railway infrastructure supports multi-worker deployments but pipeline doesn't use it
Proposed Solution
Architecture
┌──────────────────────┐
│ Plan Request │
│ (HTTP API) │
└──────────┬───────────┘
│
v
┌──────────────────────┐
│ DAG Scheduler │ ← Determines stage dependencies
│ (Coordinator) │ and dispatches to workers
└──────────┬───────────┘
│
┌─────┴─────┬─────────┬─────────┐
v v v v
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│Worker 1 │ │Worker 2 │ │Worker 3 │ │Worker N │
│(Research)│ │(Outline)│ │(Expand) │ │(Review) │
└─────────┘ └─────────┘ └─────────┘ └─────────┘
│ │ │ │
└───────────┴─────────┴─────────┘
│
v
┌───────────────┐
│ Redis Queue │ ← Job state + results
└───────────────┘
Stage Dependency DAG
# Example DAG for standard business plan
plan_dag = {
"research": {
"depends_on": [],
"parallelizable": True,
"subtasks": ["market_research", "competitor_analysis", "regulatory_research"]
},
"outline": {
"depends_on": ["research"],
"parallelizable": False
},
"expand_sections": {
"depends_on": ["outline"],
"parallelizable": True,
"subtasks": ["exec_summary", "market_analysis", "operations", "financial"]
},
"review": {
"depends_on": ["expand_sections"],
"parallelizable": False
},
"format": {
"depends_on": ["review"],
"parallelizable": False
}
}
Worker Pool Management
Railway Configuration:
# railway.toml
[workers]
plan_worker:
build:
dockerfile: Dockerfile.worker
replicas: 5 # Scale based on load
env:
REDIS_URL: ${REDIS_URL}
WORKER_POOL: plan_execution
Task Queue (Celery-style):
from celery import Celery
app = Celery('planexe', broker='redis://localhost:6379/0')
@app.task(name='stage.research')
def execute_research_stage(plan_id, prompt_context):
# Run research subtasks in parallel
results = group([
research_market.s(plan_id, prompt_context),
research_competitors.s(plan_id, prompt_context),
research_regulatory.s(plan_id, prompt_context)
])()
return results.get()
@app.task(name='stage.outline')
def execute_outline_stage(plan_id, research_results):
# Depends on research completion
return generate_outline(plan_id, research_results)
Implementation Plan
Phase 1: DAG Scheduler (Week 1-2)
-
Define stage dependency graph schema (YAML config)
-
Build coordinator service that parses DAG and dispatches tasks
-
Add Redis for job state management
-
Single worker proof-of-concept
Phase 2: Worker Pool (Week 3)
-
Deploy 3-5 workers on Railway
-
Implement task routing and load balancing
-
Add retry logic and failure handling
-
Monitor queue depth and worker utilization
Phase 3: Parallel Stages (Week 4)
-
Enable parallel execution for research subtasks
-
Enable parallel execution for section expansion
-
Add progress reporting (% complete across all workers)
-
Optimize stage chunking for latency
Phase 4: Auto-Scaling (Week 5+)
-
Dynamic worker scaling based on queue depth
-
Cost optimization (scale down during off-hours)
-
Priority queues (premium users get dedicated workers)
Benefits
-
3-5x faster plan generation for complex plans
-
Horizontal scaling - add more workers as load increases
-
Better resource utilization - multiple stages run concurrently
-
Resilience - worker failure doesn't kill entire plan generation
-
Cost efficiency - pay for compute only when queue is deep
Technical Stack
-
Task Queue: Celery + Redis (battle-tested, Python-native)
-
DAG Engine: Custom lightweight scheduler (simpler than Airflow for our use case)
-
Worker Runtime: Docker containers on Railway
-
State Storage: Redis (job metadata) + PostgreSQL (completed plans)
Risks & Mitigations
| Risk | Mitigation |
|---|---|
| Added complexity | Start with simple DAG, expand gradually |
| Redis becomes bottleneck | Use Redis cluster, cache subtask results |
| Worker coordination overhead | Keep DAG shallow (max 5 stages), minimize inter-worker communication |
| Cost increase | Monitor worker utilization, scale down aggressively |
| Debugging harder | Centralized logging (Sentry), trace IDs across workers |
Success Metrics
-
Average plan generation time decreases by 50%+
-
Worker CPU utilization stays 60-80% (not idle, not maxed)
-
Task retry rate < 2% (most jobs succeed first try)
-
P95 latency under 10 minutes for standard business plan
Future Enhancements
-
GPU workers for vision/multimodal stages
-
Speculative execution (start likely next stage before deps finish)
-
Agent-specific worker pools (specialized workers for finance plans vs. tech plans)
References
-
Celery documentation: https://docs.celeryq.dev/
-
Railway multi-service deploys: https://docs.railway.app/
-
DAG scheduling patterns: Apache Airflow, Prefect, Temporal
Detailed Implementation Plan
Phase A — Distributed Runtime Topology
- Define coordinator + worker architecture.
- Partition execution graph into shardable task groups.
- Add worker heartbeat and lease ownership semantics.
Phase B — Queue and Retry Semantics
- Introduce queue topics by task class and priority.
- Implement idempotent workers with attempt counters.
- Add dead-letter queues and replay tooling.
Phase C — Consistency and Recovery
- Persist checkpoint snapshots per milestone.
- Implement coordinator failover strategy.
- Add exactly-once/at-least-once mode selection by task type.
Validation Checklist
- Throughput scaling under worker expansion
- Recovery time after worker/node failure
- No duplicate side effects for idempotent tasks