Proposal: LLMExecutor Retry & Resilience Improvements
Status: Implemented Author: Bubba + Egon Related PRs: #198 (DeduplicateLeversTask per-lever decomposition), #220 (transient retries), #221 (validation error feedback retries)
Background
llm_executor.py implements a fallback chain: when one LLM fails, it tries the next configured LLM in sequence. This handles model-level failures well.
Several tasks (e.g., DeduplicateLeversTask, ReviewPlanTask) had added their own local retry loops on top of llm_executor.run(), creating duplicated, inconsistent retry logic scattered across task files.
neoneye's design direction (from PR #198 review): retry logic belongs in llm_executor.py, not in individual tasks.
Implementation (Option 3: Two-tier resilience)
PR #220 implemented the recommended two-tier approach.
Tier 1 — LLMExecutor handles transient errors
RetryConfig dataclass controls per-model retry behaviour:
@dataclass
class RetryConfig:
max_retries: int = 2 # retries per model (0 = no retries)
base_delay: float = 1.0 # seconds
max_delay: float = 30.0 # seconds
backoff_multiplier: float = 2.0
is_transient_error() classifies exceptions by matching error strings against known transient patterns (rate limits, timeouts, connection errors, 5xx status codes). Only transient errors trigger retries — permanent errors (auth failures, validation errors, schema mismatches) immediately fall through to the next model.
The retry loop uses exponential backoff: 1s, 2s, 4s, ... capped at max_delay.
Tier 2 — Luigi resume handles task-level failures
If all LLM attempts and retries are exhausted, the task raises and Luigi marks it failed. Re-running the pipeline resumes from the failed task. The operator can swap model config between runs to address systematic failures.
Integration with related features
- LLMChatError (#237): Tasks wrap LLM exceptions in
LLMChatErrorwith uniqueerror_idfor traceability.is_transient_error()works through the wrapper becauseLLMChatError.__str__()preserves the original cause text. - Usage metrics (#110): Failed attempts are recorded to
usage_metrics.jsonlwith error classification anderror_id. Successful attempts are recorded byTrackActivityinstrumentation. - Pipeline stop:
PipelineStopRequestedis never retried — it propagates immediately. - Validation error feedback (#102, PR #221):
max_validation_retriesparameter enables retrying the same model on Pydantic validation errors. Before each retry,_extract_validation_feedback()walks the exception chain to extract structured error details (missing fields, type mismatches) and stores them inexecutor.validation_feedback. The caller'sexecute_functioncan read this property and inject correction feedback into the prompt. Validation retries run after transient retries are exhausted — if the last error is a validation error rather than transient, the validation retry loop fires. This converts blind retries into self-correcting dialogue (see proposal #102).
Context Overflow Handling (separate from retries)
DeduplicateLeversTask uses compact-history fallback for context window overflow. This is not a retry — it's a prompt restructuring. The distinction matters:
- Retry: Same prompt, same schema, try again. Appropriate for transient errors. Handled by
LLMExecutor. - Prompt restructuring: Different prompt (compacted context), same schema. Appropriate for context overflow. Stays in the task.
Future Work
- Config-driven retry settings via
llm_config/custom.json(currently hardcoded defaults). - Remove remaining per-task retry loops as each task is updated (follow-on PRs).
- Wire individual tasks to read
executor.validation_feedbackand append correction messages to prompts (see proposal #102).
References
worker_plan/worker_plan_internal/llm_util/llm_executor.py— retry logic andRetryConfigworker_plan/worker_plan_internal/llm_util/llm_errors.py—LLMChatErrorwith error traceabilityworker_plan/worker_plan_internal/llm_util/usage_metrics.py— file-based usage metricsworker_plan/worker_plan_internal/lever/deduplicate_levers.py— example of per-task compact fallback (PR #198)