Proposal: LLM Error Traceability via Error UUIDs
Author: neoneye Date: 2026-03-10 Status: Implemented (PR #237)
Problem
When an LLM call fails inside a pipeline task, the exception was caught and re-raised as a generic ValueError:
except Exception as e:
logger.debug(f"LLM chat interaction failed: {e}")
logger.error("LLM chat interaction failed.", exc_info=True)
raise ValueError("LLM chat interaction failed.") from e
This pattern appeared in 38 files across the codebase. By the time LLMExecutor caught the re-raised ValueError, str(e) was the fixed string "LLM chat interaction failed." — the root cause (timeout, auth error, invalid JSON, etc.) was lost.
PR #236 added classify_error() to categorize error strings into short labels for usage_metrics.jsonl. But because the original exception was masked, these errors all classified as "unknown" with error_detail: "LLM chat interaction failed." — defeating the purpose of classification.
The root cause was logged via logger.error(..., exc_info=True), but there was no way to correlate a log line with a specific row in usage_metrics.jsonl.
Goals
- Preserve the root cause so
classify_error()can categorize it correctly. - Correlate metrics with logs so a user can look up the full traceback for any failed metric row.
- Keep the change mechanical — the call sites should all follow the same pattern.
- No behaviour change — callers that catch
ValueErrormust continue to work.
Implementation
1. LLMChatError exception
A dedicated exception that carries structured context, defined in
worker_plan/worker_plan_internal/llm_util/llm_errors.py:
import uuid
class LLMChatError(ValueError):
"""Raised when an LLM chat interaction fails.
Carries the root-cause exception and a unique error_id for
cross-referencing log entries with usage_metrics.jsonl rows.
Extends ValueError for backward compatibility with existing
except ValueError catch sites.
"""
def __init__(self, cause: Exception, error_id: str | None = None, message: str | None = None):
self.cause = cause
self.error_id = error_id or uuid.uuid4().hex[:12]
self.message = message or "LLM chat interaction failed"
super().__init__(f"{self.message} [{self.error_id}]: {cause}")
Key properties:
- str(LLMChatError) includes the root cause, so classify_error() categorizes correctly.
- error_id is a short UUID (12 hex chars) printed in both the log and the metric row.
- message allows callers to distinguish between numbered interactions (e.g. interaction 1 vs 2).
- Extends ValueError for backward compatibility with existing catch sites.
2. Updated call sites (38 files)
The standard pattern (30 files):
from worker_plan_internal.llm_util.llm_errors import LLMChatError
except Exception as e:
llm_error = LLMChatError(cause=e)
logger.debug(f"LLM chat interaction failed [{llm_error.error_id}]: {e}")
logger.error(f"LLM chat interaction failed [{llm_error.error_id}]", exc_info=True)
raise llm_error from e
For files with numbered interactions (e.g. expert_finder.py, questions_answers.py):
except Exception as e:
llm_error = LLMChatError(cause=e, message="LLM chat interaction 2 failed")
logger.debug(f"{llm_error.message} [{llm_error.error_id}]: {e}")
logger.error(f"{llm_error.message} [{llm_error.error_id}]", exc_info=True)
raise llm_error from e
3. error_id in usage_metrics.jsonl
record_usage_metric() accepts an optional error_id parameter.
LLMExecutor._record_attempt_token_metrics() extracts the error_id from the exception:
if not success:
error_id = getattr(exception, "error_id", None) if exception else None
record_usage_metric(
model=llm_model_name,
duration_seconds=duration,
success=False,
error_message=error_message,
error_id=error_id,
)
4. Verified output
Log output:
ERROR LLM chat interaction failed [4c2a64973bcd]: 1 validation error for ...
Traceback (most recent call last):
...
usage_metrics.jsonl row:
{"timestamp": "2026-03-10T19:50:18.821350", "success": false, "model": "openrouter-gemini-2.0-flash-001", "duration_seconds": 5.391, "error": "invalid_json", "error_id": "4c2a64973bcd"}
A user can now grep 4c2a64973bcd in the logs to find the full traceback.
Files changed
| File | Change |
|---|---|
worker_plan/worker_plan_internal/llm_util/llm_errors.py |
New. LLMChatError exception class with cause, error_id, message. |
worker_plan/worker_plan_internal/llm_util/usage_metrics.py |
Added error_id parameter to record_usage_metric(). |
worker_plan/worker_plan_internal/llm_util/llm_executor.py |
Extract error_id from exception, pass to record_usage_metric(). |
36 files with raise ValueError("LLM chat interaction failed.") |
Mechanical replacement with LLMChatError. |
2 files with raise ValueError("LLM chat interaction N failed.") |
Replaced with LLMChatError(cause=e, message=...). |
worker_plan/tests/test_llm_errors.py |
New. 8 tests for LLMChatError. |
worker_plan/tests/test_usage_metrics.py |
Added tests for error_id field and LLMChatError integration. |
Out of scope
- Structured logging (JSON log format) — useful but a separate concern.
- Retry-level error tracking —
LLMExecutoralready tracks per-attempt results. - Persisting error IDs to the database for cloud runs — can be added later via
TrackActivity.