Proposal: Usage Metrics for Local Runs (No Database Required)
Author: Bubba (VoynichLabs)
Date: 2026-03-08
Status: Implemented (PR #219)
Problem
PlanExe has solid token-counting infrastructure (token_counter.py, token_instrumentation.py,
token_metrics_store.py), but it only activates when a PLANEXE_TASK_ID is set and a database
connection is available. For local CLI runs — the primary use case for self-hosted users — zero
usage data is recorded.
This is not specific to LM Studio. Any local inference server (Ollama, llama.cpp server, vLLM, LMDeploy, Jan, etc.) has the same problem: performance metrics live only in the server's UI or stdout, not in the plan output.
As a result, users cannot answer basic questions:
- How many tokens did this plan consume?
- How long did each task take?
- Which tasks were the heaviest?
- How does Qwen 9B compare to Qwen 35B on the same prompt?
Solution
Write a lightweight usage_metrics.json file into the run output directory at the end of every
pipeline run. No database. No task ID. No web stack. Works for any OpenAI-compatible backend.
What to capture (per LLM call, from response.raw.usage)
| Field | Source | Notes |
|---|---|---|
input_tokens |
usage.prompt_tokens |
Already extracted by extract_token_count() |
output_tokens |
usage.completion_tokens |
Already extracted |
duration_seconds |
time.perf_counter() delta |
Already measured in LLMExecutor._try_one_attempt() |
llm_model |
config key | Already available in LLMAttempt |
task_name |
Luigi task class name | Available in PlanTask.run() |
success |
LLMAttempt.success |
Already in LLMAttempt |
Output file: {run_id_dir}/usage_metrics.json
{
"run_id_dir": "/path/to/run",
"generated_at": "2026-03-08T17:45:00Z",
"summary": {
"total_input_tokens": 412847,
"total_output_tokens": 98341,
"total_duration_seconds": 2847.3,
"total_calls": 63,
"successful_calls": 63,
"failed_calls": 0,
"avg_tokens_per_second": 38.7
},
"calls": [
{
"task_name": "IdentifyPurposeTask",
"llm_model": "lmstudio-qwen35b",
"input_tokens": 1243,
"output_tokens": 412,
"duration_seconds": 14.2,
"tokens_per_second": 29.0,
"success": true,
"timestamp": "2026-03-08T12:28:14Z"
}
]
}
Implementation
Option A — File-based accumulator in LLMExecutor (minimal change)
Extend _record_attempt_token_metrics() in llm_executor.py to also append to a local JSON
file when PLANEXE_TASK_ID is not set (i.e., always for CLI runs):
def _record_attempt_token_metrics(self, ...):
# existing DB path (unchanged)
try:
from worker_plan_internal.llm_util.token_instrumentation import record_attempt_tokens
record_attempt_tokens(...)
except Exception as exc:
logger.debug("Failed to record token metrics for attempt: %s", exc)
# NEW: always write to file if run_id_dir is available
run_id_dir = os.environ.get("RUN_ID_DIR")
if run_id_dir:
_append_usage_record(run_id_dir, {
"llm_model": llm_model_name,
"duration_seconds": duration,
"success": success,
"error_message": error_message,
"token_count": extract_token_count(response).to_dict() if response else None,
"timestamp": datetime.utcnow().isoformat() + "Z",
})
_append_usage_record() would be a thread-safe JSON-lines appender (one record per line, atomic
rename on flush) so concurrent Luigi workers don't corrupt the file.
Option B — Summary writer in run_plan_pipeline.py
At the end of the pipeline run, walk all output JSON files and reconstruct timing from
start_time.json + task completion timestamps. Less accurate (no per-call token counts unless
providers embed them in task outputs) but zero changes to the hot path.
Recommended: Option A
Option A captures real per-call token counts from the API response usage object. The
extract_token_count() function already handles all providers (OpenAI, Anthropic, OpenRouter,
Ollama, LM Studio). No new dependencies. The RUN_ID_DIR env var is already required to run the
pipeline — it's always set.
Provider Coverage
| Provider | usage in response |
Notes |
|---|---|---|
| LM Studio | ✅ | prompt_tokens, completion_tokens in every response |
| Ollama | ✅ | prompt_eval_count, eval_count (needs mapping) |
| llama.cpp server | ✅ | Same as LM Studio |
| OpenAI | ✅ | Standard |
| OpenRouter | ✅ | Plus cost field |
| Anthropic | ✅ | input_tokens, output_tokens |
| vLLM | ✅ | OpenAI-compatible |
| Jan | ✅ | OpenAI-compatible |
All local inference servers that implement the OpenAI /v1/chat/completions spec return
usage in the response body. extract_token_count() already handles the field name
variations.
Why Not Just Use PLANEXE_TASK_ID?
The existing DB path requires:
- A running database (web app stack)
PLANEXE_TASK_IDenv var set to a valid UUID from the DBdatabase_api.planexe_db_singletonimportable
None of these are present in a bare CLI run. The file-based approach requires only RUN_ID_DIR
(already mandatory) and standard library (json, os, threading).
What This Enables
- Model comparison: Run the same prompt on Qwen 9B vs 35B, compare
usage_metrics.json— total tokens, tok/s, task-level breakdown. - Regression detection: Task X used 2K tokens last week, now uses 20K — prompt bloat caught.
- Hardware benchmarking: Share
summary.avg_tokens_per_secondalongside run results. - Cost estimation: For cloud providers (OpenRouter, OpenAI), multiply by pricing from config.
- Self-documenting runs: The output directory is self-contained — no external DB query needed to understand what happened.
Implementation Notes (PR #219)
The final implementation diverges from Option A above in several ways:
File format: JSONL, not JSON
usage_metrics.jsonl uses one JSON object per line (append-only). This avoids the need for
atomic rename or in-memory accumulation — each LLM call appends a single line. Thread-safe
by nature since each write is a short append to a file handle.
Recording source: llama_index instrumentation, not LLMExecutor
Successful calls are recorded by TrackActivity (the llama_index BaseEventHandler) which
receives the actual ChatResponse with full token counts, cost, and provider:model info.
LLMExecutor._record_attempt_token_metrics() only records failures, since instrumentation
end events are not emitted when the LLM call fails.
This was necessary because execute_function(llm) returns the processed result (a Pydantic
model or string), not the raw ChatResponse. The instrumentation layer is the only place
with access to the real response.
Model field includes provider
The model field contains the full provider:model string (e.g.
Google AI Studio:google/gemini-2.0-flash-001), matching activity_overview.json.
Example output
{"timestamp": "2026-03-10T13:36:48.250446", "success": true, "model": "Google AI Studio:google/gemini-2.0-flash-001", "duration_seconds": 4.879, "input_tokens": 5316, "output_tokens": 643, "cost_usd": 0.0007888}
{"timestamp": "2026-03-10T13:36:53.554864", "success": true, "model": "Google:google/gemini-2.0-flash-001", "duration_seconds": 5.237, "input_tokens": 8877, "output_tokens": 562, "cost_usd": 0.0011125}
Key files
| File | Role |
|---|---|
worker_plan/worker_plan_internal/llm_util/usage_metrics.py |
Core module: set_usage_metrics_path(), record_usage_metric() |
worker_plan/worker_plan_internal/llm_util/track_activity.py |
Records successful calls via _record_file_usage_metric() |
worker_plan/worker_plan_internal/llm_util/llm_executor.py |
Records failed calls only |
worker_plan/worker_plan_internal/plan/run_plan_pipeline.py |
Sets/clears metrics path around pipeline execution |
worker_plan/worker_plan_api/filenames.py |
USAGE_METRICS_JSONL constant |
Resolved open questions
- Ollama field mapping: Handled by
extract_token_count()andTrackActivity._extract_token_usage()which already support multiple field name variations. - Thread safety: JSONL append-per-line is safe for concurrent Luigi workers.
- Thinking tokens: Recorded when available (e.g. from OpenRouter reasoning models).
nullfor providers that don't expose them. - Retention on resume: Appended. A resumed run adds only the new calls alongside the restored snapshot's existing metrics.
Follow-up: Error Classification (PR #236)
Raw Python exception strings in the "error" field were verbose and could leak raw model output.
PR #236 added classify_error() to replace them with short category labels:
| Category | Matches on |
|---|---|
invalid_json |
"json", "validation error", "pydantic" |
timeout |
"timeout", "timed out" |
empty_response |
"empty", "none", "no response" |
connection_error |
"connection", "connect", "dns", "network" |
rate_limit |
"rate limit", "429", "too many requests" |
auth_error |
"auth", "401", "403", "unauthorized", "forbidden" |
server_error |
"500", "502", "503", "504", "server error", "internal server" |
model_not_found |
"model not found", "not found", "404" |
unknown |
fallback — raw message preserved in error_detail (truncated to 200 chars) |
Follow-up: Error Traceability (PR #237)
All 38 call sites that raised ValueError("LLM chat interaction failed.") were converted to
LLMChatError (see proposal 113). This:
- Preserves the root cause in
str(LLMChatError), soclassify_error()categorizes correctly (previously all were"unknown"). - Adds
error_id(12-char UUID) to both log output and the JSONL row, enabling cross-referencing.
Example failure row after both PRs:
{"timestamp": "2026-03-10T19:50:18.821350", "success": false, "model": "openrouter-gemini-2.0-flash-001", "duration_seconds": 5.391, "error": "invalid_json", "error_id": "4c2a64973bcd"}
A user can grep 4c2a64973bcd in the logs to find the full traceback.