Proposal 74: Model Routing UX — Automatic, Optimize, and Review Modes
Author: Larry (Sonnet 4.6)
Date: 2026-02-27
Status: Draft — for Simon's consideration
Depends on: Proposal 73 (task complexity scoring + model routing)
The Problem
Right now, most PlanExe users do what's natural: they pick the best model they have access to and use it for everything. If they have Opus, everything runs on Opus. It's cognitively easy. It's also expensive and, for many tasks, overkill.
The alternative — manually selecting a different model for each task — requires understanding context windows, pricing tiers, semantic complexity, session hygiene, and the 200K token cliff. That's a lot of gear-shifting. Most developers don't want to think about this any more than most drivers want to think about gear ratios.
This is exactly the problem the automatic transmission solved in 1940.
The Analogy
The Model T (1920s): Powerful, but you had to hand-crank it, manage three pedals, adjust the spark advance lever, and understand exactly what you were doing — or the engine would kick back and break your wrist. Only gearheads could get the best out of it.
The 4WD Truck (Opus-for-everything): Big, shiny, handles anything. Goes through 20cm of snow, 5 inches of mud, carries anything in the bed. Gets 5 miles to the gallon. When you're heading to the store on a sunny day, it's overkill — but it works, and you don't have to think.
The Street Racer's Manual Transmission (power users): These are the people who WANT control. They feel every gear change. They know exactly which model to use for which task. They can zip through a codebase like a street race through downtown. Maximum efficiency, maximum control — but requires skill and attention.
The 1940 Hydra-Matic (what PlanExe should offer): GM's Earl Thompson spent a decade figuring out how to encode the expertise of gear-shifting into the machine itself. The driver just says "go." The car figures out the gears. The knowledge is built in — you benefit from it without needing to possess it.
We are in the 1935 moment for AI agents. The tools work. But you need to be a gearhead to get the best out of them without blowing your budget. PlanExe can be the Hydra-Matic.
Proposed: Three Routing Modes
Mode 1: auto
"The Truck"
Use whatever model is configured as the default for everything. No routing logic. No complexity scoring. Maximum capability, maximum simplicity.
- Who it's for: Teams with budget flexibility who value simplicity. Spotify-style enterprise use. Developers who are new to AI-assisted coding and don't want cognitive overhead.
- Behavior: Every task runs on the user's configured model, regardless of complexity score.
- Cost profile: Highest, predictable.
- Config:
model_routing: auto
Mode 2: optimize
"The Hydra-Matic"
PlanExe scores each task using the complexity rubric (Proposal 73), selects the appropriate model tier automatically, and executes without asking. The user never thinks about model selection — the expertise is encoded in the system.
- Who it's for: Cost-conscious developers and small teams who trust the rubric. The "I just want to drive" crowd who also care about gas mileage.
- Behavior: Tasks scored 4–7 route to Minimax, 8–11 to Haiku, 12–15 to Sonnet, 16–20 to Opus. Session boundaries are managed automatically at context thresholds.
- Cost profile: Lowest, variable.
- Config:
model_routing: optimize
What the user sees: A summary after plan generation showing estimated cost vs all-Opus cost. E.g.:
Routing plan: 3 tasks → Minimax | 2 tasks → Haiku | 1 task → Sonnet | 1 task → Opus
Estimated cost: $4.20 (vs $18.00 at Opus-only) — 77% savings
Mode 3: review
"The Street Racer with a Co-Pilot"
PlanExe scores each task and generates a routing recommendation, but pauses for human approval before executing. The user sees exactly what model is recommended for each task, the reasoning, and the estimated cost. They can override any routing decision before committing.
- Who it's for: Power users who want control AND want the system's analysis as a starting point. Developers building intuition for model selection. Simon.
- Behavior: After plan generation, presents a routing summary with per-task recommendations and estimated costs. User can accept, modify per-task, or override globally. Execution begins only after approval.
- Cost profile: Same as
optimizeif accepted as-is; variable if overridden. - Config:
model_routing: review
What the user sees:
Task A: Module split (http_server.py, 1089 lines)
Score: 19/20 → Opus recommended
Reason: Cross-file architectural refactor, whole-codebase context dependency
Estimated: $3.20
[Accept] [Override: Sonnet] [Override: Haiku]
Task B: API rename (task_id → plan_id, 88 files)
Score: 13/20 → Sonnet (planning) + Minimax (execution) recommended
Reason: Large file surface, but mechanical pattern — plan once, execute cheap
Estimated: $0.85
[Accept] [Override: all-Opus]
The 200K Token Cliff (Session Hygiene)
Relevant in both optimize and review modes.
When using Opus through Anthropic directly (not OpenRouter), the per-token price doubles after 200K tokens in a session: - ≤200K: $5 input / $25 output per 1M tokens - >200K: $10 input / $37.50 output per 1M tokens
This is not a hard rule to never cross. Sometimes the large context is exactly what you need — Spotify's use case of ingesting full service dependency trees in one pass is the canonical example. The value of that large-context read can absolutely justify the higher rate.
The waste happens when sessions drift past 200K tokens carrying context that's no longer active: old file reads, abandoned exploration paths, prior conversation history. That's paying the premium rate for tokens that aren't working.
PlanExe's session management in optimize mode should:
1. Track estimated session token count per model
2. Flag when approaching 200K with Opus
3. Offer to checkpoint: summarize active context → close session → open new session with summary + current task
4. Never force-close — the decision to pay for large context is the user's
Implementation Notes (for Simon's consideration)
This proposal is UX/product-layer only. The complexity scoring engine (Proposal 73) is the prerequisite. Once that exists, these three modes are essentially:
auto: bypass the scoring engine entirelyoptimize: run scoring engine, apply routing table, executereview: run scoring engine, present routing plan as interactive confirmation step, then execute
The routing table (score → model mapping) should be configurable per-project, not hardcoded. Simon may have different thresholds than a default user. The table in Proposal 73 (4–7=Minimax, 8–11=Haiku, 12–15=Sonnet, 16–20=Opus) is a starting point for calibration.
Questions for Simon
- Does the three-mode framing resonate with how you'd want to configure PlanExe on a new project?
- For
reviewmode — is per-task override granularity useful, or would project-level override (e.g., "use Sonnet minimum for everything") be sufficient? - The 200K session checkpoint behavior — should this be automatic in
optimizemode or always require confirmation? - Are there task types in PlanExe's current plan generation that would never make sense to route below Sonnet, regardless of score? (I'm thinking: anything touching the plan's core reasoning chain probably needs to stay at Sonnet+.)
This proposal is a companion to Proposal 73 (complexity scoring) and should be reviewed together. Both are docs-only — no code changes.