Proposal 93: Local Model Roadmap — March 2026

Author: PlanExe Core Team
Date: March 7, 2026
Status: Accepted
Target: VoynichLabs/PlanExe2026 main branch

Executive Summary

PlanExe achieved its first complete pipeline run on local hardware using Qwen 3.5-9B, executing all 63 tasks with zero failures. This milestone unlocks offline-first operation at zero API cost on consumer hardware (Mac Mini M4 Pro, 64GB). The root causes of all prior local model failures have been identified and remediated. This proposal documents the breakthrough, the technical fixes, and the roadmap for multi-model comparison and production hardening.

Milestone Achieved: March 7, 2026

The Run

Model: Qwen 3.5-9B — lmstudio-community/Qwen3.5-9B-GGUF, Q4_K_M quantization, 6.1 GB
HuggingFace: https://huggingface.co/lmstudio-community/Qwen3.5-9B-GGUF
LM Studio key: qwen/qwen3.5-9b@q4_k_m
Max context: 262,144 tokens
Hardware: Mac Mini M4 Pro (14-core CPU, 64GB unified memory)
Tasks: 63-task PlanExe pipeline
Result: 0 failures, 100% task completion rate
Cost: $0.00 (fully offline)

Significance

PlanExe can now execute its entire planning pipeline without cloud dependencies, internet connectivity, or API billing. This establishes a baseline for local-first AI planning workflows and opens the door to edge deployment, privacy-sensitive use cases, and cost-neutral scaling.

Root Cause Analysis: Why Local Models Failed Until Now

All prior local model attempts failed silently or incompletely due to four specific issues:

1. Wrong Adapter Class (LMStudio vs OpenAILike)

The original integration used class: LMStudio from llama_index. This class sends JSON schemas as plain text in the user message, with no grammar enforcement. The local model would "see" the schema but had no mechanism to enforce it in structured output.

Fix: Switched to class: OpenAILike with should_use_structured_outputs: true, which sends a proper response_format: json_schema payload to LM Studio. LM Studio enforces this via Outlines grammar, guaranteeing JSON structure at inference time.

2. Silent 60-Second Timeout

The OpenAILike adapter configuration used request_timeout as a field name. OpenAILike ignores this field; the correct field is timeout. Without an explicit timeout, the adapter defaulted to 60 seconds. Long-running tasks would hang and fail silently.

Fix: Corrected field name from request_timeout to timeout in local.json adapter config.

3. Pydantic Enum Fields → $defs/$ref Issues

Pydantic's default JSON schema serialization for Enum fields generates $defs and $ref pointers. Outlines grammar cannot resolve these refs at runtime, causing schema validation to fail on tasks with Enum-typed fields.

Fix: Implemented FlatSchemaModel pattern across 9 core task files, converting Enum fields to Literal[...] unions which expand inline and require no refs.

4. GLM 4.7 Flash MLX Thinking Mode

When enabled, GLM 4.7 Flash puts all output in reasoning_content and leaves content empty. This broke response parsing, even when schema structure was correct.

Fix: Disable thinking mode via LM Studio preset configuration.

Fixes Shipped (March 7, 2026)

Adapter Config (config/local.json)
Switched to OpenAILike class
Added should_use_structured_outputs: true
Corrected timeout field
Literal/Enum Pattern (9 task files)
Replaced Pydantic Enum with Literal[...] unions
Removed all $defs/$ref dependencies
Verified schema flattening
CI Parity Test (tests/test_local_model_parity.py)
63-task end-to-end test using local model
Passes with zero failures (baseline established)
LM Studio Preset (presets/planexe-agents.yml)
Optimized for task completion
Disabled thinking mode
Structured output enforcement

What Was Blocking Local Models (Technical Deep Dive)

Issue	Impact	Root Cause	Solution
LMStudio class	Schema not enforced	No grammar support	Switch to OpenAILike + json_schema format
request_timeout field	60s silent timeout	OpenAILike ignores field	Rename to timeout
Pydantic Enum → $defs	Schema validation fails	Outlines can't resolve refs	Use Literal[...] unions instead
GLM thinking mode	Empty content field	Reasoning takes all output	Disable via preset

Roadmap: Next Steps

Phase 1: Multi-Model Baseline Comparison (Weeks 1–2)

1.1 GLM 4.7 Flash Full Pipeline Run

Disable thinking mode via LM Studio preset
Execute full 63-task pipeline
Compare latency, accuracy, and token efficiency vs. Qwen 9B baseline

1.2 Comparison Report

Likert-scale scoring matrix: - Models tested: Qwen 3.5-9B, Qwen 35B, GLM 4.7 Flash - Baseline cloud: OpenRouter Gemini 3.1 Flash Lite - Metrics: Task completion rate, avg latency, reasoning quality, cost per run, memory footprint - Deliverable: reports/local-vs-cloud-comparison-march-2026.md

Phase 2: Schema Hardening (Weeks 2–3)

2.1 FlatSchemaModel / $defs Pipeline-Wide Audit

Scan all 63 task definitions for: - Remaining Enum fields (should be converted to Literal) - Nested dataclasses with defaults that might trigger $defs - Union types that don't flatten to Outlines grammar - Document findings in audit report

2.2 null Guard PR

Branch: fix/structured-response-null-guard
Utility: require_raw() — ensures structured output never returns None
Target: 33 task files across pipeline
Prevent silent null-reference errors in task output

Phase 3: Local Model Infrastructure (Weeks 3–4)

3.1 LM Studio /api/v1/chat Migration

Current: /v1/chat/completions (OpenAI compatibility layer)
Target: /api/v1/chat (native LM Studio endpoint)
Rationale: Better latency, fewer abstraction layers, direct feature access
Update OpenAILike adapter to support endpoint override

3.2 Hub Preset Publishing

Once PR #192 merges, publish preset to lmstudio.ai/82deutschmark/planexe-agents
Community-discoverable, one-click install from LM Studio UI
Includes: all task schema fixes, timeout tuning, thinking mode disable, Outlines grammar

Phase 4: Production Readiness (Weeks 4–5)

4.1 Multi-Model Fallback Strategy

Primary: GLM 4.7 Flash (reasoning + speed)
Fallback 1: Qwen 3.5-9B (proven stable, lower latency)
Fallback 2: Cloud (OpenRouter Gemini 3.1 Flash Lite, if all local models unavailable)
Implement graceful degradation with retry logic and telemetry

Technical Debt & Known Risks

Schema Refs in Non-OpenAILike Tasks — Phase 2 audit will identify scope
LM Studio Latency Under Load — Compare sustained throughput with cloud baseline
Quantization Quality — 9B vs 35B vs GLM reasoning trade-offs not yet quantified
Offline Inference Infrastructure — LM Studio dependency; explore Ollama/vLLM alternatives later

Success Criteria

✅ Phase 1: Qwen 9B baseline established (63/63 tasks, 0 failures)
🔄 Phase 2: GLM 4.7 Flash achieves 63/63 completion; comparison report published
🔄 Phase 3: FlatSchemaModel audit complete; null guards merged
🔄 Phase 4: Hub preset published; multi-model fallback operational
🔄 Phase 5: PlanExe deployable offline on M4 Pro baseline hardware; docs published

Hardware Baseline

CPU: Apple M4 Pro (14-core)
Memory: 64GB unified
Storage: 1TB SSD (model cache)
OS: macOS 26.3
Runtime: LM Studio 0.x + OpenAI-compatible adapter

Timeline

Phase	Weeks	Owner	Deliverables
Phase 1	1–2	Core team	GLM run, comparison report
Phase 2	2–3	Core team	Audit report, null guard PR
Phase 3	3–4	Core team	LM Studio native endpoint, preset publish
Phase 4	4–5	Core team	Fallback logic, production docs

References

Qwen 3.5-9B: https://huggingface.co/lmstudio-community/Qwen3.5-9B-GGUF (Q4_K_M, 6.1 GB)
GLM 4.7 Flash: https://huggingface.co/THUDM/glm-4-9b
LM Studio: https://lmstudio.ai/
Outlines Grammar: https://github.com/outlines-ai/outlines
OpenAI Chat Completions API: https://platform.openai.com/docs/api-reference/chat/create

Approval & Sign-Off

Date: March 7, 2026
Status: Ready for review by VoynichLabs/PlanExe2026 maintainers
Next Step: Merge to main; begin Phase 1 execution