Codebase Cleanliness Remediation Roadmap
Author: PlanExe Team
Date: 2026-03-31
Status: Proposal
Tags: codebase, refactor, maintainability, testing, architecture
Pitch
PlanExe has strong architectural intent, good repo-level guardrails, and a meaningful test footprint, but several core services have accumulated large load-bearing modules, mixed production and experimental code, and uneven operational hygiene. This proposal defines a cleanup program that improves maintainability without breaking service contracts or slowing ongoing product work.
Problem
The codebase is not chaotic, but it is carrying too much complexity in too few files. The main issues found during inspection were:
- Giant modules that concentrate unrelated responsibilities.
- Debug and operational scaffolding mixed into production entrypoints.
- Experimental and proof-of-concept code living too close to production paths.
- Broad exception handling that weakens failure diagnosis.
- Uneven logging and runtime hygiene across services.
- Incomplete test coverage in some high-risk areas, especially the multi-user frontend.
This matters because PlanExe is evolving into a multi-service execution engine. Large files and mixed concerns slow down review, increase regression risk, and make it harder for both humans and autonomous agents to safely change the system.
Feasibility
This cleanup is feasible now because the repo already has several advantages:
- Service boundaries are documented in package-level
AGENTS.mdfiles. - Shared contracts are called out explicitly for
worker_plan,database_api,worker_plan_api,mcp_cloud, and the frontends. - There is already a meaningful unit-test base across
worker_plan,mcp_cloud,mcp_local, and shared utilities.
The main constraint is backward compatibility. We should not redesign public APIs while cleaning internals. The cleanup must preserve:
worker_planrequest and response shapes.mcp_cloudandmcp_localtool contracts.- Shared DB models and legacy compatibility behavior.
Proposal
Define a staged remediation program focused on six concrete hygiene issues.
Issue 1: Giant Load-Bearing Modules
Evidence
Several files are too large and own too many responsibilities:
worker_plan/worker_plan_internal/plan/run_plan_pipeline.pyat 4,288 lines.frontend_multi_user/src/app.pyat 3,857 lines.worker_plan_database/app.pyat 1,520 lines.mcp_cloud/http_server.pyat 1,431 lines.
These files are not just large. They mix routing, orchestration, validation, operational policy, artifact handling, billing, auth, or workflow logic in the same module.
Why it is a problem
Large modules make the code harder to reason about, harder to test in isolation, and easier to break when adding unrelated features. They also push reviewers toward shallow approval because a single diff can span too many concerns.
Fix steps
- ~~Split
frontend_multi_user/src/app.pyby concern intoauth,billing,admin,downloads,account, andplan_routes.~~ Done (PR #476): Split 3,857-line monolith into 6 Flask Blueprint modules + utils (app.py reduced to 1,441 lines). Follow-up fix: updated allurl_for()calls in templates to use blueprint-prefixed endpoint names (plan_routes.*,auth.*,downloads.*). - ~~Split
mcp_cloud/http_server.pyintomiddleware,route_registration,tool_http_bridge, andserver_boot.~~ Done: Split 1,439-line monolith into 4 focused modules + re-export shim. - ~~Convert
worker_plan/worker_plan_internal/plan/run_plan_pipeline.pyfrom a giant task registry file into a thin pipeline assembly module plus task-specific modules grouped by stage.~~ Done: Split 4,257-line monolith into ~66 individual stage files understages/+ framework-only core module (563 lines). - Extract reusable orchestration helpers from
worker_plan_database/app.pyinto focused worker, billing, and queue modules. - Set an internal size target for service modules. As a starting rule, new files should stay below roughly 500 lines unless there is a strong reason not to.
Issue 2: Debug Scaffolding in Production Entrypoints
Evidence
Production-facing files still contain direct startup print() diagnostics or ad hoc debugging traces, for example in:
mcp_cloud/http_server.pyworker_plan/worker_plan_internal/llm_factory.py- Several task modules that print query and response payloads in executable paths
Why it is a problem
Direct prints are sometimes useful during incident response, but they are not a coherent observability strategy. They create inconsistent runtime output, complicate log filtering, and encourage one-off diagnostics instead of structured instrumentation.
Fix steps
- ~~Replace entrypoint
print()startup breadcrumbs with structuredloggingcalls atINFOorDEBUG.~~ Done (PR #474): Converted 22[startup]prints inmcp_cloud/http_server.pyandmcp_cloud/db_setup.pyto_startup_log()helper. - ~~Gate verbose diagnostics behind explicit env vars such as
PLANEXE_DEBUG_STARTUPor module-specific debug flags.~~ Done (PR #474): Startup breadcrumbs now gated behindPLANEXE_DEBUG_STARTUP=1. - Move sample-driver code and debugging helpers into
if __name__ == "__main__":blocks or dedicated scripts. - Add a lightweight test or lint-like assertion that production modules do not contain uncategorized top-level
print()calls.
Issue 3: Experimental Code Mixed with Production Code
Evidence
The repo contains worker_plan/worker_plan_internal/proof_of_concepts/ with 19 Python files plus several experimental_premise_attack*.py modules under production-adjacent diagnostics paths.
Why it is a problem
Experimental work is good. Leaving it in production-adjacent trees without strong boundaries makes discovery noisier, encourages accidental coupling, and makes the production surface look less intentional than it is.
Fix steps
- ~~Move proof-of-concept code into a clearly isolated top-level
experiments/orresearch/area, or explicitly mark it as non-production in module names and docs.~~ Done (PR #474): Moved 18 PoC scripts to top-levelexperiments/. - ~~Move experimental diagnostics variants out of the default runtime namespace unless they are active candidates for shipping.~~ Done (PR #474): Deleted 6
experimental_premise_attack{1-6}.pyfiles superseded by productionpremise_attack.py. - ~~Add a short README in each experimental area defining its status, owner, and graduation criteria.~~ Done (PR #474): Added
experiments/README.md. - Ensure production imports never depend on experimental modules.
Issue 4: Broad Exception Handling
Evidence
The repo contains many except Exception: blocks across service and runtime code, including in:
frontend_multi_user/src/app.pyworker_plan_database/app.pymcp_cloud/http_server.pyworker_plan/app.pyworker_plan_internal/llm_util/*
Some are reasonable containment boundaries, but many are too broad to preserve actionable failure context.
Why it is a problem
Broad exception handling hides root causes, weakens retry logic, and makes it easier for bugs to silently degrade behavior rather than fail in a controlled and visible way.
Fix steps
- Audit all
except Exception:blocks and classify them asintentional boundary,temporary workaround, orshould narrow. - Narrow handlers to specific exception classes where possible.
- Where a broad boundary is required, log structured context and re-raise domain-specific exceptions rather than generic failures.
- Add tests for failure classification in critical paths such as MCP request handling, billing, downloads, and pipeline stop/retry logic.
Issue 5: Uneven Logging and Runtime Hygiene
Evidence
The codebase contains many local logging.basicConfig(...) calls spread across services and executable modules, plus a mix of logging styles and one-off debug behavior.
Why it is a problem
Distributed basicConfig calls make runtime behavior inconsistent and harder to control. They also blur the line between library code, service entrypoints, and local scripts.
Fix steps
- Restrict
logging.basicConfig(...)to service entrypoints and dedicated CLI scripts. - Remove logging configuration from reusable library modules.
- Define a small shared logging helper for PlanExe service startup so format and level handling are consistent.
- Standardize logger naming and expected levels for normal operation, diagnostics, and failure cases.
Issue 6: Test Gaps in High-Risk Service Areas
Evidence
The overall repo has a respectable test footprint, but frontend_multi_user explicitly notes that no automated tests currently exist for many UI or DB flows.
Why it is a problem
The multi-user frontend handles auth, admin flows, billing, downloads, and user account state. That is too much business risk to leave mostly protected by manual confidence and good intentions.
Fix steps
- Add focused unit tests around billing, account state, admin user resolution, plan retry, and artifact download behavior.
- Add tests for helpers extracted from
frontend_multi_user/src/app.pyas part of the module split. - Prioritize tests for failure paths, not just success paths.
- Keep tests close to the logic they protect so refactors remain cheap.
Implementation Plan
Phase 1: Inventory and Safety Rails
- Create an inventory of oversized modules, broad exception handlers, and top-level print/debug usage.
- Tag each item as
refactor now,refactor when touched, orleave as boundary. - Add lightweight tests around current behavior before moving code in the largest modules.
Phase 2: Split the Worst Offenders
- ~~Refactor
frontend_multi_user/src/app.pyfirst because it mixes the most distinct business concerns.~~ Done (PR #476). Templateurl_for()references fixed to match new blueprint endpoints. - ~~Refactor
mcp_cloud/http_server.pysecond because it sits on a public protocol boundary.~~ Done. - Refactor
worker_plan_database/app.pyandrun_plan_pipeline.pyin smaller slices to avoid destabilizing the execution engine.
Phase 3: Remove Operational Noise
- Replace production
print()usage with structured logging. - Centralize startup logging setup per service.
- Move or label experimental modules so the production tree is easier to navigate.
Phase 4: Exception and Test Cleanup
- Narrow broad exception handlers in the services touched during earlier phases.
- Add targeted regression tests for every cleanup area.
- Update docs where module entrypoints or development workflows change.
Integration Points
This proposal integrates with existing PlanExe boundaries rather than fighting them:
worker_plan/AGENTS.mdalready defines the public worker API and internal separation rules.mcp_cloud/AGENTS.mdalready documents the internal split thathttp_server.pyshould better reflect in code.frontend_multi_user/AGENTS.mdalready calls out DB and artifact invariants that should survive route extraction.- The existing
python test.pyconvention can remain the top-level test entrypoint while coverage expands.
Success Metrics
- No production-facing Python module above 1,500 lines after the first cleanup wave.
- ~~
frontend_multi_user/src/app.pyreduced by at least 50% through route and helper extraction.~~ Done (PR #476): Reduced by 63% (3,857 → 1,441 lines). - ~~
mcp_cloud/http_server.pyreduced to a focused HTTP assembly module rather than a mixed implementation file.~~ Done: Split intoserver_boot.py,middleware.py,tool_http_bridge.py,route_registration.py+ re-export shim. - Zero uncategorized top-level
print()statements in production service modules. - Documented justification for all remaining
except Exception:boundaries in service code. - New automated tests covering multi-user billing, retry, and download flows.
Risks
- Refactors may accidentally break backward compatibility across services. Mitigation: keep public contracts frozen and add regression tests before moving code.
- Cleanup work may turn into an endless style exercise with no shipping value. Mitigation: prioritize only high-leverage areas with operational or review cost.
- Pipeline refactors may destabilize long-running plan generation. Mitigation: split orchestration carefully and preserve task behavior while moving code.
Why Now
PlanExe is already at the point where architectural cleanliness affects product velocity. The codebase has enough quality and enough structure to justify a cleanup pass, but it also has enough scale that delaying the work will make later changes slower, riskier, and more expensive. This is the right time to pay down the structural debt while the service boundaries are still understandable and before more execution-engine features pile on top.