Proposal 80 — Zero-Downtime Railway Deployment

Status

Proposed

Problem

Deploying to Railway currently causes disruption:

MCP clients (Claude Code, Cursor, etc.) have active HTTP connections to mcp.planexe.org. When the mcp_cloud container restarts, in-flight plan_create calls fail, SSE progress streams disconnect, and download URLs return 502 during the restart window.
Frontend users on home.planexe.org get 502 errors while frontend_multi_user rebuilds and restarts.
Running plans are abandoned mid-execution when worker_plan_database restarts. The task stays in processing state indefinitely (or until a heartbeat timeout marks it failed), wasting the user's credits.

Railway rebuilds each service from its Dockerfile on every deploy. During the build + restart window (30-120 seconds per service), the old container is stopped and the new one is not yet healthy.

Current Deployment Topology

Internet
  |
  +-- mcp.planexe.org    --> Railway: mcp_cloud        (FastAPI/Uvicorn, port 8001)
  +-- home.planexe.org   --> Railway: frontend_multi_user (Flask/Gunicorn, port 5000)
  |
  +-- (internal) worker_plan          (FastAPI/Uvicorn, port 8000)
  +-- (internal) worker_plan_database (Flask, no HTTP — polls DB)
  +-- (internal) database_postgres    (PostgreSQL 16)

Current `railway.toml` settings

Service	Restart Policy	Health Check
`mcp_cloud`	`ON_FAILURE` (10 retries)	`/healthcheck` (100s timeout)
`frontend_multi_user`	`NEVER`	10s interval
`worker_plan`	`NEVER`	10s interval
`worker_plan_database`	`NEVER`	10s interval

What happens today on deploy

Railway receives a git push (or manual redeploy trigger).
All services with matching watchPatterns are rebuilt in parallel.
For each service, Railway stops the old container (sends SIGTERM, waits ~10s, SIGKILL) then starts the new one.
Traffic is routed to the new container only after health check passes.
During the gap between old-container-down and new-container-healthy, Railway returns 502 to all requests.

Solution

A combination of Railway-side and application-side changes to eliminate or minimize downtime.

Part 1: Deploy Order (Manual Checklist)

Services have a dependency graph. Deploying in the wrong order can cause cascading failures (e.g., mcp_cloud starts before database_postgres migration finishes). The safe order:

1. database_postgres     (no downtime — Railway Postgres addon persists across deploys)
2. worker_plan           (stateless HTTP API, no active connections from end users)
3. worker_plan_database  (background worker — see Part 3 for graceful drain)
4. mcp_cloud             (public-facing — see Part 2)
5. frontend_multi_user   (public-facing — see Part 2)

Steps 4 and 5 are independent and can run in parallel, but each individually needs the zero-downtime treatment from Part 2.

For routine code changes that only affect one service, deploy just that service. The watchPatterns in each railway.toml already control which services rebuild on a git push.

Part 2: Zero-Downtime for Public-Facing Services

Option A: Railway Replicas with Rolling Deploy (Recommended)

Railway supports running multiple replicas per service with rolling deploys. When a new version is deployed, Railway:

Starts new replica(s) alongside the old ones.
Waits for new replica(s) to pass health checks.
Drains traffic from old replica(s) — stops routing new requests but allows in-flight requests to complete.
Stops old replica(s).

Configuration changes needed:

mcp_cloud/railway.toml:

[build]
builder = "DOCKERFILE"
dockerfilePath = "/mcp_cloud/Dockerfile"
watchPatterns = ["/mcp_cloud/**", "/database_api/**", "/worker_plan/worker_plan_api/**"]
context = "."

[deploy]
startCommand = "python -m mcp_cloud.http_server"
healthcheckPath = "/healthcheck"
healthcheckTimeout = 100
restartPolicyType = "ON_FAILURE"
restartPolicyMaxRetries = 10
numReplicas = 2

frontend_multi_user/railway.toml:

[build]
builder = "DOCKERFILE"
dockerfilePath = "/frontend_multi_user/Dockerfile"
watchPatterns = ["/frontend_multi_user/**", "/worker_plan/worker_plan_api/**", "/database_api/**"]
context = "."

[deploy]
healthcheckPath = "/healthcheck"
healthcheckTimeout = 100
restartPolicyType = "ON_FAILURE"
restartPolicyMaxRetries = 3
numReplicas = 2

Cost: 2x for these two services. On Railway's Pro plan, this is typically the cheapest path to zero downtime.

Why it works: Both services are stateless. mcp_cloud uses FastAPI/Uvicorn (all state is in Postgres). frontend_multi_user uses Flask/Gunicorn with session cookies (session data is server-side but non-critical — a user simply re-authenticates if their session lands on the new replica).

Option B: Blue-Green via Railway Environments (Alternative)

If replicas are not available or too costly:

Create a staging environment in Railway that mirrors production.
Deploy to staging first. Smoke-test against the staging URL.
Swap the custom domain (mcp.planexe.org, home.planexe.org) from the production service to the staging service via Railway dashboard or CLI.
Domain swap is near-instant (Railway updates its reverse proxy, no DNS change needed).

Drawback: Manual process, requires maintaining two environments.

Part 3: Graceful Shutdown for `worker_plan_database`

The background worker is the highest-risk service during deploys. It runs long tasks (10-20 minutes) and has no HTTP interface for Railway to drain.

Current behavior on SIGTERM

worker_plan_database runs start_task_monitor() which is a while True loop with process_pending_tasks(). On SIGTERM: - Python raises SystemExit (or the process is killed after Railway's grace period). - If a task is mid-execution, the subprocess running the pipeline is killed. - The task stays in processing state in the database.

Proposed: SIGTERM handler with graceful drain

Add a signal handler that: 1. Sets a _shutdown_requested flag. 2. The main loop checks this flag and stops picking up new tasks. 3. Waits for the current task (if any) to finish, up to a configurable timeout. 4. If the timeout expires, marks the in-progress task as failed with a clear message so it can be retried.

Code change in worker_plan_database/app.py:

import signal
import threading

_shutdown_requested = threading.Event()

def _handle_sigterm(signum, frame):
    logger.info("SIGTERM received. Will stop after current task completes.")
    _shutdown_requested.set()

signal.signal(signal.SIGTERM, _handle_sigterm)
signal.signal(signal.SIGINT, _handle_sigterm)

Update start_task_monitor():

def start_task_monitor():
    logger.info("Started monitoring database for pending tasks.")
    try:
        last_heartbeat_time = time.time()
        while not _shutdown_requested.is_set():
            processed_something = process_pending_tasks()
            # Use Event.wait() instead of time.sleep() so SIGTERM wakes us immediately
            _shutdown_requested.wait(timeout=1 if processed_something else 5)

            new_heartbeat_time = time.time()
            if processed_something:
                last_heartbeat_time = new_heartbeat_time
            if new_heartbeat_time - last_heartbeat_time > HEARTBEAT_INTERVAL_IN_SECONDS:
                last_heartbeat_time = new_heartbeat_time
                with app.app_context():
                    WorkerItem.upsert_heartbeat(worker_id=WORKER_ID)
    except KeyboardInterrupt:
        logger.info("KeyboardInterrupt received. Stopping task monitor...")
    except Exception as e:
        logger.critical(f"Unhandled exception in task monitor: {e}", exc_info=True)
    finally:
        logger.info("Task monitor shut down.")
        logging.shutdown()

Railway grace period: Increase the SIGTERM-to-SIGKILL grace period so the worker has time to finish its current task. Railway's default is 10 seconds, which is far too short for a 10-20 minute pipeline run.

In the Railway dashboard (per-service settings), set the drain timeout / shutdown grace period to a value that covers the worst case. Unfortunately, railway.toml does not currently expose a shutdownGracePeriodSeconds key — this must be set in the Railway dashboard or via the Railway API.

As a fallback, the worker should mark its current task as failed on forced shutdown, so the user can retry via plan_retry:

def _mark_current_task_failed_on_shutdown():
    """Best-effort: mark current processing task as failed so user can retry."""
    with app.app_context():
        task = PlanItem.query.filter_by(
            state=PlanState.processing,
        ).filter(
            PlanItem.parameters.contains({"_worker_id": WORKER_ID})
        ).first()
        if task:
            task.state = PlanState.failed
            task.progress_message = "Worker restarted during execution. Use plan_retry to re-run."
            db.session.commit()
            logger.info("Marked task %s as failed due to shutdown.", task.id)

Call this in the finally block of start_task_monitor().

Part 4: SSE Reconnection Resilience

MCP clients monitoring plan progress via SSE (GET /sse/plan/{plan_id}) will lose their connection during a deploy. The SSE implementation already sends a retry: 5000 hint (line 95 of sse.py), which tells well-behaved SSE clients (including EventSource in browsers) to automatically reconnect after 5 seconds.

Current state: This already works. After the new container passes its health check, the client reconnects and resumes receiving progress events. No code change needed.

Improvement (optional): Add Last-Event-Id support so clients can resume from where they left off instead of getting the current snapshot. This is a nice-to-have but not required — the current behavior (reconnect and get latest state) is sufficient since SSE events are idempotent status snapshots, not an ordered event log.

Part 5: MCP Client Retry Guidance

MCP clients that get a 502 during a deploy should retry. The plan_create handler is idempotent in the sense that each call creates a new plan — there is no risk of duplicate execution. For plan_status, plan_file_info, and plan_list, retries are naturally safe (read-only operations).

This is primarily a documentation concern. The MCP server instructions already tell clients to poll plan_status periodically, which implicitly handles transient failures.

Part 6: Database Migrations During Deploy

Schema migrations run at service startup (ensure_*_columns() functions). These use ALTER TABLE ADD COLUMN IF NOT EXISTS which is: - Idempotent: Safe to run from multiple services simultaneously. - Non-blocking in PostgreSQL: ADD COLUMN with no default does not lock the table. - Backward compatible: New columns are nullable, so the old code (still running during rolling deploy) is not affected.

No changes needed. The current migration pattern is already deploy-safe.

Deploy Checklist

Before deploying

[ ] Verify the change is on the correct branch and CI passes.
[ ] Check watchPatterns in railway.toml — confirm only the intended services will rebuild.
[ ] If schema migrations are included, verify they are additive (nullable columns, IF NOT EXISTS).

During deploy

[ ] If deploying worker_plan_database: check that no critical long-running tasks are in processing state. If there are, either wait for them to finish or accept that they will be retried.
[ ] Monitor Railway deploy logs for health check pass on each service.
[ ] Check /healthcheck on both mcp.planexe.org and home.planexe.org after deploy.

After deploy

[ ] Verify mcp.planexe.org/mcp/tools returns the tool list (confirms MCP is operational).
[ ] Verify home.planexe.org/healthcheck returns {"status": "ok", "database": "ok"}.
[ ] Check for any tasks stuck in processing state that may need plan_retry.
[ ] Spot-check SSE: curl -N https://mcp.planexe.org/sse/plan/<recent-plan-id> returns heartbeats.

Files Changed

File	Change
`mcp_cloud/railway.toml`	Add `numReplicas = 2`, add `healthcheckTimeout`
`frontend_multi_user/railway.toml`	Add `numReplicas = 2`, add `healthcheckPath`, `healthcheckTimeout`
`worker_plan_database/app.py`	Add SIGTERM handler, graceful drain in `start_task_monitor()`, mark-failed-on-shutdown

Risks and Mitigations

Risk	Mitigation
2x replica cost for mcp_cloud and frontend	Can scale back to 1 replica during low-traffic periods. Railway bills per-usage, so idle replicas cost very little.
Worker SIGTERM handler doesn't fire (SIGKILL before handler runs)	The `_mark_current_task_failed_on_shutdown()` fallback in `finally` handles this. Even without it, the existing heartbeat timeout mechanism will eventually mark stale tasks as failed.
Rolling deploy routes traffic to new replica before migrations finish	Migrations are idempotent `ADD COLUMN IF NOT EXISTS` — they complete in milliseconds and are safe to run concurrently. The health check only passes after the Flask/FastAPI app fully initializes (including migrations).
SSE connections drop during deploy	Already handled: `retry: 5000` hint causes automatic reconnection. The new replica serves the reconnected client with current state.
Session cookies from old frontend replica are invalid on new replica	Flask sessions use a server-side secret key (from env var). As long as `SECRET_KEY` is the same across replicas (it is — set via Railway shared variables), sessions are valid on any replica.
Database connection pool exhaustion with 2x replicas	Each service uses SQLAlchemy with `pool_recycle=280` and `pool_pre_ping=True`. Default pool size is 5 connections per worker. With 2 replicas x 4 gunicorn workers = 40 connections max for frontend. Railway Postgres supports hundreds of connections.