Adversarial Red-Team Reality Check for Plans
Author: Simon Strandgaard
Date: 2026-02-18
Status: Proposal
Tags: red-team, verification, anti-sycophancy, quality, governance
Pitch
After PlanExe generates a draft plan, send it to a panel of external models that aggressively challenge assumptions, feasibility, timelines, costs, and hidden constraints. The goal is not polite feedback; the goal is stress testing under hostile scrutiny and measuring whether the original planner can defend valid reasoning or collapses into sycophantic agreement.
Problem
A single-model plan can look coherent while being fragile.
Common failure modes:
- The planner overcommits with optimistic assumptions.
- The planner misses obvious real-world constraints.
- The planner folds too easily when challenged, or agrees with contradictory criticism.
- Users receive smooth prose instead of resilient plans.
Current outputs are under-tested against adversarial critique.
Feasibility
This is feasible with existing PlanExe architecture:
- We already have structured artifacts and intermediate files.
- We already run multi-step workflows and can add post-generation stages.
- We can gate this feature behind speed/detail mode or an explicit parameter.
Constraints:
- Extra model calls increase cost and latency.
- Prompt design must avoid toxic output while preserving adversarial rigor.
- We need deterministic scoring so users can trust the result.
Proposal
Add a post-generation red-team stage with 3 roles:
- Planner (Original): the model that created the plan.
- Red Team Panel (External): multiple different models/providers that challenge the plan.
- Judge (Arbiter): a separate model that scores arguments for factual grounding, internal consistency, and realism.
Core behavior:
- Red-team prompts should explicitly include hard challenge language (for example, direct claims that the plan is unrealistic or unworkable) to trigger non-sycophantic behavior.
- The planner must respond with evidence-backed defense, partial concession, or revision proposals.
- The judge scores each exchange and records whether the planner:
- defended correctly,
- conceded correctly,
- or agreed incorrectly (sycophancy failure).
Workflow
- Generate baseline plan (existing flow).
- Build challenge packet:
- assumptions
- budget/timeline/resource claims
- risk register summary
- Run adversarial roundtable:
- N red-team critiques (diverse models)
- planner rebuttal to each critique
- Run judge pass:
- score each critique/rebuttal pair
- identify collapses, contradictions, and unsupported agreement
- Produce outputs:
- revised plan sections (if needed)
- red-team report
- anti-sycophancy score
Output Artifacts
redteam/challenges.jsonredteam/rebuttals.jsonredteam/judgments.jsonredteam/summary.mdredteam/anti_sycophancy_score.json
Include a user-facing summary in the report:
- “Most severe realism failures”
- “Where the planner stood firm and was correct”
- “Where the planner caved and was incorrect”
- “Required revisions before execution”
Scoring Model
Suggested metrics:
realism_failure_countcritical_assumption_invalidated_countunsupported_agreement_count(primary anti-sycophancy metric)defensible_pushback_countrevision_impact_score(how much of plan changed after challenge)
Aggregate:
plan_resilience_score(0-100)anti_sycophancy_score(0-100)
Integration Points
- Post-processing stage in
worker_plan_internal.plan.run_plan_pipeline. - Optional config in task parameters (for example:
redteam_mode). - Report generation pipeline to include red-team findings.
- MCP/file outputs for download and auditability.
Phased Implementation
Phase A: Minimal Red-Team Pass
- Add one external model challenge + one rebuttal + one judge.
- Emit basic summary and anti-sycophancy score.
Phase B: Multi-Model Panel
- Expand to 3-5 challengers from different providers.
- Add disagreement clustering and contradiction detection.
Phase C: Enforcement Mode
- Add optional gate: plans with resilience score below threshold are marked “needs revision” before user export.
Success Metrics
- Reduction in downstream plan corrections after user review.
- Increase in detected unrealistic assumptions before execution.
- Stable anti-sycophancy metric across repeated adversarial prompts.
- User-rated trust improvement in final plans.
Risks
- Overly aggressive red-team prompts may degrade quality if not controlled.
- Judge model can introduce bias or inconsistent scoring.
- Added latency may reduce usability for fast iterations.
Mitigations:
- Keep challenge style aggressive but policy-safe.
- Add rubric-based judging with structured outputs.
- Make red-team intensity configurable (
off,standard,aggressive).