83 measure drift specification

Agent-Facing Specification: Drift Evaluation Between Initial Prompt and Generated Plan

Purpose

This specification defines a strict, operational procedure for evaluating drift between an initial prompt and a generated plan.

It is written for AI agents that must perform the evaluation without prior conversational context.

The evaluator’s job is not to judge whether the plan is impressive, polished, or well-written. The evaluator’s job is to determine whether the generated plan remains faithful to the prompt’s intended meaning, constraints, boundaries, priorities, and uncertainty posture.

This spec is designed to be executable as an internal evaluation procedure, a QA rubric, or a regression test for planning systems.

⸻

Core Rule

A generated plan passes only if it becomes more usable without becoming less true.

That is the governing principle.

If the output is richer, clearer, more complete, and more structured, but also introduces unsupported claims, weakens constraints, changes the product, changes the customer, or inflates confidence, then it has failed.

⸻

Definitions

3.1 Initial prompt

The source instruction, description, or concept from which the plan was generated.

3.2 Generated plan

The long-form output artifact produced from the initial prompt. This may be a strategic plan, roadmap, business plan, proposal, implementation plan, report, or similar document.

3.3 Drift

Any meaningful departure from the source prompt’s commitments, exclusions, logic, scope, or uncertainty.

3.4 Fidelity

The degree to which the generated plan preserves the source prompt’s content and posture.

3.5 Prompt contract

A normalized structured representation of what the initial prompt actually commits to.

The prompt contract is mandatory. No drift evaluation may proceed without it.

⸻

Non-Negotiable Evaluation Rules

The evaluator must follow these rules exactly.

Rule 1: Extract a prompt contract before evaluating

Do not compare the raw prompt and plan loosely. First convert the prompt into a structured contract.

Rule 2: Evaluate meaning, not wording

Rephrasing is not drift unless meaning changes.

Rule 3: Treat explicit exclusions as high-priority

Anything the prompt says not to do, not to claim, not to position, or not to include must be treated as critical.

Rule 4: Unsupported specificity is suspicious by default

Specific tools, numbers, budgets, timelines, roles, locations, legal claims, metrics, and market claims are presumed unsafe unless supported or clearly flagged as assumptions.

Rule 5: Optional features must remain optional

If the prompt says something is optional, deferred, layered, or future-phase, the generated plan must not silently promote it to a core feature.

Rule 6: Preserve uncertainty

If the prompt is cautious, provisional, or incomplete, the generated plan must preserve that. It must not replace uncertainty with smooth confidence.

Rule 7: Negative space matters

A prompt’s non-goals and omitted claims matter. The model must not fill every gap with plausible strategic filler.

Rule 8: One critical contradiction is enough to fail

A high enough weighted score does not rescue a plan that materially changes the customer, business model, regulatory posture, or explicit non-goals.

⸻

Required Inputs

The evaluating agent must receive: • the initial prompt • the generated plan • optionally, metadata such as generation settings, section structure, or intermediate notes

The evaluator must not assume any external knowledge beyond what is in these materials.

⸻

Required Outputs

The evaluator must produce all of the following: 1. Prompt contract 2. Output claim map 3. Drift incident log 4. Dimension scores 5. Pass/fail decision 6. Revision actions 7. Confidence statement about the evaluation

No partial format is acceptable.

⸻

Mandatory Evaluation Procedure

Step 1: Build the Prompt Contract

Convert the initial prompt into the following structure.

7.1 Prompt Contract Schema • Core intent • Primary problem • Product/tool/system definition • Primary buyer • Primary user • Other relevant entities • Target context/domain • Core value claim • Business model / GTM • Implementation scope • Core features • Optional features • Deferred features • Explicit non-goals • Explicit exclusions • Hard constraints • Risk / uncertainty posture • Success metrics • Legal / regulatory posture • Key assumptions allowed by the prompt • Claims the prompt explicitly avoids

7.2 Contract Extraction Standard

Each field must be written in concise declarative language.

Bad: • “It seems to maybe be about…”

Good: • “Primary buyer: institutional funders with high-volume, inconsistent application pipelines.”

7.3 Mandatory Classification

Each contract item must be tagged as one of: • explicit • strongly implied • weakly implied

Only explicit and strongly implied items may be used as safe support for output claims.

Weakly implied items may support only cautious derived claims.

⸻

Step 2: Extract the Generated Plan Claim Map

The evaluator must identify all major commitments in the generated plan.

At minimum, extract claims about: • product identity • target customer • target user • domain/context • problem statement • mechanism of value • business model • GTM • implementation phases • team roles • timelines • tools/stack • metrics • legal/regulatory claims • expansion paths • assumptions • causal claims • outcome claims

Each extracted claim must be tagged by importance: • critical • important • secondary

Only critical and important claims affect pass/fail directly. Secondary claims mainly affect quality scores.

⸻

Step 3: Align Output Claims to Prompt Support

For each claim in the generated plan, assign exactly one support label: • source-stated • source-derived • speculative-but-flagged • unsupported • contradictory

8.1 Support Label Rules

source-stated The claim is directly present in the prompt.

source-derived The claim is a reasonable inference from multiple prompt elements and does not exceed their strength.

speculative-but-flagged The claim is added, but clearly marked as assumption, possibility, or item needing validation.

unsupported The claim is not grounded in the prompt and is presented without clear uncertainty marking.

contradictory The claim conflicts with the prompt’s content, exclusions, or structure.

8.2 Strictness rule

If there is any doubt whether a claim is source-derived or unsupported, default to unsupported.

⸻

Step 4: Detect Drift Incidents

Each drift incident must be logged individually.

9.1 Drift Incident Schema

For every incident, record: • incident_id • drift_type • severity • plan_section • output_claim • prompt_contract_reference • support_label • explanation • repair_action

9.2 Allowed Drift Types • scope_expansion • constraint_erosion • unsupported_invention • confidence_inflation • business_model_drift • customer_drift • mechanism_drift • priority_drift • regulatory_drift • style_induced_semantic_drift • unsupported_metrics • invented_operational_detail • optional_to_core_promotion • uncertainty_erasure

The evaluator must not invent its own drift categories unless absolutely necessary.

⸻

Severity Rules

Every incident must receive one severity score from 0 to 4.

Severity 0 — No drift

Harmless elaboration or faithful restatement.

Severity 1 — Minor drift

Slight inflation or unnecessary elaboration that does not alter the plan’s core meaning.

Examples: • mild jargon creep • superficial strategic language • harmless extra section titles

Severity 2 — Moderate drift

A meaningful but not fatal change in emphasis, specificity, or interpretation.

Examples: • optional feature overemphasized • light unsupported operational detail • caveat softened but not removed

Severity 3 — Major drift

A substantial change to what the plan claims, how it works, who it is for, or how certain it is.

Examples: • workflow tool reframed as intelligence engine • ungrounded specific metrics used as proof • platform expansion dominates narrow wedge • unsupported legal certainty

Severity 4 — Critical drift

A central contradiction or material corruption of the prompt.

Examples: • target customer changed • business model changed • explicit non-goal violated • banned framing reintroduced • unsupported invented detail repeated across important sections • uncertainty replaced by fact in core areas • advisory product inserted where prompt forbids it

⸻

Scoring Dimensions

The evaluator must score all 10 dimensions on a 0 to 5 scale.

9.1 Scale definition • 5 = excellent fidelity • 4 = good fidelity, minor issues only • 3 = mixed fidelity, notable drift • 2 = weak fidelity • 1 = severe drift • 0 = failed completely

9.2 Required dimensions

A. Scope Fidelity

Did the output stay within the intended scope?

B. Constraint Fidelity

Did the output preserve exclusions, banned concepts, and hard boundaries?

C. Claim Strength Fidelity

Did the output preserve the strength of claims rather than escalating them?

D. Evidence Grounding Fidelity

Are material claims supported by the prompt?

E. Entity Fidelity

Did the output preserve buyer, user, applicant, stakeholder, and product identity?

F. Causal Fidelity

Did the output preserve why the product matters and how it creates value?

G. Epistemic Fidelity

Did the output preserve uncertainty, assumptions, and unresolved issues?

H. Source-Trace Fidelity

Can major claims in the plan be linked back to source content?

I. Structural Priority Fidelity

Did the output preserve what is core, optional, and deferred?

J. Language Posture Fidelity

Did the language remain appropriately restrained and true to the source posture?

⸻

Weights

The evaluator must compute a weighted fidelity score using these weights. • Constraint Fidelity: 20% • Scope Fidelity: 15% • Evidence Grounding Fidelity: 15% • Entity Fidelity: 10% • Causal Fidelity: 10% • Epistemic Fidelity: 10% • Structural Priority Fidelity: 8% • Claim Strength Fidelity: 5% • Source-Trace Fidelity: 4% • Language Posture Fidelity: 3%

Total: 100%

⸻

Automatic Failure Conditions

Regardless of weighted score, the plan must be marked FAIL if any of the following are true.

11.1 Explicit exclusion violation

A banned concept, forbidden framing, or explicit non-goal is materially reintroduced.

11.2 Customer identity drift

The target customer or buyer materially changes.

11.3 Business model drift

The product’s commercial model, product category, or GTM is materially changed without source basis.

11.4 Regulatory posture drift

The generated plan materially misstates or overstates the legal or regulatory posture.

11.5 Repeated unsupported quantitative claims

The plan introduces unsupported numerical claims in multiple important sections.

Threshold: • 3 or more important unsupported numeric claims = automatic fail

11.6 Optional-to-core promotion in central product definition

A feature marked optional, deferred, or layered becomes core in the main product narrative.

11.7 Uncertainty erasure in core assumptions

The source identifies major uncertainty, but the generated plan treats it as settled in core reasoning.

11.8 Unsupported invention in critical areas

Unsupported tools, teams, locations, legal assumptions, or implementation requirements appear in critical sections and materially shape the plan.

11.9 Loss of traceability

The generated plan makes critical claims that cannot be linked to source content or clearly flagged assumptions.

⸻

Pass / Borderline / Fail Thresholds

If no automatic-fail condition is triggered, use these thresholds.

PASS

All of the following: • weighted fidelity score >= 4.2 / 5 • no dimension below 3 • no severity 4 incidents • at most 2 severity 3 incidents • unsupported important claim count <= 3 • confidence inflation count <= 2

BORDERLINE

Any of the following: • weighted fidelity score between 3.4 and 4.19 • one dimension scored 2 • up to 4 severity 3 incidents • unsupported important claim count between 4 and 7 • confidence inflation count between 3 and 5

A borderline plan requires revision and re-evaluation. It is not approved for final use.

FAIL

Any of the following: • automatic-fail condition triggered • weighted fidelity score < 3.4 • any dimension scored 0 or 1 • more than 4 severity 3 incidents • any severity 4 incident • unsupported important claim count > 7 • confidence inflation count > 5

⸻

Counting Rules

The evaluator must track these counts.

13.1 Unsupported Important Claim Count

Count all unsupported claims tagged critical or important.

13.2 Unsupported Numeric Claim Count

Count all unsupported numerical values, percentages, estimates, timing claims, budget claims, ROI claims, market share claims, staffing counts, or quantified performance claims.

13.3 Constraint Violation Count

Count all places where the plan weakens, ignores, or bypasses an explicit prompt constraint.

13.4 Confidence Inflation Count

Count each time: • tentative becomes assertive • assumption becomes fact • exploratory becomes settled • “may/could/conditional” becomes “will/is”

Only count material occurrences, not every wording instance.

13.5 Optional-to-Core Promotion Count

Count each time an optional, deferred, or layered feature becomes part of the central product definition, core GTM, or main value proposition.

⸻

Revision Rules

If the output is BORDERLINE or FAIL, the evaluator must provide repair actions.

Each repair action must be concrete and localized.

Bad: • “Make it more faithful.”

Good: • “Replace unsupported claim that London office proximity is required for execution with a conditional note that London location may help relationship-building but is not operationally necessary.”

14.1 Allowed repair action types • delete unsupported claim • downgrade claim strength • restore uncertainty • relabel assumption as speculative • move feature from core to optional/deferred • reinsert excluded framing • restore original customer definition • restore original business model • remove invented numbers • add source-trace note • split source-provided vs system-derived content • compress future expansion to deferred roadmap

⸻

Required Final Output Format

The evaluator must return the result in a structured format that contains all required sections.

15.1 Mandatory JSON-like schema

{ "evaluation_metadata": { "spec_version": "1.0", "evaluation_mode": "strict", "confidence": "high | medium | low" }, "prompt_contract": { "core_intent": "", "primary_problem": "", "product_definition": "", "primary_buyer": "", "primary_user": "", "target_context": "", "core_value_claim": "", "business_model_gtm": "", "implementation_scope": "", "core_features": [], "optional_features": [], "deferred_features": [], "explicit_non_goals": [], "explicit_exclusions": [], "hard_constraints": [], "risk_uncertainty_posture": [], "success_metrics": [], "legal_regulatory_posture": [], "allowed_assumptions": [], "claims_explicitly_avoided": [] }, "claim_map": [ { "claim_id": "C1", "claim_text": "", "importance": "critical | important | secondary", "support_label": "source-stated | source-derived | speculative-but-flagged | unsupported | contradictory", "prompt_reference": "" } ], "dimension_scores": { "scope_fidelity": 0, "constraint_fidelity": 0, "claim_strength_fidelity": 0, "evidence_grounding_fidelity": 0, "entity_fidelity": 0, "causal_fidelity": 0, "epistemic_fidelity": 0, "source_trace_fidelity": 0, "structural_priority_fidelity": 0, "language_posture_fidelity": 0 }, "weighted_fidelity_score": 0.0, "counts": { "unsupported_important_claim_count": 0, "unsupported_numeric_claim_count": 0, "constraint_violation_count": 0, "confidence_inflation_count": 0, "optional_to_core_promotion_count": 0, "severity_3_count": 0, "severity_4_count": 0 }, "drift_incidents": [ { "incident_id": "D1", "drift_type": "", "severity": 0, "plan_section": "", "output_claim": "", "prompt_contract_reference": "", "support_label": "", "explanation": "", "repair_action": "" } ], "automatic_fail_conditions_triggered": [], "decision": { "status": "PASS | BORDERLINE | FAIL", "usable_as_is": false, "requires_revision": true, "rationale": "" }, "revision_actions": [ "" ], "summary": { "preserved_well": [], "major_failures": [], "overall_verdict": "" } }

⸻

Agent Operating Instructions

The evaluator must follow these behavioral rules.

16.1 Be conservative

Do not give credit for plausibility. Give credit only for support.

16.2 Prefer explicit incompleteness over invented coherence

If the source did not specify something important, the plan should leave it open or label it as an assumption.

16.3 Do not reward strategic polish

A plan does not score higher because it sounds sophisticated.

16.4 Treat invented precision as suspicious

Precise numbers without prompt grounding are almost always drift.

16.5 Distinguish utility from fidelity

A plan may be useful and still drift badly. Fidelity comes first.

16.6 Escalate semantic changes caused by language inflation

If wording inflation changes how strong or broad a claim sounds, treat it as semantic drift, not style only.

⸻

Heuristics for Common Failure Modes

The evaluator must explicitly look for these.

17.1 Consultant inflation

Watch for: • ecosystem • category leader • transformation • revolutionize • market capture • strategic moat • substantial market share

These are red flags unless the prompt itself uses them.

17.2 Fabricated concreteness

Watch for invented: • software tools • software stack • exact team composition • exact office setup • exact budgets • exact growth metrics • exact legal interpretations • exact operating assumptions

17.3 Confidence laundering

Watch for: • may -> will • can help -> drives • supports -> ensures • exploratory -> validated • possible -> definitive

17.4 Future-roadmap takeover

Watch for deferred ambitions dominating present scope.

17.5 Optional-to-core promotion

Watch for optional filters, scoring, analytics, or expansion becoming central product identity.

17.6 Workflow-to-intelligence drift

Watch for a workflow tool being reframed as prediction, recommendation, intelligence, or optimization engine.

17.7 Traceability loss

Watch for generated content that no longer clearly maps to user-provided input.

⸻

Minimal Acceptability Standard

A generated plan is acceptable only if it satisfies all of the following: • preserves the prompt’s buyer, product type, and value logic • preserves hard exclusions and non-goals • does not introduce critical unsupported invention • preserves uncertainty where it matters • does not convert optional layers into core identity • improves clarity and usability without overstating what is known • remains defensible when compared line by line against the prompt contract

If it fails any of these, it is not acceptable.

⸻

Short Decision Template for Agents

When an agent needs to give a concise decision, use this exact structure:

Fidelity verdict

PASS / BORDERLINE / FAIL

Why

One paragraph summarizing whether the plan preserved: • core intent • scope • constraints • uncertainty • product identity

Biggest problems

List the top 3 drift issues only.

Required fixes

List the minimum set of changes needed to reach PASS.

⸻

Final Principle

The evaluator must always ask:

Did the generated plan preserve the source prompt’s commitments, limits, and uncertainty while making the result more usable?

If the answer is no, then the plan failed, even if it sounds smarter, richer, or more complete.

That is the whole point of this spec.