Failure Modes / Reasoning / Error Accumulation

Error Accumulation

Allows small mistakes, approximations, stale assumptions, or unverified intermediate results to compound across a multi-step task until the final output fails. Unlike Reasoning Error, no single inference may be obviously catastrophic. Unlike Tool Context Overload, the drift comes from compounding errors rather than too much intermediate state.

Examples

FAIL

Early approximation compounds

No single step is outrageous — a rounding here, a restated figure there — but thirty steps later the final number is far from the inputs.

User: Work through the quarterly budget from these department figures. Assistant: Step 2: Engineering comes to $1,184k — call it roughly $1,180k. [… steps 3–9 each restate and build on prior subtotals, drifting slightly each time …] Assistant: Final budget: $4,610k. User: The raw figures sum to $4,742k — where did $132k go?
FAIL

Unverified assumption becomes load-bearing

An early guess about the schema is never checked, and every subsequent step is built on it until the final result fails against reality.

Assistant: The users table presumably has a deleted_at column, so I'll filter on that. [… 12 steps of query-building, each conditioned on the assumed column …] Tool: run_query → ERROR: column "deleted_at" does not exist
WARN

Drift visible in restatements

Nothing has broken yet, but a value is mutating as the model restates it from its own output instead of re-reading the source.

User: Work through the quarterly budget from these department figures. Assistant: Step 2: Engineering comes to $1,184k. [… later …] Assistant: Step 6: Adding the engineering total of $1,148k…
PASS

Re-grounding at checkpoints

Intermediate results are periodically re-derived from the source data rather than trusted from the model's own transcript, so drift gets caught mid-chain.

User: Work through the quarterly budget from these department figures. Assistant: Step 6: Before combining, let me re-check subtotals against the original figures… Engineering should be $1,184k — I wrote $1,180k earlier, correcting that before it propagates.

Why it happens

01

Autoregressive generation conditions every step on the model's own prior output. An early mistake becomes trusted context for everything that follows, and no mechanism revisits it once it is in the sequence.

02

Per-step error rates compound multiplicatively. Even high accuracy on individual steps yields low end-to-end success over long chains, and measured error grows with reasoning depth (Dziri et al., 2023, "Faith and Fate: Limits of Transformers on Compositionality").

03

Outcome-based supervision rewards only the final answer. Flawed intermediate steps that happen to reach a correct result go unpenalized, so training never teaches the model to keep intermediate state clean (Lightman et al., 2023, "Let's Verify Step by Step").

04

Models restate earlier intermediate results from their own generated text rather than re-deriving them. Each restatement is another generation step where numbers, names, and qualifiers can mutate slightly.

05

Confidence does not decay across steps. The model asserts step thirty with the same fluency as step one, carrying no running estimate of how much accumulated uncertainty the chain has absorbed.

06

Agent loops surface ground truth only where checkpoints are explicitly designed in. Without intermediate validation against the environment, drift stays invisible until the final output fails.

Detection Approaches

Categories of checks that can identify the issue. These are strategies, not specific implementations.

🪜

Step-level validation

Validate intermediate results against the source data or environment at checkpoints, rather than scoring only the final answer. Drift is invisible end-to-end until it fails; checked per step, the $1,184k that became $1,180k is caught before it propagates.

🔍

Restatement diffing

Diff every restated value, name, or assumption against its earlier occurrence in the chain. A figure that mutates across restatements is the early signature of this mode, detectable mechanically from the transcript before anything downstream breaks.

🧪

Golden-set evals

Maintain multi-step tasks with checkable final answers and chart accuracy against chain length. End-to-end success that falls faster than per-step accuracy predicts is compounding — and identifies how long a chain the system can sustain before requiring a re-grounding checkpoint.

Mitigation Approaches

High-level reliability strategies that reduce how often this failure occurs.

Checkpoint re-grounding

Build periodic returns to ground truth into long chains — re-derive subtotals from the original figures, re-read the schema instead of trusting the assumed column, re-query the environment. Autoregressive generation never revisits its own past output on its own, so the checkpoint has to be a scheduled step, and it's what catches the $1,180k before it propagates.

🗃️

Explicit state tracking

Keep intermediate results in structured external state — a scratchpad file, a variables table, tool-held values — and read them back from there rather than restating them from the transcript. Every prose restatement is another generation step where a figure can mutate; a value fetched from a store cannot drift between uses.

🪜

Task decomposition

Split long chains into segments that are independently validated before the next begins, since per-step error compounds multiplicatively with depth. A verified segment boundary resets accumulated error to zero, converting one thirty-step chain into several short ones the per-step accuracy can actually sustain.