Overthinking

Spends far more reasoning than the problem warrants — long deliberation on trivial questions, redundant re-derivations, or second-guessing that talks the model out of a correct answer. Unlike Verbosity Failure, the excess is in the reasoning process and its token and latency cost, not the final answer's length. Unlike Unbounded Consumption, the spend terminates on its own; it is disproportionate rather than unlimited, and the harm can include a worse answer, not just a bigger bill.

Examples

FAIL

Trivial question, massive deliberation

The answer is right, but it arrives after thousands of reasoning tokens and many seconds of latency that a one-line computation never needed.

User:What's 15% of 80?[… model reasons for 2,400 tokens — computes 0.15 × 80, re-derives it via 10% + 5%, cross-checks by computing 80 × 3/20, then reverifies each method …]Assistant:12.
FAIL

Second-guessed out of the right answer

The first conclusion was correct, but continued deliberation surfaces a plausible alternative and the model talks itself into it.

User:Which config change caused the latency spike — A, B, or C?[… reasoning trace: "B aligns with the timeline exactly… but wait, let me reconsider C…" 1,800 tokens later …]Assistant:It was C — the connection-pool change.[… B was correct, and the trace had established it before the reconsideration began …]
WARN

Simulation instead of action

The agent could settle the question with one cheap tool call, but spends its turn reasoning about what the call would probably return.

User:Is the staging deploy finished?Assistant:Based on the deploy starting around 14:02 and typical build times of 6–9 minutes, accounting for the cache being warm, it has most likely completed by now — though the migration step can add time…[… a one-line status check tool was available the entire time …]
PASS

Effort matched to difficulty

Trivial parts get answered directly, and the deliberation budget is spent where the problem actually lives.

User:What's 15% of 80, and which of these three configs caused the latency spike?Assistant:15% of 80 is 12. The config question needs actual care — comparing each change's timestamp against the spike onset… Change B landed at 14:01, ninety seconds before the spike, and it's the only one touching the request path. It's B.

Why it happens

01

Reasoning training rewards long chains. Test-time scaling teaches that more thinking improves hard-problem success, and the learned policy applies extended deliberation indiscriminately — including to problems solved correctly in the first hundred tokens (Chen et al., 2024, "Do NOT Think That Much for 2+3=?").

02

Allocating a thinking budget requires estimating difficulty before solving, which is itself the judgment the thinking was supposed to produce. With no reliable difficulty signal, the trained default is to deliberate.

03

Extended reasoning can make answers worse, not just slower. Models second-guess correct intermediate conclusions, drift to alternative framings, and in agentic settings substitute simulating outcomes for taking the one cheap action that would settle the question (Cuadron et al., 2025, "The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks").

04

Reasoning traces are often hidden or unread, so the cost is invisible in the feedback loop. Raters and evals score the final answer, and nothing charges the 2,400 tokens it took to get there.

Detection Approaches

Categories of checks that can identify the issue. These are strategies, not specific implementations.

Reasoning budget monitoring

Track reasoning tokens per request against task class and outcome, and flag the distribution's tail — especially spend on items a no-reasoning baseline already answers correctly. The 2,400-token percentage calculation is a metering hit, not a quality judgment.

Accuracy-versus-length auditing

On a labeled set, chart correctness against reasoning length. Accuracy that falls as traces grow is second-guessing measured directly — and diffing traces against final answers finds the cases where a correct intermediate conclusion was deliberated away.

Golden-set evals

Include deliberately trivial items alongside hard ones and score budget proportionality, not just correctness — tokens-to-answer on the easy half, accuracy on the hard half. A system tuned only on hard-problem accuracy will pass while burning its budget on the items that needed none.

Mitigation Approaches

High-level reliability strategies that reduce how often this failure occurs.

Difficulty-based routing

Route requests by estimated difficulty — a cheap or no-reasoning path for trivial queries, extended thinking reserved for problems that earn it — using the reasoning-effort controls most providers now expose. The model won't self-allocate; the router has to.

Thinking budget caps

Set per-request reasoning-token budgets matched to task class, forcing commitment when the budget is spent. Capping also bounds the second-guessing window — the reconsideration that talked the model out of B needed 1,800 tokens of room to happen in.

Instruction constraints

Instruct the model to answer directly when confident, to stop at the first verified conclusion rather than re-deriving it by other methods, and — in agentic settings — to prefer one cheap observation over reasoning about what the observation would show. "Check the deploy status" beats a paragraph estimating it.