Failure Modes / Tools / Tool Result Misread

Tool Result Misread

Misinterprets, ignores, overgeneralizes, or incorrectly transforms the result returned by a tool. Unlike Tool Recovery Failure, the tool may have succeeded; the model mishandles its output. Unlike Source Misrepresentation, the source is specifically a tool result.

Examples

FAIL

Expectation overrides the result

The model predicted success while generating the call, and reads the failure response as confirming it — the status field is right there.

User: Run the deployment and tell me how it went. Tool: deploy() → {"status": "failed", "stage": "migration", "error": "lock timeout"} Assistant: The deployment completed — everything is live. The migration stage ran as part of the rollout.
FAIL

Empty result confabulated into an answer

The query returned nothing, but fluent continuation wins over reporting an absence, and the gap gets filled with a plausible figure.

User: How many active users did the Berlin region have last week? Tool: query(region=berlin, metric=active_users) → 0 rows Assistant: The Berlin region had around 12,000 active users last week, roughly in line with the prior month.
WARN

Salient field read, units dropped

The number is transcribed faithfully but its unit is not, turning a millisecond latency into a second one — directionally read, precisely wrong.

User: What's the p99 latency on the checkout endpoint? Tool: get_latency(endpoint=checkout, percentile=99) → {"value": 840, "unit": "ms"} Assistant: p99 latency is 840 seconds, so checkout is severely degraded.
PASS

Result parsed, absence reported

The model reads the structure rather than skimming it — including treating an empty result as a finding, not a gap to fill.

User: How many active users did the Berlin region have last week? Tool: query(region=berlin, metric=active_users) → 0 rows Assistant: The query returned no rows for Berlin last week — that's either zero recorded activity or a data pipeline gap. Want me to check whether other regions returned data for the same window?

Why it happens

01

Tool output enters the model as undifferentiated text to condition on. The model does not parse structure the way a program would; it skims JSON, tables, and logs pattern-wise and latches onto salient fields, which fails on dense or deeply nested results.

02

The model forms an expectation of the result before the call returns, because it generated the surrounding reasoning. Output that contradicts that expectation tends to be read as confirming it, since continuing the predicted trajectory is the path of least resistance for next-token prediction.

03

Key values buried in the middle of a long result are retrieved unreliably. The model anchors on the head and tail of the output and misses or misattributes what sits between (Liu et al., 2023, "Lost in the Middle: How Language Models Use Long Contexts").

04

Machine-oriented formats such as dense JSON, CSV, and raw logs are rare in pretraining text relative to prose, so the model reads them less reliably, confusing similar keys, misaligning table columns, or dropping units.

05

Empty or partial results invite confabulation. Fluent continuation is rewarded over flagging an absence, so the model fills the gap with a plausible interpretation instead of reporting that the tool returned nothing useful.

06

Function-calling training and evaluation focus on producing a correct call. Whether the model interprets the response correctly is rarely measured, so misreads survive both fine-tuning and benchmarks.

Detection Approaches

Categories of checks that can identify the issue. These are strategies, not specific implementations.

🔗

Entailment checking

Check that every claim the assistant makes about a tool result is entailed by the raw output. "The deployment completed" against a status of "failed" is a direct contradiction an NLI model catches mechanically, and a stated figure for a query that returned zero rows is unsupported by construction.

⚖️

LLM-as-judge evaluation

Give the judge the raw tool output next to the assistant's reading and have it verify the load-bearing fields — status values, units, and whether an empty result was reported as an absence or silently filled in with a plausible number.

🧪

Golden-set evals

Plant adversarial results in scripted tool responses — failure statuses after plausible-looking calls, empty result sets, values whose unit field changes the meaning — and score whether the reading survives. Misreads concentrate exactly where the result contradicts the expected trajectory.

Mitigation Approaches

High-level reliability strategies that reduce how often this failure occurs.

🧾

Result normalization

Pre-parse tool output in the scaffold and lift the load-bearing fields into a compact labeled block — status, value with its unit attached, row count — before the model reads it. The model skims structure pattern-wise; handing it "840 ms" as one token pair instead of a value field and a unit field three keys apart removes the misread's raw material. Label empty results explicitly as a returned absence, not missing output.

📝

Instruction constraints

Require the model to quote the status field and units before interpreting any result, and to treat an empty result as a finding to report — zero rows is an answer about Berlin, not a gap to fill with a plausible 12,000. The expectation formed while generating the call is what overrides the actual output; forced transcription makes the contradiction visible before the interpretation commits.

🚧

Entailment gating

Before a response built on tool results ships, check that each claim about a result is entailed by the raw output, and regenerate on contradiction with the discrepancy named. "The deployment completed" against a failed status never reaches the user, and a figure asserted over zero returned rows is unsupported by construction.