Failure Modes / Retrieval / Query Rewrite Failure

Query Rewrite Failure

Reformulates a user's search, retrieval, or tool query in a way that drops intent, adds false constraints, or searches the wrong concept. Unlike Retrieval Miss, the failure originates in the rewritten query rather than only in retrieval ranking or coverage. Unlike Goal Misinterpretation, the misunderstanding is expressed specifically through the query sent downstream.

Examples

FAIL

Negation dropped in rewrite

The user's "not caused by timeouts" is a low-probability constraint that the rewrite normalizes away — retrieval then fetches exactly the documents the user excluded.

User: Show me the incidents from last quarter that were not caused by timeouts. [… rewritten query: "timeout incidents last quarter" …] Assistant: Here are last quarter's incidents — all five were timeout-related.
FAIL

Wrong antecedent resolved

Expanding "the second one" into a standalone query requires picking the right referent from the conversation, and the rewriter picks the wrong list.

User: Compare the two auth approaches you mentioned. Actually, tell me more about the second one. [… rewriter resolves "the second one" to the second search result from an earlier turn, not the second auth approach; query becomes "session token expiration" …] Assistant: Session tokens expire after 24 hours by default…
WARN

Codename "corrected" into generic vocabulary

The internal project name looks like a typo to the rewriter, which normalizes it into common words that no longer match the corpus. Retrieval still returns plausible-looking results.

User: What's the rollout status of Project Kestrel? [… rewritten query: "bird species rollout status" returns generic deployment docs …] Assistant: Based on the deployment documentation, rollouts proceed in three phases…
PASS

Constraints carried through

The rewrite expands vocabulary for retrievability but keeps the negation explicit, and the issued query is logged so the step is auditable.

User: Show me the incidents from last quarter that were not caused by timeouts. Tool: search "Q1 incidents -timeout" with cause filter excluding "timeout" — returns 3 results Assistant: Three of last quarter's eight incidents were unrelated to timeouts: a certificate expiry, a bad config push, and a disk-full event.

Why it happens

01

Rewriters optimize for what retrieves well, not for what the user meant. A reformulation is judged by whether it surfaces high-scoring documents, so rewrites drift toward retrievable phrasings of a different question (Ma et al., 2023, "Query Rewriting for Retrieval-Augmented Large Language Models").

02

Rewriting is paraphrase, and paraphrase regenerates meaning through the model's priors. Rare constraints — a negation, an exact version, a jurisdiction, a "but not X" — are low-probability tokens that the rewrite drops or normalizes toward the more common form of the query.

03

The rewriter fills ambiguity with assumptions. Short or underspecified user queries get expanded with plausible details the user never stated, and those invented constraints then exclude the documents that actually answer the real question.

04

Conversational rewriting depends on resolving references across turns. Expanding "what about the second one?" into a standalone query requires correctly identifying the antecedent, and a wrong resolution sends retrieval after the wrong entity entirely.

05

The rewrite is an unobserved intermediate. Users see the final answer, not the query that produced it, and most pipelines log neither the rewrite nor its retrieval quality, so systematic rewrite errors are misdiagnosed as retrieval or generation failures.

06

Rewriters are tuned on benchmark query distributions. Domain jargon, internal codenames, and atypical phrasings sit outside that distribution, and the rewriter "corrects" them into generic vocabulary that no longer matches the corpus.

Detection Approaches

Categories of checks that can identify the issue. These are strategies, not specific implementations.

📏

Constraint preservation checks

Mechanically diff the original query against the rewrite for load-bearing tokens — negations, version numbers, jurisdictions, quoted names, codenames. A rewrite that drops "not", flips an exclusion, or "corrects" Project Kestrel into common words fails the check before retrieval ever runs.

⚖️

LLM-as-judge evaluation

Log every rewrite and run a judge comparing it against the user's original request and conversation, asking whether intent survived — constraints dropped, plausible details invented, references like "the second one" resolved to the wrong antecedent. The rewrite must be logged at all for this to be possible.

🧪

Golden-set evals

Maintain conversational queries that stress rewriting — negated constraints, internal jargon outside the rewriter's training distribution, multi-turn reference resolution — with gold evidence labeled, so a bad rewrite shows up as a retrieval score drop attributable to the rewrite step.

Mitigation Approaches

High-level reliability strategies that reduce how often this failure occurs.

📏

Constraint preservation checks

Run the same load-bearing-token diff as a gate before the query is issued — a rewrite that drops a negation, loses a version number, or "corrects" Project Kestrel gets repaired or replaced with the original instead of silently sent downstream.

📝

Instruction constraints

Prompt the rewriter to expand rather than replace — preserve negations, quoted names, codenames, and version numbers verbatim, and treat unfamiliar terms as domain vocabulary to keep, not typos to normalize toward common words.

🔀

Query ensembling

Issue the original query alongside the rewrite and merge results, so the rewrite can only add recall, never lose what the user's own phrasing would have found — a dropped "not" or a mangled codename no longer controls the entire retrieved set.