Failure Modes / Faithfulness / Context-Conflicting Hallucination

Context-Conflicting Hallucination

States a claim that contradicts information available to the model: the user's explicit input or supplied data, or facts elsewhere in the active context such as prior turns, retrieved text, summaries, or tool outputs. Unlike Extrinsic Hallucination, the claim contradicts available information rather than merely going beyond it unsupported. Unlike Self-Contradiction, the conflict is against provided input or context the model should have used, not only against its own generated statements.

Examples

FAIL

Prior overrides the document

The uploaded document states the current fact, but the model answers from its stronger pretraining prior and contradicts the source it was given.

[… uploaded company profile reads: "Maria Chen was appointed CEO in January 2025." …] User: According to this profile, who is the CEO? Assistant: The CEO is David Okafor, who has led the company since 2018.
FAIL

Contradicting an earlier turn

A fact the user stated early in the session falls out of effective context, and the model regenerates a detail that flips it.

User: For all of this trip planning — I'm vegetarian, please keep that in mind. [… 30 turns later …] Assistant: For your last night, I'd book Asador El Toro — their signature beef tasting menu is the best in the city.
WARN

Silent side-taking

The supplied document and the model's prior disagree, and the model quietly picks the document without surfacing the conflict — right this time, but by luck rather than policy.

[… uploaded profile reads: "Maria Chen was appointed CEO in January 2025." …] User: Who is the CEO? Assistant: The CEO is Maria Chen.
PASS

Conflict surfaced

Follows the provided source and explicitly flags that it differs from what the model would have said from memory.

[… uploaded profile reads: "Maria Chen was appointed CEO in January 2025." …] User: Who is the CEO? Assistant: According to the profile you uploaded, Maria Chen, appointed in January 2025. Note that this is more recent than my background knowledge, which still associates the role with David Okafor — the document is the better source here.

Why it happens

01

When supplied context disagrees with what the model memorized during pretraining, the model often prefers its parametric knowledge, especially when the context contradicts popular or frequently repeated facts (Longpre et al., 2021, "Entity-Based Knowledge Conflicts in Question Answering").

02

Attention over long contexts is uneven. Information placed in the middle of a long prompt is used far less reliably than information at the start or end, so the model can simply fail to weight the passage it ends up contradicting (Liu et al., 2023, "Lost in the Middle").

03

The training objective rewards producing a likely continuation, not consistency with the prompt. A fluent claim that matches strong pretraining priors can outscore a faithful claim that matches the supplied document (Ji et al., 2023, "Survey of Hallucination in Natural Language Generation").

04

In multi-turn or agentic settings, earlier facts get compressed, summarized, or pushed out of the effective context. The model then regenerates the fact from priors instead of from the original statement, and the regeneration can flip details.

05

Instruction tuning teaches models to answer confidently rather than to flag conflicts. When the prompt and the model's prior disagree, surfacing the tension is rarely rewarded, so the model silently picks one side.

06

Evaluation suites mostly score answers against external ground truth, not against the specific context provided, so context-consistency failures go unmeasured and uncorrected during development.

Detection Approaches

Categories of checks that can identify the issue. These are strategies, not specific implementations.

🔗

Entailment checking

Run an NLI model between each output claim and the supplied context — the uploaded document, prior turns, tool results. A contradiction label is a direct hit for this mode, and unlike extrinsic additions, these conflicts are exactly what entailment models are best at catching.

⚖️

LLM-as-judge evaluation

Run a judge with the full supplied context in view and ask it to flag any claim the context contradicts. Prompt it specifically for prior-override patterns — answers that swap in a more famous or more typical fact than the one the document states.

🧪

Golden-set evals

Maintain counterfactual documents whose facts deliberately contradict popular pretraining priors — a different CEO, a changed policy, an updated date — and regression-test whether answers come from the document or from the model's memorized version.

Mitigation Approaches

High-level reliability strategies that reduce how often this failure occurs.

📝

Instruction constraints

Instruct the model to treat supplied context as authoritative over its background knowledge and to surface disagreement rather than silently pick a side — "the document says X; my background knowledge says Y; the document is the better source" instead of letting the stronger prior win.

📌

Context pinning

Keep durable facts — user constraints like "I'm vegetarian," key document claims — restated in the system prompt or near the active end of context rather than buried mid-prompt, where uneven attention lets the model regenerate them from priors and flip details.

Self-check pass

Before answering, have the model re-read the supplied context and check its draft against it claim by claim — specifically asking whether any answer swapped in a more famous or more typical fact than the one the document actually states.