Failure Modes / Faithfulness / Extrinsic Hallucination

Extrinsic Hallucination

Adds information that cannot be verified from the provided source material, neither supported nor contradicted by it, while making the answer appear source-grounded. Unlike Context-Conflicting Hallucination, the added content does not contradict the source; it is simply ungrounded in it. Unlike Specificity Hallucination, the defining issue is lack of source support rather than fabricated precision.

Examples

FAIL

Ungrounded detail presented as from the source

The answer is framed as a reading of the report, but the cause it states appears nowhere in it. Nothing contradicts the source — the claim is simply not in it.

[… uploaded incident report describes a 40-minute outage; it never states a cause …] User: Summarize what this report says about the outage. Assistant: The report explains that a 40-minute outage occurred, caused by a misconfigured load balancer during a routine deploy.
FAIL

Gap filled instead of flagged

Asked about something the source does not cover, the model supplies a plausible answer rather than saying the document is silent.

[… uploaded report does not mention affected customers …] User: According to the report, how many customers were affected? Assistant: The report indicates that roughly 1,200 customers experienced degraded service during the window.
WARN

Unmarked background knowledge

Mixes accurate general knowledge into a source-grounded answer without marking where the document ends and the model's own knowledge begins.

[… uploaded report describes the outage but never mentions load balancers in general …] User: Summarize what this report says about the outage. Assistant: The report describes a 40-minute outage. Load balancer misconfigurations like this are among the most common causes of such incidents.
PASS

Source boundary kept visible

Answers strictly from the document and labels anything beyond it as the model's own addition.

User: Summarize what this report says about the outage. Assistant: The report covers a 40-minute outage on March 3 and the recovery steps taken. It does not state a root cause or a customer-impact count. If useful, I can suggest likely causes — but that would be my inference, not the report.

Why it happens

01

Pretraining stores vast world knowledge in the model's weights, and generation draws on it freely. Nothing architecturally separates "what the source said" from "what the model knows," so background knowledge leaks into source-grounded tasks (Maynez et al., 2020, "On Faithfulness and Factuality in Abstractive Summarization").

02

Training and evaluation reward confident answers over abstentions. A model gains nothing by saying the source does not cover something, so it fills gaps with plausible content instead (Kalai et al., 2025, "Why Language Models Hallucinate").

03

Human raters and preference models tend to score complete-sounding, detailed answers above ones that stay strictly within the source, so RLHF tuning pushes models toward elaboration beyond the provided material.

04

Summarization and QA training corpora themselves contain reference outputs with information absent from the source documents, teaching models that adding unstated facts is part of the task (Maynez et al., 2020, "On Faithfulness and Factuality in Abstractive Summarization").

05

Extrinsic additions are the hardest faithfulness errors to catch automatically. They contradict nothing, so consistency checkers and entailment filters pass them, and they survive into deployed systems (Ji et al., 2023, "Survey of Hallucination in Natural Language Generation").

06

Prompts often blur the boundary, asking the model to "use the document" rather than to answer only from it. Without an explicit grounding constraint, mixing in parametric knowledge is the default behavior.

Detection Approaches

Categories of checks that can identify the issue. These are strategies, not specific implementations.

🔗

Entailment checking

Run an NLI model between each output claim and the source, and treat the neutral label as the signal — extrinsic content is neither entailed nor contradicted. Pipelines that only alert on contradiction pass this mode by construction, so the neutral bucket is the one to inspect.

⚖️

LLM-as-judge evaluation

Run a judge that sees only the source document and the answer — no world knowledge allowed — and asks for each claim where in the source it comes from. Plausible background facts that the judge cannot point to a passage for are exactly this failure.

🧪

Golden-set evals

Maintain source documents with known, deliberate gaps — an incident report that never states a cause, a study that never gives a sample size — and regression-test whether summaries fill those gaps from parametric knowledge while staying framed as a reading of the source.

Mitigation Approaches

High-level reliability strategies that reduce how often this failure occurs.

📝

Instruction constraints

Say "answer only from the document," not "use the document," and make "the report doesn't state this" an expected answer shape. Require anything beyond the source to be labeled as the model's own inference, so the source boundary stays visible instead of blurring by default.

Self-check pass

Before answering, have the model point each claim in its draft at the passage it comes from and strip or relabel anything it cannot trace. Plausible background facts are exactly what this catches — they feel source-grounded but have no passage to point to.

🚪

Entailment gating

Gate output on an NLI check between each claim and the source, treating the neutral label as a block, not a pass. Pipelines that only gate on contradiction admit this mode by construction, since extrinsic additions contradict nothing.