Failure Modes / Retrieval / Conflicting Source Failure

Conflicting Source Failure

Fails to detect, compare, qualify, or reconcile retrieved sources that disagree with one another. Unlike Source Misrepresentation, the issue is not a single source being misstated but a failure to handle disagreement across sources. Unlike Calibration Failure, the error occurs in evidence synthesis before the final confidence framing.

Examples

FAIL

Conflict silently resolved

Two retrieved policies disagree, and the model picks one — by passage order, not by which is authoritative — without ever mentioning the other.

User: How long do employees have to submit expense reports? [… passage A (2023 handbook): "within 30 days" — passage B (2025 policy update): "within 14 days" …] Assistant: Employees must submit expense reports within 30 days.
FAIL

Disagreement smoothed into false consensus

The sources report contradictory findings, and the synthesis blends them into one unified claim that neither source makes.

User: Does the new onboarding flow improve activation? [… study A: activation up 12% — study B: no significant change, possible novelty effect …] Assistant: The research consistently shows the new onboarding flow improves activation by around 12%.
WARN

Conflict noticed, then averaged away

The model registers that the numbers differ but resolves the tension with a vague midpoint instead of surfacing the disagreement for the user to judge.

User: How long do employees have to submit expense reports? [… passage A: "within 30 days" — passage B: "within 14 days" …] Assistant: Expense reports should generally be submitted within a few weeks.
PASS

Disagreement surfaced and weighed

Names both sources, uses their dates to weigh them, and gives the user the conflict along with the reasoned resolution.

User: How long do employees have to submit expense reports? [… passage A (2023 handbook): "within 30 days" — passage B (2025 policy update): "within 14 days" …] Assistant: The sources disagree — the 2023 handbook says 30 days, but a 2025 policy update says 14. The update is more recent and explicitly supersedes the handbook, so 14 days is the current rule.

Why it happens

01

Training data for grounded generation almost always pairs a question with consistent evidence and a single answer. Models learn to synthesize one fluent response, not to notice that the evidence set itself disagrees (Chen et al., 2022, "Rich Knowledge Sources Bring Complex Knowledge Conflicts").

02

When retrieved sources conflict, models resolve the disagreement silently using shallow cues — passage order, phrasing confidence, surface features of the text — rather than source reliability or evidence quality (Wan et al., 2024, "What Evidence Do Language Models Find Convincing?").

03

The model's own parametric belief acts as a tiebreaker. Evidence agreeing with pretraining priors gets weight and contradicting evidence gets discounted, so the conflict is settled by prior bias instead of being surfaced (Xie et al., 2023, "Adaptive Chameleon or Stubborn Sloth").

04

Preference tuning rewards decisive, unified answers. A response that lays out two contradictory findings and declines to pick one reads as unhelpful to raters, so models are pushed toward smoothing disagreement into false consensus.

05

Retrieval pipelines deliver passages stripped of provenance — no publication date, authority, or methodology — leaving the model without the metadata needed to weigh one source against another even when it does notice the conflict.

06

Detecting conflict requires comparing retrieved passages against each other, but attention over a long stuffed context is uneven. Disagreements between passages placed far apart in the prompt are less likely to be noticed at all.

Detection Approaches

Categories of checks that can identify the issue. These are strategies, not specific implementations.

🔗

Entailment checking

Run pairwise NLI across the retrieved passages before generation. A contradiction between passages establishes that a conflict existed in the evidence set — then any answer that asserts one side without mentioning the other is this failure, detected mechanically.

⚖️

LLM-as-judge evaluation

Run a judge that sees the full retrieved set alongside the answer and asks whether the sources agree, and if not, how the answer handled it — flagging silent resolution, blends into false consensus, and vague midpoints that average the disagreement away.

🧪

Golden-set evals

Maintain corpora seeded with documented disagreements — a dated policy and its update, two studies with contradictory findings — and regression-test whether answers surface the conflict and resolve it by provenance, rather than by passage order or confidence of phrasing.

Mitigation Approaches

High-level reliability strategies that reduce how often this failure occurs.

🛡️

Provenance tiering

Carry dates, authority, and document status through retrieval so the model has something better than passage order to resolve conflicts with. The ok-grade answer — "the 2025 update supersedes the 2023 handbook" — is only possible when the passages arrive with their provenance attached.

📝

Instruction constraints

Instruct the model to compare retrieved passages against each other before synthesizing, and to surface disagreement rather than smooth it — naming both sources and the resolution beats a vague midpoint or a silent pick, even though tuning taught it the decisive answer rates better.

🚪

Entailment gating

Run pairwise NLI across the retrieved set before generation, and when passages contradict, route to a conflict-handling path that requires the answer to address both sides — so the disagreement is established mechanically instead of depending on uneven attention across a stuffed context.