Failure Modes / Fabrication / Authority Hallucination

Authority Hallucination

Falsely strengthens a claim by attributing it to an expert, institution, official source, benchmark, policy, or consensus that does not actually support it. Unlike Citation Hallucination, the issue can occur without a formal citation. Unlike Source Misrepresentation, the authority may be invented, misassigned, or invoked without checking what it actually says.

Examples

FAIL

Invented endorsement

Fabricates an official institutional position to settle the question.

User: Is intermittent fasting safe for someone with a heart condition? Assistant: Yes โ€” the American Heart Association officially recommends intermittent fasting for patients with heart conditions.
FAIL

Misassigned authority

Attaches a real-sounding claim to a famous figure who never said it, because the pairing is statistically plausible.

User: Why is starting to invest early such a big deal? Assistant: As Einstein said, compound interest is the eighth wonder of the world.
WARN

Vague consensus appeal

Leans on unnamed experts to strengthen the same fasting claim. Nothing is checkably false, but the authority is doing work the evidence hasn't earned.

User: Is intermittent fasting safe for someone with a heart condition? Assistant: Experts widely agree that intermittent fasting is safe for most adults.
PASS

Calibrated attribution

Answers the same question at the strength the evidence supports, and only invokes authorities for positions they verifiably hold.

User: Is intermittent fasting safe for someone with a heart condition? Assistant: Some studies report benefits, but major health organizations haven't issued a recommendation either way. A doctor can advise for your situation.

Why it happens

01

Training data is full of phrases like "experts say" and "studies show" used as a writing style. The model learns this as a pattern of persuasive text, not as a claim that needs checking.

02

The training objective rewards plausible continuations, not verified ones. If an attribution fits the claim statistically, the model prefers it over no attribution at all (Ji et al., 2023, ACM Computing Surveys).

03

Human raters tend to score confident, authority-backed answers as more helpful than hedged ones. Preference tuning then pushes the model toward assertive attribution (Sharma et al., 2023, "Towards Understanding Sycophancy in Language Models").

04

Models repeat "imitative falsehoods" โ€” claims about consensus or official positions that are common in training data but wrong โ€” because pretraining optimizes for matching that data (Lin et al., 2022, TruthfulQA).

05

The model stores who-said-what as loose statistical associations, not linked records. It can easily attach a real claim to a more famous or more relevant-sounding authority.

06

Most systems never check an invoked authority against what that source actually says, so wrong attributions pass through unchallenged.

Detection Approaches

Categories of checks that can identify the issue. These are strategies, not specific implementations.

๐Ÿ”Ž

Claim-to-source verification

Extract every named authority โ€” experts, institutions, policies, benchmarks โ€” and look up whether the source exists and actually holds the attributed position. Invented endorsements and misassigned quotes fail the lookup outright.

โš–๏ธ

LLM-as-judge evaluation

Run a judge prompted to flag authority-invoking language ("experts agree", "officially recommends", "as X said") and score whether the response provides checkable support for the attribution or uses the authority as unearned rhetorical weight.

๐Ÿงช

Golden-set evals

Maintain a test set of questions known to elicit imitative falsehoods โ€” misattributed quotes, fabricated institutional positions, false consensus claims โ€” and regression-test that the model declines to invoke authorities it cannot support.

Mitigation Approaches

High-level reliability strategies that reduce how often this failure occurs.

๐Ÿ“š

Retrieval grounding

Require the model to attribute positions only to sources present in retrieved context, rather than from parametric memory. If no source backs the claim, the answer must drop the authority rather than the claim.

๐Ÿ“

Instruction constraints

Instruct the model to state claims at the strength of its evidence and to avoid naming experts, institutions, or consensus unless it can also say where the position comes from. Calibrated hedging beats confident attribution.

โœ…

Self-check pass

Before answering, have the model re-read its draft and strip or downgrade any attribution it cannot verify โ€” converting "the AHA recommends" into "some studies report" when the institutional position is unconfirmed.