Failure Modes / Response Integrity / Verbosity Failure

Verbosity Failure

Provides more detail, repetition, caveats, background, or explanation than the task, user, medium, or decision requires. Unlike Irrelevance, the extra material may be relevant but excessive. Unlike Audience Mismatch, the problem is length and density rather than reader fit overall.

Examples

FAIL

Padding around a one-word answer

Wraps a single fact in restatement, background, caveats, and a closing summary none of which was asked for.

User: What's the capital of Australia? Assistant: Great question! The capital of Australia is a common point of confusion, as many people assume it's Sydney or Melbourne, both larger and better-known cities. In fact, the capital is Canberra, a planned city purpose-built in 1913 as a compromise between the two rivals. So, to summarize: the capital of Australia is Canberra.
WARN

Fixed chat overhead

The content is right-sized, but tuned-in framing habits add boilerplate at both ends.

User: What's the capital of Australia? Assistant: I'd be happy to help with that! The capital of Australia is Canberra. Let me know if you'd like to learn more about Australian geography!
PASS

Sized to the question

Delivers the fact at the length the question warrants.

User: What's the capital of Australia? Assistant: Canberra.

Why it happens

01

Reward models prefer longer responses. Length correlates strongly with reward scores in RLHF pipelines, so optimization inflates output length even when no new information is added (Singhal et al., 2023, "A Long Way to Go: Investigating Length Correlations in RLHF").

02

Automated judges share the bias. LLM-as-judge evaluation favors longer, more elaborate answers, so model comparisons and fine-tuning loops built on these judges reinforce verbosity rather than penalizing it (Zheng et al., 2023, "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena").

03

Length reads as effort. Human raters skimming under time pressure use thoroughness as a proxy for quality, rewarding visible coverage over judgment about what the reader actually needs.

04

Padding hedges against unknown needs. The model cannot see which detail matters to this user, and including everything is never scored as wrong, while omitting the one needed point is, so the asymmetry favors exhaustive answers.

05

Chat formatting conventions add fixed overhead. Tuned-in habits such as restating the question, announcing the structure, listing caveats, and closing with a summary attach to responses regardless of whether the content warrants them.

06

Generation has no revision pass. Tokens are emitted left to right without a step that asks what could be cut, so the first sufficiently complete draft, redundancies included, is what ships.

Detection Approaches

Categories of checks that can identify the issue. These are strategies, not specific implementations.

๐Ÿ“

Compression testing

Have a model compress the response to its minimal sufficient form and measure what fraction survives. An answer that reduces from 150 words to one with nothing lost was padding by construction โ€” compressibility is a direct, mechanical measure of the overhead.

โš–๏ธ

LLM-as-judge evaluation

Judges share the length bias that causes this failure, so the rubric has to fight it explicitly โ€” score information per token, or compare the response pairwise against its compressed version and ask what the extra words bought. A judge asked simply to rate quality will reward the verbosity it was meant to catch.

๐Ÿงช

Golden-set evals

Maintain questions with known minimal sufficient answers and measure token overhead above that floor, tracking boilerplate markers โ€” restated questions, announced structure, closing summaries โ€” separately from content. Fixed chat overhead shows up as a constant tax across every answer regardless of the question's size.

Mitigation Approaches

High-level reliability strategies that reduce how often this failure occurs.

๐Ÿ“

Instruction constraints

Size the answer to the question and ban the fixed overhead by name โ€” no restating the question, no announcing the structure, no closing summary, no "great question." The boilerplate is tuned-in habit, not judgment, so naming the specific patterns removes them more reliably than a general plea for brevity.

๐Ÿงน

Output post-processing

Run the compression test as a fix instead of a metric โ€” reduce the draft to its minimal sufficient form and ship that version. The 150 words around "Canberra" survive because generation has no revision pass; a compression pass is that revision, and what it cannot remove without losing information was the actual answer.

๐Ÿง‘โ€๐Ÿซ

Preference data curation

Break the length-reward correlation at its source โ€” penalize length in the reward model, score information per token rather than coverage, and check judge models for the same bias before trusting their ratings. Verbosity is what optimization against length-loving signals produces; every loop that keeps the bias keeps the failure.