Incompleteness

Leaves out information, constraints, caveats, steps, options, or outputs needed to satisfy the user's task. Unlike Step Omission, the missing material appears in the final answer rather than necessarily in the work process. Unlike Concision Failure, incompleteness can happen even when the answer is not especially brief.

Examples

FAIL

Dropped requirement

Answers two of three explicitly requested parts and never acknowledges the third.

User: Give me a regex for US phone numbers, explain how it works, and show one input it incorrectly rejects. Assistant: Here's the regex: ^\d{3}-\d{3}-\d{4}$ โ€” it matches three digits, a hyphen, three digits, a hyphen, and four digits. Let me know if you need anything else!
WARN

Missing implied coverage

Answers what was literally asked but omits the option any complete answer would include.

User: How do I share this 30 GB file with a client? Assistant: Upload it to Google Drive and send them the link. [โ€ฆ no mention that free Drive accounts cap at 15 GB, or of alternatives that handle the size โ€ฆ]
PASS

Full coverage

Satisfies each requested part, in the order asked, and flags the known limitation.

User: Give me a regex for US phone numbers, explain how it works, and show one input it incorrectly rejects. Assistant: Regex: ^\d{3}-\d{3}-\d{4}$ โ€” three digits, hyphen, three digits, hyphen, four digits. It incorrectly rejects "(555) 123-4567", a common valid format, because it only accepts hyphenated input.

Why it happens

01

Instruction-following degrades with the number of requirements. As a request stacks parts, constraints, and asked-for outputs, the probability of satisfying all of them falls, and the dropped ones are simply absent rather than flagged.

02

Attention to the request decays over a long generation. Early parts of the answer track the question closely, while later sections increasingly condition on the answer's own momentum, so trailing requirements are the ones that vanish.

03

Nothing checks the answer against the request. Generation ends when a closing pattern becomes likely, not when a completeness check passes, and few scaffolds add a verification pass that re-reads the question before sending.

04

Effort-saving survives preference tuning. Raters rarely cross off every requested element, so a fluent answer covering most of the request scores as well as a complete one, and the shortcut is reinforced wherever checking is expensive.

05

The model answers the modal version of the question. Common phrasings pull toward their most frequent training-data interpretation, and the user's specific additions, the third question, the unusual constraint, fall outside the template the model is completing.

06

Coverage requires knowing what the task implies, not just what it states. Necessary caveats, options, and prerequisites are part of domain pragmatics, and where the model's task model is shallow it does not know what a complete answer would contain.

Detection Approaches

Categories of checks that can identify the issue. These are strategies, not specific implementations.

๐Ÿ“‹

Instruction checklist scoring

Extract the request's enumerable parts โ€” three questions asked, the constraint stated, the example demanded โ€” and mechanically check each against the response. Two of three parts answered with the third silently absent is countable without judging quality, and per-part scoring shows which positions get dropped.

โš–๏ธ

LLM-as-judge evaluation

Have the judge ask what a complete answer to this task would contain, beyond what was literally enumerated โ€” the 15 GB cap that makes the suggested upload impossible, the option any practitioner would mention. Implied coverage is where checklist extraction runs out and judgment has to take over.

๐Ÿงช

Golden-set evals

Maintain multi-part requests with countable elements and chart per-part coverage against requirement count and position. Trailing requirements vanish first as requests stack up โ€” the curve tells you how many parts a request can carry before the system starts shedding them.

Mitigation Approaches

High-level reliability strategies that reduce how often this failure occurs.

โœ…

Self-check pass

Before sending, have the model re-read the request, extract its enumerable parts, and point at where the response satisfies each โ€” regex, explanation, and the incorrectly rejected input all get an address or the answer isn't done. Generation ends when a closing pattern becomes likely; this pass replaces that stopping rule with the request's own list.

๐Ÿ”

Validate-and-retry loops

Run the checklist extraction mechanically in the scaffold and bounce incomplete responses back with the missing part named โ€” "the third requested item, an input the regex incorrectly rejects, is absent." Per-part checking is exactly what raters skip and what made the shortcut survivable; the loop does it on every response.

๐Ÿชœ

Task decomposition

Split multi-part requests so each part is generated and verified on its own before assembly. Instruction-following degrades as requirements stack and trailing parts vanish first; decomposition means no part is ever trailing, which removes the position effect rather than fighting it.