Failure Modes / Control / Instruction Noncompliance

Instruction Noncompliance

Fails to follow an explicit, applicable instruction from the governing prompt, user request, or task procedure. Unlike Constraint Violation, this covers any instruction, not only limits or requirements. Unlike Context Rot, the failure is observed noncompliance with an applicable instruction, regardless of whether fading context salience caused it.

Examples

FAIL

Standing instruction silently dropped

An explicit, still-applicable instruction from the governing prompt simply doesn't happen, with no acknowledgment that it was skipped.

System: Always begin every reply with a one-line TL;DR. User: Why is the staging deploy failing? Assistant: Looking at the logs, the deploy fails during the asset build step. The node version in CI was bumped last week, which…
WARN

Partial compliance

The instruction is recognizably attempted but not actually followed — the summary exists, but not where the instruction put it.

System: Always begin every reply with a one-line TL;DR. User: Why is the staging deploy failing? Assistant: Looking at the logs, the deploy fails during the asset build step… In short: CI's node bump broke the asset build.
FAIL

Substituting a "better" answer

The model judges the asked-for thing suboptimal and delivers what it thinks the user should have wanted instead.

User: Translate this error message into German — just the translation, please. Assistant: Rather than translating it, I'd suggest rewording the message first — it's quite confusing in English. Here's an improved version: …
PASS

Literal compliance

The instruction is followed exactly as written, even though the model might have preferred a different shape.

System: Always begin every reply with a one-line TL;DR. User: Why is the staging deploy failing? Assistant: TL;DR: CI's node version bump broke the asset build step. The full picture: the deploy fails during asset compilation because…

Why it happens

01

Instruction following is statistical imitation, not rule execution. The model learned a distribution over compliant-looking responses, so any given instruction is honored with some probability rather than guaranteed.

02

Compliance drops as instructions accumulate. Benchmarks of verifiable instructions show models satisfying each one in isolation but dropping some once several apply at once (Zhou et al., 2023, "Instruction-Following Evaluation for Large Language Models").

03

Preference tuning rewarded what raters liked, not literal compliance. Models learn to "improve" on instructions they implicitly judge suboptimal, substituting the response the rater would have preferred over the one that was asked for.

04

Instruction force decays with distance. Directives far from the generation point are underweighted relative to recent context, so instructions issued early in a long prompt or session quietly lose their grip (Liu et al., 2023, "Lost in the Middle: How Language Models Use Long Contexts").

05

Unusual phrasings and rare instruction types are undertrained. An instruction that matches common training patterns is followed reliably; one expressed in an uncommon form may simply not be recognized as binding.

06

Aggregate evaluation hides per-instruction failures. A response that satisfies most of a request still scores well on holistic metrics, so partial compliance is consistently underpenalized during development.

Detection Approaches

Categories of checks that can identify the issue. These are strategies, not specific implementations.

📋

Instruction checklist scoring

Decompose the governing prompt into atomic instructions and verify each one independently — mechanically where the instruction is checkable, with a judge where it is not. Aggregate quality scores hide a single dropped instruction; per-item scoring is the point.

⚖️

LLM-as-judge evaluation

Prompt the judge specifically for substitution and partial compliance — the "better" answer delivered instead of the one asked for, the TL;DR that exists but not where the instruction put it. Holistic quality judging will miss both, because the responses are good.

🧪

Golden-set evals

Maintain verifiable-instruction suites and vary the number of simultaneous instructions, since compliance drops as they stack. Include uncommon phrasings of common instructions, which fail at higher rates than their familiar forms.

Mitigation Approaches

High-level reliability strategies that reduce how often this failure occurs.

🧮

Instruction budgeting

Treat the governing prompt as a budget, since each instruction is honored with a probability that falls as they stack. Cut instructions that don't earn their slot, state the survivors in common phrasings rather than novel ones, and move anything mechanically enforceable out of the prompt into the pipeline so it stops competing for compliance at all.

📌

Context pinning

Keep standing instructions restated near the generation point instead of relying on a directive issued once at the top of a long prompt or session. Instruction force decays with distance, so an always-on rule like the TL;DR-first requirement has to keep traveling with the conversation to keep binding.

Self-check pass

Before responding, have the model re-derive the checklist of applicable instructions and verify its draft against each — including placement, since a TL;DR that exists but not where the instruction put it is still a miss. Make literal compliance the stated standard, so the trained reflex to substitute a "better" answer gets checked rather than indulged.