Failure Modes / Retrieval / Index Drift

Index Drift

Lets the retrieval index diverge from the source corpus, permissions, metadata, embeddings, or current document state. Unlike Outdated Source Reliance, this describes an indexing-system mismatch rather than a model decision to trust an old source. Unlike Retrieval Miss, the missing result is caused by the index no longer representing the corpus correctly.

Examples

FAIL

Ghost document

The source document was retracted weeks ago, but its tombstoned vectors still serve, and the answer cites a document that no longer exists.

User: What's our policy on customer data retention? [… retrieval returns chunks from a policy doc deleted from the wiki three weeks ago after legal review …] Assistant: Per the data retention policy, customer records are kept for seven years.
FAIL

Edit never reindexed

The document was corrected at the source, but the index still holds the old embedding and old text, so retrieval confidently serves the superseded version.

User: Which regions is the service available in? [… the docs page was updated yesterday to add two regions; the index last synced last month …] Assistant: The service is available in us-east and eu-west only.
WARN

Quiet coverage shrinkage

An ingestion failure has been silently dropping new documents for weeks. Nothing errors — the corpus just stops at a date no one chose, and answers skew old without explanation.

User: Summarize the engineering decisions from recent design reviews. [… the crawler lost access to the design-review space in April; every doc since then is missing from the index …] Assistant: Recent design reviews focused on the database migration and the queue rewrite. [… both decisions are months old; four newer reviews exist but were never indexed …]
PASS

Index freshness surfaced

The pipeline compares index sync time against the source system and the answer carries that bound instead of presenting the index as current.

User: What's our policy on customer data retention? Tool: search index — last successful sync 11 days ago Assistant: Based on documents indexed as of 11 days ago, retention is seven years. The source wiki has changed since that sync, so I'd verify against the live policy page before relying on this.

Why it happens

01

Indexing is asynchronous by design. Embedding and ingestion are expensive, so corpora are indexed on batch schedules, and every edit, deletion, or permission change made between runs leaves the index describing a corpus that no longer exists.

02

Deletes and updates are harder than inserts. Many vector stores handle removal by tombstoning or periodic rebuilds, so retracted, superseded, or re-permissioned documents linger in the index long after they changed at the source.

03

Embedding model upgrades silently fork the index. Vectors produced by the old model are not comparable to queries embedded with the new one, and partial re-embedding leaves the corpus split across incompatible representations.

04

Permissions live in the source system, but the index keeps its own copy as metadata. Access revocations propagate on sync schedules or not at all, so the index keeps serving documents the user can no longer see at the source.

05

Ingestion failures are quiet. A parser that starts rejecting a new document format, a crawler blocked by an auth change, or a partially failed batch drops content from the index without any user-visible error, and nothing downstream knows the coverage shrank.

06

Drift is invisible at query time. Retrieval still returns confident, well-ranked results from the stale index, and end-to-end evaluations are run against a frozen snapshot, so divergence accumulates until a user notices a missing or ghost document.

Detection Approaches

Categories of checks that can identify the issue. These are strategies, not specific implementations.

🔄

Index-source reconciliation

Periodically diff the index against the source systems — document counts, content checksums, last-modified timestamps, permission states, embedding model versions. Ghost documents, unindexed edits, and half-re-embedded corpora all surface as reconciliation deltas long before a user notices.

🐤

Canary documents

Plant known documents in the corpus, then edit, delete, and re-permission them on a schedule and query for each state. A canary that still retrieves after deletion, or serves its old text after an edit, measures exactly how far the index lags reality.

📉

Ingestion monitoring

Alert on the quiet failures — per-source document throughput dropping to zero, parser rejection rates climbing, batch jobs partially failing, the newest indexed document aging past the sync cadence. Coverage shrinkage produces no query-time error, so the pipeline itself has to raise the signal.

Mitigation Approaches

High-level reliability strategies that reduce how often this failure occurs.

🔄

Event-driven reindexing

Sync on source-system change events — edits, deletions, permission revocations — rather than batch schedules alone, and prioritize deletes and re-permissions, since a ghost document that keeps serving is worse than a new one that indexes late.

🏗️

Versioned index rebuilds

Treat embedding model upgrades and large re-ingestions as atomic swaps — build the new index completely, validate it, then cut over — so queries never run against a corpus split across incompatible vector spaces or a half-finished rebuild.

🕒

Index freshness surfacing

Attach the last-successful-sync age to retrieval results and let answers carry that bound — "based on documents indexed as of 11 days ago" — so when drift does exist, the user gets a verifiable staleness window instead of an index presented as the live corpus.