Pattern reliability

When RAG retrieves wrong chunks: failure modes and containment

RAG pipeline failure modes are containment failures at the retrieval-to-generation boundary: five modes mapped to how each propagates, its detection signal, and the gate that holds it.

By LatentEval Published 2026-04-01 Updated 2026-05-25

Your RAG pipeline pulled ten chunks for a question, and the answer came back fluent, confident, and about the wrong thing. Now you have to decide where the bug lives. The instinct is to blame retrieval and reach for the ranker: add chunks, swap the embedding model, tighten the query. That treats the symptom at the stage where the error started rather than the stage where it shipped. A retrieval error is only a failure once the generation step turns it into an answer a person reads and acts on. The reliability work is to treat the generator as the containment boundary for retrieval faults, and to measure what fraction of those faults it actually holds.

The five ways a RAG pipeline turns a retrieval error into an answer

Retrieval-augmented generation is marketed as the fix for hallucination, and it does lower the rate at which a model invents facts out of nothing. What it introduces in exchange is a hand-off: a retriever selects evidence, a generator writes over it, and the answer is only as reliable as that hand-off. Barnett and colleagues cataloged seven engineering failure points of RAG systems from three production case studies across research, education, and biomedical domains, and their sharper finding for a reliability audience is that “validation of a RAG system is only feasible during operation” (Barnett et al., Seven Failure Points When Engineering a Retrieval Augmented Generation System, CAIN 2024, peer-reviewed; arXiv:2401.05856, as of 2026-07). You cannot design the failure out at the whiteboard, so you have to catch it at the boundary where it crosses.

The table below is the reference the rest of this page builds on. It keys each failure mode on the question a reliability owner actually needs answered: how the error travels from retrieval into generation, what signal reveals it, and which gate holds it. Stale index is added because a snapshot study of a fixed corpus never sees it, and in production it is one of the most common ways a well-grounded answer turns out wrong.

RAG failure mode	How it propagates retrieval → generation	Detection signal	Containment lever	Judgment: where the gate lives
Retrieval miss	The answer-bearing chunk never enters the top-k window; given no refusal path, the generator composes a fluent answer from unrelated chunks or parametric memory, so a recall gap ships as a confident answer	Low recall@k against a known answer set; low top similarity score; a groundedness check finds no supporting span	Negative rejection: let the generator abstain on weak evidence; a recall@k threshold that blocks the answer; query reformulation before generation	Generation. Recall never reaches 1, so the answer step must be allowed to say the evidence is not there
Off-topic drift	Topically adjacent distractor chunks score high and enter the window; the generator reads their presence as endorsement and anchors on the wrong-but-related content, so the answer stays on-domain and misses the question	High answer-to-context similarity with low answer-to-query relevance; low precision@k despite strong scores; the cited span is off-topic	Rerank and a precision filter before the window; a per-chunk relevance gate; require a cited span so topicality is checkable	Retrieval precision. More chunks raise this risk; a reranker plus span attribution lowers it
Hallucinated grounding	With correct chunks present, the generator asserts claims no chunk entails, wrapped in citation formatting; a downstream reader trusts the reference marker and the fabrication ships as sourced	Claim-level entailment failure; a cited sentence its own span does not support; a faithfulness score below threshold	Sentence-level attribution enforcement; an independent faithfulness verifier the generator cannot overrule; abstain when entailment fails	Claim entailment. Treat a citation as formatting; verify that the cited span actually entails the claim
Silent context truncation	Assembled context overruns the window, so the answer-bearing chunk is dropped or buried mid-context where the model under-uses it, and generation proceeds as if the evidence were fully present	Token-budget accounting shows the concatenated context overran the window; the key chunk sits mid-order; a position-swap test moves the answer	Budget-aware assembly that logs every drop; rerank the key chunk to a window edge; raise a signal on truncation instead of dropping silently	Context assembly. The defect is the silence; mid-context placement is a documented model weakness
Stale index	The index lags the source of truth, so retrieval returns correct-looking but outdated chunks; the generator grounds faithfully on stale evidence and returns an answer that is well-supported and wrong-in-time	Index-freshness lag (source update time vs index build time); document-version drift; missing as-of metadata on chunks	A freshness SLA plus a re-index cadence; version and as-of metadata so the answer can flag stale evidence; time-aware retrieval filters	The index. Faithfulness scores this as a pass, so groundedness checks miss it entirely

The five modes do not share a gate, which is the point the last column makes: two are held at generation, one at retrieval precision, one at context assembly, one at the index. “Make RAG reliable” resolves into five gates sitting at four different boundaries, and the reliability question for each is the same one distributed systems ask of any fault, which fraction of faults the boundary actually holds. The mechanics of how a single fault fans out once a gate misses it are worked in detail in the companion analysis on error propagation and cascade containment; this page keys on the RAG-specific modes rather than repeating that arithmetic.

Retrieval and generation are two hops, and the generator is the gate

A RAG pipeline is the smallest interesting cascade: two components and one hand-off. The retriever passes the generator a set of chunks with no confidence attached, and the generator reads their mere presence as endorsement. That is the entire vulnerability. An upstream stage produces output stripped of its own uncertainty, a downstream stage treats it as ground truth, and a well-formed but wrong result rides across the boundary untouched. This is the same trust-without-verification defect that MAST, the first empirical taxonomy of multi-agent LLM failures, files under inter-agent misalignment. MAST built its 14 failure modes in 3 categories from 150 human-annotated traces (kappa 0.88), then scaled annotation to a 1,600+ trace dataset across 7 frameworks with an LLM-as-judge pipeline (Cemri et al., Why Do Multi-Agent LLM Systems Fail?, arXiv:2503.13657, preprint, as of 2026-07). Its taxonomy sorts failures into “(i) system design issues, (ii) inter-agent misalignment, and (iii) task verification,” and the middle category is exactly the retriever-to-generator hand-off compressed into a single edge.

The retriever and the generator play asymmetric roles in that hand-off, and the asymmetry decides where you spend. The retriever sets how often a fault starts: how often the answer-bearing chunk fails to land in the window, how often a distractor outranks it. The generator sets how far each fault travels: whether a missing chunk becomes an abstention or a fabrication, whether a distractor becomes a hedge or an anchor.

You cannot drive retrieval to perfect recall on an open corpus, so a residual stream of retrieval faults is guaranteed. What is not guaranteed is that they ship. The generator is the last boundary a retrieval fault crosses before a user reads it, which makes it the natural place to put the gate.

Framing the generator as a containment boundary changes the metric you care about. The relevant number is not the answer’s task score on a clean benchmark; it is the share of injected retrieval faults the generation step refuses to turn into a confident answer. That is containment applied to the retrieval-to-generation edge specifically, and it is a different quantity from accuracy: a pipeline can score well on questions whose evidence is present and still convert most of its retrieval misses into fabrications. The general error-propagation mechanism is the same one that governs longer agent chains; what is RAG-specific is that almost all of the leverage sits on one edge.

The retrieval-to-generation gate, in arithmetic

The value of the containment framing is that it turns into numbers you can reason about before you build anything. Model the pipeline as two stages. Let r be the probability that the answer-bearing chunk lands in the retrieved window and survives to the prompt. When it does, the generator grounds correctly with probability g. When it does not, the generator either abstains, returning an honest “insufficient evidence,” with probability a, or fills the gap with a confident answer with probability 1 − a.

The table below is computed from stated assumptions, not measured. It fixes r = 0.7 (the answer-bearing chunk reaches the window on 70% of queries) and g = 0.9 (given good evidence, the generator grounds correctly 90% of the time), then varies the generator’s willingness to abstain, a.

Generator abstain rate on empty evidence (a)	Grounded-correct (r·g)	Retrieval miss caught as “insufficient evidence” ((1−r)·a)	Confident-wrong shipped (r(1−g) + (1−r)(1−a))
0.9	0.63	0.27	0.10
0.5	0.63	0.15	0.22
0.1	0.63	0.03	0.34
0.0	0.63	0.00	0.37

The grounded-correct column never moves. Improving the generator’s willingness to return “insufficient evidence” does nothing for the answers it was already getting right; it converts confident-wrong answers into honest abstentions, one for one, across the retrieval-miss share. At a = 0 every retrieval miss ships as a fabrication and the confident-wrong rate is 0.37. Raise abstention to 0.9 and that rate falls to 0.10 while the correct-answer rate holds. The generator cannot recover a retrieval miss, because the evidence genuinely is not there. Abstention contains the miss, holding the fault at the boundary so it never ships dressed as an answer.

One number in that table refuses to fall. The floor on confident-wrong is r(1 − g) = 0.07: the answers that had good evidence and still came out wrong. Abstention cannot touch that share: the generator has good evidence and is failing to stay faithful to it. That floor is the hallucinated-grounding row of the taxonomy, and it needs a different gate, claim-level entailment applied sentence by sentence, since lowering the confidence threshold cannot catch a well-evidenced answer that is simply unfaithful. The arithmetic is deterministic and the inputs are declared, so treat the shape as real and the specific decimals as illustrative; a measured pipeline would report each of these as a rate with an interval, which is what the later section on measurement is about.

Off-topic drift: why the tenth chunk can make the answer worse

The problem in the opener, ten chunks retrieved and the answer off-topic, is not a recall problem, and reaching for more chunks makes it worse. Off-topic drift is a precision failure that the generator amplifies. Semantic search returns what is near in embedding space, and nearness is a weak proxy for whether a chunk carries the answer. A chunk about the right entity in the wrong time period, or the right topic at the wrong granularity, scores high and enters the window as a distractor. The generator, reading presence as endorsement, anchors on it and produces an answer that is fluent, on-domain, and beside the point.

Adding chunks to fix a drift problem is the wrong direction twice over. It lowers precision, admitting more distractors, and it lengthens the context, which triggers a second, well-documented weakness. Liu and colleagues found that language models under-use evidence buried in the middle of a long context: “performance is often highest when relevant information occurs at the beginning or end of the input context, and significantly degrades when models must access relevant information in the middle of long contexts, even for explicitly long-context models” (Liu et al., Lost in the Middle: How Language Models Use Long Contexts, TACL, peer-reviewed, as of 2026-07). The tenth chunk you add to raise recall can push the one answer-bearing chunk into the position the model reads least reliably.

Drift is also the mode where attributing the fault to the right stage decides everything about the fix. If the answer-bearing chunk was in the window and the generator still drifted, the gate belongs at generation: a stricter grounding instruction, a rerank that surfaces the right chunk to a window edge, an attribution requirement. If the answer-bearing chunk never made it in and a distractor took its slot, the gate belongs at retrieval precision: a reranker, a per-chunk relevance filter, a tighter top-k. One symptom covers both cases while the gate sits at a different stage in each, and guessing wrong spends a sprint on the stage that was not at fault.

The containment lever that generalizes across both cases is span-level attribution. Require the generator to name the chunk and the span it used for each claim, and drift becomes visible instead of silent: a human or a judge can see that the cited span is topically off, where a bare fluent answer hides it. Attribution removes drift’s cover, which is the precondition for measuring how often it happens.

The modes a faithfulness score cannot see

Two rows of the taxonomy defeat the metric most teams reach for first. Faithfulness, or groundedness, checks whether the answer is supported by the retrieved context, and it is the right check for hallucinated grounding: a claim with no entailing span is exactly what a faithfulness score is built to catch. The trap is treating it as the whole reliability story, because two failure modes pass it clean while shipping a wrong answer.

Stale index is the sharper case. When the index lags the source of truth, retrieval returns chunks that are correctly formatted, topically on point, and out of date. The generator grounds on them faithfully, so faithfulness reports a pass, and the answer is well-supported and wrong-in-time. This is a construct-validity gap in the metric: faithfulness measures agreement between the answer and the retrieved context, while agreement between the answer and reality is a separate property, and those two come apart the moment the context itself is stale. No amount of grounding discipline at the generator closes it, because the generator is doing its job correctly on bad inputs. The gate has to sit at the index: a freshness SLA, version and as-of metadata on every chunk, a re-index cadence tied to how fast the source changes.

Hallucinated grounding passes a naive faithfulness check for the opposite reason: the failure hides in the citation formatting. A model that emits a plausible sentence and appends a reference marker looks grounded to a check that only confirms a citation is present. The containment lever is entailment at the claim level, sentence by sentence, against the specific cited span, run by a verifier the generator cannot overrule. A citation being present and the cited span entailing the claim are two different checks, and only the second certifies a RAG answer, which is why a single aggregate faithfulness score is not enough on its own.

The general lesson is that a metric contains only the failures it was designed to see. A faithfulness score is a real gate for one row of the taxonomy and blind to two others, which is why containment has to be measured per mode rather than rolled into one headline number, the same way blast radius is read per injection point rather than reported as a single scalar.

Measuring containment at the retrieval-to-generation boundary

Every gate in the taxonomy is a hypothesis until you inject the fault and watch what the pipeline does. Passive observation of production traffic tells you the failures that already happened; it does not tell you how far the next one reaches, because you do not control when the answer-bearing chunk goes missing or when a stale chunk slips in. The measurement method is controlled fault injection: reproduce each mode deliberately and trace it into the generated answer.

The injections map one to one onto the taxonomy. Drop the answer-bearing chunk from the window to seed a retrieval miss, and measure how often the generator abstains versus fabricates. Insert high-scoring distractors to seed off-topic drift, and measure how often the answer anchors on them. Swap in an outdated version of the correct chunk to seed a stale-index fault, and measure how often the answer flags the staleness versus grounding on it silently. For each, the outcome is a containment rate: the fraction of injected faults the boundary held, reported with a confidence interval. A single run gives you one sample of that rate; the interval is what turns a sample into a measurement, and the number of runs it takes scales with how rare the escape is and how tight a bound the decision needs. A low-frequency fault stays wide until the sample grows, so the confidence interval on a low-frequency containment rate is often the number that decides whether you have measured anything at all. The discipline is the same one that governs any honest eval, that a rate without an interval decides nothing, applied to propagation instead of task quality.

Scored across all five modes and both stages, those per-mode containment rates aggregate into a pipeline’s cascade resistance: how reliably this specific retrieval-to-generation path contains its faults instead of shipping them as answers. That is the number worth tracking across versions, and worth running under the same fault-injection regimen you would apply to any agent system.

What this site points toward is a reliability profiler designed to seed a retrieval fault, follow it into the generated answer, and report a containment rate with a bootstrap confidence interval. None of that is measured yet, so no such figure is quoted here, and each external number here points to a source you can open for yourself. If your pipeline answers one well-scoped question over a small, static, hand-curated corpus, most of this does not pay off yet; the containment view earns its keep once retrieval is open-ended and the answers are acted on.

Where to put the gate

RAG reliability becomes tractable when you read the pipeline as a containment problem: for each way retrieval fails, the question is whether the generation step holds the fault or ships it. That question has a different answer for each of the five modes, and the taxonomy on this page is the map of which gate answers which:

A retrieval miss is held at generation, by letting the model abstain on weak evidence rather than fabricate.
Off-topic drift is held at retrieval precision, by reranking and filtering distractors, and made visible by span attribution.
Hallucinated grounding is held by claim-level entailment, because a citation marker is not evidence of grounding.
Silent context truncation is held at context assembly, by making a dropped or mid-buried chunk raise a signal.
Stale index is held at the index, because a faithfulness score certifies a well-grounded answer over outdated evidence.

Each mode on this page names a gate and the metric that scores it. The reliability glossary holds the canonical definition of blast radius, containment rate, and cascade resistance, and the fault injection that measures each one. Those per-mode rates, once run, publish to the research index. And the broader question of what reliable AI agents demand in production sits one level up from the single-pipeline case, where a RAG stage is one component among many, each with a hand-off of its own to contain.