Not because the technology is flawed. Because RAG is a pipeline, not a feature. It has four stages, each with its own failure modes. And errors at each stage don't just add up. They multiply.
If your organization runs a RAG system and hasn't evaluated each stage independently, there's a specific math problem working against you. Here's what it looks like, why most testing approaches miss it, and what the teams getting RAG right actually measure.
The Compounding Error Problem
A RAG pipeline has four stages: chunking (splitting documents into pieces), embedding (converting those pieces into searchable vectors), retrieval (finding the right pieces for a given question), and generation (using those pieces to produce an answer).
Each stage can fail independently. And the failures cascade.
Here's the math. Suppose each stage operates at 95% accuracy. That sounds strong. But overall system reliability is 0.95 x 0.95 x 0.95 x 0.95 = 0.815. Nearly one in five queries produces a bad result.
Drop one stage to 90%, and the picture gets worse: 0.90 x 0.95 x 0.95 x 0.95 = 0.77. More than one in five queries fail.
This is arithmetic, not opinion.
It applies to every RAG deployment. No downstream stage can compensate for an upstream failure. A model given the wrong context will generate a confident, plausible, wrong answer. Every time.
This is why RAG prototypes work well in demos but struggle in production. With ten test queries, you might not notice a 20% failure rate. With thousands of real users asking unexpected questions, that failure rate becomes the defining characteristic of your system. And once users decide a system can't be trusted, they rarely check back to see if it improved.
The Testing Gap Most Teams Miss
Most teams test RAG end-to-end: they ask a question and evaluate whether the answer "looks right."
This is the wrong test.
A plausible answer built on wrong retrieval is worse than no answer at all, because everyone trusts it. The answer reads well. It's grammatically correct. It addresses the question. But the documents it drew from were the wrong documents, and the conclusions are unsupported by the actual source material. End-to-end testing can't distinguish between a correct answer from correct context and a confident answer from incorrect context.
The fix: verify each stage independently.
Chunking Tests
For your most important documents, inspect the chunks directly. Are related concepts staying together, or are they split across chunk boundaries? If a function's documentation ends up in one chunk and its implementation in another, retrieval will return incomplete information regardless of how good your embedding model is. Production data shows that roughly 80% of RAG failures trace back to chunking decisions, not to embedding quality or retrieval algorithms.
Retrieval Tests
Build a set of 20-30 questions where you know which source documents contain the answer. Run those questions through your retrieval system and check: did the right documents come back? At what rank? If your known-answer questions aren't retrieving the expected sources, everything downstream is built on a bad foundation.
Generation Tests
Given correct context, does the model actually use it? This isolates generation quality from retrieval quality. Feed the model the right documents directly (bypassing retrieval) and check whether the answer is faithful to that context. If the model hallucinates even with correct input, you have a prompt engineering problem, not a retrieval problem.
Each test tells you something different. Together, they tell you exactly where your pipeline breaks. Without them, you're guessing.
What Production RAG Actually Looks Like
Beyond testing each stage, production-grade RAG requires attention to details that prototypes skip.
Context placement matters
Research by Liu et al. ("Lost in the Middle," published in Transactions of the ACL, 2024) found that language models use information best when it appears at the beginning or end of the context window. Performance degrades by over 30% when relevant information is buried in the middle. For RAG systems, this means placing your most relevant retrieved chunks first, less relevant ones in the middle, and moderately relevant ones at the end, exploiting both primacy and recency effects.
Know when NOT to use RAG
If your entire knowledge base is under 50,000 tokens, skip RAG entirely and put everything in the context window. You'll get better results (no retrieval errors, no chunking artifacts, no embedding mismatches) at comparable cost. RAG adds complexity that's only justified when your data is too large to fit in the window, changes frequently, or requires source attribution.
Beware irrelevant distractors
Semantic similarity is imprecise. A query about "Python decorators" might retrieve content about "holiday decorations" because the embeddings are close enough. False positive retrieval is especially dangerous because the model generates a confident answer from wrong context, and end-to-end testing often won't catch it. The defense is a combination of hybrid search (combining semantic and keyword matching) and reranking (a second-pass model that scores retrieval quality more precisely).
The Evaluation Framework That Actually Works
The teams that get RAG right share one habit: they measure retrieval quality separately from answer quality. When something breaks, they know exactly which stage failed. Four metrics cover the full pipeline:
Context precision
Of the chunks you retrieved, how many were actually relevant? This is your retrieval signal-to-noise ratio. Low precision means you're drowning the model in irrelevant context, which degrades answer quality even when the right information is present somewhere in the results.
Context recall
Of all the relevant chunks in your index, how many did you find? Low recall means you're missing important information. The answer might be partially correct but incomplete, or the model might hallucinate the missing pieces rather than acknowledge the gap.
Faithfulness
Does the generated answer stick to the retrieved context? Measured as the proportion of claims in the response that are actually supported by the retrieved documents. Low faithfulness means the model is generating content from its training data rather than from your documents, which defeats the purpose of RAG.
Answer relevance
Does the answer actually address the question? You can have perfect retrieval and faithful generation but still miss what the user was asking. This catches cases where the system works technically but fails practically.
The first two metrics diagnose retrieval. The last two diagnose generation. When your RAG system produces bad answers, these four numbers tell you where to look.
You don't need a massive evaluation dataset to start. Twenty to thirty curated question-answer-source triples covering your most important query types will reveal whether your pipeline works and where it breaks.
Concrete targets:
A retrieval hit rate above 80% (the right source document appears in the top results) means your retrieval is solid. Average rank below 2.0 means the right content is usually near the top. Below those thresholds, focus on chunking and search strategy before tuning anything else.
The goal is to make retrieval quality visible, because the most common failure mode in production RAG isn't a dramatic crash. It's a slow drift where retrieval quality degrades and nobody notices until users stop trusting the system.
The Question Your Team Should Be Asking
If your organization runs a RAG system today, whether it's a customer support bot, an internal knowledge base, or a document search tool, there's one diagnostic that separates the teams with reliable AI from the teams with impressive demos:
Have you evaluated each stage of the pipeline independently?
If the answer is no, the compounding error math is probably working against you. Not because your team made bad choices, but because end-to-end testing creates a false sense of reliability that only breaks down at production scale.
Go Deeper: RAG Pipeline Engineering
For the full technical deep dive, including working code for RAG evaluation pipelines and the chunking strategies that matter most, read Chapter 6 of our free book.
Read Chapter 6: RAG Pipeline EngineeringStart With the Retrieval Test
Pick a question your RAG system should be able to answer. Ask it. Then look at which documents it actually retrieved, not just the final answer.
If you've never done that, you've found your first action item. If you have, and the results weren't what you expected, you've found something more important.
We help organizations evaluate and improve their AI systems through our AI-Augmented Assessment. RAG pipeline evaluation, including stage-by-stage testing and the evaluation framework described above, is a standard component. Most of the time, the model is fine. The retrieval isn't. And retrieval is something you can diagnose, measure, and fix.