Every enterprise team we talk to has the same RAG story: the demo was incredible. The production system… less so.
The gap between a RAG demo and a production RAG system is wider than most teams expect. Here are the failure modes we see most often.
Failure Mode 1: Naive Chunking
The default chunking strategy — split every N characters with M overlap — works fine for demos on clean documents. It fails on:
- Documents with tables, where semantic meaning spans rows
- Transcripts and call logs, where context requires full exchanges
- Technical documentation, where a procedure spans multiple sections
The fix is semantic chunking: split on meaning, not character count. For operational documents, we typically use a combination of structural markers (headers, section breaks) and semantic similarity.
Failure Mode 2: Embedding Mismatch
Most teams use the same embedding model for indexing and retrieval. The problem: query language and document language are often very different.
A user asks: “Which carriers have the highest exception rates for frozen goods?”
The documents contain: “Carrier performance report Q4 2024. Frozen category exception rate by carrier…”
The embedding similarity for this query-document pair is lower than you’d expect, because the query is in natural language and the document is in report language.
Solutions:
- Query expansion (generate multiple query variants)
- Hybrid search (combine dense embeddings with BM25)
- Re-ranking after retrieval
Failure Mode 3: Missing Metadata Filtering
Retrieval without metadata filtering returns documents from across your entire corpus. For operational systems, this is almost always wrong.
An agent answering questions about current inventory levels shouldn’t be surfacing documents from 18 months ago. An agent scoped to the Northeast region shouldn’t be retrieving West Coast operational data.
Metadata filtering is not optional. Every document needs structured metadata, and every query needs to pass filters based on context.
Failure Mode 4: No Evaluation Pipeline
Teams build RAG systems without defining what “good” looks like. They have no way to know if a change made things better or worse.
Before shipping any RAG system to production, you need:
- A golden dataset of question-answer pairs from domain experts
- Automated evaluation metrics (faithfulness, relevance, groundedness)
- A regression test suite that runs before every deployment
This takes time. It’s not optional.
Failure Mode 5: Ignoring the “Not in Corpus” Case
What does your agent do when the answer isn’t in your documents?
Most RAG systems hallucinate. They generate a plausible-sounding answer that isn’t grounded in any retrieved document. In operational contexts, this is dangerous.
The system needs explicit handling for “I don’t know” — and that requires detecting when retrieved context is insufficient, which requires its own evaluation layer.
The Production Checklist
Before you ship a RAG system:
- Semantic chunking tuned to your document types
- Hybrid search (dense + sparse)
- Metadata schema defined and populated
- Query expansion or HyDE implemented
- Re-ranking after initial retrieval
- Faithfulness evaluation in CI
- “Not in corpus” handling
- Monitoring for retrieval quality drift
RAG is one of the highest-leverage tools in the agent stack. But only if you do it right.