Naive RAG โ chunk documents, embed them, find nearest neighbors, stuff into prompt โ works for demos. It fails in production because real queries are ambiguous, real documents have structure, and simple cosine similarity doesn't capture semantic relevance reliably.
The five layers of a production RAG system
1. Smart Chunking (the most underrated fix)
Fixed-size chunking breaks semantic units mid-sentence. Use semantic chunking (split on topic changes), hierarchical chunking (chunk + parent doc), or document-structure-aware chunking (split on headers, sections). Better chunks = better retrieval before you touch embedding models.
2. Hybrid Search (sparse + dense)
Dense vector search misses exact keyword matches. Sparse BM25 misses semantic similarity. Combine both with RRF (Reciprocal Rank Fusion) or weighted combination. Hybrid search improves recall by 15โ25% on most document corpuses vs. dense-only.
3. Query Expansion & Rewriting
User queries are often short, ambiguous, or use different vocabulary than source documents. Generate 3โ5 query variants with an LLM, retrieve for each, then deduplicate results before reranking. Dramatically improves recall on vague or multi-faceted queries.
4. Reranking
First-stage retrieval over-retrieves (top 50โ100 results). A cross-encoder reranker (e.g., Cohere Rerank, BGE-Reranker) scores each (query, chunk) pair much more accurately than embedding similarity. Reranking often adds 10โ20% precision on top of good retrieval.
5. Contextual Compression
Long retrieved chunks waste context window and dilute focus. A compression step extracts only the sentence(s) from each chunk that are relevant to the query. Reduces input tokens by 40โ60% while improving answer quality.
Production RAG pipeline (MoltBot)
Production RAG on MoltBot
Hybrid retrieval, reranking, contextual compression โ all configured in a single pipeline object. 14-day free trial.
Start Free Trial โ