RAG Architecture in Production: Beyond Naive Retrieval

Naive RAG — chunk documents, embed them, find nearest neighbors, stuff into prompt — works for demos. It fails in production because real queries are ambiguous, real documents have structure, and simple cosine similarity doesn't capture semantic relevance reliably.

The five layers of a production RAG system

1. Smart Chunking (the most underrated fix)

Fixed-size chunking breaks semantic units mid-sentence. Use semantic chunking (split on topic changes), hierarchical chunking (chunk + parent doc), or document-structure-aware chunking (split on headers, sections). Better chunks = better retrieval before you touch embedding models.

2. Hybrid Search (sparse + dense)

Dense vector search misses exact keyword matches. Sparse BM25 misses semantic similarity. Combine both with RRF (Reciprocal Rank Fusion) or weighted combination. Hybrid search improves recall by 15–25% on most document corpuses vs. dense-only.

3. Query Expansion & Rewriting

User queries are often short, ambiguous, or use different vocabulary than source documents. Generate 3–5 query variants with an LLM, retrieve for each, then deduplicate results before reranking. Dramatically improves recall on vague or multi-faceted queries.

4. Reranking

First-stage retrieval over-retrieves (top 50–100 results). A cross-encoder reranker (e.g., Cohere Rerank, BGE-Reranker) scores each (query, chunk) pair much more accurately than embedding similarity. Reranking often adds 10–20% precision on top of good retrieval.

5. Contextual Compression

Long retrieved chunks waste context window and dilute focus. A compression step extracts only the sentence(s) from each chunk that are relevant to the query. Reduces input tokens by 40–60% while improving answer quality.

Production RAG pipeline (MoltBot)

from moltbot.rag import Pipeline, HybridRetriever, Reranker

rag = Pipeline(
    retriever=HybridRetriever(
        dense_model="text-embedding-3-large",
        sparse_model="bm25",
        fusion="rrf",
        top_k=50,
    ),
    reranker=Reranker(model="cohere-rerank-3", top_n=5),
    compress=True,              # contextual compression
    query_expansion="llm",       # generate 3 query variants
    chunking="semantic",
)

answer = rag.query("What is our refund policy for enterprise contracts?")
      

Production RAG on MoltBot

Hybrid retrieval, reranking, contextual compression — all configured in a single pipeline object. 14-day free trial.

Start Free Trial →