How RAG Actually Works: Architecture Patterns That Scale

Most RAG tutorials skip the parts that matter in production: chunking strategies that preserve context, retrieval that stays fast at 100M+ documents, and architectures that hold up under real load.

The gap between a working RAG demo and a production RAG system comes down to five design decisions.

Chunking Determines Retrieval Quality

Bad chunking is where most RAG systems fail. Fixed-size chunking at 512 characters splits mid-sentence, destroys semantic boundaries, and loses context.

Semantic chunking -- splitting on paragraph breaks with token-aware overlap -- preserves meaning. Optimal chunk sizes are content-dependent: 512 tokens with 50-token overlap for technical docs, 256 tokens for conversational content, 1024 for long-form narrative, 200 for code (respecting function boundaries).

The counterintuitive finding: larger chunks dilute relevance scores. A 2,000-token chunk containing one relevant paragraph and four irrelevant ones will score lower than a 400-token chunk containing just the relevant paragraph.

Hybrid Retrieval Outperforms Either Method Alone

Dense retrieval (vector similarity) handles semantic matching but misses exact keywords. Sparse retrieval (BM25) handles keyword matching but lacks semantic understanding. Hybrid retrieval with Reciprocal Rank Fusion combines both.

Research shows hybrid outperforms dense-only by 15-20%. The fusion weight depends on content type: technical docs favor dense (alpha 0.7), product names favor sparse (alpha 0.3), general Q&A splits evenly.

Re-ranking Is the Cheapest Quality Win

Retrieve broadly (top-20), then re-rank precisely (top-5) using a cross-encoder. Dense retrieval alone: 73% accuracy. Dense plus cross-encoder re-ranking: 89% accuracy. Cost: +50ms latency and roughly $0.001 per query.

Worth it for high-stakes domains -- customer support, legal, medical. Skip it for low-stakes chat where latency matters more than precision.

Query Transformation Fixes Ambiguous Input

Users write "how do I make the thing faster." The retrieval system sees garbage. An LLM rewrite step expands this into specific queries: "optimize application performance," "reduce API response latency," "improve database query speed."

This adds one LLM call of latency but improves accuracy 15-20% on ambiguous queries. For well-formed technical queries, it adds cost without benefit. Gate it on query clarity scoring.

Context Assembly Is Where Tokens Get Wasted

Naive context assembly concatenates five retrieved documents into the prompt. Most of that content is irrelevant. Extract the highest-scoring paragraphs from each document instead. This saves 30-40% of context tokens while maintaining answer quality.

Production Architecture Tiers

Basic (docs, Q&A): Semantic chunking, dense retrieval, simple assembly. Under 200ms latency.

Production (scale, quality): Adaptive chunking, hybrid retrieval, cross-encoder re-ranking, query transformation, caching layer. Under 500ms. Cache hit rate of 30-40% on typical workloads.

Mission-critical (legal, medical): All of the above plus two-stage retrieval (broad-then-precise), metadata filtering, confidence scoring, and human-in-the-loop for low-confidence answers. Under 1 second.

The best RAG system is not the most complex. It is the one that solves the specific problem at acceptable cost and latency. Start with basic, measure retrieval recall at k (target >90% at k=5), and add complexity only where metrics show gaps.

Architecture patterns proven at scale. Research citations: RAG Survey, RAG for AIGC, Systematic RAG Review.