Jonathan Haaswritingnowusesabout
emailgithubx
Jonathan Haaswritingnowusesabout

How RAG Actually Works: Architecture Patterns That Scale

January 6, 2025·3 min read

Deep dive into RAG architectures: chunking strategies, retrieval methods, embedding optimization, and production patterns with research-backed analysis.

#ai#research#rag#architecture

Most RAG tutorials skip the parts that matter in production: chunking strategies that preserve context, retrieval that stays fast at 100M+ documents, and architectures that hold up under real load.

The gap between a working RAG demo and a production RAG system comes down to five design decisions.

Chunking Determines Retrieval Quality

Bad chunking is where most RAG systems fail. Fixed-size chunking at 512 characters splits mid-sentence, destroys semantic boundaries, and loses context.

Semantic chunking -- splitting on paragraph breaks with token-aware overlap -- preserves meaning. Optimal chunk sizes are content-dependent: 512 tokens with 50-token overlap for technical docs, 256 tokens for conversational content, 1024 for long-form narrative, 200 for code (respecting function boundaries).

The counterintuitive finding: larger chunks dilute relevance scores. A 2,000-token chunk containing one relevant paragraph and four irrelevant ones will score lower than a 400-token chunk containing just the relevant paragraph.

Hybrid Retrieval Outperforms Either Method Alone

Dense retrieval (vector similarity) handles semantic matching but misses exact keywords. Sparse retrieval (BM25) handles keyword matching but lacks semantic understanding. Hybrid retrieval with Reciprocal Rank Fusion combines both.

Research shows hybrid outperforms dense-only by 15-20%. The fusion weight depends on content type: technical docs favor dense (alpha 0.7), product names favor sparse (alpha 0.3), general Q&A splits evenly.

Re-ranking Is the Cheapest Quality Win

Retrieve broadly (top-20), then re-rank precisely (top-5) using a cross-encoder. Dense retrieval alone: 73% accuracy. Dense plus cross-encoder re-ranking: 89% accuracy. Cost: +50ms latency and roughly $0.001 per query.

Worth it for high-stakes domains -- customer support, legal, medical. Skip it for low-stakes chat where latency matters more than precision.

Query Transformation Fixes Ambiguous Input

Users write "how do I make the thing faster." The retrieval system sees garbage. An LLM rewrite step expands this into specific queries: "optimize application performance," "reduce API response latency," "improve database query speed."

This adds one LLM call of latency but improves accuracy 15-20% on ambiguous queries. For well-formed technical queries, it adds cost without benefit. Gate it on query clarity scoring.

Context Assembly Is Where Tokens Get Wasted

Naive context assembly concatenates five retrieved documents into the prompt. Most of that content is irrelevant. Extract the highest-scoring paragraphs from each document instead. This saves 30-40% of context tokens while maintaining answer quality.

Production Architecture Tiers

Basic (docs, Q&A): Semantic chunking, dense retrieval, simple assembly. Under 200ms latency.

Production (scale, quality): Adaptive chunking, hybrid retrieval, cross-encoder re-ranking, query transformation, caching layer. Under 500ms. Cache hit rate of 30-40% on typical workloads.

Mission-critical (legal, medical): All of the above plus two-stage retrieval (broad-then-precise), metadata filtering, confidence scoring, and human-in-the-loop for low-confidence answers. Under 1 second.

The best RAG system is not the most complex. It is the one that solves the specific problem at acceptable cost and latency. Start with basic, measure retrieval recall at k (target >90% at k=5), and add complexity only where metrics show gaps.


Architecture patterns proven at scale. Research citations: RAG Survey, RAG for AIGC, Systematic RAG Review.

share

Continue reading

Building for Humans AND Machines: The Dual-Audience Problem

Every web design decision now must serve two audiences: humans who browse visually and AI agents that consume data programmatically. The architectural...

Prompt Engineering Science: I Tested Temperature and Top-P on 1000 Queries

Systematic experiments on temperature and top-p sampling parameters across 1000 real queries with empirical data on creativity, coherence, and...

Two Minds in the Machine: Shared Context Is the Only Thing That Matters

I added Gemini to a codebase that already had Claude embedded. The useful discovery was about shared context files, not model capabilities.

emailgithubx