RAG Architectures in Production
Building a RAG demo takes a weekend. Building a RAG system that survives production takes months. The gap between “it works on my laptop” and “it works at scale with real users” is where most projects fail.
RAG systems break when retrieval, scoring, and caching are treated as afterthoughts — a common production AI failure mode. The key insight: treat RAG as a system, not a feature, aligned with the Applied AI delivery model.
Why RAG fails in production
Most RAG implementations suffer from the same core issues:
1. Retrieval quality collapse
Your vector search returns “relevant” documents, but relevance is contextual. A query about “Python exceptions” might retrieve documentation about snake species if your embedding model wasn’t fine-tuned for your domain. Production RAG needs:
- Hybrid search: Combine semantic (vector) search with lexical (BM25) for better recall
- Re-ranking: Use cross-encoders to re-score top-k results before sending to LLM
- Query expansion: Decompose complex queries into multiple retrieval passes
2. Latency budget violations
Users expect sub-second responses. A naive RAG pipeline easily hits 3-5 seconds:
| Component | Typical Latency |
|---|---|
| Embedding query | 50-100ms |
| Vector search | 100-300ms |
| Re-ranking | 200-500ms |
| LLM generation | 1-3s |
Production systems need aggressive caching, query-level routing, and streaming responses to stay within budget.
3. Context window mismanagement
Stuffing retrieved chunks into the context window without strategy leads to:
- Lost in the middle: LLMs struggle with information buried in long contexts
- Irrelevant noise: Low-quality chunks dilute attention
- Token waste: Paying for context that doesn’t help
What works in production
Based on our experience building RAG systems for Applied AI clients:
Tight retrieval evaluation loops
Before deployment, establish retrieval quality baselines:
- Build a golden set of query-document pairs
- Track Precision@k, Recall@k, and MRR across releases
- Set automated alerts when metrics drop
Latency budgets as contracts
Define latency SLAs upfront and design backwards:
- P50 < 1s: Most users get fast responses
- P95 < 3s: Even slow queries are acceptable
- P99 < 5s: Rare edge cases don’t break UX
If your system must run under strict SLAs, borrow patterns from Trading Systems & Platforms where latency is life.
Human-in-the-loop feedback
Production RAG improves through feedback:
- Log retrieval results alongside user satisfaction signals
- Build annotation pipelines for relevance judgments
- Continuously fine-tune embeddings on production data
This feedback-driven approach is central to Applied AI delivery and ensures observability similar to Computer Vision systems.
Key architecture patterns
Pattern 1: Semantic caching
Cache not just queries, but semantic clusters of queries. Similar questions should hit cache even with different wording.
Pattern 2: Adaptive chunking
Different content types need different chunking strategies:
- Code: Chunk by function/class boundaries
- Documentation: Chunk by section headers
- Conversations: Chunk by speaker turns
Pattern 3: Fallback chains
When primary retrieval fails:
- Try expanded query
- Fall back to broader search scope
- Use LLM to generate answer with lower confidence flag
Key takeaways
- Treat RAG as a system: Retrieval, ranking, and generation are interconnected subsystems
- Measure retrieval quality: You can’t improve what you don’t measure
- Design for latency: Start with latency budget, work backwards
- Build feedback loops: Production data is your best training signal
- Plan for failure: Graceful degradation beats hard failures