RAG Architecture Patterns for Production: Chunking, Hybrid Search, and Access Control
How to build RAG that works in production: chunking strategies, hybrid search, RBAC/ABAC access control, evaluation, and latency optimisation. Visual guide.
December 15, 2025
Updated March 17, 2026 14 min read
Building a RAG demo takes a weekend. Shipping one that survives production — with real users, evolving documents, latency budgets, and access control — takes months. The gap is a system design problem, not a model problem.
This article walks through every layer: how data flows, where things break, how to pick a chunking strategy, why hybrid search matters, how to lock down retrieval for multi-tenant systems, and how to measure everything. Each section opens with the visual so you can skim the picture first and read for depth.
The full pipeline: index time and query time
Before diving into each layer, here is the complete picture. Index time transforms raw documents into searchable embeddings. Query time retrieves, ranks, and generates a response. Every concept in this article maps to one of these stages.
Index time: Docs → Chunk → Embed → Vector store (with metadata). Query time: User + roles → Access filter → Hybrid retrieve → Re-rank → Generate → Cache.
Where RAG breaks in production
Most failures come from three layers. Understanding them before you build saves weeks of debugging later.
Every layer has a distinct failure mode. Click a card to see details. Fix retrieval first, then latency, then context strategy.
Pure vector misses exact terms (error codes, product IDs, names). BM25 catches them. Merge + rerank gives the best of both worlds.
LLM generation dominates. Semantic cache is the highest-ROI optimisation: 30–50% call reduction on FAQ/support workloads.
Step 1 — Chunking: match strategy to document type
Chunking is the highest-leverage index-time decision. Wrong chunk size or strategy degrades retrieval quality more than almost any other factor.
Four strategies. Click a card to see the schematic. Parent-child is the strongest general baseline.
Fixed-size: document split into equal segments with overlap. Simple, fast, but may cut mid-sentence.
Semantic: split where embedding similarity drops. Chunks vary in size; boundaries follow natural structure.
Structure-aware: code by function, markdown by header, transcript by speaker. Respects document structure.
Children are contained in the parent. Child 2 matches the query; LLM receives the full Parent 1 (which contains Child 2). Precision at retrieval, depth at generation.
Pure vector search misses exact matches. Pure BM25 misses semantic similarity. Production RAG uses both, then re-ranks the merged result with a cross-encoder.
Step 3 — Embedding model: off-the-shelf or fine-tuned
Start off-the-shelf. Fine-tune only after you have a golden test set and measured a gap.
Model
Context
When
text-embedding-3-large (OpenAI)
General English
Fast start, low friction
embed-v3 (Cohere)
Multilingual
Multi-language corpora
all-MiniLM-L6-v2
Open-source
Self-hosted, cost-sensitive
Fine-tuned (domain)
Medical / legal / finance / code
After >1 000 labeled pairs, retrieval still below target
Fine-tuning with contrastive learning on domain-specific pairs typically yields 15–30% retrieval quality improvement on specialised corpora.
RAG vs fine-tuning: different tools for different jobs
RAG and fine-tuning solve different problems. In production, most high-quality systems use both.
RAG: at inference time the vector DB retrieves fresh documents. No retraining — add a document to the index and it’s immediately retrievable.
Fine-tuning bakes reasoning patterns into weights — or distills a large model into a smaller, cheaper one. No retrieval overhead, but knowledge is frozen at training cutoff. Tone and format stay in the prompt.
Production default: RAG for factual grounding and freshness; a distilled model for low-latency domain inference. Tone and format stay in the prompt.
Step 4 — Evaluation: measure before and after you ship
Without measurement you cannot improve. Set up evaluation before launch, not as a post-launch firefight.
Precision@k
> 0.7
Fraction of top-k results that are relevant.
MRR
> 0.6
Mean Reciprocal Rank — rank of first relevant result.
Faithfulness
> 0.8
Does the answer reflect retrieved context?
Answer relevance
> 0.75
Does the answer address the query?
Offline — golden test set of 100–500 query-document pairs with human relevance labels. Run before every major pipeline change.
Online — log query, retrieved chunks, scores, answer, latency, implicit user signals. Track top_score distribution over time: dropping scores mean corpus has drifted from the embedding model. This is the RAG equivalent of why ML models degrade in production; for the narrower distribution-mismatch mechanics, see model skewing.
Python: log_rag_interaction for production monitoring
One query becomes 3 variants. Parallel retrieval casts a wider net. Merging + reranking surfaces the best chunks — essential for multi-hop questions.
Graceful degradation: high confidence → answer; medium → answer + flag; low → escalate. Never silently return a hallucinated answer.
Step 6 — Access control: RBAC and ABAC
When multiple teams or tenants share the same RAG, a single unguarded index is a compliance failure waiting to happen. The rule is simple: enforce access inside retrieval, not in the UI. If the LLM receives forbidden chunks, it can leak restricted content in the generated answer.
Start from the scenario — choose the access model — enforce it at retrieval.
Scenario
Access model
Enforcement shape
Public FAQ or single-team corpus
No extra layer
Standard RAG
Shared knowledge with stable department or tenant boundaries
RBAC
Metadata filter or partitioned indexes
Regulated access by region, clearance, project, customer attributes
ABAC
Richer metadata predicates or policy engine
Without access control, any user can retrieve any document. With it, the filter runs inside the vector store — forbidden chunks never reach the LLM.
Three enforcement strategies
A is the default. B adds physical isolation for strict compliance. C is a trap: forbidden chunks degrade context before filtering.
Metadata filter: forbidden chunks are blocked inside the DB — they never reach the app or LLM. Lowest overhead. Recommended default for most RAG systems.
Index partitioning: inaccessible indexes are never queried — physically impossible to leak. Higher ops overhead but required for strict compliance or multi-tenant isolation.
Post-retrieval filter is a trap: the vector DB ranks forbidden docs above allowed ones, so after filtering you have 1 relevant chunk instead of 5. Use only when A and B are not feasible.
def rag_with_rbac( user_id: str, query: str, resolve_roles: Callable[[str], list[str]], retrieve: Callable[[str, dict], list[Chunk]], k: int = 10,) -> list[Chunk]: """Retrieve only chunks the user is allowed to see.""" roles = resolve_roles(user_id) # Build a filter: any chunk whose access_groups overlaps user roles permission_filter = {"access_groups": {"$in": roles}} return retrieve(query, filter=permission_filter, k=k)
Key takeaways
Treat RAG as a system. Chunking, retrieval, ranking, generation, and caching are interconnected — a failure in any layer breaks the whole.
Parent-child chunking is the strongest general baseline: 128-token children for precise retrieval, 512-token parents for LLM context.
Hybrid search (BM25 + vector) is non-negotiable. Pure vector misses exact matches — product codes, names, error messages.
Measure offline (golden test set) and online (retrieval score drift). Dropping scores mean corpus has drifted from the embedding model.
Semantic caching gives the highest ROI on latency: 30–50% LLM call reduction on FAQ/support workloads at a 0.92 cosine threshold.
Enforce access inside retrieval, not in the UI. Metadata filter (strategy A) is the default: forbidden data never leaves the vector store.
Frequently asked questions
What is RAG in AI?
Retrieval-Augmented Generation (RAG) is an architecture pattern where an LLM’s responses are grounded in documents retrieved from a knowledge base at query time. Instead of relying solely on knowledge baked into model weights, the system retrieves relevant context and includes it in the LLM’s prompt. This enables factual grounding, source citations, and up-to-date answers without retraining.
When does RAG fail in production?
RAG fails most often due to retrieval quality problems (wrong chunks retrieved), context window mismanagement (too many chunks diluting attention), latency budget violations (retrieval + generation chain too slow), and corpus drift (documents updated without re-embedding). Each layer needs its own monitoring.
What is the best chunking strategy for RAG?
Parent-child chunking is the strongest general baseline: small chunks (128 tokens) for precise retrieval, parent chunks (512 tokens) returned to the LLM for context. Structure-aware chunking outperforms fixed-size chunking when you have typed document collections (code, transcripts, tables).
RAG vs fine-tuning: which should I choose?
Use RAG when knowledge must stay current, you need source traceability, or the corpus exceeds the context window. Tone and format are prompting concerns — fine-tuning is not needed for them. Fine-tuning is justified when: (1) the task requires reasoning patterns absent from the base model, or (2) you are distilling — compressing a large model’s task-specific capability into a smaller, faster, cheaper model for production. In practice, the strongest systems combine both: RAG for factual grounding with up-to-date sources, and a distilled model for low-latency domain inference.
How do you evaluate RAG quality?
Build a golden test set of 100–500 query-document pairs with human relevance labels. Track Precision@k, Recall@k, and MRR for retrieval quality. Track faithfulness and answer relevance for generation quality. Monitor retrieval score distributions in production for drift.
How do you reduce latency in a RAG pipeline?
In order of impact: semantic caching (30–50% call reduction), response streaming (perceived latency drops immediately), smaller re-ranker model, async embedding + retrieval, and reducing the number of retrieved chunks passed to the LLM.
How do you implement RBAC in RAG?
Tag every chunk with permission metadata (access_groups, tenant_id, or department). At query time, resolve the user’s roles and pass a hard filter to the vector store — only chunks matching the user’s roles are returned. This means forbidden data never reaches the LLM. All major vector stores (Pinecone, Weaviate, pgvector, Qdrant) support metadata filtering natively.