RAG Architecture Patterns for Production: Chunking, Hybrid Search, and Access Control

RAG Architecture Patterns for Production: Chunking, Hybrid Search, and Access Control

How to build RAG that works in production: chunking strategies, hybrid search, RBAC/ABAC access control, evaluation, and latency optimisation. Visual guide.

Building a RAG demo takes a weekend. Shipping one that survives production — with real users, evolving documents, latency budgets, and access control — takes months. The gap is a system design problem, not a model problem.

This article walks through every layer: how data flows, where things break, how to pick a chunking strategy, why hybrid search matters, how to lock down retrieval for multi-tenant systems, and how to measure everything. Each section opens with the visual so you can skim the picture first and read for depth.


The full pipeline: index time and query time

Before diving into each layer, here is the complete picture. Index time transforms raw documents into searchable embeddings. Query time retrieves, ranks, and generates a response. Every concept in this article maps to one of these stages.

INDEX TIMEDocsrawChunksplitEmbedvectoriseEach chunk gets metadata: dept, role, tenantVectorIndexchunks +metadataQUERY TIMEUser+ rolesAccessfilterRBAC/ABACHybridretrieveRe-rankGenerateCache+ streamAccess filter applied inside retrieval — LLM never receives forbidden context
Index time: Docs → Chunk → Embed → Vector store (with metadata). Query time: User + roles → Access filter → Hybrid retrieve → Re-rank → Generate → Cache.

Where RAG breaks in production

Most failures come from three layers. Understanding them before you build saves weeks of debugging later.

Every layer has a distinct failure mode. Click a card to see details. Fix retrieval first, then latency, then context strategy.

❌ Pure vector onlyquery: “error code ABC-123 fails”Vector search onlyChunk #47 — similar topic, no “ABC-123”LLM answer misses the exact error term✓ Hybrid: BM25 + vector + rerankquery: “error code ABC-123 fails”BM25 exactVector semanticMerge + cross-encoder rerank✓ Chunk with “ABC-123” surfaces — LLM answer is grounded and precise
Pure vector misses exact terms (error codes, product IDs, names). BM25 catches them. Merge + rerank gives the best of both worlds.
Embed 75msVector 200msRe-rank 350msLLM generation ~2 000msSemantic caching + response streaming cut perceived latency by 30–60%
LLM generation dominates. Semantic cache is the highest-ROI optimisation: 30–50% call reduction on FAQ/support workloads.
Context window bloatMore chunks does not mean better answers. LLMs lose info buried in the middle.Irrelevant chunks dilute attention.Fix: limit retrieved chunks (top-5 after rerank), use cross-encoder to filter noise.
Limit retrieved chunks; avoid irrelevant chunks diluting attention.

Step 1 — Chunking: match strategy to document type

Chunking is the highest-leverage index-time decision. Wrong chunk size or strategy degrades retrieval quality more than almost any other factor.

Four strategies. Click a card to see the schematic. Parent-child is the strongest general baseline.

Fixed-sizeDocumentsplit every N tokensChunk 1Chunk 2Chunk 3Chunk 4Chunk 5overlapEqual N tokens each, M% overlap at boundaries
Fixed-size: document split into equal segments with overlap. Simple, fast, but may cut mid-sentence.
SemanticDocumentsplit at low-similarity sentence boundariesChunk 1boundaryChunk 2Chunk 3Chunk 4Variable size — respects topic/sentence boundaries
Semantic: split where embedding similarity drops. Chunks vary in size; boundaries follow natural structure.
Structure-awareCodefn Afn Bsplit by functionMarkdown## H2## H2split by headerTranscriptSpeaker ASpeaker Bsplit by turnEach doc type uses its native boundaries — no mid-function or mid-paragraph cuts
Structure-aware: code by function, markdown by header, transcript by speaker. Respects document structure.
INDEX TIME — Parent contains childrenQUERY TIMEParent 1 (512 tokens) — children insideChild 1128 tChild 2matchedChild 3128 tChild 1, 2, 3 are inside Parent 1Parent 2 (512 tokens)C4C5C6C4, C5, C6 inside Parent 2User query”explain X”Child 2 matchreturn parentLLM contextParent 1512 tokenschild was inside parent
Children are contained in the parent. Child 2 matches the query; LLM receives the full Parent 1 (which contains Child 2). Precision at retrieval, depth at generation.
Retrieval chunk
128 tokens
Small = precise matching signal.
Context chunk
512 tokens
Large = enough context for generation.
Overlap (fixed)
10–15%
Prevents context loss at boundaries.
Python: structure-aware chunker (code / markdown / transcript)
def chunk_by_structure(document: str, doc_type: str) -> list[str]:
    if doc_type == "code":
        return split_by_function_boundary(document)
    elif doc_type == "markdown":
        return split_by_header_level(document, max_level=2)
    elif doc_type == "transcript":
        return split_by_speaker_turn(document)
    else:
        return split_by_paragraph(document, max_tokens=512)

Step 2 — Retrieval: hybrid search + reranking

Pure vector search misses exact matches. Pure BM25 misses semantic similarity. Production RAG uses both, then re-ranks the merged result with a cross-encoder.

QueryBM25exact keyword matchVectorsemantic similarityMerge +NormaliseCross-encoderrerankTop-k chunksto LLM contextalpha parameter controls BM25 vs vector weight (default 0.5, tune per domain)
Hybrid retrieval: BM25 catches exact terms (error codes, names), vector catches paraphrases. Cross-encoder reranker improves final ordering.
Python: hybrid_search combining BM25 + cosine similarity
from rank_bm25 import BM25Okapi
import numpy as np

def hybrid_search(
    query: str,
    documents: list[str],
    embeddings: np.ndarray,
    query_embedding: np.ndarray,
    k: int = 20,
    alpha: float = 0.5,
) -> list[int]:
    """
    Combine BM25 lexical score with vector cosine similarity.
    alpha=0.5 weights them equally; tune based on your domain.
    """
    tokenized = [doc.lower().split() for doc in documents]
    bm25 = BM25Okapi(tokenized)
    bm25_scores = bm25.get_scores(query.lower().split())

    cosine_scores = (embeddings @ query_embedding) / (
        np.linalg.norm(embeddings, axis=1) * np.linalg.norm(query_embedding)
    )

    bm25_norm = (bm25_scores - bm25_scores.min()) / (bm25_scores.max() - bm25_scores.min() + 1e-9)
    cosine_norm = (cosine_scores - cosine_scores.min()) / (cosine_scores.max() - cosine_scores.min() + 1e-9)

    combined = alpha * cosine_norm + (1 - alpha) * bm25_norm
    return np.argsort(combined)[::-1][:k].tolist()

Step 3 — Embedding model: off-the-shelf or fine-tuned

Start off-the-shelf. Fine-tune only after you have a golden test set and measured a gap.

ModelContextWhen
text-embedding-3-large (OpenAI)General EnglishFast start, low friction
embed-v3 (Cohere)MultilingualMulti-language corpora
all-MiniLM-L6-v2Open-sourceSelf-hosted, cost-sensitive
Fine-tuned (domain)Medical / legal / finance / codeAfter >1 000 labeled pairs, retrieval still below target

Fine-tuning with contrastive learning on domain-specific pairs typically yields 15–30% retrieval quality improvement on specialised corpora.


RAG vs fine-tuning: different tools for different jobs

RAG and fine-tuning solve different problems. In production, most high-quality systems use both.

RAG — inference-time pipelineUser query”What is our refund policy?”Embed queryfloat[1536]Vector DBcosine search+ metadata filterRe-rankcross-encodertop-5 chunksLLMquery + chunksin context windowCited answergrounded in docs+ source referencesUse RAG when:Knowledge updates frequently (docs, policies, products)Source citations are requiredCorpus is too large to fit in a context windowYou need to update knowledge without retraining
RAG: at inference time the vector DB retrieves fresh documents. No retraining — add a document to the index and it’s immediately retrievable.
TRAIN (once)Labeled dataset> 1 000 (q, answer) pairsFine-tune base LLMcontrastive / LoRA / SFTSpecialized modeldomain reasoning baked in weightsUse fine-tuning when:Reasoning patterns absent from base modelDistillation: large model → small, cheap specialistINFER (every request — no retrieval step)User query”Classify this complaint”Specialized LLMno retrieval overheadDomain-reasoned answertask patterns baked into weightsLower latency — no vector search step.Knowledge is frozen at training cutoff.
Fine-tuning bakes reasoning patterns into weights — or distills a large model into a smaller, cheaper one. No retrieval overhead, but knowledge is frozen at training cutoff. Tone and format stay in the prompt.
Both — most common production setupUser query”Summarise ticket #99”Vector DBretrieves fresh chunksRe-ranktop-5 chunksFine-tuned LLMdomain style + fresh contextCited, styled answergrounded facts + correct toneResponsibility splitRAG handlesFreshness · citations · large corpus · no retrainingFine-tuning handlesDomain reasoning · distillation · new task patterns
Production default: RAG for factual grounding and freshness; a distilled model for low-latency domain inference. Tone and format stay in the prompt.

Step 4 — Evaluation: measure before and after you ship

Without measurement you cannot improve. Set up evaluation before launch, not as a post-launch firefight.

Precision@k
> 0.7
Fraction of top-k results that are relevant.
MRR
> 0.6
Mean Reciprocal Rank — rank of first relevant result.
Faithfulness
> 0.8
Does the answer reflect retrieved context?
Answer relevance
> 0.75
Does the answer address the query?

Offline — golden test set of 100–500 query-document pairs with human relevance labels. Run before every major pipeline change.

Online — log query, retrieved chunks, scores, answer, latency, implicit user signals. Track top_score distribution over time: dropping scores mean corpus has drifted from the embedding model. This is the RAG equivalent of why ML models degrade in production; for the narrower distribution-mismatch mechanics, see model skewing.

Python: log_rag_interaction for production monitoring
def log_rag_interaction(
    query: str,
    retrieved_chunks: list[str],
    retrieved_scores: list[float],
    generated_answer: str,
    latency_ms: int,
    session_id: str,
) -> None:
    event = {
        "query": query,
        "chunk_count": len(retrieved_chunks),
        "top_score": max(retrieved_scores) if retrieved_scores else 0,
        "min_score": min(retrieved_scores) if retrieved_scores else 0,
        "answer_length_tokens": len(generated_answer.split()),
        "latency_ms": latency_ms,
        "session_id": session_id,
        "timestamp": datetime.utcnow().isoformat(),
    }
    metrics_sink.emit(event)

Step 5 — Architecture patterns for production scale

Three patterns that matter most once you are past the prototype stage.

Three patterns that pay off at scale. Click a card to see details. Start with semantic caching — highest ROI for FAQ workloads.

Incoming query”How do I cancel?”Embed queryfloat[1536]Cache lookupcosine similarity ≥ 0.92?against cached embeddingsHITReturn cached answer~5ms · zero LLM cost✓ 30–50% of calls served from cacheMISSVector searchhybrid + rerankLLM generates~2 000msCache resultstore (embedding, answer)Return answerfull pipeline costThreshold 0.92 cosine: “How to cancel my subscription?” and “Cancel account — how?” hit the same cache entry.FAQ and support workloads benefit most — same questions phrased differently every time.
Semantic cache matches by embedding similarity, not exact text. A cache hit skips the full RAG pipeline — zero LLM cost, ~5ms response.
Python: semantic cache lookup
def get_cached_response(
    query: str,
    cache: SemanticCache,
    threshold: float = 0.92,
) -> str | None:
    query_embedding = embed(query)
    cached = cache.find_similar(query_embedding, threshold=threshold)
    if cached:
        metrics.increment("cache_hit")
        return cached.response
    return None
Original query”Why does RAG fail whendocs are stale?”LLM expandergenerates 2–3related variantsV1: “RAG staleness problem”keyword variantV2: “document freshness RAG”semantic variantV3: “index update latency LLM”broader variantretrieve top-kretrieve top-kretrieve top-kMerge + dedup+ rerankcross-encoderLLMtop-5chunks
One query becomes 3 variants. Parallel retrieval casts a wider net. Merging + reranking surfaces the best chunks — essential for multi-hop questions.
User queryprecise vector searchscore ≥ 0.80?top-1 retrieval scoreYES✓ Confident LLM answerNOExpand scopebroader query + more chunksscore ≥ 0.60?after broader retrievalYES⚠ Answer + low-confidence flagNOEscalate to humanor surface “I don’t know”Fallback chains prevent silent failures. A confident wrong answer is worse than a transparent escalation.Log every fallback — a rising escalation rate signals index drift or embedding model mismatch.
Graceful degradation: high confidence → answer; medium → answer + flag; low → escalate. Never silently return a hallucinated answer.

Step 6 — Access control: RBAC and ABAC

When multiple teams or tenants share the same RAG, a single unguarded index is a compliance failure waiting to happen. The rule is simple: enforce access inside retrieval, not in the UI. If the LLM receives forbidden chunks, it can leak restricted content in the generated answer.

Start from the scenario — choose the access model — enforce it at retrieval.

ScenarioAccess modelEnforcement shape
Public FAQ or single-team corpusNo extra layerStandard RAG
Shared knowledge with stable department or tenant boundariesRBACMetadata filter or partitioned indexes
Regulated access by region, clearance, project, customer attributesABACRicher metadata predicates or policy engine
No access controlOne index — everyone retrieves everythingSalesrepAll docs:legal, HR, financeLegalcounselSales rep can retrieve legal contractsWith access controlRetrieval filtered — LLM sees only allowed chunksSalesrepsales docs onlyShared index+ metadata filterLegalcounsellegal + sales docsEach user searches only their allowed set
Without access control, any user can retrieve any document. With it, the filter runs inside the vector store — forbidden chunks never reach the LLM.

Three enforcement strategies

A is the default. B adds physical isolation for strict compliance. C is a trap: forbidden chunks degrade context before filtering.

A. Metadata filter — one index, access enforced at query timeUserrole: [“sales”, “public”]query: “Q4 targets?”Vector DB — filter applied INSIDE the databasevector_search(query_vector, filter={“access_groups”: {“$in”: user.roles}})Only chunks whose access_groups intersect user.roles are scannedAllowed chunkspublic + sales docspassed to LLMHow chunks are tagged at index timechunk: “Q4 sales targets…“access_groups: [“sales”, “admin”]✓ returned (user has “sales”)chunk: “Public FAQ page…“access_groups: [“public”]✓ returned (user has “public”)chunk: “Legal contract…“access_groups: [“legal”, “admin”]✗ filtered — never leaves DB
Metadata filter: forbidden chunks are blocked inside the DB — they never reach the app or LLM. Lowest overhead. Recommended default for most RAG systems.
B. Index partitioning — separate collections, physical isolationUserrole: “internal_sales”query: “Q4 targets?”Access routerresolves: public+ internal_salesidx_public✓ queried — user has accessidx_internal_sales✓ queried — user has accessidx_confidential_legal✗ skipped — no query sentMerge resultsdedup + rerankacross indexesLLMonly public + saleschunks in contextPhysical isolation: each index has independent retention, quota, and encryption key. Router never sends a query to an inaccessible index.Trade-off: more indexes to manage; merge step adds latency. Best fit for strict compliance and multi-tenant SaaS.
Index partitioning: inaccessible indexes are never queried — physically impossible to leak. Higher ops overhead but required for strict compliance or multi-tenant isolation.
C. Post-retrieval filter — ⚠ avoid as primary strategyUser queryrole: “sales”Vector DBno filter — top-10 returnedall docs mixed (all roles)10 chunks returned by DB9 forbidden (legal, finance, HR)1 allowed (sales doc)App filtercheck user.rolesagainst each chunk1 chunkto LLM1/10 contextWhy this degrades qualityStrategy A or B (correct)Vector DB returns only relevant + allowed chunksLLM gets top-5 high-quality, on-topic context → good answerranking reflects true relevance within user’s scopeStrategy C (anti-pattern)DB returns mixed docs — 9 forbidden rank higher than 1 allowedAfter filtering: 1 weak chunk → LLM has almost nothing → poor answertop-k ranking is polluted before filtering even begins
Post-retrieval filter is a trap: the vector DB ranks forbidden docs above allowed ones, so after filtering you have 1 relevant chunk instead of 5. Use only when A and B are not feasible.

Filter syntax by vector store:

StoreStrategy A — metadata filter
Pineconefilter={"access_groups": {"$in": user_roles}}
Weaviatewhere={"path": ["access_groups"], "operator": "ContainsAny", "valueStringArray": user_roles}
pgvector`WHERE metadata->‘access_groups’ ?
Qdrantfilter=FieldCondition(key="access_groups", match=MatchAny(any=user_roles))
Python: rag_with_rbac — resolve roles, build filter, retrieve
def rag_with_rbac(
    user_id: str,
    query: str,
    resolve_roles: Callable[[str], list[str]],
    retrieve: Callable[[str, dict], list[Chunk]],
    k: int = 10,
) -> list[Chunk]:
    """Retrieve only chunks the user is allowed to see."""
    roles = resolve_roles(user_id)
    # Build a filter: any chunk whose access_groups overlaps user roles
    permission_filter = {"access_groups": {"$in": roles}}
    return retrieve(query, filter=permission_filter, k=k)

Key takeaways
  • Treat RAG as a system. Chunking, retrieval, ranking, generation, and caching are interconnected — a failure in any layer breaks the whole.
  • Parent-child chunking is the strongest general baseline: 128-token children for precise retrieval, 512-token parents for LLM context.
  • Hybrid search (BM25 + vector) is non-negotiable. Pure vector misses exact matches — product codes, names, error messages.
  • Measure offline (golden test set) and online (retrieval score drift). Dropping scores mean corpus has drifted from the embedding model.
  • Semantic caching gives the highest ROI on latency: 30–50% LLM call reduction on FAQ/support workloads at a 0.92 cosine threshold.
  • Enforce access inside retrieval, not in the UI. Metadata filter (strategy A) is the default: forbidden data never leaves the vector store.

Frequently asked questions

What is RAG in AI? Retrieval-Augmented Generation (RAG) is an architecture pattern where an LLM’s responses are grounded in documents retrieved from a knowledge base at query time. Instead of relying solely on knowledge baked into model weights, the system retrieves relevant context and includes it in the LLM’s prompt. This enables factual grounding, source citations, and up-to-date answers without retraining.

When does RAG fail in production? RAG fails most often due to retrieval quality problems (wrong chunks retrieved), context window mismanagement (too many chunks diluting attention), latency budget violations (retrieval + generation chain too slow), and corpus drift (documents updated without re-embedding). Each layer needs its own monitoring.

What is the best chunking strategy for RAG? Parent-child chunking is the strongest general baseline: small chunks (128 tokens) for precise retrieval, parent chunks (512 tokens) returned to the LLM for context. Structure-aware chunking outperforms fixed-size chunking when you have typed document collections (code, transcripts, tables).

RAG vs fine-tuning: which should I choose? Use RAG when knowledge must stay current, you need source traceability, or the corpus exceeds the context window. Tone and format are prompting concerns — fine-tuning is not needed for them. Fine-tuning is justified when: (1) the task requires reasoning patterns absent from the base model, or (2) you are distilling — compressing a large model’s task-specific capability into a smaller, faster, cheaper model for production. In practice, the strongest systems combine both: RAG for factual grounding with up-to-date sources, and a distilled model for low-latency domain inference.

How do you evaluate RAG quality? Build a golden test set of 100–500 query-document pairs with human relevance labels. Track Precision@k, Recall@k, and MRR for retrieval quality. Track faithfulness and answer relevance for generation quality. Monitor retrieval score distributions in production for drift.

How do you reduce latency in a RAG pipeline? In order of impact: semantic caching (30–50% call reduction), response streaming (perceived latency drops immediately), smaller re-ranker model, async embedding + retrieval, and reducing the number of retrieved chunks passed to the LLM.

How do you implement RBAC in RAG? Tag every chunk with permission metadata (access_groups, tenant_id, or department). At query time, resolve the user’s roles and pass a hard filter to the vector store — only chunks matching the user’s roles are returned. This means forbidden data never reaches the LLM. All major vector stores (Pinecone, Weaviate, pgvector, Qdrant) support metadata filtering natively.


Ready to build production AI systems?

We help teams ship AI that works in the real world. Let's discuss your project.

Related posts

Related reading