Mactores Blog

RAG Explained: Architecture, Evaluation, & Production Systems

Written by Bal Heroor | Mar 11, 2026 9:00:00 AM

For most of computing history, software could only do what it was explicitly programmed to do. Rules had to be written. Logic had to be encoded. Every edge case had to be anticipated. This made software powerful but fundamentally brittle. The moment a situation fell outside the rules, the system broke.

Large language models changed this fundamentally. For the first time, a system could understand — process natural language, reason across domains, synthesize information, generate structured outputs, and do all of this without a single hand-written rule. The surface area of what software could now address exploded. Tasks that previously required human interpretation, like reading a contract, summarizing a report, answering an ambiguous question, and writing code, have become automatable.

This is why LLMs became the central infrastructure bet for enterprise AI. Not because they are the final form of intelligence, but because they eliminated the bottleneck that had constrained automation for decades: the need for explicit programming. An LLM can be directed with natural language, generalize to new situations, and operate across virtually any domain. That flexibility is unprecedented.

But reasoning well requires knowing the right things. This is precisely where their architecture creates constraints that matter enormously in production.

 If you prefer to start with a visual walkthrough, we have also covered this topic in depth on video — watch it here. Otherwise, let's get into the RAG architecture. 

 

 

What Is Retrieval-Augmented Generation (RAG)?

Retrieval-Augmented Generation decouples the model's reasoning capability from its knowledge. Instead of encoding knowledge in weights, RAG supplies the model with relevant, current, private information at inference time, which is retrieved from a live external knowledge store and injected directly into the prompt context.

The model's role becomes interpretation and synthesis. The retrieval system's role is evidence supply. This separation is architecturally clean: knowledge stays versioned, updatable, and auditable in the retrieval layer; the model handles the language understanding and generation it was built for.

 

The Five-Step Flow

1. User submits a query

2. The query is embedded into a vector and used to search a knowledge base

3. The top-k most relevant documents are retrieved

4. Those documents are injected into the LLM's prompt as context

5. The LLM generates a response grounded in that context

 

Step 5 is the key differentiator. The LLM is not generating freely from parametric memory. It is grounded by evidence. If the retrieved documents don't contain the answer, a well-instructed LLM will say so rather than confabulate.

 

The Two Core Subsystems

Subsystem

Role

Technology

Retriever

Finds the most relevant knowledge from the knowledge base given the query

Embedding models, vector databases, BM25, hybrid search

Generator

Takes the query + retrieved context and produces a final, grounded response

LLMs (GPT-4, Claude, Mistral, Llama, etc.)

 

The retriever and generator are loosely coupled. You can swap one without changing the other. This modularity is one of RAG's greatest engineering strengths.

 

RAG vs. Fine-Tuning vs. Prompt Engineering

These three approaches are frequently treated as interchangeable, but they operate at entirely different layers of the stack and solve fundamentally different problems. Here's how they compare across the parameters that matter in production:

Parameter

Prompt Engineering

Fine-Tuning

RAG

Knowledge Graphs

What it changes

Input instruction only

Model weights

Retrieval context at inference

Structured entity-relationship store

Adds new knowledge?

No — bounded by training data

Weakly — facts are brittle in weights

Yes — dynamic, external, always current

Yes — structured, explicit relationships

Handles private data?

No

Only if included in training data

Yes

Yes

Knowledge freshness

Static (training cutoff)

Static (retraining required)

Real-time (re-index on update)

Near real-time (graph update)

Update cost

None — just edit the prompt

High — requires full or LoRA retraining cycle

Low — re-embed and re-index changed documents

Medium — extract entities, update graph

Best for

Tone, format, behavior, task framing

Domain style, reasoning patterns, output structure

Knowledge-intensive Q&A over evolving private data

Highly connected data with explicit relationships

Hallucination risk

High — no grounding mechanism

Medium — facts encoded loosely in weights

Low — grounded in retrieved evidence

Low — answers derived from explicit graph facts

Auditability

None

None

High — every answer traceable to source chunks

High — every answer traceable to graph nodes

Inference cost impact

Minimal

None (cost is in training)

Moderate — embedding + retrieval + LLM tokens

Moderate — graph traversal + LLM tokens

Implementation complexity

Low

High

Medium

High

 

The Complete RAG Pipeline

RAG consists of two distinct pipelines running at different times. Understanding them separately is essential for debugging and optimization.

 

Offline Pipeline: Data Preparation (Runs Periodically)

This pipeline processes raw documents and prepares them for retrieval. It runs when documents are added, updated, or deleted — not on every query.

Stage 1 — Document Ingestion: Raw documents are collected from sources (S3, SharePoint, Confluence, CRM APIs, databases). Format-specific parsers handle each file type. Metadata is extracted and stored alongside content.

Stage 2 — Document Parsing: Raw bytes are converted into structured text. PDFs are parsed for text layers (or OCR'd if scanned). HTML is stripped of navigation and boilerplate. Tables are extracted as structured content. Parsing quality directly determines the ceiling of your RAG system.

Stage 3 — Chunking: Parsed documents are split into retrievable units. The strategy used here (covered in depth in Section 5) is arguably the most impactful decision in the entire pipeline.

Stage 4 — Embedding Generation: Each chunk is passed through an embedding model (e.g., Amazon Titan Embeddings, OpenAI text-embedding-3-large, or open-source models like bge-large-en) to produce a dense vector representation, typically 768 to 3072 dimensions depending on the model.

Stage 5 — Vector Indexing: Vectors are stored in a vector database. An index structure (HNSW, IVF, etc.) is built to enable fast approximate nearest neighbor search at query time.

 

Online Pipeline: Query Flow (Runs on Every Request)

Stage 1 — Query Embedding: The user's query is embedded using the same model used to embed documents. This is critical — using different models breaks the semantic similarity space.

Stage 2 — Vector Retrieval: The query vector is used to search the index for the top-k nearest neighbors (typically k=20 to 50 before re-ranking).

Stage 3 — Re-Ranking: Retrieved candidates are re-scored by a cross-encoder model to select the most relevant k' documents (typically k'=3 to 10).

Stage 4 — Context Assembly: Retrieved chunks are deduplicated, ordered by relevance, and formatted into a prompt template.

Stage 5 — LLM Generation: The assembled prompt (system instructions + retrieved context + user query) is sent to the LLM, which generates a grounded response.

Understanding both pipelines lets you localize failures precisely. A bad answer is either a retrieval failure (offline pipeline problem) or a generation failure (online pipeline problem). Knowing which determines your fix.



Vector Search: The Engine Behind RAG

When an embedding model processes text, it maps that text to a point in a high-dimensional vector space. The position of that point encodes semantic meaning. Texts that mean similar things map to nearby points. This is the geometric foundation of RAG retrieval.

 

Embeddings: A Brief Technical Grounding

An embedding model is a transformer-based neural network trained to produce representations where semantic similarity correlates with geometric proximity. The model outputs a vector of floats — 768, 1536, or 3072 dimensions depending on the model.

The training objective typically involves contrastive learning: similar text pairs (e.g., a question and its answer) are pushed together in the vector space. Dissimilar pairs are pushed apart. After training, the model generalizes this learned geometry to new text.

 

Similarity Metrics: The Geometry of Relevance

Once documents and queries are embedded, retrieval is a geometric problem: find the document vectors closest to the query vector.

Cosine similarity measures the angle between two vectors, independent of their magnitudes. This is the most common metric for text embeddings because it captures directional similarity (semantic meaning) without being dominated by vector magnitude.

cosine_similarity(A, B) = (A · B) / (||A|| × ||B||)

 

Range: [-1, 1]

1 = identical direction (maximum semantic similarity)

0 = orthogonal (no semantic relationship)

-1 = opposite direction

 

The dot product (inner product) is equivalent to cosine similarity when vectors are normalized to unit length. Many embedding models normalize their outputs, making dot product and cosine similarity interchangeable. Dot product is computationally cheaper, which matters at scale.

Euclidean distance (L2 distance) measures absolute distance between vector endpoints. Less commonly used for text embeddings because it conflates magnitude with direction.

 

Enterprise Vector Database Options

The vector database market has fragmented rapidly. Choosing the right one requires understanding your scale, latency requirements, filtering needs, and infrastructure constraints.

 

AWS-Native Options

1. Amazon OpenSearch (with vector search)

OpenSearch's k-NN plugin adds HNSW-based vector search to a battle-tested distributed search engine. The critical advantage is that OpenSearch unifies full-text BM25 search with vector similarity search in a single system — hybrid retrieval without a separate pipeline. It also supports rich metadata filtering via its mature query DSL.

Suitable for: teams already operating OpenSearch for log analytics or full-text search who want to add vector retrieval without introducing a new system.

 

2. Amazon Aurora PostgreSQL with pgvector

pgvector is a PostgreSQL extension that adds vector storage and similarity search directly to a relational database. Vectors are stored as columns alongside structured data. This enables exact SQL joins between vector retrieval and structured metadata filtering.

-- Retrieve top-5 chunks for a query vector, filtered by department and year

SELECT content, metadata, embedding <=> $1::vector AS distance

FROM document_chunks

WHERE metadata->>'department' = 'legal'

AND metadata->>'year'::int > 2022

ORDER BY distance

LIMIT 5;

pgvector supports both exact (IVFFlat) and approximate (HNSW) index types. It is not the fastest vector search engine at massive scale, but for datasets under 10M vectors with complex metadata requirements, its ability to run SQL-native vector queries is operationally valuable.

 

3. Amazon Kendra

Kendra is a managed enterprise search service that abstracts the indexing, retrieval, and ranking pipeline. It natively connects to S3, SharePoint, Confluence, Salesforce, and other enterprise content sources. Retrieval uses a combination of semantic and keyword matching without requiring you to manage embeddings.

Kendra is best for enterprise customers who need fast deployment without deep customization. The trade-off is reduced flexibility — you cannot control chunking strategies, embedding models, or re-ranking in the way a custom pipeline would allow.

 

4. Amazon Bedrock Knowledge Bases

AWS Bedrock Knowledge Bases is the highest-level abstraction for RAG on AWS. You connect an S3 bucket, select an embedding model, and AWS handles the full offline pipeline: parsing, chunking, embedding, and indexing into a managed vector store (backed by OpenSearch Serverless). The query API handles embedding + retrieval automatically.

For teams building on Bedrock LLMs, this is the lowest-friction path to a working RAG system. The cost is customization: chunking strategy and retrieval logic are partially configurable but not fully open.

 

Naive RAG Architecture

Before building complex systems, understand the simplest possible RAG implementation. Naive RAG is what most tutorials demonstrate. Knowing its limits is what motivates every optimization in Section 9 and beyond.

 

The Pipeline

User Query

Embedding Model (query → vector)

Vector Database (vector similarity search)

Top-k Retrieved Chunks (e.g., top 3-5)

Prompt Template (system prompt + chunks + query)

LLM

Response

 

The implementation is genuinely simple:

# Simplified naive RAG

def query_naive_rag(user_query: str, k: int = 5) -> str:

# 1. Embed the query

query_vector = embedding_model.embed(user_query)

# 2. Retrieve top-k chunks

results = vector_db.similarity_search(query_vector, k=k)

chunks = [r.content for r in results]

# 3. Build prompt

context = "\n\n".join(chunks)

prompt = f"""Answer the question using only the provided context.

Context:

{context}

 

Question: {user_query}

 

Answer:"""

# 4. Generate

return llm.generate(prompt)

 

Why Naive RAG Breaks in Production?

  • Irrelevant retrieval — Vector search returns semantically similar chunks, not necessarily the ones that answer the question; without re-ranking, the LLM receives noise.
  • Poor chunking destroys context — Fixed-size splits produce mid-argument fragments where the retrieved chunk references "the factors discussed in the previous section" — a section the LLM never sees.
  • Context window overload — The "lost in the middle" paper shows LLM performance degrades when relevant content is buried in a long context; naive RAG dumps chunks in retrieval order, not importance order.
  • Single-query retrieval fails multi-part questions — A question spanning Q3 revenue and a five-year trend requires two distinct retrievals; a single vector query will capture one and miss the other.
  • No query understanding — Conversational queries like "What about the new policy?" are embedded as-is, with no resolution of what "new policy" refers to in context.

 

Advanced RAG Architecture

Advanced RAG is not a single architecture — it is a collection of targeted improvements to naive RAG, each solving a specific failure mode. A production system selectively applies these improvements based on the requirements of the use case.

 

The Components

User Query

[1] Query Understanding & Rewriting

[2] Hybrid Retrieval (vector + BM25)

[3] Metadata Filtering

Candidate Set (top-50 to top-100 chunks)

[4] Cross-Encoder Re-Ranking

Refined Set (top-3 to top-10 chunks)

[5] Context Compression & Assembly

[6] Prompt Construction (with citations)

LLM

[7] Response with Source Attribution

 

Each numbered stage is an optional enhancement. You don't implement all of them at once. You diagnose failure modes and add the relevant fix.

[1] Query Understanding & Rewriting: The raw user query is analyzed and transformed before retrieval. Techniques include query expansion (adding synonyms or related terms), query decomposition (splitting multi-part questions into sub-queries), and HyDE (Hypothetical Document Embedding — generating a hypothetical answer and using its embedding to retrieve documents that look like the answer).

[2] Hybrid Retrieval: Combines vector similarity search with BM25 keyword search (see Section 10). Dense retrieval captures semantic meaning. Sparse retrieval captures exact keyword matches. Together they outperform either alone across virtually every benchmark.

[3] Metadata Filtering: Before or after vector search, filter by structured attributes. Retrieval scoped to department=engineering AND date>2024-01-01 is dramatically more precise than full-corpus retrieval.

[4] Re-Ranking: A cross-encoder model re-scores candidates with full query-document attention (unlike bi-encoder embedding models). This second-pass scoring is expensive but dramatically improves precision.

[5] Context Compression: Remove portions of retrieved chunks that are not relevant to the query. Reduces noise in the context window and leaves more room for additional documents.

[6] Citation-Aware Prompting: Instruct the model to attribute every factual claim to its source chunk. This enables post-generation grounding verification.

 

Retrieval Improvements

The effectiveness of a RAG system depends heavily on the quality of its retrieval stage. Advanced retrieval techniques help ensure that the most relevant documents reach the LLM before answer generation.

 

BM25: Classical Information Retrieval

BM25 (Best Match 25) is a probabilistic ranking function developed in the 1990s that remains one of the strongest baseline retrieval algorithms for exact keyword matching.

The BM25 score for a document D given query Q is:

BM25(D, Q) = Σ IDF(qᵢ) × [f(qᵢ, D) × (k₁ + 1)] / [f(qᵢ, D) + k₁ × (1 - b + b × |D|/avgdl)]

 

Where:

  • f(qᵢ, D) = term frequency of query term i in document D
  • IDF(qᵢ) = inverse document frequency (rare terms score higher)
  • |D| = document length, avgdl = average document length
  • k₁ = term frequency saturation parameter (typically 1.2-2.0)
  • b = length normalization parameter (typically 0.75)

 

BM25 excels at exact match retrieval: product codes, error codes, proper nouns, legal terms, abbreviations. A query for "SOX Section 404 compliance" will retrieve documents containing "SOX Section 404" with high precision via BM25. A dense vector search on the same query might retrieve semantically related documents about regulatory compliance that don't mention SOX at all.

BM25 fails on semantic paraphrase. A query about "revenue growth" won't retrieve a document that discusses "top-line expansion" unless those exact words also appear.

 

Hybrid Search: Reciprocal Rank Fusion

Hybrid search combines BM25 and dense vector retrieval results into a single ranked list. The standard merging algorithm is Reciprocal Rank Fusion (RRF):

RRF_score(d) = Σₖ 1 / (k + rankₖ(d))

Where rankₖ(d) is the rank of document d in retrieval system k, and k is a constant (typically 60).

def reciprocal_rank_fusion(bm25_results, dense_results, k=60):

scores = {}

for rank, doc_id in enumerate(bm25_results):

scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)

for rank, doc_id in enumerate(dense_results):

scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)

return sorted(scores.items(), key=lambda x: x[1], reverse=True)

 

The elegance of RRF is that it doesn't require score normalization. BM25 scores and cosine similarities live in different ranges — RRF operates only on ranks, making them directly combinable.

Empirically, hybrid search with RRF outperforms either pure vector or pure BM25 retrieval across most benchmarks. The intuition: if a document ranks high in both BM25 and vector retrieval, it almost certainly matches the query on both exact and semantic dimensions. RRF amplifies this agreement.

 

Metadata Filtering

Metadata filtering constrains the retrieval search space using structured attributes stored alongside vectors. This is the RAG equivalent of a SQL WHERE clause before a JOIN.

Every document chunk should carry metadata extracted at ingestion time:

"content": "The company's Q3 2024 revenue was $4.2 billion...",

"embedding": [0.023, -0.144, 0.891, ...],

"metadata": {

"source": "Q3-2024-earnings-report.pdf",

"document_type": "earnings_report",

"department": "finance",

"year": 2024,

"quarter": "Q3",

"created_at": "2024-10-15",

"author": "CFO Office"

}

}

 

Pre-filtering (filter before vector search) is more efficient but may miss relevant documents if filters are too restrictive. Post-filtering (filter after vector search) is safer but wastes compute on irrelevant vectors. Most production systems use pre-filtering at the category level (document type, department) and rely on re-ranking for finer relevance judgments.

 

Re-Ranking: Cross-Encoders

The retrieval pipeline's first stage uses bi-encoders — models that embed query and document independently, then compare vectors. Bi-encoders are fast (queries and documents can be pre-computed and cached) but their independence assumption limits relevance accuracy.

Cross-encoders process the query and document together in a single forward pass, enabling full attention between every token in the query and every token in the document. This is far more accurate but requires running the model fresh for every query-document pair — you can't pre-compute document representations.

The practical pattern: use bi-encoders to retrieve 50-100 candidates, use cross-encoders to re-rank to top 5-10.

# Cross-encoder re-ranking

from sentence_transformers import CrossEncoder

 

reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

 

# Score each query-document pair

scores = reranker.predict([(query, chunk) for chunk in candidate_chunks])

 

# Sort by score, take top-k

ranked = sorted(zip(scores, candidate_chunks), reverse=True)

top_chunks = [chunk for _, chunk in ranked[:5]]

 

Popular cross-encoder models for re-ranking include ms-marco-MiniLM variants for general English retrieval and bge-reranker models for multilingual use cases. Cohere Rerank is a managed API option for teams that don't want to self-host.

 

Context Assembly and Prompt Construction

Retrieval produces a set of candidate chunks. What you do with those chunks before sending them to the LLM significantly affects response quality. Context assembly is an under-appreciated engineering discipline.

 

The Context Assembly Problem

The LLM's context window is finite and expensive. Every token used for context is a token not available for the response. More critically, LLMs have a documented "lost in the middle" problem: performance degrades for relevant information placed in the center of a long context window. The model reliably attends to the beginning and end of context but under-attends to the middle.

This means context assembly is not just packing — it's strategic ordering.

 

Deduplication

After retrieval and re-ranking, multiple chunks from the same document section may appear in the candidate set. Overlapping sliding window chunks, in particular, produce near-duplicate content. Deduplicate before prompt construction using:

  • Exact deduplication: remove identical chunks
  • Near-deduplication: compute pairwise cosine similarity between chunks; remove chunks above a similarity threshold (e.g., 0.95)

Duplicates inflate context length without adding information and increase cost.

 

Context Compression

Even after deduplication, retrieved chunks may contain irrelevant sentences. LLMLingua and similar context compression techniques use a small language model to identify and remove sentences from chunks that are not relevant to the query, reducing context length by 2-10x with minimal accuracy degradation.

Original chunk (400 tokens): "The company was founded in 1995. [CEO history, founding story, early products...]

Revenue in Q3 2024 reached $4.2 billion, up 12% year-over-year. [additional context not related to query]"

 

Compressed chunk (80 tokens): "Revenue in Q3 2024 reached $4.2 billion, up 12% year-over-year."

Context compression is especially valuable when dealing with long-tail enterprise documents where relevant information is sparse within each document.

 

Prompt Construction with Citation Anchors

Instruct the model to cite sources using reference markers in the context:

SYSTEM: You are a research assistant. Answer questions using ONLY the provided context.

For every factual claim, include a citation in the format [Source N].

If the answer is not in the context, say "I cannot find this in the available documents."

 

CONTEXT:

[Source 1] (Q3 2024 Earnings Report, p.4)

Revenue in Q3 2024 reached $4.2 billion, up 12% year-over-year...

 

[Source 2] (Annual Report 2024, p.12)

The growth was driven primarily by expansion in the APAC region...

 

QUESTION: What drove revenue growth in Q3 2024?

This prompt structure serves two purposes: it anchors the model's response to retrievable evidence, and it enables post-generation grounding verification — you can programmatically check whether cited [Source N] passages actually support the claims made.

 

Ordering Chunks for Maximum Attention

Given the "lost in the middle" phenomenon, place the most relevant chunks at the beginning and end of your context window, with less relevant chunks in the middle. If you have 5 chunks ranked 1-5, order them: [1, 3, 5, 4, 2] — rank 1 at position 0, rank 2 at the last position.

 

The 5 Types of RAG Architectures

RAG is not a single pattern — it is a design space. These five architectures represent distinct RAG is not a single pattern — it is a design space. These five architectures represent distinct points in that space, each suited to different problem types.

Parameter

Naive RAG

Advanced RAG

Agentic RAG

Multi-Hop RAG

Graph RAG

Core pattern

Single query → retrieve → generate

Query rewriting → hybrid retrieval → re-rank → generate

LLM agent decides when and what to retrieve dynamically

Chain of retrievals, each refining the next

Knowledge graph traversal + vector search

Retrieval strategy

Single vector search

Hybrid BM25 + vector with re-ranking

Dynamic, tool-driven, iterative

Sequential, context-informed hops

Graph traversal + vector similarity

Query understanding

Raw query embedded as-is

HyDE rewriting, query decomposition

Agent plans and reformulates queries autonomously

LLM evaluates sufficiency and generates the next sub-query

Entity extraction drives the traversal path

Handles multi-part questions?

No

Partially

Yes

Yes — by design

Yes, via graph relationships

Handles entity relationships?

No

No

Partially

No

Yes — core strength

Implementation complexity

Low

Medium

High

High

Very High

Latency

Lowest

Low–Medium

High (multiple LLM calls)

High (multiple retrieval hops)

Medium–High

Hallucination risk

High

Low

Low

Low

Very Low

Cost

Lowest

Moderate

High

High

High

AWS tooling

Bedrock Knowledge Bases

Bedrock KB + OpenSearch

Amazon Bedrock Agents

Custom Lambda + Step Functions

Neptune + OpenSearch + Bedrock

Best for

Demos, simple FAQ, POCs

Enterprise search, support copilots, research

Complex reasoning, multi-domain Q&A, ambiguous queries

Evidence chaining, legal discovery, literature review

Regulatory compliance, supply chain, interconnected enterprise data

Production-ready?

Not recommended

Yes

Yes, with orchestration framework

Yes, for specific use cases

Yes, for graph-native domains



AWS-Native RAG Architectures

AWS provides two primary paths to building production RAG: a managed path with Amazon Bedrock Knowledge Bases, and a custom path for teams requiring architectural control.

 

Architecture Option 1: Amazon Bedrock Knowledge Bases (Managed)

This is the fastest path from zero to a working RAG pipeline on AWS. Bedrock Knowledge Bases abstracts the entire offline pipeline.

Architecture:

 

[Amazon S3]

→ Documents (PDF, DOCX, HTML, CSV, XLSX)

[Bedrock Knowledge Bases]

→ Automatic chunking (fixed or semantic)

→ Automatic embedding (Amazon Titan Embeddings or Cohere)

→ Automatic indexing

[Vector Store] (managed OpenSearch Serverless, Aurora pgvector, or Pinecone)

[Bedrock Retrieve & Generate API]

→ Handles query embedding, retrieval, and LLM invocation in one call

[Application Layer] (Lambda, ECS, API Gateway)

 

Key API call:

import boto3

 

bedrock_agent_runtime = boto3.client('bedrock-agent-runtime', region_name='us-east-1')

 

response = bedrock_agent_runtime.retrieve_and_generate(

input={'text': 'What is our Q3 2024 revenue?'},

retrieveAndGenerateConfiguration={

'type': 'KNOWLEDGE_BASE',

'knowledgeBaseConfiguration': {

'knowledgeBaseId': 'YOUR_KB_ID',

'modelArn': 'arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-3-sonnet-20240229-v1:0',

'retrievalConfiguration': {

'vectorSearchConfiguration': {

'numberOfResults': 5,

'overrideSearchType': 'HYBRID' # BM25 + vector

}

}

}

}

)

 

print(response['output']['text'])

print(response['citations']) # Sources returned with every response

 

When to use Bedrock Knowledge Bases: Teams without dedicated ML engineers, fast deployment requirements, standard document types, and acceptable customization constraints.

 

Architecture Option 2: Custom RAG Pipeline (Full Control)

When Bedrock Knowledge Bases' abstractions are insufficient — custom chunking strategies, specialized embedding models, non-standard document types, complex re-ranking pipelines — build a custom pipeline using AWS primitives.

Architecture:

 

OFFLINE PIPELINE:

S3 (raw documents)

↓ [S3 Event → Lambda trigger]

Lambda (document parsing: PyMuPDF, Unstructured.io)

Lambda (chunking: semantic chunker)

Bedrock Embedding API (Titan Embeddings V2 or Cohere)

OpenSearch Service (vector index with metadata)

↓ [Step Functions orchestrates the above]

 

ONLINE PIPELINE:

API Gateway

Lambda (query handler)

↓ [parallel]

Bedrock Embedding API (query vector) | OpenSearch BM25 query

↓ [merge with RRF]

Lambda (cross-encoder re-ranking)

Lambda (context assembly + prompt construction)

Bedrock LLM API (Claude, Llama 3, etc.)

API Gateway Response

 

AWS Service Mapping:

Pipeline Stage

AWS Service

Document storage

Amazon S3

Pipeline orchestration

AWS Step Functions

Document parsing & chunking

AWS Lambda

Embedding generation

Amazon Bedrock (Titan Embeddings V2)

Vector storage + hybrid search

Amazon OpenSearch Service

Re-ranking

Lambda + self-hosted cross-encoder (or Cohere Rerank API)

LLM generation

Amazon Bedrock (Claude 3.5, Llama 3.1, etc.)

API serving

API Gateway + Lambda

Caching

Amazon ElastiCache (Redis)

Monitoring

Amazon CloudWatch + AWS X-Ray

 

When to use custom pipelines: Domain-specific embedding models, custom chunking for specialized document types (e.g., legal contracts, code, medical records), multi-language requirements, custom re-ranking models, and complex multi-hop retrieval logic.

 

Evaluating a RAG System (The Most Critical Section)

Evaluation is not optional. It is the engineering practice that separates a working system from a broken one. The 4 dimensions you can evaluate your RAG systems are:

 

Retrieval Optimizations

  • Better chunking: If your retrieval recall is below 0.80, investigate your chunking before anything else. Move from fixed-size to semantic chunking for prose documents. Implement hierarchical indexing for long-form documents.
  • Hybrid search: It should be the default for any system handling specialized terminology, proper nouns, or regulatory/legal language. The RRF implementation is 30 lines of code and consistently improves retrieval metrics by 5-15%.
  • Query rewriting with HyDE: Doing this improves retrieval for short or ambiguous queries. HyDE (Hypothetical Document Embedding) works by asking the LLM to generate a hypothetical answer to the query, embedding that answer, and using the resulting vector to search.
  • Metadata extraction and filtering reduce the retrieval search space dramatically. A well-tagged corpus with document type, department, and date metadata allows scoping retrieval to the relevant subset before vector search runs.

 

Generation Optimizations

  • System prompt engineering: Include explicit grounding instructions: "Answer only using the provided context. If the context does not contain the answer, say so explicitly." Include citation instructions: "Cite every factual claim with [Source N]." Include uncertainty instructions: "If context is ambiguous or contradictory, note the uncertainty."
  • Context compression: Reduces noise in the context window. LLMLingua, Selective Context, or simple extractive compression methods can reduce context length by 2-5x. Shorter context → lower cost, better attention focus, reduced hallucination from irrelevant passages.

 

Cost Optimizations

  • Query caching: Cache responses for identical or near-identical queries. Enterprise environments often have repetitive query patterns — "What is our PTO policy?" gets asked hundreds of times daily. Cache hit rates of 20-40% are common.
  • Dynamic retrieval: Implement confidence-based retrieval: if the top retrieved chunk scores above a similarity threshold and re-ranker score exceeds another threshold, send only that chunk to the LLM.
  • Model routing: Classify query complexity. Route simple factual queries to Claude Haiku or Llama 3.1 8B. Reserve Claude 3 Sonnet or Opus for complex multi-hop or analytical queries.



Conclusion

There is a persistent myth in the enterprise AI market: that deploying an LLM is the hard part. It isn't. The hard part is connecting that LLM to your knowledge reliably, evaluating whether the connection is working, and maintaining that connection at production scale as your knowledge evolves.

RAG is not a prompt pattern. It is not a hack around LLM limitations. It is a full information retrieval system — built on decades of IR research — that happens to use an LLM as its synthesis layer. The companies that understand this are building AI systems that work. The companies that don't are shipping demos that break in production.

On AWS, the practical path is this: start with Amazon Bedrock Knowledge Bases for fast deployment and validation. Once you understand your requirements — your failure modes, your performance bottlenecks, your cost structure — move to a custom pipeline using OpenSearch for hybrid retrieval, Lambda for orchestration, and Bedrock LLMs for generation. Use Step Functions to make the offline pipeline auditable and replayable.

The difference between a production AI assistant and a demo is retrieval architecture, evaluation frameworks, and cost optimization. You now have the foundation for all three.

If you are building on AWS and want to get the architecture right from the start, this is exactly what we do at Mactores. We consult with enterprises on AWS AI architecture — from selecting the right RAG pattern for your use case to building the evaluation infrastructure that keeps it working in production. If you are at that stage, we would be glad to talk.