For most of computing history, software could only do what it was explicitly programmed to do. Rules had to be written. Logic had to be encoded. Every edge case had to be anticipated. This made software powerful but brittle. The moment a situation fell outside the rules, the system broke.
Large language models changed this fundamentally. For the first time, a system could understand. It could process natural language, reason across domains, synthesize information, generate structured outputs, and do all of this without a single hand-written rule. Tasks that previously required human interpretation, like reading a contract, summarizing a report, answering an ambiguous question, and writing code, have become automatable.
This is why LLMs became the central infrastructure bet for enterprise AI. Not because they are the final form of intelligence, but because they eliminated the bottleneck that had constrained automation for decades: the need for explicit programming.
An LLM can be directed with natural language, generalize to new situations, and operate across virtually any domain. That flexibility is unprecedented. But reasoning well requires knowing the right things. This is precisely where their architecture creates constraints that matter enormously in production.
If you prefer to start with a visual walkthrough, we have also covered this topic in depth on video — watch it here. Otherwise, let's get into the RAG architecture.
What Is Retrieval-Augmented Generation (RAG)?
Retrieval-Augmented Generation decouples the model's reasoning capability from its knowledge. Instead of encoding knowledge in weights, RAG supplies the model with relevant, current, private information at inference time, which is retrieved from a live external knowledge store and injected directly into the prompt context.
The model's role becomes interpretation and synthesis. The retrieval system's role is evidence supply. This separation is architecturally clean: knowledge stays versioned, updatable, and auditable in the retrieval layer; the model handles the language understanding and generation it was built for.
The Five-Step Flow

The last step is the key differentiator. The LLM is not generating freely from parametric memory. It is grounded in evidence. If the retrieved documents don't contain the answer, a well-instructed LLM will say so rather than confabulate.
The Two Core Subsystems
|
Subsystem |
Role |
Technology |
|
Retriever |
Finds the most relevant knowledge from the knowledge base given the query |
Embedding models, vector databases, BM25, hybrid search |
|
Generator |
Takes the query + retrieved context and produces a final, grounded response |
LLMs (GPT-4, Claude, Mistral, Llama, etc.) |
The retriever and generator are loosely coupled. You can swap one without changing the other. This modularity is one of RAG's greatest engineering strengths.
RAG vs. Fine-Tuning vs. Prompt Engineering
These three approaches are frequently treated as interchangeable, but they operate at entirely different layers of the stack and solve fundamentally different problems. Here's how they compare across the parameters that matter in production:
|
Parameter |
RAG |
Fine-Tuning |
Prompt Engineering |
|
What it changes |
Retrieval context at inference |
Model weights |
Input instruction only |
|
Adds new knowledge? |
Yes — dynamic, external, always current |
Weakly — facts are brittle in weights |
No — bounded by training data |
|
Handles private data? |
Yes |
Only if included in training data |
No |
|
Knowledge freshness |
Real-time (re-index on update) |
Static (retraining required) |
Static (training cutoff) |
|
Update cost |
Low — re-embed and re-index changed documents |
High — requires full or LoRA retraining cycle |
None — just edit the prompt |
|
Best for |
Knowledge-intensive Q&A over evolving private data |
Domain style, reasoning patterns, output structure |
Tone, format, behavior, task framing |
|
Hallucination risk |
Low — grounded in retrieved evidence |
Medium — facts encoded loosely in weights |
High — no grounding mechanism |
|
Auditability |
High — every answer traceable to source chunks |
None |
None |
|
Inference cost impact |
Moderate — embedding + retrieval + LLM tokens |
None (cost is in training) |
Minimal |
|
Implementation complexity |
Medium |
High |
Low |
The Complete RAG Pipeline
RAG consists of two distinct pipelines running at different times. Understanding them separately is essential for debugging and optimization.
Offline Pipeline: Data Preparation (Runs Periodically)
This pipeline processes raw documents and prepares them for retrieval. It only runs when documents are added, updated, or deleted.

Here's a breakdown of the five stages of the Offline pipeline:
Stage 1 — Document Ingestion: Raw documents are collected from sources (S3, SharePoint, Confluence, CRM APIs, databases). Format-specific parsers handle each file type. Metadata is extracted and stored alongside content.
Stage 2 — Document Parsing: Raw bytes are converted into structured text. PDFs are parsed for text layers (or OCR'd if scanned). HTML is stripped of navigation and boilerplate. Tables are extracted as structured content. Parsing quality directly determines the ceiling of your RAG system.
Stage 3 — Chunking: Parsed documents are split into retrievable units. The strategy used here (covered in depth in Section 5) is arguably the most impactful decision in the entire pipeline.
Stage 4 — Embedding Generation: Each chunk is passed through an embedding model to produce a dense vector representation, typically 768 to 3072 dimensions depending on the model.
Stage 5 — Vector Indexing: Vectors are stored in a vector database. An index structure (HNSW, IVF, etc.) is built to enable fast approximate nearest neighbor search at query time.
Online Pipeline: Query Flow (Runs on Every Request)

Here's a breakdown of the Online pipeline:
Stage 1 — Query Embedding: The user's query is embedded using the same model used to embed documents. This is critical — using different models breaks the semantic similarity space.
Stage 2 — Vector Retrieval: The query vector is used to search the index for the top-k nearest neighbors (typically k=20 to 50 before re-ranking).
Stage 3 — Re-Ranking: Retrieved candidates are re-scored by a cross-encoder model to select the most relevant k' documents (typically k'=3 to 10).
Stage 4 — Context Assembly: Retrieved chunks are deduplicated, ordered by relevance, and formatted into a prompt template.
Stage 5 — LLM Generation: The assembled prompt (system instructions + retrieved context + user query) is sent to the LLM, which generates a grounded response.
Understanding both pipelines lets you localize failures precisely. A bad answer is either a retrieval failure (offline pipeline problem) or a generation failure (online pipeline problem). Knowing which determines your fix.
Vector Search: The Engine Behind RAG
When an embedding model processes text, it maps that text to a point in a high-dimensional vector space. The position of that point encodes semantic meaning. Texts that mean similar things map to nearby points. This is the geometric foundation of RAG retrieval.
Embeddings: A Brief Technical Grounding
An embedding model is a transformer-based neural network trained to produce representations where semantic similarity correlates with geometric proximity. The model outputs a vector of floats — 768, 1536, or 3072 dimensions depending on the model.
The training objective typically involves contrastive learning: similar text pairs (e.g., a question and its answer) are pushed together in the vector space. Dissimilar pairs are pushed apart. After training, the model generalizes this learned geometry to new text.
Similarity Metrics: The Geometry of Relevance
Once documents and queries are embedded, retrieval is a geometric problem: find the document vectors closest to the query vector.
Cosine similarity measures the angle between two vectors, independent of their magnitudes. This is the most common metric for text embeddings because it captures directional similarity (semantic meaning) without being dominated by vector magnitude. Cosine similarity can be measured by the following formula:
cosine_similarity(A, B) = (A · B) / (||A|| × ||B||)
Range: [-1, 1], where:
1 = identical direction (maximum semantic similarity)
0 = orthogonal (no semantic relationship)
-1 = opposite direction
The dot product (inner product) is equivalent to cosine similarity when vectors are normalized to unit length. Many embedding models normalize their outputs, making dot product and cosine similarity interchangeable. Dot product is computationally cheaper, which matters at scale.
Euclidean distance (L2 distance) measures absolute distance between vector endpoints. Less commonly used for text embeddings because it conflates magnitude with direction.
Enterprise Vector Database Options
The vector database market has fragmented rapidly. Choosing the right one requires understanding your scale, latency requirements, filtering needs, and infrastructure constraints.
I recommend choosing from one of the following four AWS vector databases. AWS offers strong security, built-in compliance, and predictable cost control, along with long-term scalability and greater architectural flexibility, making them well-suited for enterprise-grade systems.
1. Amazon OpenSearch (with vector search)
OpenSearch's k-NN plugin adds HNSW-based vector search to a battle-tested distributed search engine. The critical advantage is that OpenSearch unifies full-text BM25 search with vector similarity search in a single system. This enables hybrid retrieval without a separate pipeline. It also supports rich metadata filtering via its mature query DSL.
Amazon OpenSearch is suitable for teams already operating OpenSearch for log analytics or full-text search and want to add vector retrieval without introducing a new system.
2. Amazon Aurora PostgreSQL with pgvector
pgvector is a PostgreSQL extension that adds vector storage and similarity search directly to a relational database. Vectors are stored as columns alongside structured data. This enables exact SQL joins between vector retrieval and structured metadata filtering.
pgvector supports both exact (IVFFlat) and approximate (HNSW) index types. It is not the fastest vector search engine at a massive scale, but for datasets under 10M vectors with complex metadata requirements, its ability to run SQL-native vector queries is operationally valuable.
3. Amazon Kendra
Kendra is a managed enterprise search service that abstracts the indexing, retrieval, and ranking pipeline. It natively connects to S3, SharePoint, Confluence, Salesforce, and other enterprise content sources. Retrieval uses a combination of semantic and keyword matching without requiring you to manage embeddings.
Kendra is best for enterprise customers who need fast deployment without deep customization. The trade-off is reduced flexibility. That means you cannot control chunking strategies, embedding models, or re-ranking in the way a custom pipeline would allow.
4. Amazon Bedrock Knowledge Bases
AWS Bedrock Knowledge Bases is the highest-level abstraction for RAG on AWS. You connect an S3 bucket, select an embedding model, and AWS handles the full offline pipeline: parsing, chunking, embedding, and indexing into a managed vector store (backed by OpenSearch Serverless). The query API handles embedding + retrieval automatically.
For teams building on Bedrock LLMs, this is the lowest-friction path to a working RAG system. The cost of customization is that the chunking strategy and retrieval logic are partially configurable but not fully open.
Naive RAG Architecture
Before building complex systems, understand the simplest possible RAG implementation.
Naive RAG is the most basic, straightforward implementation of RAG, where there is no re-ranking, no query rewriting, no hybrid search, and no feedback loops. Just the raw, fundamental pipeline. It's the starting point before you add any advanced optimizations.

How Does it Work?
Naive Rag operates in two phases: the Offline Indexing phase and the Online Query phase.
Offline Indexing Phase:
This is the preparation phase. You essentially build a searchable knowledge base from your documents. Here's how it works:
-
Collect your documents: PDFs, Word files, web pages, etc.
-
Chunk them: Break each document into smaller pieces, because LLMs can't process huge documents at once.
-
Embed each chunk: Run every chunk through an embedding model, which converts the text into a vector (a list of numbers that captures the meaning of the text).
-
Store the vectors: Save all those vectors in a vector database, indexed and ready to be searched.
Online Query Phase:
This is the live, real-time flow when a user has a query.
-
User asks a question: e.g., "What is our refund policy?"
-
Embed the query: The question is converted into a vector using the same embedding model used during indexing.
-
Vector search: The query vector is compared against all the stored vectors in the database. The most semantically similar chunks are retrieved (these are called Top-K results).
-
Build the prompt: The retrieved chunks are combined with the original question into a structured prompt: System instruction + Retrieved context + User query.
-
LLM generates the answer: The full prompt is sent to the LLM, which reads the context and generates a grounded, informed response.
-
Return to user: The answer is sent back.
Why Naive RAG Breaks in Production?
-
Irrelevant retrieval: Vector search returns semantically similar chunks, not necessarily the ones that answer the question.
-
Poor chunking destroys context: Fixed-size splits produce mid-argument fragments where the retrieved chunk references "the factors discussed in the previous section" — a section the LLM never sees.
-
Context window overload: The "lost in the middle" paper shows LLM performance degrades when relevant content is buried in a long context; naive RAG dumps chunks in retrieval order, not importance order.
-
Single-query retrieval fails multi-part questions: A question spanning Q3 revenue and a five-year trend requires two distinct retrievals; a single vector query will capture one and miss the other.
-
No query understanding: Conversational queries like "What about the new policy?" are embedded as-is, with no resolution of what "new policy" refers to in context.
Advanced RAG Architecture
Advanced RAG is not a single architecture. It is a collection of targeted improvements to naive RAG, where each improvement solves a specific failure mode. A production system selectively applies these improvements based on the requirements of the use case.

How does it Work?
Advanced RAG directly addresses the weaknesses of Naive RAG. It doesn't change the fundamental skeleton (offline indexing + online query) but adds intelligence and optimization layers inside each phase. It improves the quality and relevance of retrieved information and generated responses.
Offline Indexing Phase
-
Better Document Processing: Unlike Naive RAG, which simply dumps raw text, Advanced RAG cleans and enriches documents before processing. It fixes formatting, extracts metadata (author, date, topic, source), and handles different file types more intelligently.
-
Smarter Chunking Strategies: Instead of blindly splitting documents into fixed-size chunks, Advanced RAG uses context-aware chunking methods such as:
-
Semantic chunking that splits at natural meaning boundaries.
-
Sentence-window chunking that stores small chunks for precise retrieval but keeps surrounding sentences as context.
-
Parent-child chunking that stores small chunks for searching, but retrieves the larger parent chunk for richer context.
-
-
Enriched Metadata & Tagging: Each chunk gets tagged with metadata, including the source document, section title, date, topic category, etc. This allows filtering during retrieval, so the search isn't purely vector-based.
-
Hybrid Indexing: Instead of only storing vectors, Advanced RAG builds two indexes in parallel: a vector index for semantic/meaning-based search and a keyword index (BM25) for exact/lexical search. This gives the retrieval step the best of both worlds.
-
Index Optimization: Vectors are organized using advanced indexing structures (like HNSW) to make similarity search faster and more accurate at scale.
Online Query Phase
-
Pre-Retrieval (Query Optimization): This is something Naive RAG completely skips. Before even touching the vector database, Advanced RAG improves the query itself:
-
Query rewriting: Rephrases the user's question to be clearer and more search-friendly.
-
Query expansion: Generates multiple variations of the same question to cast a wider search net.
-
Query decomposition: Breaks a complex question into smaller sub-questions that are easier to retrieve individually.
-
HyDE (Hypothetical Document Embedding): Generates a hypothetical ideal answer first, then uses that as the search query, which often finds better matches.
-
-
Retrieval (Hybrid Search): Instead of just vector search, Advanced RAG combines semantic search, keyword search, and metadata filtering. The results from both are then merged using a scoring technique called Reciprocal Rank Fusion (RRF).
-
Post-Retrieval (Re-ranking): After retrieving the Top-K chunks, Advanced RAG re-ranks them using a more powerful cross-encoder model that looks at the query and each chunk together to score true relevance. The weakest chunks are discarded before building the prompt.
-
Context Compression: Even after re-ranking, some retrieved chunks may still contain irrelevant sentences. Advanced RAG compresses the context to extract only the most relevant parts of each chunk.
-
Prompt Engineering: Instead of a simple "here's the context, answer the question" prompt, Advanced RAG uses structured, carefully engineered prompts with:
-
Clear system instructions
-
Role definition for the LLM
-
Instructions on how to handle missing information
-
Formatting guidelines for the response
-
-
LLM Generation: The LLM receives a much cleaner, more focused prompt compared to Naive RAG, leading to more accurate, grounded, and relevant answers.
-
Response Return + Logging: The answer is returned to the user, and the full interaction (query, retrieved chunks, response, latency) is logged for monitoring and future optimization.
Retrieval Improvements
The effectiveness of a RAG system depends heavily on the quality of its retrieval stage. Advanced retrieval techniques help ensure that the most relevant documents reach the LLM before answer generation.
BM25: Classical Information Retrieval
BM25 (Best Match 25) is a probabilistic ranking function developed in the 1990s that remains one of the strongest baseline retrieval algorithms for exact keyword matching.
The BM25 score for a document D given query Q is:
BM25(D, Q) = Σ IDF(qᵢ) × [f(qᵢ, D) × (k₁ + 1)] / [f(qᵢ, D) + k₁ × (1 - b + b × |D|/avgdl)]
Where:
-
f(qᵢ, D) = term frequency of query term i in document D
-
IDF(qᵢ) = inverse document frequency (rare terms score higher)
-
|D| = document length, avgdl = average document length
-
k₁ = term frequency saturation parameter (typically 1.2-2.0)
-
b = length normalization parameter (typically 0.75)
BM25 excels at exact match retrieval: product codes, error codes, proper nouns, legal terms, and abbreviations. A query for "SOX Section 404 compliance" will retrieve documents containing "SOX Section 404" with high precision via BM25. A dense vector search on the same query might retrieve semantically related documents about regulatory compliance that don't mention SOX at all.
BM25 fails on semantic paraphrase. A query about "revenue growth" won't retrieve a document that discusses "top-line expansion" unless those exact words also appear.
Hybrid Search: Reciprocal Rank Fusion
Hybrid search combines keyword-based retrieval (BM25) with vector search to deliver both exact matches and semantic relevance in a single result set.
Instead of relying on raw scores, Reciprocal Rank Fusion (RRF) merges results based on ranking positions, making it a simple and robust approach for hybrid retrieval.
Here’s a minimal implementation of RRF that combines BM25 and dense retrieval outputs:
def reciprocal_rank_fusion(bm25_results, dense_results, k=60):
scores = {} # Stores cumulative RRF scores for each document
# Process BM25 ranked results
for rank, doc_id in enumerate(bm25_results):
# Add reciprocal rank score; higher-ranked docs contribute more
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
# Process dense (vector) ranked results
for rank, doc_id in enumerate(dense_results):
# Accumulate score if document appears in both lists
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
# Sort documents by final combined score (highest first)
return sorted(scores.items(), key=lambda x: x[1], reverse=True)
How this implementation works:
This function takes two ranked lists of document IDs, one from BM25 and one from dense vector retrieval, and merges them into a single relevance-ranked output.
It begins by initializing a dictionary to store cumulative scores for each document. The function then iterates over both input lists. For each document, it calculates a score using a reciprocal formula, where higher-ranked documents contribute more and lower-ranked ones contribute less. The constant k acts as a smoothing factor, preventing top-ranked items from dominating too aggressively.
If a document appears in both lists, its scores are accumulated, effectively boosting documents that perform well across both retrieval methods. This is the key strength of RRF. It naturally prioritizes agreement between systems without needing to normalize their scoring mechanisms.
Finally, the function sorts all documents based on their aggregated scores in descending order and returns the combined ranking.
Metadata Filtering
Metadata filtering narrows the search space using structured attributes stored alongside each vector. Each document chunk is stored with its content, embedding, and associated metadata:
{
"content": "The company's Q3 2024 revenue was $4.2 billion...",
"embedding": [0.023, -0.144, 0.891, "..."],
"metadata": {
"source": "Q3-2024-earnings-report.pdf",
"document_type": "earnings_report",
"department": "finance",
"year": 2024,
"quarter": "Q3",
"created_at": "2024-10-15",
"author": "CFO Office"
}
}
Each document chunk is represented as a combination of raw text, its vector embedding, and structured metadata.
The content field contains the actual text that will be retrieved and passed to the language model. The embedding field stores the numerical vector representation of that text, enabling semantic similarity search.
The metadata field provides structured context about the document, such as its source, type, department, and time attributes. While embeddings handle semantic matching, metadata is used to apply constraints before or during retrieval.
For example, a query can be restricted to financial documents from a specific quarter or department, reducing the search space and improving relevance. This ensures that retrieval is not only semantically accurate but also contextually aligned with business constraints.
Re-Ranking: Cross-Encoders
Initial retrieval uses bi-encoders, which independently embed queries and documents for fast similarity search. While efficient, this approach prioritizes speed over precision. To improve relevance, a second stage re-ranks the retrieved candidates using cross-encoders, which evaluate each query–document pair more accurately.
In practice, bi-encoders retrieve a broad set of candidates (e.g., top 50–100), and cross-encoders refine them to the most relevant results.
# Cross-encoder re-ranking
from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
# Score each query-document pair
scores = reranker.predict([(query, chunk) for chunk in candidate_chunks])
# Sort by score, take top-k
ranked = sorted(zip(scores, candidate_chunks), reverse=True)
top_chunks = [chunk for _, chunk in ranked[:5]]
This code applies a cross-encoder model to re-rank a set of candidate document chunks based on their relevance to a given query.
It begins by initializing a pre-trained cross-encoder model, which evaluates query–document pairs jointly rather than independently. This allows the model to capture deeper contextual relationships between the query and each candidate chunk.
The scoring is performed using the model’s predict function, which takes a list of query–chunk pairs and returns a relevance score for each pair. Since all scores are generated by the same model in the same context, they are directly comparable.
These scores are then paired with their corresponding chunks using zip and sorted in descending order using sorted, ensuring that the most relevant results move to the top. Finally, the top-k chunks are selected from the ranked list and passed to the next stage of the pipeline for response generation.
Context Assembly and Prompt Construction
Retrieval produces a set of candidate chunks. What you do with those chunks before sending them to the LLM significantly affects response quality. Context assembly is an under-appreciated engineering discipline.
The Context Assembly Problem
The LLM's context window is finite and expensive. Every token used for context is a token not available for the response. More critically, LLMs have a documented "lost in the middle" problem: performance degrades for relevant information placed in the center of a long context window. The model reliably attends to the beginning and end of context but under-attends to the middle.
Deduplication
After retrieval and re-ranking, multiple chunks from the same document section may appear in the candidate set. Overlapping sliding window chunks, in particular, produce near-duplicate content. Deduplicate before prompt construction using:
-
Exact deduplication: remove identical chunks
-
Near-deduplication: compute pairwise cosine similarity between chunks; remove chunks above a similarity threshold (e.g., 0.95)
Duplicates inflate context length without adding information and increase cost.
Context Compression
Even after deduplication, retrieved chunks may contain irrelevant sentences. LLMLingua and similar context compression techniques use a small language model to identify and remove sentences from chunks that are not relevant to the query. This reduces context length by 2-10x with minimal accuracy degradation.
Here's an example of how Context Compression works:
Query:
What was the company’s revenue growth in Q3 2024?
Original chunk (400 tokens):
The company was founded in 1995 and initially focused on enterprise software solutions. Over the years, it expanded into cloud computing, AI services, and global markets. The leadership team has evolved significantly, with multiple CEO transitions shaping its strategy.
In recent financial updates, the company reported that revenue in Q3 2024 reached $4.2 billion, reflecting strong demand across its cloud division. This marks a 12% increase year-over-year, driven by enterprise adoption and subscription growth. Additional investments were made in AI capabilities and infrastructure.
Compressed chunk (80 tokens):
Revenue in Q3 2024 reached $4.2 billion, up 12% year-over-year.
Context compression is especially valuable when dealing with long-tail enterprise documents where relevant information is sparse within each document.
Prompt Construction with Citation Anchors
Prompt construction with citation anchors is a technique used in Retrieval-Augmented Generation (RAG) to make model responses more accurate, transparent, and verifiable. Instead of allowing the model to generate answers freely, you explicitly instruct it to rely only on the provided context and to support every factual statement with a reference marker such as [Source 1] or [Source 2]. This ensures that the output is grounded in real data rather than assumptions.
Here is a prompt example:
SYSTEM INSTRUCTIONS: You are a research assistant. Answer questions using ONLY the provided context.
For every factual claim, include a citation in the format [Source N].
If the answer is not in the context, say "I cannot find this in the available documents."
CONTEXT:
[Source 1] (Q3 2024 Earnings Report, p.4)
Revenue in Q3 2024 reached $4.2 billion, up 12% year-over-year...
[Source 2] (Annual Report 2024, p.12)
The growth was driven primarily by expansion in the APAC region...
QUESTION:
What drove revenue growth in Q3 2024?
This prompt structure serves two purposes: it anchors the model's response to retrievable evidence, and it enables post-generation grounding verification. You can programmatically check whether cited [Source N] passages actually support the claims made.
Ordering Chunks for Maximum Attention
Given the "lost in the middle" phenomenon, place the most relevant chunks at the beginning and end of your context window, with less relevant chunks in the middle. If you have 5 chunks ranked 1-5, order them: [1, 3, 5, 4, 2] — rank 1 at position 0, rank 2 at the last position.
The 5 Types of RAG Architectures
RAG is not a single pattern but a design space. Apart from Naive RAG and Advanced RAG, you should also know about Agentic RAG, Multi-Ho RAG, and Graph RAG. These five architectures represent distinct points in that space, each suited to different problem types.
|
Parameter |
Naive RAG |
Advanced RAG |
Agentic RAG |
Multi-Hop RAG |
Graph RAG |
|
Core pattern |
Single query → retrieve → generate |
Query rewriting → hybrid retrieval → re-rank → generate |
LLM agent decides when and what to retrieve dynamically |
Chain of retrievals, each refining the next |
Knowledge graph traversal + vector search |
|
Retrieval strategy |
Single vector search |
Hybrid BM25 + vector with re-ranking |
Dynamic, tool-driven, iterative |
Sequential, context-informed hops |
Graph traversal + vector similarity |
|
Query understanding |
Raw query embedded as-is |
HyDE rewriting, query decomposition |
Agent plans and reformulates queries autonomously |
LLM evaluates sufficiency and generates the next sub-query |
Entity extraction drives the traversal path |
|
Handles multi-part questions? |
No |
Partially |
Yes |
Yes — by design |
Yes, via graph relationships |
|
Handles entity relationships? |
No |
No |
Partially |
No |
Yes — core strength |
|
Implementation complexity |
Low |
Medium |
High |
High |
Very High |
|
Latency |
Lowest |
Low–Medium |
High (multiple LLM calls) |
High (multiple retrieval hops) |
Medium–High |
|
Hallucination risk |
High |
Low |
Low |
Low |
Very Low |
|
Cost |
Lowest |
Moderate |
High |
High |
High |
|
AWS tooling |
Bedrock Knowledge Bases |
Bedrock KB + OpenSearch |
Amazon Bedrock Agents |
Custom Lambda + Step Functions |
Neptune + OpenSearch + Bedrock |
|
Best for |
Demos, simple FAQ, POCs |
Enterprise search, support copilots, research |
Complex reasoning, multi-domain Q&A, ambiguous queries |
Evidence chaining, legal discovery, literature review |
Regulatory compliance, supply chain, and interconnected enterprise data |
|
Production-ready? |
Not recommended |
Yes |
Yes, with an orchestration framework |
Yes, for specific use cases |
Yes, for graph-native domains |
AWS-Native RAG Architectures
AWS provides two primary paths to building production RAG: a managed path with Amazon Bedrock Knowledge Bases, and a custom path for teams requiring architectural control.
Architecture Option 1: Amazon Bedrock Knowledge Bases (Managed)
Amazon Bedrock Knowledge Bases provide a fully managed approach to building RAG systems, abstracting the complexity of ingestion, indexing, and retrieval into a single integrated workflow. This approach is designed for teams that want to move quickly without managing the underlying retrieval infrastructure.
.jpg?width=3750&height=1938&name=RAG%20bedrock%20knowledge%20base%20(1).jpg)
The following example shows how retrieval and generation are combined into a single API call:
import boto3
# Initialize Bedrock Agent Runtime client
bedrock_agent_runtime = boto3.client(
'bedrock-agent-runtime',
region_name='us-east-1'
)
# Execute full RAG pipeline: retrieve + generate in one call
response = bedrock_agent_runtime.retrieve_and_generate(
# User query input
input={'text': 'What is our Q3 2024 revenue?'},
# Configuration for retrieval + generation
retrieveAndGenerateConfiguration={
'type': 'KNOWLEDGE_BASE',
'knowledgeBaseConfiguration': {
'knowledgeBaseId': 'YOUR_KB_ID', # Your Bedrock Knowledge Base ID
# Foundation model used for response generation
'modelArn': 'arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-3-sonnet-20240229-v1:0',
# Retrieval configuration
'retrievalConfiguration': {
'vectorSearchConfiguration': {
# Number of documents to retrieve
'numberOfResults': 5,
# Use hybrid search (BM25 + vector similarity)
'overrideSearchType': 'HYBRID'
}
}
}
}
)
# Generated response text
print(response['output']['text'])
# Source documents used for the answer (traceability)
print(response['citations'])
How this implementation works:
This code uses the retrieve_and_generate function to execute the full RAG pipeline within a single API call. The client is first initialized using boto3, which provides access to the Bedrock Agent Runtime. The query is passed through the input parameter, which represents the user’s question.
The retrieveAndGenerateConfiguration block defines how retrieval and generation should be handled. Within this, the knowledgeBaseId specifies the data source, while the modelArn determines which foundation model is used to generate the response.
The vectorSearchConfiguration controls retrieval behavior. The numberOfResults parameter limits how many documents are retrieved, and overrideSearchType enables hybrid search, combining keyword and semantic retrieval.
Once the request is executed, the system retrieves relevant documents, generates a response using the selected model, and returns both the answer and its supporting citations. The output text contains the generated response, while the citations provide traceability to the original source documents.
When to use this approach:
This architecture is best suited for teams that prioritize speed and simplicity over deep customization. It enables rapid deployment of production-ready RAG systems without requiring dedicated ML infrastructure or pipeline management.
Architecture Option 2: Custom RAG Pipeline (Full Control)
When Bedrock Knowledge Bases' abstractions are insufficient, build a custom pipeline using AWS primitives.

AWS Service Mapping:
|
Pipeline Stage |
AWS Service |
|
Document storage |
Amazon S3 |
|
Pipeline orchestration |
AWS Step Functions |
|
Document parsing & chunking |
AWS Lambda |
|
Embedding generation |
Amazon Bedrock (Titan Embeddings V2) |
|
Vector storage + hybrid search |
Amazon OpenSearch Service |
|
Re-ranking |
Lambda + self-hosted cross-encoder (or Cohere Rerank API) |
|
LLM generation |
Amazon Bedrock (Claude 3.5, Llama 3.1, etc.) |
|
API serving |
API Gateway + Lambda |
|
Caching |
Amazon ElastiCache (Redis) |
|
Monitoring |
Amazon CloudWatch + AWS X-Ray |
When to use custom pipelines: Domain-specific embedding models, custom chunking for specialized document types (e.g., legal contracts, code, medical records), multi-language requirements, custom re-ranking models, and complex multi-hop retrieval logic.
Evaluating a RAG System (The Most Critical Section)
Evaluation is not optional. It is the engineering practice that separates a working system from a broken one. The 4 dimensions you can evaluate your RAG systems are:
Retrieval Optimizations
-
Better chunking: If your retrieval recall is below 0.80, investigate your chunking before anything else. Move from fixed-size to semantic chunking for prose documents. Implement hierarchical indexing for long-form documents.
-
Hybrid search: It should be the default for any system handling specialized terminology, proper nouns, or regulatory/legal language. The RRF implementation is 30 lines of code and consistently improves retrieval metrics by 5-15%.
-
Query rewriting with HyDE: Doing this improves retrieval for short or ambiguous queries. HyDE (Hypothetical Document Embedding) works by asking the LLM to generate a hypothetical answer to the query, embedding that answer, and using the resulting vector to search.
-
Metadata extraction and filtering reduce the retrieval search space dramatically. A well-tagged corpus with document type, department, and date metadata allows scoping retrieval to the relevant subset before vector search runs.
Generation Optimizations
-
System prompt engineering: Include explicit grounding instructions: "Answer only using the provided context. If the context does not contain the answer, say so explicitly." Include citation instructions: "Cite every factual claim with [Source N]." Include uncertainty instructions: "If context is ambiguous or contradictory, note the uncertainty."
-
Context compression: Reduces noise in the context window. LLMLingua, Selective Context, or simple extractive compression methods can reduce context length by 2-5x. Shorter context → lower cost, better attention focus, reduced hallucination from irrelevant passages.
Cost Optimizations
-
Query caching: Cache responses for identical or near-identical queries. Enterprise environments often have repetitive query patterns — "What is our PTO policy?" gets asked hundreds of times daily. Cache hit rates of 20-40% are common.
-
Dynamic retrieval: Implement confidence-based retrieval: if the top retrieved chunk scores above a similarity threshold and the re-ranker score exceeds another threshold, send only that chunk to the LLM.
-
Model routing: Classify query complexity. Route simple factual queries to Claude Haiku or Llama 3.1 8B. Reserve Claude 3 Sonnet or Opus for complex multi-hop or analytical queries.
Conclusion
There is a persistent myth in the enterprise AI market that deploying an LLM is the hard part. It isn't. The hard part is connecting that LLM to your knowledge, reliably evaluating whether the connection is working, and maintaining that connection at a production scale as your knowledge evolves.
RAG is a full information retrieval system that uses an LLM as its synthesis layer. The companies that understand this are building AI systems that work. The companies that don't are shipping demos that break in production.
On AWS, the practical path is this: start with Amazon Bedrock Knowledge Bases for fast deployment and validation. Once you understand your requirements, move to a custom pipeline using OpenSearch for hybrid retrieval, Lambda for orchestration, and Bedrock LLMs for generation. Use Step Functions to make the offline pipeline auditable and replayable.
If you are building on AWS and want to get the architecture right from the start, this is exactly what we do at Mactores. We consult with enterprises on AWS AI architecture, from selecting the right RAG pattern for your use case to building the evaluation infrastructure that keeps it working in production. If you are at that stage, we would be glad to talk.

