For most of computing history, software could only do what it was explicitly programmed to do. Rules had to be written. Logic had to be encoded. Every edge case had to be anticipated. This made software powerful but fundamentally brittle. The moment a situation fell outside the rules, the system broke.
Large language models changed this fundamentally. For the first time, a system could understand — process natural language, reason across domains, synthesize information, generate structured outputs, and do all of this without a single hand-written rule. The surface area of what software could now address exploded. Tasks that previously required human interpretation, like reading a contract, summarizing a report, answering an ambiguous question, and writing code, have become automatable.
This is why LLMs became the central infrastructure bet for enterprise AI. Not because they are the final form of intelligence, but because they eliminated the bottleneck that had constrained automation for decades: the need for explicit programming. An LLM can be directed with natural language, generalize to new situations, and operate across virtually any domain. That flexibility is unprecedented.
But reasoning well requires knowing the right things. This is precisely where their architecture creates constraints that matter enormously in production.
If you prefer to start with a visual walkthrough, we have also covered this topic in depth on video — watch it here. Otherwise, let's get into the RAG architecture.
What Is Retrieval-Augmented Generation (RAG)?
Retrieval-Augmented Generation decouples the model's reasoning capability from its knowledge. Instead of encoding knowledge in weights, RAG supplies the model with relevant, current, private information at inference time, which is retrieved from a live external knowledge store and injected directly into the prompt context.
The model's role becomes interpretation and synthesis. The retrieval system's role is evidence supply. This separation is architecturally clean: knowledge stays versioned, updatable, and auditable in the retrieval layer; the model handles the language understanding and generation it was built for.
The Five-Step Flow
1. User submits a query
2. The query is embedded into a vector and used to search a knowledge base
3. The top-k most relevant documents are retrieved
4. Those documents are injected into the LLM's prompt as context
5. The LLM generates a response grounded in that context
Step 5 is the key differentiator. The LLM is not generating freely from parametric memory. It is grounded by evidence. If the retrieved documents don't contain the answer, a well-instructed LLM will say so rather than confabulate.
The Two Core Subsystems
|
Subsystem |
Role |
Technology |
|
Retriever |
Finds the most relevant knowledge from the knowledge base given the query |
Embedding models, vector databases, BM25, hybrid search |
|
Generator |
Takes the query + retrieved context and produces a final, grounded response |
LLMs (GPT-4, Claude, Mistral, Llama, etc.) |
The retriever and generator are loosely coupled. You can swap one without changing the other. This modularity is one of RAG's greatest engineering strengths.
RAG vs. Fine-Tuning vs. Prompt Engineering
These three approaches are frequently treated as interchangeable, but they operate at entirely different layers of the stack and solve fundamentally different problems. Here's how they compare across the parameters that matter in production:
|
Parameter |
Prompt Engineering |
Fine-Tuning |
RAG |
Knowledge Graphs |
|
What it changes |
Input instruction only |
Model weights |
Retrieval context at inference |
Structured entity-relationship store |
|
Adds new knowledge? |
No — bounded by training data |
Weakly — facts are brittle in weights |
Yes — dynamic, external, always current |
Yes — structured, explicit relationships |
|
Handles private data? |
No |
Only if included in training data |
Yes |
Yes |
|
Knowledge freshness |
Static (training cutoff) |
Static (retraining required) |
Real-time (re-index on update) |
Near real-time (graph update) |
|
Update cost |
None — just edit the prompt |
High — requires full or LoRA retraining cycle |
Low — re-embed and re-index changed documents |
Medium — extract entities, update graph |
|
Best for |
Tone, format, behavior, task framing |
Domain style, reasoning patterns, output structure |
Knowledge-intensive Q&A over evolving private data |
Highly connected data with explicit relationships |
|
Hallucination risk |
High — no grounding mechanism |
Medium — facts encoded loosely in weights |
Low — grounded in retrieved evidence |
Low — answers derived from explicit graph facts |
|
Auditability |
None |
None |
High — every answer traceable to source chunks |
High — every answer traceable to graph nodes |
|
Inference cost impact |
Minimal |
None (cost is in training) |
Moderate — embedding + retrieval + LLM tokens |
Moderate — graph traversal + LLM tokens |
|
Implementation complexity |
Low |
High |
Medium |
High |
The Complete RAG Pipeline
RAG consists of two distinct pipelines running at different times. Understanding them separately is essential for debugging and optimization.
Offline Pipeline: Data Preparation (Runs Periodically)
This pipeline processes raw documents and prepares them for retrieval. It runs when documents are added, updated, or deleted — not on every query.
Stage 1 — Document Ingestion: Raw documents are collected from sources (S3, SharePoint, Confluence, CRM APIs, databases). Format-specific parsers handle each file type. Metadata is extracted and stored alongside content.
Stage 2 — Document Parsing: Raw bytes are converted into structured text. PDFs are parsed for text layers (or OCR'd if scanned). HTML is stripped of navigation and boilerplate. Tables are extracted as structured content. Parsing quality directly determines the ceiling of your RAG system.
Stage 3 — Chunking: Parsed documents are split into retrievable units. The strategy used here (covered in depth in Section 5) is arguably the most impactful decision in the entire pipeline.
Stage 4 — Embedding Generation: Each chunk is passed through an embedding model (e.g., Amazon Titan Embeddings, OpenAI text-embedding-3-large, or open-source models like bge-large-en) to produce a dense vector representation, typically 768 to 3072 dimensions depending on the model.
Stage 5 — Vector Indexing: Vectors are stored in a vector database. An index structure (HNSW, IVF, etc.) is built to enable fast approximate nearest neighbor search at query time.
Online Pipeline: Query Flow (Runs on Every Request)
Stage 1 — Query Embedding: The user's query is embedded using the same model used to embed documents. This is critical — using different models breaks the semantic similarity space.
Stage 2 — Vector Retrieval: The query vector is used to search the index for the top-k nearest neighbors (typically k=20 to 50 before re-ranking).
Stage 3 — Re-Ranking: Retrieved candidates are re-scored by a cross-encoder model to select the most relevant k' documents (typically k'=3 to 10).
Stage 4 — Context Assembly: Retrieved chunks are deduplicated, ordered by relevance, and formatted into a prompt template.
Stage 5 — LLM Generation: The assembled prompt (system instructions + retrieved context + user query) is sent to the LLM, which generates a grounded response.
Understanding both pipelines lets you localize failures precisely. A bad answer is either a retrieval failure (offline pipeline problem) or a generation failure (online pipeline problem). Knowing which determines your fix.
Vector Search: The Engine Behind RAG
When an embedding model processes text, it maps that text to a point in a high-dimensional vector space. The position of that point encodes semantic meaning. Texts that mean similar things map to nearby points. This is the geometric foundation of RAG retrieval.
Embeddings: A Brief Technical Grounding
An embedding model is a transformer-based neural network trained to produce representations where semantic similarity correlates with geometric proximity. The model outputs a vector of floats — 768, 1536, or 3072 dimensions depending on the model.
The training objective typically involves contrastive learning: similar text pairs (e.g., a question and its answer) are pushed together in the vector space. Dissimilar pairs are pushed apart. After training, the model generalizes this learned geometry to new text.
Similarity Metrics: The Geometry of Relevance
Once documents and queries are embedded, retrieval is a geometric problem: find the document vectors closest to the query vector.
Cosine similarity measures the angle between two vectors, independent of their magnitudes. This is the most common metric for text embeddings because it captures directional similarity (semantic meaning) without being dominated by vector magnitude.
cosine_similarity(A, B) = (A · B) / (||A|| × ||B||)
Range: [-1, 1]
1 = identical direction (maximum semantic similarity)
0 = orthogonal (no semantic relationship)
-1 = opposite direction
The dot product (inner product) is equivalent to cosine similarity when vectors are normalized to unit length. Many embedding models normalize their outputs, making dot product and cosine similarity interchangeable. Dot product is computationally cheaper, which matters at scale.
Euclidean distance (L2 distance) measures absolute distance between vector endpoints. Less commonly used for text embeddings because it conflates magnitude with direction.
Enterprise Vector Database Options
The vector database market has fragmented rapidly. Choosing the right one requires understanding your scale, latency requirements, filtering needs, and infrastructure constraints.
AWS-Native Options
1. Amazon OpenSearch (with vector search)
OpenSearch's k-NN plugin adds HNSW-based vector search to a battle-tested distributed search engine. The critical advantage is that OpenSearch unifies full-text BM25 search with vector similarity search in a single system — hybrid retrieval without a separate pipeline. It also supports rich metadata filtering via its mature query DSL.
Suitable for: teams already operating OpenSearch for log analytics or full-text search who want to add vector retrieval without introducing a new system.
2. Amazon Aurora PostgreSQL with pgvector
pgvector is a PostgreSQL extension that adds vector storage and similarity search directly to a relational database. Vectors are stored as columns alongside structured data. This enables exact SQL joins between vector retrieval and structured metadata filtering.
-- Retrieve top-5 chunks for a query vector, filtered by department and year
SELECT content, metadata, embedding <=> $1::vector AS distance
FROM document_chunks
WHERE metadata->>'department' = 'legal'
AND metadata->>'year'::int > 2022
ORDER BY distance
LIMIT 5;
pgvector supports both exact (IVFFlat) and approximate (HNSW) index types. It is not the fastest vector search engine at massive scale, but for datasets under 10M vectors with complex metadata requirements, its ability to run SQL-native vector queries is operationally valuable.
3. Amazon Kendra
Kendra is a managed enterprise search service that abstracts the indexing, retrieval, and ranking pipeline. It natively connects to S3, SharePoint, Confluence, Salesforce, and other enterprise content sources. Retrieval uses a combination of semantic and keyword matching without requiring you to manage embeddings.
Kendra is best for enterprise customers who need fast deployment without deep customization. The trade-off is reduced flexibility — you cannot control chunking strategies, embedding models, or re-ranking in the way a custom pipeline would allow.
4. Amazon Bedrock Knowledge Bases
AWS Bedrock Knowledge Bases is the highest-level abstraction for RAG on AWS. You connect an S3 bucket, select an embedding model, and AWS handles the full offline pipeline: parsing, chunking, embedding, and indexing into a managed vector store (backed by OpenSearch Serverless). The query API handles embedding + retrieval automatically.
For teams building on Bedrock LLMs, this is the lowest-friction path to a working RAG system. The cost is customization: chunking strategy and retrieval logic are partially configurable but not fully open.
Naive RAG Architecture
Before building complex systems, understand the simplest possible RAG implementation. Naive RAG is what most tutorials demonstrate. Knowing its limits is what motivates every optimization in Section 9 and beyond.
The Pipeline
User Query
↓
Embedding Model (query → vector)
↓
Vector Database (vector similarity search)
↓
Top-k Retrieved Chunks (e.g., top 3-5)
↓
Prompt Template (system prompt + chunks + query)
↓
LLM
↓
Response
The implementation is genuinely simple:
# Simplified naive RAG
def query_naive_rag(user_query: str, k: int = 5) -> str:
# 1. Embed the query
query_vector = embedding_model.embed(user_query)
# 2. Retrieve top-k chunks
results = vector_db.similarity_search(query_vector, k=k)
chunks = [r.content for r in results]
# 3. Build prompt
context = "\n\n".join(chunks)
prompt = f"""Answer the question using only the provided context.
Context:
{context}
Question: {user_query}
Answer:"""
# 4. Generate
return llm.generate(prompt)
Why Naive RAG Breaks in Production?
- Irrelevant retrieval — Vector search returns semantically similar chunks, not necessarily the ones that answer the question; without re-ranking, the LLM receives noise.
- Poor chunking destroys context — Fixed-size splits produce mid-argument fragments where the retrieved chunk references "the factors discussed in the previous section" — a section the LLM never sees.
- Context window overload — The "lost in the middle" paper shows LLM performance degrades when relevant content is buried in a long context; naive RAG dumps chunks in retrieval order, not importance order.
- Single-query retrieval fails multi-part questions — A question spanning Q3 revenue and a five-year trend requires two distinct retrievals; a single vector query will capture one and miss the other.
- No query understanding — Conversational queries like "What about the new policy?" are embedded as-is, with no resolution of what "new policy" refers to in context.
Advanced RAG Architecture
Advanced RAG is not a single architecture — it is a collection of targeted improvements to naive RAG, each solving a specific failure mode. A production system selectively applies these improvements based on the requirements of the use case.
The Components
User Query
↓
[1] Query Understanding & Rewriting
↓
[2] Hybrid Retrieval (vector + BM25)
↓
[3] Metadata Filtering
↓
Candidate Set (top-50 to top-100 chunks)
↓
[4] Cross-Encoder Re-Ranking
↓
Refined Set (top-3 to top-10 chunks)
↓
[5] Context Compression & Assembly
↓
[6] Prompt Construction (with citations)
↓
LLM
↓
[7] Response with Source Attribution
Each numbered stage is an optional enhancement. You don't implement all of them at once. You diagnose failure modes and add the relevant fix.
[1] Query Understanding & Rewriting: The raw user query is analyzed and transformed before retrieval. Techniques include query expansion (adding synonyms or related terms), query decomposition (splitting multi-part questions into sub-queries), and HyDE (Hypothetical Document Embedding — generating a hypothetical answer and using its embedding to retrieve documents that look like the answer).
[2] Hybrid Retrieval: Combines vector similarity search with BM25 keyword search (see Section 10). Dense retrieval captures semantic meaning. Sparse retrieval captures exact keyword matches. Together they outperform either alone across virtually every benchmark.
[3] Metadata Filtering: Before or after vector search, filter by structured attributes. Retrieval scoped to department=engineering AND date>2024-01-01 is dramatically more precise than full-corpus retrieval.
[4] Re-Ranking: A cross-encoder model re-scores candidates with full query-document attention (unlike bi-encoder embedding models). This second-pass scoring is expensive but dramatically improves precision.
[5] Context Compression: Remove portions of retrieved chunks that are not relevant to the query. Reduces noise in the context window and leaves more room for additional documents.
[6] Citation-Aware Prompting: Instruct the model to attribute every factual claim to its source chunk. This enables post-generation grounding verification.
Retrieval Improvements
The effectiveness of a RAG system depends heavily on the quality of its retrieval stage. Advanced retrieval techniques help ensure that the most relevant documents reach the LLM before answer generation.
BM25: Classical Information Retrieval
BM25 (Best Match 25) is a probabilistic ranking function developed in the 1990s that remains one of the strongest baseline retrieval algorithms for exact keyword matching.
The BM25 score for a document D given query Q is:
BM25(D, Q) = Σ IDF(qᵢ) × [f(qᵢ, D) × (k₁ + 1)] / [f(qᵢ, D) + k₁ × (1 - b + b × |D|/avgdl)]
Where:
- f(qᵢ, D) = term frequency of query term i in document D
- IDF(qᵢ) = inverse document frequency (rare terms score higher)
- |D| = document length, avgdl = average document length
- k₁ = term frequency saturation parameter (typically 1.2-2.0)
- b = length normalization parameter (typically 0.75)
BM25 excels at exact match retrieval: product codes, error codes, proper nouns, legal terms, abbreviations. A query for "SOX Section 404 compliance" will retrieve documents containing "SOX Section 404" with high precision via BM25. A dense vector search on the same query might retrieve semantically related documents about regulatory compliance that don't mention SOX at all.
BM25 fails on semantic paraphrase. A query about "revenue growth" won't retrieve a document that discusses "top-line expansion" unless those exact words also appear.
Hybrid Search: Reciprocal Rank Fusion
Hybrid search combines BM25 and dense vector retrieval results into a single ranked list. The standard merging algorithm is Reciprocal Rank Fusion (RRF):
RRF_score(d) = Σₖ 1 / (k + rankₖ(d))
Where rankₖ(d) is the rank of document d in retrieval system k, and k is a constant (typically 60).
def reciprocal_rank_fusion(bm25_results, dense_results, k=60):
scores = {}
for rank, doc_id in enumerate(bm25_results):
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
for rank, doc_id in enumerate(dense_results):
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
return sorted(scores.items(), key=lambda x: x[1], reverse=True)
The elegance of RRF is that it doesn't require score normalization. BM25 scores and cosine similarities live in different ranges — RRF operates only on ranks, making them directly combinable.
Empirically, hybrid search with RRF outperforms either pure vector or pure BM25 retrieval across most benchmarks. The intuition: if a document ranks high in both BM25 and vector retrieval, it almost certainly matches the query on both exact and semantic dimensions. RRF amplifies this agreement.
Metadata Filtering
Metadata filtering constrains the retrieval search space using structured attributes stored alongside vectors. This is the RAG equivalent of a SQL WHERE clause before a JOIN.
Every document chunk should carry metadata extracted at ingestion time:
"content": "The company's Q3 2024 revenue was $4.2 billion...",
"embedding": [0.023, -0.144, 0.891, ...],
"metadata": {
"source": "Q3-2024-earnings-report.pdf",
"document_type": "earnings_report",
"department": "finance",
"year": 2024,
"quarter": "Q3",
"created_at": "2024-10-15",
"author": "CFO Office"
}
}
Pre-filtering (filter before vector search) is more efficient but may miss relevant documents if filters are too restrictive. Post-filtering (filter after vector search) is safer but wastes compute on irrelevant vectors. Most production systems use pre-filtering at the category level (document type, department) and rely on re-ranking for finer relevance judgments.
Re-Ranking: Cross-Encoders
The retrieval pipeline's first stage uses bi-encoders — models that embed query and document independently, then compare vectors. Bi-encoders are fast (queries and documents can be pre-computed and cached) but their independence assumption limits relevance accuracy.
Cross-encoders process the query and document together in a single forward pass, enabling full attention between every token in the query and every token in the document. This is far more accurate but requires running the model fresh for every query-document pair — you can't pre-compute document representations.
The practical pattern: use bi-encoders to retrieve 50-100 candidates, use cross-encoders to re-rank to top 5-10.
# Cross-encoder re-ranking
from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
# Score each query-document pair
scores = reranker.predict([(query, chunk) for chunk in candidate_chunks])
# Sort by score, take top-k
ranked = sorted(zip(scores, candidate_chunks), reverse=True)
top_chunks = [chunk for _, chunk in ranked[:5]]
Popular cross-encoder models for re-ranking include ms-marco-MiniLM variants for general English retrieval and bge-reranker models for multilingual use cases. Cohere Rerank is a managed API option for teams that don't want to self-host.
Context Assembly and Prompt Construction
Retrieval produces a set of candidate chunks. What you do with those chunks before sending them to the LLM significantly affects response quality. Context assembly is an under-appreciated engineering discipline.
The Context Assembly Problem
The LLM's context window is finite and expensive. Every token used for context is a token not available for the response. More critically, LLMs have a documented "lost in the middle" problem: performance degrades for relevant information placed in the center of a long context window. The model reliably attends to the beginning and end of context but under-attends to the middle.
This means context assembly is not just packing — it's strategic ordering.
Deduplication
After retrieval and re-ranking, multiple chunks from the same document section may appear in the candidate set. Overlapping sliding window chunks, in particular, produce near-duplicate content. Deduplicate before prompt construction using:
- Exact deduplication: remove identical chunks
- Near-deduplication: compute pairwise cosine similarity between chunks; remove chunks above a similarity threshold (e.g., 0.95)
Duplicates inflate context length without adding information and increase cost.
Context Compression
Even after deduplication, retrieved chunks may contain irrelevant sentences. LLMLingua and similar context compression techniques use a small language model to identify and remove sentences from chunks that are not relevant to the query, reducing context length by 2-10x with minimal accuracy degradation.
Original chunk (400 tokens): "The company was founded in 1995. [CEO history, founding story, early products...]
Revenue in Q3 2024 reached $4.2 billion, up 12% year-over-year. [additional context not related to query]"
Compressed chunk (80 tokens): "Revenue in Q3 2024 reached $4.2 billion, up 12% year-over-year."
Context compression is especially valuable when dealing with long-tail enterprise documents where relevant information is sparse within each document.
Prompt Construction with Citation Anchors
Instruct the model to cite sources using reference markers in the context:
SYSTEM: You are a research assistant. Answer questions using ONLY the provided context.
For every factual claim, include a citation in the format [Source N].
If the answer is not in the context, say "I cannot find this in the available documents."
CONTEXT:
[Source 1] (Q3 2024 Earnings Report, p.4)
Revenue in Q3 2024 reached $4.2 billion, up 12% year-over-year...
[Source 2] (Annual Report 2024, p.12)
The growth was driven primarily by expansion in the APAC region...
QUESTION: What drove revenue growth in Q3 2024?
This prompt structure serves two purposes: it anchors the model's response to retrievable evidence, and it enables post-generation grounding verification — you can programmatically check whether cited [Source N] passages actually support the claims made.
Ordering Chunks for Maximum Attention
Given the "lost in the middle" phenomenon, place the most relevant chunks at the beginning and end of your context window, with less relevant chunks in the middle. If you have 5 chunks ranked 1-5, order them: [1, 3, 5, 4, 2] — rank 1 at position 0, rank 2 at the last position.
The 5 Types of RAG Architectures
RAG is not a single pattern — it is a design space. These five architectures represent distinct RAG is not a single pattern — it is a design space. These five architectures represent distinct points in that space, each suited to different problem types.
|
Parameter |
Naive RAG |
Advanced RAG |
Agentic RAG |
Multi-Hop RAG |
Graph RAG |
|
Core pattern |
Single query → retrieve → generate |
Query rewriting → hybrid retrieval → re-rank → generate |
LLM agent decides when and what to retrieve dynamically |
Chain of retrievals, each refining the next |
Knowledge graph traversal + vector search |
|
Retrieval strategy |
Single vector search |
Hybrid BM25 + vector with re-ranking |
Dynamic, tool-driven, iterative |
Sequential, context-informed hops |
Graph traversal + vector similarity |
|
Query understanding |
Raw query embedded as-is |
HyDE rewriting, query decomposition |
Agent plans and reformulates queries autonomously |
LLM evaluates sufficiency and generates the next sub-query |
Entity extraction drives the traversal path |
|
Handles multi-part questions? |
No |
Partially |
Yes |
Yes — by design |
Yes, via graph relationships |
|
Handles entity relationships? |
No |
No |
Partially |
No |
Yes — core strength |
|
Implementation complexity |
Low |
Medium |
High |
High |
Very High |
|
Latency |
Lowest |
Low–Medium |
High (multiple LLM calls) |
High (multiple retrieval hops) |
Medium–High |
|
Hallucination risk |
High |
Low |
Low |
Low |
Very Low |
|
Cost |
Lowest |
Moderate |
High |
High |
High |
|
AWS tooling |
Bedrock Knowledge Bases |
Bedrock KB + OpenSearch |
Amazon Bedrock Agents |
Custom Lambda + Step Functions |
Neptune + OpenSearch + Bedrock |
|
Best for |
Demos, simple FAQ, POCs |
Enterprise search, support copilots, research |
Complex reasoning, multi-domain Q&A, ambiguous queries |
Evidence chaining, legal discovery, literature review |
Regulatory compliance, supply chain, interconnected enterprise data |
|
Production-ready? |
Not recommended |
Yes |
Yes, with orchestration framework |
Yes, for specific use cases |
Yes, for graph-native domains |
AWS-Native RAG Architectures
AWS provides two primary paths to building production RAG: a managed path with Amazon Bedrock Knowledge Bases, and a custom path for teams requiring architectural control.
Architecture Option 1: Amazon Bedrock Knowledge Bases (Managed)
This is the fastest path from zero to a working RAG pipeline on AWS. Bedrock Knowledge Bases abstracts the entire offline pipeline.
Architecture:
[Amazon S3]
→ Documents (PDF, DOCX, HTML, CSV, XLSX)
↓
[Bedrock Knowledge Bases]
→ Automatic chunking (fixed or semantic)
→ Automatic embedding (Amazon Titan Embeddings or Cohere)
→ Automatic indexing
↓
[Vector Store] (managed OpenSearch Serverless, Aurora pgvector, or Pinecone)
↓
[Bedrock Retrieve & Generate API]
→ Handles query embedding, retrieval, and LLM invocation in one call
↓
[Application Layer] (Lambda, ECS, API Gateway)
Key API call:
import boto3
bedrock_agent_runtime = boto3.client('bedrock-agent-runtime', region_name='us-east-1')
response = bedrock_agent_runtime.retrieve_and_generate(
input={'text': 'What is our Q3 2024 revenue?'},
retrieveAndGenerateConfiguration={
'type': 'KNOWLEDGE_BASE',
'knowledgeBaseConfiguration': {
'knowledgeBaseId': 'YOUR_KB_ID',
'modelArn': 'arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-3-sonnet-20240229-v1:0',
'retrievalConfiguration': {
'vectorSearchConfiguration': {
'numberOfResults': 5,
'overrideSearchType': 'HYBRID' # BM25 + vector
}
}
}
}
)
print(response['output']['text'])
print(response['citations']) # Sources returned with every response
When to use Bedrock Knowledge Bases: Teams without dedicated ML engineers, fast deployment requirements, standard document types, and acceptable customization constraints.
Architecture Option 2: Custom RAG Pipeline (Full Control)
When Bedrock Knowledge Bases' abstractions are insufficient — custom chunking strategies, specialized embedding models, non-standard document types, complex re-ranking pipelines — build a custom pipeline using AWS primitives.
Architecture:
OFFLINE PIPELINE:
S3 (raw documents)
↓ [S3 Event → Lambda trigger]
Lambda (document parsing: PyMuPDF, Unstructured.io)
↓
Lambda (chunking: semantic chunker)
↓
Bedrock Embedding API (Titan Embeddings V2 or Cohere)
↓
OpenSearch Service (vector index with metadata)
↓ [Step Functions orchestrates the above]
ONLINE PIPELINE:
API Gateway
↓
Lambda (query handler)
↓ [parallel]
Bedrock Embedding API (query vector) | OpenSearch BM25 query
↓ [merge with RRF]
Lambda (cross-encoder re-ranking)
↓
Lambda (context assembly + prompt construction)
↓
Bedrock LLM API (Claude, Llama 3, etc.)
↓
API Gateway Response
AWS Service Mapping:
|
Pipeline Stage |
AWS Service |
|
Document storage |
Amazon S3 |
|
Pipeline orchestration |
AWS Step Functions |
|
Document parsing & chunking |
AWS Lambda |
|
Embedding generation |
Amazon Bedrock (Titan Embeddings V2) |
|
Vector storage + hybrid search |
Amazon OpenSearch Service |
|
Re-ranking |
Lambda + self-hosted cross-encoder (or Cohere Rerank API) |
|
LLM generation |
Amazon Bedrock (Claude 3.5, Llama 3.1, etc.) |
|
API serving |
API Gateway + Lambda |
|
Caching |
Amazon ElastiCache (Redis) |
|
Monitoring |
Amazon CloudWatch + AWS X-Ray |
When to use custom pipelines: Domain-specific embedding models, custom chunking for specialized document types (e.g., legal contracts, code, medical records), multi-language requirements, custom re-ranking models, and complex multi-hop retrieval logic.
Evaluating a RAG System (The Most Critical Section)
Evaluation is not optional. It is the engineering practice that separates a working system from a broken one. The 4 dimensions you can evaluate your RAG systems are:
Retrieval Optimizations
- Better chunking: If your retrieval recall is below 0.80, investigate your chunking before anything else. Move from fixed-size to semantic chunking for prose documents. Implement hierarchical indexing for long-form documents.
- Hybrid search: It should be the default for any system handling specialized terminology, proper nouns, or regulatory/legal language. The RRF implementation is 30 lines of code and consistently improves retrieval metrics by 5-15%.
- Query rewriting with HyDE: Doing this improves retrieval for short or ambiguous queries. HyDE (Hypothetical Document Embedding) works by asking the LLM to generate a hypothetical answer to the query, embedding that answer, and using the resulting vector to search.
- Metadata extraction and filtering reduce the retrieval search space dramatically. A well-tagged corpus with document type, department, and date metadata allows scoping retrieval to the relevant subset before vector search runs.
Generation Optimizations
- System prompt engineering: Include explicit grounding instructions: "Answer only using the provided context. If the context does not contain the answer, say so explicitly." Include citation instructions: "Cite every factual claim with [Source N]." Include uncertainty instructions: "If context is ambiguous or contradictory, note the uncertainty."
- Context compression: Reduces noise in the context window. LLMLingua, Selective Context, or simple extractive compression methods can reduce context length by 2-5x. Shorter context → lower cost, better attention focus, reduced hallucination from irrelevant passages.
Cost Optimizations
- Query caching: Cache responses for identical or near-identical queries. Enterprise environments often have repetitive query patterns — "What is our PTO policy?" gets asked hundreds of times daily. Cache hit rates of 20-40% are common.
- Dynamic retrieval: Implement confidence-based retrieval: if the top retrieved chunk scores above a similarity threshold and re-ranker score exceeds another threshold, send only that chunk to the LLM.
- Model routing: Classify query complexity. Route simple factual queries to Claude Haiku or Llama 3.1 8B. Reserve Claude 3 Sonnet or Opus for complex multi-hop or analytical queries.
Conclusion
There is a persistent myth in the enterprise AI market: that deploying an LLM is the hard part. It isn't. The hard part is connecting that LLM to your knowledge reliably, evaluating whether the connection is working, and maintaining that connection at production scale as your knowledge evolves.
RAG is not a prompt pattern. It is not a hack around LLM limitations. It is a full information retrieval system — built on decades of IR research — that happens to use an LLM as its synthesis layer. The companies that understand this are building AI systems that work. The companies that don't are shipping demos that break in production.
On AWS, the practical path is this: start with Amazon Bedrock Knowledge Bases for fast deployment and validation. Once you understand your requirements — your failure modes, your performance bottlenecks, your cost structure — move to a custom pipeline using OpenSearch for hybrid retrieval, Lambda for orchestration, and Bedrock LLMs for generation. Use Step Functions to make the offline pipeline auditable and replayable.
The difference between a production AI assistant and a demo is retrieval architecture, evaluation frameworks, and cost optimization. You now have the foundation for all three.
If you are building on AWS and want to get the architecture right from the start, this is exactly what we do at Mactores. We consult with enterprises on AWS AI architecture — from selecting the right RAG pattern for your use case to building the evaluation infrastructure that keeps it working in production. If you are at that stage, we would be glad to talk.

