RAG Explained: Architecture, Evaluation, & Production Systems

Written by Bal Heroor | Apr 1, 2026 10:18:11 AM

For most of computing history, software could only do what it was explicitly programmed to do. Rules had to be written. Logic had to be encoded. Every edge case had to be anticipated. This made software powerful but brittle. The moment a situation fell outside the rules, the system broke.

Large language models changed this fundamentally. For the first time, a system could understand. It could process natural language, reason across domains, synthesize information, generate structured outputs, and do all of this without a single hand-written rule. Tasks that previously required human interpretation, like reading a contract, summarizing a report, answering an ambiguous question, and writing code, have become automatable.

This is why LLMs became the central infrastructure bet for enterprise AI. Not because they are the final form of intelligence, but because they eliminated the bottleneck that had constrained automation for decades: the need for explicit programming.

An LLM can be directed with natural language, generalize to new situations, and operate across virtually any domain. That flexibility is unprecedented. But reasoning well requires knowing the right things. This is precisely where their architecture creates constraints that matter enormously in production.

If you prefer to start with a visual walkthrough, we have also covered this topic in depth on video — watch it here. Otherwise, let's get into the RAG architecture.

What Is Retrieval-Augmented Generation (RAG)?

Retrieval-Augmented Generation decouples the model's reasoning capability from its knowledge. Instead of encoding knowledge in weights, RAG supplies the model with relevant, current, private information at inference time, which is retrieved from a live external knowledge store and injected directly into the prompt context.

The model's role becomes interpretation and synthesis. The retrieval system's role is evidence supply. This separation is architecturally clean: knowledge stays versioned, updatable, and auditable in the retrieval layer; the model handles the language understanding and generation it was built for.

The Five-Step Flow

The last step is the key differentiator. The LLM is not generating freely from parametric memory. It is grounded in evidence. If the retrieved documents don't contain the answer, a well-instructed LLM will say so rather than confabulate.

The Two Core Subsystems

Subsystem	Role	Technology
Retriever	Finds the most relevant knowledge from the knowledge base given the query	Embedding models, vector databases, BM25, hybrid search
Generator	Takes the query + retrieved context and produces a final, grounded response	LLMs (GPT-4, Claude, Mistral, Llama, etc.)

The retriever and generator are loosely coupled. You can swap one without changing the other. This modularity is one of RAG's greatest engineering strengths.

RAG vs. Fine-Tuning vs. Prompt Engineering

These three approaches are frequently treated as interchangeable, but they operate at entirely different layers of the stack and solve fundamentally different problems. Here's how they compare across the parameters that matter in production:

Parameter	RAG	Fine-Tuning	Prompt Engineering
What it changes	Retrieval context at inference	Model weights	Input instruction only
Adds new knowledge?	Yes — dynamic, external, always current	Weakly — facts are brittle in weights	No — bounded by training data
Handles private data?	Yes	Only if included in training data	No
Knowledge freshness	Real-time (re-index on update)	Static (retraining required)	Static (training cutoff)
Update cost	Low — re-embed and re-index changed documents	High — requires full or LoRA retraining cycle	None — just edit the prompt
Best for	Knowledge-intensive Q&A over evolving private data	Domain style, reasoning patterns, output structure	Tone, format, behavior, task framing
Hallucination risk	Low — grounded in retrieved evidence	Medium — facts encoded loosely in weights	High — no grounding mechanism
Auditability	High — every answer traceable to source chunks	None	None
Inference cost impact	Moderate — embedding + retrieval + LLM tokens	None (cost is in training)	Minimal
Implementation complexity	Medium	High	Low

The Complete RAG Pipeline

RAG consists of two distinct pipelines running at different times. Understanding them separately is essential for debugging and optimization.

Offline Pipeline: Data Preparation (Runs Periodically)

This pipeline processes raw documents and prepares them for retrieval. It only runs when documents are added, updated, or deleted.

Here's a breakdown of the five stages of the Offline pipeline:

Stage 1 — Document Ingestion: Raw documents are collected from sources (S3, SharePoint, Confluence, CRM APIs, databases). Format-specific parsers handle each file type. Metadata is extracted and stored alongside content.

Stage 2 — Document Parsing: Raw bytes are converted into structured text. PDFs are parsed for text layers (or OCR'd if scanned). HTML is stripped of navigation and boilerplate. Tables are extracted as structured content. Parsing quality directly determines the ceiling of your RAG system.

Stage 3 — Chunking: Parsed documents are split into retrievable units. The strategy used here (covered in depth in Section 5) is arguably the most impactful decision in the entire pipeline.

Stage 4 — Embedding Generation: Each chunk is passed through an embedding model to produce a dense vector representation, typically 768 to 3072 dimensions depending on the model.

Stage 5 — Vector Indexing: Vectors are stored in a vector database. An index structure (HNSW, IVF, etc.) is built to enable fast approximate nearest neighbor search at query time.

Online Pipeline: Query Flow (Runs on Every Request)

Here's a breakdown of the Online pipeline:

Stage 1 — Query Embedding: The user's query is embedded using the same model used to embed documents. This is critical — using different models breaks the semantic similarity space.

Stage 2 — Vector Retrieval: The query vector is used to search the index for the top-k nearest neighbors (typically k=20 to 50 before re-ranking).

Stage 3 — Re-Ranking: Retrieved candidates are re-scored by a cross-encoder model to select the most relevant k' documents (typically k'=3 to 10).

Stage 4 — Context Assembly: Retrieved chunks are deduplicated, ordered by relevance, and formatted into a prompt template.

Stage 5 — LLM Generation: The assembled prompt (system instructions + retrieved context + user query) is sent to the LLM, which generates a grounded response.

Understanding both pipelines lets you localize failures precisely. A bad answer is either a retrieval failure (offline pipeline problem) or a generation failure (online pipeline problem). Knowing which determines your fix.

Vector Search: The Engine Behind RAG

When an embedding model processes text, it maps that text to a point in a high-dimensional vector space. The position of that point encodes semantic meaning. Texts that mean similar things map to nearby points. This is the geometric foundation of RAG retrieval.

Embeddings: A Brief Technical Grounding

An embedding model is a transformer-based neural network trained to produce representations where semantic similarity correlates with geometric proximity. The model outputs a vector of floats — 768, 1536, or 3072 dimensions depending on the model.

The training objective typically involves contrastive learning: similar text pairs (e.g., a question and its answer) are pushed together in the vector space. Dissimilar pairs are pushed apart. After training, the model generalizes this learned geometry to new text.

Similarity Metrics: The Geometry of Relevance

Once documents and queries are embedded, retrieval is a geometric problem: find the document vectors closest to the query vector.

Cosine similarity measures the angle between two vectors, independent of their magnitudes. This is the most common metric for text embeddings because it captures directional similarity (semantic meaning) without being dominated by vector magnitude. Cosine similarity can be measured by the following formula:

cosine_similarity(A, B) = (A · B) / (||A|| × ||B||)

Range: [-1, 1], where:
1 = identical direction (maximum semantic similarity)
0 = orthogonal (no semantic relationship)
-1 = opposite direction

The dot product (inner product) is equivalent to cosine similarity when vectors are normalized to unit length. Many embedding models normalize their outputs, making dot product and cosine similarity interchangeable. Dot product is computationally cheaper, which matters at scale.

Euclidean distance (L2 distance) measures absolute distance between vector endpoints. Less commonly used for text embeddings because it conflates magnitude with direction.

Enterprise Vector Database Options

The vector database market has fragmented rapidly. Choosing the right one requires understanding your scale, latency requirements, filtering needs, and infrastructure constraints.

I recommend choosing from one of the following four AWS vector databases. AWS offers strong security, built-in compliance, and predictable cost control, along with long-term scalability and greater architectural flexibility, making them well-suited for enterprise-grade systems.

1. Amazon OpenSearch (with vector search)

OpenSearch's k-NN plugin adds HNSW-based vector search to a battle-tested distributed search engine. The critical advantage is that OpenSearch unifies full-text BM25 search with vector similarity search in a single system. This enables hybrid retrieval without a separate pipeline. It also supports rich metadata filtering via its mature query DSL.

Amazon OpenSearch is suitable for teams already operating OpenSearch for log analytics or full-text search and want to add vector retrieval without introducing a new system.

2. Amazon Aurora PostgreSQL with pgvector

pgvector is a PostgreSQL extension that adds vector storage and similarity search directly to a relational database. Vectors are stored as columns alongside structured data. This enables exact SQL joins between vector retrieval and structured metadata filtering.

pgvector supports both exact (IVFFlat) and approximate (HNSW) index types. It is not the fastest vector search engine at a massive scale, but for datasets under 10M vectors with complex metadata requirements, its ability to run SQL-native vector queries is operationally valuable.

3. Amazon Kendra

Kendra is a managed enterprise search service that abstracts the indexing, retrieval, and ranking pipeline. It natively connects to S3, SharePoint, Confluence, Salesforce, and other enterprise content sources. Retrieval uses a combination of semantic and keyword matching without requiring you to manage embeddings.

Kendra is best for enterprise customers who need fast deployment without deep customization. The trade-off is reduced flexibility. That means you cannot control chunking strategies, embedding models, or re-ranking in the way a custom pipeline would allow.

4. Amazon Bedrock Knowledge Bases

AWS Bedrock Knowledge Bases is the highest-level abstraction for RAG on AWS. You connect an S3 bucket, select an embedding model, and AWS handles the full offline pipeline: parsing, chunking, embedding, and indexing into a managed vector store (backed by OpenSearch Serverless). The query API handles embedding + retrieval automatically.

For teams building on Bedrock LLMs, this is the lowest-friction path to a working RAG system. The cost of customization is that the chunking strategy and retrieval logic are partially configurable but not fully open.

Naive RAG Architecture

Before building complex systems, understand the simplest possible RAG implementation.

Naive RAG is the most basic, straightforward implementation of RAG, where there is no re-ranking, no query rewriting, no hybrid search, and no feedback loops. Just the raw, fundamental pipeline. It's the starting point before you add any advanced optimizations.

How Does it Work?

Naive Rag operates in two phases: the Offline Indexing phase and the Online Query phase.

Offline Indexing Phase:

This is the preparation phase. You essentially build a searchable knowledge base from your documents. Here's how it works:

Collect your documents: PDFs, Word files, web pages, etc.
Chunk them: Break each document into smaller pieces, because LLMs can't process huge documents at once.
Embed each chunk: Run every chunk through an embedding model, which converts the text into a vector (a list of numbers that captures the meaning of the text).
Store the vectors: Save all those vectors in a vector database, indexed and ready to be searched.

Online Query Phase:

This is the live, real-time flow when a user has a query.

User asks a question: e.g., "What is our refund policy?"
Embed the query: The question is converted into a vector using the same embedding model used during indexing.
Vector search: The query vector is compared against all the stored vectors in the database. The most semantically similar chunks are retrieved (these are called Top-K results).
Build the prompt: The retrieved chunks are combined with the original question into a structured prompt: System instruction + Retrieved context + User query.
LLM generates the answer: The full prompt is sent to the LLM, which reads the context and generates a grounded, informed response.
Return to user: The answer is sent back.

Why Naive RAG Breaks in Production?

Irrelevant retrieval: Vector search returns semantically similar chunks, not necessarily the ones that answer the question.
Poor chunking destroys context: Fixed-size splits produce mid-argument fragments where the retrieved chunk references "the factors discussed in the previous section" — a section the LLM never sees.
Context window overload: The "lost in the middle" paper shows LLM performance degrades when relevant content is buried in a long context; naive RAG dumps chunks in retrieval order, not importance order.
Single-query retrieval fails multi-part questions: A question spanning Q3 revenue and a five-year trend requires two distinct retrievals; a single vector query will capture one and miss the other.
No query understanding: Conversational queries like "What about the new policy?" are embedded as-is, with no resolution of what "new policy" refers to in context.

Advanced RAG Architecture

Advanced RAG is not a single architecture. It is a collection of targeted improvements to naive RAG, where each improvement solves a specific failure mode. A production system selectively applies these improvements based on the requirements of the use case.

How does it Work?

Advanced RAG directly addresses the weaknesses of Naive RAG. It doesn't change the fundamental skeleton (offline indexing + online query) but adds intelligence and optimization layers inside each phase. It improves the quality and relevance of retrieved information and generated responses.

Offline Indexing Phase

Better Document Processing: Unlike Naive RAG, which simply dumps raw text, Advanced RAG cleans and enriches documents before processing. It fixes formatting, extracts metadata (author, date, topic, source), and handles different file types more intelligently.
Smarter Chunking Strategies: Instead of blindly splitting documents into fixed-size chunks, Advanced RAG uses context-aware chunking methods such as:
- Semantic chunking that splits at natural meaning boundaries.
- Sentence-window chunking that stores small chunks for precise retrieval but keeps surrounding sentences as context.
- Parent-child chunking that stores small chunks for searching, but retrieves the larger parent chunk for richer context.
Enriched Metadata & Tagging: Each chunk gets tagged with metadata, including the source document, section title, date, topic category, etc. This allows filtering during retrieval, so the search isn't purely vector-based.
Hybrid Indexing: Instead of only storing vectors, Advanced RAG builds two indexes in parallel: a vector index for semantic/meaning-based search and a keyword index (BM25) for exact/lexical search. This gives the retrieval step the best of both worlds.
Index Optimization: Vectors are organized using advanced indexing structures (like HNSW) to make similarity search faster and more accurate at scale.

Online Query Phase

Pre-Retrieval (Query Optimization): This is something Naive RAG completely skips. Before even touching the vector database, Advanced RAG improves the query itself:
- Query rewriting: Rephrases the user's question to be clearer and more search-friendly.
- Query expansion: Generates multiple variations of the same question to cast a wider search net.
- Query decomposition: Breaks a complex question into smaller sub-questions that are easier to retrieve individually.
- HyDE (Hypothetical Document Embedding): Generates a hypothetical ideal answer first, then uses that as the search query, which often finds better matches.
Retrieval (Hybrid Search): Instead of just vector search, Advanced RAG combines semantic search, keyword search, and metadata filtering. The results from both are then merged using a scoring technique called Reciprocal Rank Fusion (RRF).
Post-Retrieval (Re-ranking): After retrieving the Top-K chunks, Advanced RAG re-ranks them using a more powerful cross-encoder model that looks at the query and each chunk together to score true relevance. The weakest chunks are discarded before building the prompt.
Context Compression: Even after re-ranking, some retrieved chunks may still contain irrelevant sentences. Advanced RAG compresses the context to extract only the most relevant parts of each chunk.
Prompt Engineering: Instead of a simple "here's the context, answer the question" prompt, Advanced RAG uses structured, carefully engineered prompts with:
- Clear system instructions
- Role definition for the LLM
- Instructions on how to handle missing information
- Formatting guidelines for the response
LLM Generation: The LLM receives a much cleaner, more focused prompt compared to Naive RAG, leading to more accurate, grounded, and relevant answers.
Response Return + Logging: The answer is returned to the user, and the full interaction (query, retrieved chunks, response, latency) is logged for monitoring and future optimization.

Retrieval Improvements

The effectiveness of a RAG system depends heavily on the quality of its retrieval stage. Advanced retrieval techniques help ensure that the most relevant documents reach the LLM before answer generation.

BM25: Classical Information Retrieval

BM25 (Best Match 25) is a probabilistic ranking function developed in the 1990s that remains one of the strongest baseline retrieval algorithms for exact keyword matching.

The BM25 score for a document D given query Q is:

BM25(D, Q) = Σ IDF(qᵢ) × [f(qᵢ, D) × (k₁ + 1)] / [f(qᵢ, D) + k₁ × (1 - b + b × |D|/avgdl)]

Where:

f(qᵢ, D) = term frequency of query term i in document D
IDF(qᵢ) = inverse document frequency (rare terms score higher)
|D| = document length, avgdl = average document length
k₁ = term frequency saturation parameter (typically 1.2-2.0)
b = length normalization parameter (typically 0.75)

BM25 excels at exact match retrieval: product codes, error codes, proper nouns, legal terms, and abbreviations. A query for "SOX Section 404 compliance" will retrieve documents containing "SOX Section 404" with high precision via BM25. A dense vector search on the same query might retrieve semantically related documents about regulatory compliance that don't mention SOX at all.

BM25 fails on semantic paraphrase. A query about "revenue growth" won't retrieve a document that discusses "top-line expansion" unless those exact words also appear.

Hybrid Search: Reciprocal Rank Fusion

Hybrid search combines keyword-based retrieval (BM25) with vector search to deliver both exact matches and semantic relevance in a single result set.

Instead of relying on raw scores, Reciprocal Rank Fusion (RRF) merges results based on ranking positions, making it a simple and robust approach for hybrid retrieval.

Here’s a minimal implementation of RRF that combines BM25 and dense retrieval outputs:

How this implementation works:

This function takes two ranked lists of document IDs, one from BM25 and one from dense vector retrieval, and merges them into a single relevance-ranked output.

It begins by initializing a dictionary to store cumulative scores for each document. The function then iterates over both input lists. For each document, it calculates a score using a reciprocal formula, where higher-ranked documents contribute more and lower-ranked ones contribute less. The constant k acts as a smoothing factor, preventing top-ranked items from dominating too aggressively.

If a document appears in both lists, its scores are accumulated, effectively boosting documents that perform well across both retrieval methods. This is the key strength of RRF. It naturally prioritizes agreement between systems without needing to normalize their scoring mechanisms.

Finally, the function sorts all documents based on their aggregated scores in descending order and returns the combined ranking.

Metadata Filtering

Metadata filtering narrows the search space using structured attributes stored alongside each vector. Each document chunk is stored with its content, embedding, and associated metadata:

Each document chunk is represented as a combination of raw text, its vector embedding, and structured metadata.

The content field contains the actual text that will be retrieved and passed to the language model. The embedding field stores the numerical vector representation of that text, enabling semantic similarity search.

The metadata field provides structured context about the document, such as its source, type, department, and time attributes. While embeddings handle semantic matching, metadata is used to apply constraints before or during retrieval.

For example, a query can be restricted to financial documents from a specific quarter or department, reducing the search space and improving relevance. This ensures that retrieval is not only semantically accurate but also contextually aligned with business constraints.

Re-Ranking: Cross-Encoders

Initial retrieval uses bi-encoders, which independently embed queries and documents for fast similarity search. While efficient, this approach prioritizes speed over precision. To improve relevance, a second stage re-ranks the retrieved candidates using cross-encoders, which evaluate each query–document pair more accurately.

In practice, bi-encoders retrieve a broad set of candidates (e.g., top 50–100), and cross-encoders refine them to the most relevant results.

This code applies a cross-encoder model to re-rank a set of candidate document chunks based on their relevance to a given query.

It begins by initializing a pre-trained cross-encoder model, which evaluates query–document pairs jointly rather than independently. This allows the model to capture deeper contextual relationships between the query and each candidate chunk.

The scoring is performed using the model’s predict function, which takes a list of query–chunk pairs and returns a relevance score for each pair. Since all scores are generated by the same model in the same context, they are directly comparable.

These scores are then paired with their corresponding chunks using zip and sorted in descending order using sorted, ensuring that the most relevant results move to the top. Finally, the top-k chunks are selected from the ranked list and passed to the next stage of the pipeline for response generation.

Context Assembly and Prompt Construction

Retrieval produces a set of candidate chunks. What you do with those chunks before sending them to the LLM significantly affects response quality. Context assembly is an under-appreciated engineering discipline.

The Context Assembly Problem

The LLM's context window is finite and expensive. Every token used for context is a token not available for the response. More critically, LLMs have a documented "lost in the middle" problem: performance degrades for relevant information placed in the center of a long context window. The model reliably attends to the beginning and end of context but under-attends to the middle.

Deduplication

After retrieval and re-ranking, multiple chunks from the same document section may appear in the candidate set. Overlapping sliding window chunks, in particular, produce near-duplicate content. Deduplicate before prompt construction using:

Exact deduplication: remove identical chunks
Near-deduplication: compute pairwise cosine similarity between chunks; remove chunks above a similarity threshold (e.g., 0.95)

Duplicates inflate context length without adding information and increase cost.

Context Compression

Even after deduplication, retrieved chunks may contain irrelevant sentences. LLMLingua and similar context compression techniques use a small language model to identify and remove sentences from chunks that are not relevant to the query. This reduces context length by 2-10x with minimal accuracy degradation.

Here's an example of how Context Compression works:

Query:
What was the company’s revenue growth in Q3 2024?

Original chunk (400 tokens):
The company was founded in 1995 and initially focused on enterprise software solutions. Over the years, it expanded into cloud computing, AI services, and global markets. The leadership team has evolved significantly, with multiple CEO transitions shaping its strategy.

In recent financial updates, the company reported that revenue in Q3 2024 reached $4.2 billion, reflecting strong demand across its cloud division. This marks a 12% increase year-over-year, driven by enterprise adoption and subscription growth. Additional investments were made in AI capabilities and infrastructure.

Compressed chunk (80 tokens):
Revenue in Q3 2024 reached $4.2 billion, up 12% year-over-year.

Context compression is especially valuable when dealing with long-tail enterprise documents where relevant information is sparse within each document.

Prompt Construction with Citation Anchors

Prompt construction with citation anchors is a technique used in Retrieval-Augmented Generation (RAG) to make model responses more accurate, transparent, and verifiable. Instead of allowing the model to generate answers freely, you explicitly instruct it to rely only on the provided context and to support every factual statement with a reference marker such as [Source 1] or [Source 2]. This ensures that the output is grounded in real data rather than assumptions.

Here is a prompt example:

This prompt structure serves two purposes: it anchors the model's response to retrievable evidence, and it enables post-generation grounding verification. You can programmatically check whether cited [Source N] passages actually support the claims made.

Ordering Chunks for Maximum Attention

Given the "lost in the middle" phenomenon, place the most relevant chunks at the beginning and end of your context window, with less relevant chunks in the middle. If you have 5 chunks ranked 1-5, order them: [1, 3, 5, 4, 2] — rank 1 at position 0, rank 2 at the last position.

The 5 Types of RAG Architectures

RAG is not a single pattern but a design space. Apart from Naive RAG and Advanced RAG, you should also know about Agentic RAG, Multi-Ho RAG, and Graph RAG. These five architectures represent distinct points in that space, each suited to different problem types.

Parameter	Naive RAG	Advanced RAG	Agentic RAG	Multi-Hop RAG	Graph RAG
Core pattern	Single query → retrieve → generate	Query rewriting → hybrid retrieval → re-rank → generate	LLM agent decides when and what to retrieve dynamically	Chain of retrievals, each refining the next	Knowledge graph traversal + vector search
Retrieval strategy	Single vector search	Hybrid BM25 + vector with re-ranking	Dynamic, tool-driven, iterative	Sequential, context-informed hops	Graph traversal + vector similarity
Query understanding	Raw query embedded as-is	HyDE rewriting, query decomposition	Agent plans and reformulates queries autonomously	LLM evaluates sufficiency and generates the next sub-query	Entity extraction drives the traversal path
Handles multi-part questions?	No	Partially	Yes	Yes — by design	Yes, via graph relationships
Handles entity relationships?	No	No	Partially	No	Yes — core strength
Implementation complexity	Low	Medium	High	High	Very High
Latency	Lowest	Low–Medium	High (multiple LLM calls)	High (multiple retrieval hops)	Medium–High
Hallucination risk	High	Low	Low	Low	Very Low
Cost	Lowest	Moderate	High	High	High
AWS tooling	Bedrock Knowledge Bases	Bedrock KB + OpenSearch	Amazon Bedrock Agents	Custom Lambda + Step Functions	Neptune + OpenSearch + Bedrock
Best for	Demos, simple FAQ, POCs	Enterprise search, support copilots, research	Complex reasoning, multi-domain Q&A, ambiguous queries	Evidence chaining, legal discovery, literature review	Regulatory compliance, supply chain, and interconnected enterprise data
Production-ready?	Not recommended	Yes	Yes, with an orchestration framework	Yes, for specific use cases	Yes, for graph-native domains

AWS-Native RAG Architectures

AWS provides two primary paths to building production RAG: a managed path with Amazon Bedrock Knowledge Bases, and a custom path for teams requiring architectural control.

Architecture Option 1: Amazon Bedrock Knowledge Bases (Managed)

Amazon Bedrock Knowledge Bases provide a fully managed approach to building RAG systems, abstracting the complexity of ingestion, indexing, and retrieval into a single integrated workflow. This approach is designed for teams that want to move quickly without managing the underlying retrieval infrastructure.

The following example shows how retrieval and generation are combined into a single API call:

How this implementation works:

This code uses the retrieve_and_generate function to execute the full RAG pipeline within a single API call. The client is first initialized using boto3, which provides access to the Bedrock Agent Runtime. The query is passed through the input parameter, which represents the user’s question.

The retrieveAndGenerateConfiguration block defines how retrieval and generation should be handled. Within this, the knowledgeBaseId specifies the data source, while the modelArn determines which foundation model is used to generate the response.

The vectorSearchConfiguration controls retrieval behavior. The numberOfResults parameter limits how many documents are retrieved, and overrideSearchType enables hybrid search, combining keyword and semantic retrieval.

Once the request is executed, the system retrieves relevant documents, generates a response using the selected model, and returns both the answer and its supporting citations. The output text contains the generated response, while the citations provide traceability to the original source documents.

When to use this approach:

This architecture is best suited for teams that prioritize speed and simplicity over deep customization. It enables rapid deployment of production-ready RAG systems without requiring dedicated ML infrastructure or pipeline management.

Architecture Option 2: Custom RAG Pipeline (Full Control)

When Bedrock Knowledge Bases' abstractions are insufficient, build a custom pipeline using AWS primitives.

AWS Service Mapping:

Pipeline Stage	AWS Service
Document storage	Amazon S3
Pipeline orchestration	AWS Step Functions
Document parsing & chunking	AWS Lambda
Embedding generation	Amazon Bedrock (Titan Embeddings V2)
Vector storage + hybrid search	Amazon OpenSearch Service
Re-ranking	Lambda + self-hosted cross-encoder (or Cohere Rerank API)
LLM generation	Amazon Bedrock (Claude 3.5, Llama 3.1, etc.)
API serving	API Gateway + Lambda
Caching	Amazon ElastiCache (Redis)
Monitoring	Amazon CloudWatch + AWS X-Ray

When to use custom pipelines: Domain-specific embedding models, custom chunking for specialized document types (e.g., legal contracts, code, medical records), multi-language requirements, custom re-ranking models, and complex multi-hop retrieval logic.

Evaluating a RAG System (The Most Critical Section)

Evaluation is not optional. It is the engineering practice that separates a working system from a broken one. The 4 dimensions you can evaluate your RAG systems are:

Retrieval Optimizations

Better chunking: If your retrieval recall is below 0.80, investigate your chunking before anything else. Move from fixed-size to semantic chunking for prose documents. Implement hierarchical indexing for long-form documents.
Hybrid search: It should be the default for any system handling specialized terminology, proper nouns, or regulatory/legal language. The RRF implementation is 30 lines of code and consistently improves retrieval metrics by 5-15%.
Query rewriting with HyDE: Doing this improves retrieval for short or ambiguous queries. HyDE (Hypothetical Document Embedding) works by asking the LLM to generate a hypothetical answer to the query, embedding that answer, and using the resulting vector to search.
Metadata extraction and filtering reduce the retrieval search space dramatically. A well-tagged corpus with document type, department, and date metadata allows scoping retrieval to the relevant subset before vector search runs.

Generation Optimizations

System prompt engineering: Include explicit grounding instructions: "Answer only using the provided context. If the context does not contain the answer, say so explicitly." Include citation instructions: "Cite every factual claim with [Source N]." Include uncertainty instructions: "If context is ambiguous or contradictory, note the uncertainty."
Context compression: Reduces noise in the context window. LLMLingua, Selective Context, or simple extractive compression methods can reduce context length by 2-5x. Shorter context → lower cost, better attention focus, reduced hallucination from irrelevant passages.

Cost Optimizations

Query caching: Cache responses for identical or near-identical queries. Enterprise environments often have repetitive query patterns — "What is our PTO policy?" gets asked hundreds of times daily. Cache hit rates of 20-40% are common.
Dynamic retrieval: Implement confidence-based retrieval: if the top retrieved chunk scores above a similarity threshold and the re-ranker score exceeds another threshold, send only that chunk to the LLM.
Model routing: Classify query complexity. Route simple factual queries to Claude Haiku or Llama 3.1 8B. Reserve Claude 3 Sonnet or Opus for complex multi-hop or analytical queries.

Conclusion

There is a persistent myth in the enterprise AI market that deploying an LLM is the hard part. It isn't. The hard part is connecting that LLM to your knowledge, reliably evaluating whether the connection is working, and maintaining that connection at a production scale as your knowledge evolves.

RAG is a full information retrieval system that uses an LLM as its synthesis layer. The companies that understand this are building AI systems that work. The companies that don't are shipping demos that break in production.

On AWS, the practical path is this: start with Amazon Bedrock Knowledge Bases for fast deployment and validation. Once you understand your requirements, move to a custom pipeline using OpenSearch for hybrid retrieval, Lambda for orchestration, and Bedrock LLMs for generation. Use Step Functions to make the offline pipeline auditable and replayable.

If you are building on AWS and want to get the architecture right from the start, this is exactly what we do at Mactores. We consult with enterprises on AWS AI architecture, from selecting the right RAG pattern for your use case to building the evaluation infrastructure that keeps it working in production. If you are at that stage, we would be glad to talk.

View full post