Skip to main content

Module 5: RAG Architecture β€” Retrieval-Augmented Generation Deep Dive

Duration: 90-120 minutes | Level: Deep-Dive Audience: Cloud Architects, Platform Engineers, AI Engineers Last Updated: March 2026


5.1 Why RAG Exists​

Large Language Models are powerful, but they have three fundamental limitations that make them unreliable for enterprise use without additional architecture.

Limitation 1: Knowledge Cutoff​

Every LLM is frozen in time. GPT-4o's training data has a cutoff. Claude's training data has a cutoff. If your organization published a new security policy last Tuesday, no LLM on Earth knows about it. The model will either refuse to answer or, worse, confidently fabricate something plausible.

Limitation 2: Hallucination​

When an LLM does not know the answer, it does not say "I don't know." It generates a plausible-sounding response that may be entirely fabricated. This is not a bug β€” it is a fundamental property of how autoregressive language models work. They predict the next most likely token, and "likely" does not mean "true."

Limitation 3: No Access to Your Data​

Even if an LLM had perfect knowledge of the entire public internet, it still would not know your internal HR policies, your proprietary product catalog, your customer contracts, your Confluence wiki, or your internal runbooks. Enterprise data is private by definition.

The RAG Solution​

Retrieval-Augmented Generation (RAG) solves all three problems with one architectural pattern: instead of relying on what the model already knows, you retrieve relevant information from your own data sources and inject it into the prompt at query time. The model then generates a response grounded in your actual data.

Without RAG:  User Question β†’ LLM (uses training data only) β†’ Response (may hallucinate)
With RAG: User Question β†’ Retrieve from YOUR data β†’ LLM (uses retrieved context) β†’ Grounded Response

RAG does not change the model. It does not retrain it. It simply gives the model a reference library to consult before answering.

RAG vs Fine-Tuning vs Prompt Engineering β€” Decision Matrix​

Before building a RAG pipeline, understand where it fits alongside other techniques.

DimensionPrompt EngineeringRAGFine-TuningPre-Training
What it doesBetter instructions to the modelGives the model your data at query timeAdjusts model weights on your dataTrains a model from scratch
CostFree$$ (search infra + embeddings)$$$ (GPU compute + data prep)$$$$$$ (massive compute)
Time to implementMinutesHours to daysDays to weeksMonths
Data freshnessN/AReal-time (index updates)Stale (requires retraining)Stale
Data volume needed0 examplesA document corpus100s-1000s of examplesBillions of tokens
Handles private dataNoYesPartially (baked into weights)Partially
Provides citationsNoYes (source documents)NoNo
Risk of hallucinationHighLow (when well-built)MediumMedium
Best analogyWriting better exam questionsGiving the student a textbook during the examSending the student to a specialized courseBuilding a student from scratch
When to Use Each Technique
  • Prompt Engineering first β€” always. It is free and high-leverage.
  • RAG when you need factual grounding in private, dynamic, or recent data.
  • Fine-Tuning when you need the model to adopt a specific tone, format, or domain vocabulary.
  • Combine them β€” production systems typically use all three together.

5.2 The RAG Pipeline End-to-End​

Every RAG system has two distinct pipelines that operate independently.

The Two Pipelines​

Ingestion Pipeline (Offline)​

The ingestion pipeline runs periodically or on-demand to process your source documents and make them searchable. It does not involve any LLM calls.

StepWhat HappensAzure ServiceLatency
1. Document LoadingRead files from storageAzure Blob Storage, SharePoint, SQLSeconds
2. Content ExtractionExtract text, tables, images from documentsAzure AI Document IntelligenceSeconds/doc
3. CleaningRemove headers, footers, boilerplate, artifactsCustom code or AI Document IntelligenceMilliseconds
4. ChunkingSplit text into retrieval-sized segmentsCustom code or Azure AI Search integrated vectorizationMilliseconds
5. EnrichmentAdd metadata: title, date, category, sourceAzure AI Search skillsetsSeconds
6. EmbeddingConvert each chunk to a vectorAzure OpenAI text-embedding-3-large~100ms/chunk
7. IndexingStore vectors + metadata in search indexAzure AI Search, Cosmos DBMilliseconds

Query Pipeline (Online)​

The query pipeline runs in real time for every user question. Latency matters here.

StepWhat HappensTypical Latency
1. Query ProcessingRewrite, expand, decompose the user's question200-500ms (if using LLM)
2. Query EmbeddingConvert question to a vector50-100ms
3. RetrievalSearch the index (vector + keyword + filters)50-200ms
4. RerankingReorder results by semantic relevance100-300ms
5. Context AssemblyBuild the prompt with retrieved chunks<10ms
6. LLM GenerationGenerate the answer using the model500-3000ms
TotalEnd-to-end latency1-4 seconds

5.3 Document Ingestion and Processing​

The quality of your RAG system is capped by the quality of your ingestion pipeline. If the extracted text is garbled, no amount of sophisticated retrieval will save it.

Supported Document Types​

FormatExtraction MethodComplexityNotes
MarkdownDirect parsingLowBest format for RAG β€” structure is explicit
Plain TextDirect readingLowNo structure to leverage
HTMLHTML parser + cleanupMediumRemove nav, scripts, ads
PDF (digital)Text extractionMediumPreserves some structure
PDF (scanned)OCR requiredHighQuality depends on scan quality
Word (.docx)XML parsingMediumPreserves headings, tables
Excel (.xlsx)Cell-level extractionMediumConvert rows to text or keep tabular
PowerPoint (.pptx)Slide-by-slide extractionMediumSpeaker notes often contain key info
CSV / TSVRow-level parsingLowEach row can be a chunk
Database tablesSQL queriesMediumDenormalize joins into text
ImagesVision models / OCRHighDiagrams, whiteboard photos

Azure AI Document Intelligence​

Azure AI Document Intelligence (formerly Form Recognizer) is the recommended service for structured extraction from complex documents. It handles:

  • Layout analysis β€” detects headings, paragraphs, tables, figures
  • Table extraction β€” returns structured table data preserving rows and columns
  • Key-value pair extraction β€” for forms and invoices
  • OCR β€” for scanned documents and handwriting
  • Custom models β€” train on your specific document layouts
from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.core.credentials import AzureKeyCredential

client = DocumentIntelligenceClient(
endpoint="https://my-doc-intel.cognitiveservices.azure.com/",
credential=AzureKeyCredential(os.getenv("DOC_INTEL_KEY"))
)

# Analyze a PDF with layout model
with open("annual_report.pdf", "rb") as f:
poller = client.begin_analyze_document(
model_id="prebuilt-layout",
body=f,
content_type="application/pdf"
)

result = poller.result()

# Extract text by page with structure
for page in result.pages:
print(f"--- Page {page.page_number} ---")
for line in page.lines:
print(line.content)

# Extract tables separately (critical for RAG quality)
for table in result.tables:
print(f"Table with {table.row_count} rows, {table.column_count} columns")
for cell in table.cells:
print(f" Row {cell.row_index}, Col {cell.column_index}: {cell.content}")

Metadata Extraction and Enrichment​

Raw text is not enough. Metadata enables filtered search, which dramatically improves retrieval relevance. Always extract and store:

Metadata FieldSourcePurpose
titleDocument title or filenameDisplay in citations
source_urlOriginal location (SharePoint, Blob)Link back to source
created_dateFile metadataFreshness filtering
modified_dateFile metadataFreshness filtering
authorFile metadataAttribution
departmentFolder structure or tagScope filtering
document_typeFile extension or classificationType filtering
languageLanguage detectionMulti-language support
chunk_indexAssigned during chunkingOrdering adjacent chunks
parent_document_idAssigned during ingestionLinking chunks to source

5.4 Chunking Strategies β€” Critical for Relevance​

Chunking is the single most impactful design decision in your RAG pipeline. Bad chunking means bad retrieval, which means bad answers. There is no model smart enough to overcome poorly chunked data.

The fundamental tension: chunks must be small enough to be relevant to specific queries, but large enough to contain sufficient context to be useful.

Strategy 1: Fixed-Size Chunking​

Split text into chunks of a fixed number of characters or tokens, with optional overlap.

def fixed_size_chunk(text: str, chunk_size: int = 512, overlap: int = 50):
"""Split text into fixed-size chunks with overlap."""
tokens = tokenizer.encode(text)
chunks = []
start = 0
while start < len(tokens):
end = start + chunk_size
chunk_tokens = tokens[start:end]
chunks.append(tokenizer.decode(chunk_tokens))
start += chunk_size - overlap # overlap tokens carried forward
return chunks
AspectDetail
How it worksSplit every N tokens regardless of content
OverlapTypically 10-20% (e.g., 50-100 tokens for 512-token chunks)
ProsSimple, predictable chunk sizes, easy to estimate storage
ConsBreaks sentences, paragraphs, and semantic units mid-thought
Best forHomogeneous text without strong structure (e.g., chat logs)

Strategy 2: Sentence-Based Chunking​

Split text at sentence boundaries, grouping sentences until the chunk reaches a target size.

import nltk

def sentence_chunk(text: str, max_tokens: int = 512):
"""Group sentences into chunks up to max_tokens."""
sentences = nltk.sent_tokenize(text)
chunks, current_chunk = [], []
current_size = 0

for sentence in sentences:
sentence_tokens = len(tokenizer.encode(sentence))
if current_size + sentence_tokens > max_tokens and current_chunk:
chunks.append(" ".join(current_chunk))
current_chunk, current_size = [], 0
current_chunk.append(sentence)
current_size += sentence_tokens

if current_chunk:
chunks.append(" ".join(current_chunk))
return chunks
AspectDetail
How it worksAccumulate sentences until target size is reached
ProsNever breaks a sentence in half, preserves basic meaning
ConsVariable chunk sizes, may still split related paragraphs
Best forNarrative text, articles, documentation

Strategy 3: Paragraph-Based Chunking​

Split at paragraph boundaries (double newlines). Each paragraph becomes a chunk, or paragraphs are grouped to reach a target size.

AspectDetail
How it worksUse paragraph breaks as natural split points
ProsPreserves author's intended logical groupings
ConsParagraphs vary wildly in size (some are 1 sentence, some are 500 words)
Best forWell-structured documents with consistent paragraph sizes

Strategy 4: Semantic Chunking​

Use embedding similarity to detect where the topic shifts, then split at topic boundaries.

from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def semantic_chunk(sentences: list, embeddings: np.ndarray, threshold: float = 0.75):
"""Split at points where semantic similarity drops below threshold."""
chunks, current_chunk = [], [sentences[0]]

for i in range(1, len(sentences)):
similarity = cosine_similarity(
embeddings[i-1].reshape(1, -1),
embeddings[i].reshape(1, -1)
)[0][0]

if similarity < threshold:
# Topic shift detected β€” start a new chunk
chunks.append(" ".join(current_chunk))
current_chunk = [sentences[i]]
else:
current_chunk.append(sentences[i])

if current_chunk:
chunks.append(" ".join(current_chunk))
return chunks
AspectDetail
How it worksEmbed each sentence, detect topic boundaries via similarity drops
ProsProduces the most semantically coherent chunks
ConsExpensive (every sentence must be embedded), variable sizes, harder to debug
Best forHigh-value corpora where retrieval quality justifies the cost

Strategy 5: Recursive / Hierarchical Chunking​

Split using a hierarchy of separators: first by headers, then by paragraphs, then by sentences, then by characters. Only descend to the next level if the chunk exceeds the target size.

# LangChain's RecursiveCharacterTextSplitter uses this approach
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=[
"\n## ", # H2 headings first
"\n### ", # H3 headings
"\n\n", # Paragraphs
"\n", # Lines
". ", # Sentences
" ", # Words (last resort)
]
)

chunks = splitter.split_text(document_text)
AspectDetail
How it worksTry the most structural separator first, fall back to smaller units
ProsRespects document hierarchy, good balance of structure and size control
ConsRequires well-formatted source documents
Best forMarkdown, HTML, structured documentation (this is the default recommendation)

Strategy 6: Document-Aware Chunking​

Specialized chunking that understands the document format.

Sub-StrategyHow It WorksBest For
Markdown-awareSplit at #, ##, ### headers keeping each section intactTechnical docs, wikis
HTML-awareSplit at <section>, <article>, <h1>-<h6> tagsWeb content
Table-awareKeep entire tables as single chunks (never split a table row)Financial reports, data sheets
Code-awareKeep entire functions/classes as single chunksCode repositories
Slide-awareKeep each slide as a chunk with speaker notesPresentations

Chunking Strategy Comparison​

StrategyChunk Size ControlSemantic QualityImplementation CostCompute CostBest For
Fixed-SizeExactLowVery LowVery LowUniform text, quick prototypes
Sentence-BasedApproximateMediumLowLowArticles, narratives
Paragraph-BasedVariableMedium-HighLowLowWell-structured docs
SemanticVariableHighestHighHigh (embedding calls)High-value corpora
RecursiveApproximateHighMediumLowMarkdown, HTML, general docs
Document-AwareVariableHighMedium-HighLowFormat-specific content
The Most Common Chunking Mistake

Using fixed-size chunking with no overlap on structured documents. A 512-token boundary that falls in the middle of a table or code block produces two useless chunks. Always use document-aware or recursive chunking for structured content.


5.5 Chunk Size β€” The Goldilocks Problem​

Chunk size is the second most impactful parameter after chunking strategy. The optimal size depends on your content type, embedding model, and retrieval patterns.

The Tradeoff​

Too Small (< 128 tokens)          Just Right (256-1024 tokens)         Too Large (> 2048 tokens)
β”œβ”€β”€ Loses context β”œβ”€β”€ Contains enough context β”œβ”€β”€ Dilutes relevance
β”œβ”€β”€ Many chunks needed β”œβ”€β”€ Manageable number of chunks β”œβ”€β”€ Wastes LLM tokens
β”œβ”€β”€ Retrieval returns fragments β”œβ”€β”€ Each chunk is self-contained β”œβ”€β”€ Hard to rank accurately
β”œβ”€β”€ High embedding costs β”œβ”€β”€ Balanced costs β”œβ”€β”€ Vector similarity degrades
└── Precise but incomplete └── The sweet spot └── Complete but noisy
Content TypeRecommended Size (tokens)OverlapRationale
Technical documentation512-102410-15%Sections are self-contained, need full context
FAQ / Q&A pairs128-2560%Each Q&A is a natural chunk
Legal contracts512-76820%Clauses reference each other, need higher overlap
Knowledge base articles512-102410%Articles have clear topic boundaries
Chat transcripts256-51215%Conversations shift topics frequently
Source codeFunction-level0%Each function is a natural unit
Research papers768-102415%Dense content needs larger context windows
Product catalogs256-5120%Each product is independent
Email threadsPer-email0%Each email is a natural unit
Financial reports512-768 (tables intact)10%Keep tables as single chunks

How to Choose: Empirical Testing​

There is no universally correct chunk size. The right approach is to test multiple sizes against your actual queries and measure retrieval quality.

# Test multiple chunk sizes and measure Hit Rate @ K
chunk_sizes = [256, 512, 768, 1024]
test_queries = load_evaluation_queries() # queries with known relevant docs

for size in chunk_sizes:
chunks = recursive_chunk(corpus, chunk_size=size, overlap=int(size * 0.1))
index = build_index(chunks)

hit_rate = 0
for query, expected_doc_id in test_queries:
results = index.search(query, top_k=5)
if expected_doc_id in [r.doc_id for r in results]:
hit_rate += 1

print(f"Chunk Size: {size} | Hit Rate@5: {hit_rate / len(test_queries):.2%}")

5.6 Embeddings β€” Turning Text into Vectors​

Embeddings are the mathematical bridge between human language and machine-searchable vectors. Every chunk of text in your index and every user query is converted into a dense vector of floating-point numbers before retrieval can happen.

What Are Embeddings?​

An embedding is a list of numbers (a vector) that represents the semantic meaning of a piece of text. Texts with similar meanings produce vectors that are close together in high-dimensional space.

"How do I reset my password?"  β†’  [0.023, -0.041, 0.118, ..., 0.067]  (1536 dimensions)
"I forgot my login credentials" β†’ [0.025, -0.039, 0.121, ..., 0.064] (very similar vector)
"Azure VM pricing in East US" β†’ [0.891, 0.234, -0.567, ..., 0.445] (very different vector)

Azure OpenAI Embedding Models​

ModelDimensionsMax Input TokensPerformanceCost (per 1M tokens)Recommendation
text-embedding-3-large3072 (or 256-3072)8,191Highest quality~$0.13Production workloads requiring best accuracy
text-embedding-3-small1536 (or 256-1536)8,191Very good~$0.02Cost-effective production use
text-embedding-ada-0021536 (fixed)8,191Good (legacy)~$0.10Legacy β€” migrate to v3 models

Dimension Selection​

The text-embedding-3-* models support Matryoshka embeddings β€” you can truncate the vector to fewer dimensions and still retain meaningful similarity. This is a powerful cost/quality tradeoff.

DimensionsStorage per VectorQualityUse Case
2561 KBGood enough for coarse similarityPrototyping, low-cost deployments
5122 KBStrong for most use casesBalanced production deployments
15366 KBVery highStandard production deployments
307212 KBHighestPrecision-critical applications

Distance Metrics​

When comparing vectors, the search engine uses a distance metric to calculate similarity.

MetricFormulaRangeBest For
Cosine Similaritycos(A, B) = (A . B) / (|A| |B|)-1 to 1 (1 = identical)Most text search (normalized vectors)
Dot ProductA . B-inf to +infWhen vectors are already normalized
Euclidean (L2)sqrt(sum((A-B)^2))0 to +inf (0 = identical)When magnitude matters

For Azure AI Search and most text retrieval scenarios, cosine similarity is the standard choice.

Generating Embeddings with Azure OpenAI​

from openai import AzureOpenAI

client = AzureOpenAI(
azure_endpoint="https://my-aoai.openai.azure.com/",
api_key=os.getenv("AZURE_OPENAI_KEY"),
api_version="2025-12-01-preview"
)

def get_embedding(text: str, model: str = "text-embedding-3-large", dimensions: int = 1536):
"""Generate an embedding vector for a text string."""
response = client.embeddings.create(
input=text,
model=model,
dimensions=dimensions # Matryoshka: choose your dimension
)
return response.data[0].embedding # list of floats

# Embed a document chunk
chunk = "Azure Private Link enables you to access Azure PaaS services..."
vector = get_embedding(chunk)
print(f"Vector dimensions: {len(vector)}") # 1536

# Embed a user query (same model, same dimensions!)
query_vector = get_embedding("How does Private Link work?")
Critical Rule

You must use the same embedding model and the same dimensions for both your document chunks and your user queries. Mixing models (e.g., indexing with ada-002 but querying with text-embedding-3-large) produces meaningless similarity scores because the vector spaces are incompatible.


5.7 Vector Databases and Indexes​

Once you have vectors, you need a place to store them and search them efficiently. This is the role of the vector database or vector-capable search engine.

What Is a Vector Database?​

A vector database stores high-dimensional vectors alongside their metadata and provides fast approximate nearest neighbor (ANN) search. Given a query vector, it returns the K most similar vectors (and their associated text chunks) in milliseconds.

Azure-Native Options​

ServiceTypeVector DimsHybrid SearchSemantic RankerIntegrated EmbeddingPricing Model
Azure AI SearchSearch-as-a-serviceUp to 3072Yes (keyword + vector)Yes (transformer-based)Yes (built-in vectorization)Per search unit (SU)
Azure Cosmos DB (NoSQL)Multi-model databaseUp to 4096Yes (with full-text index)NoNoPer RU + storage
Azure Cosmos DB (PostgreSQL / vCore)PostgreSQL with pgvectorUp to 2000YesNoNoPer vCore + storage
Azure SQL DatabaseRelational databaseVia extensionsPartialNoNoPer DTU/vCore

Dedicated Vector Databases​

DatabaseManaged ServiceMax DimsUnique StrengthAzure Integration
PineconeFully managed20,000Purpose-built, serverless optionVia API
WeaviateCloud or self-hostedUnlimitedMulti-modal (text, images), GraphQL APIAKS or VM
QdrantCloud or self-hosted65,536Sparse + dense vectors, filteringAKS or VM
MilvusCloud (Zilliz) or self-hosted32,768GPU-accelerated search, massive scaleAKS or VM
ChromaDBEmbedded or self-hostedUnlimitedSimple API, great for prototypingEmbedded in app
pgvector (PostgreSQL)Azure Database for PostgreSQL2,000Familiar SQL interface, no new infraAzure PostgreSQL

Index Types​

The index algorithm determines how vectors are organized for fast search.

Index TypeHow It WorksSearch SpeedAccuracyMemoryBest For
Flat (brute force)Compare query against every vectorSlowest100% exactLowSmall datasets (< 50K vectors)
IVF (Inverted File)Cluster vectors, search only nearby clustersFast~95-99%MediumMedium datasets
HNSW (Hierarchical Navigable Small World)Multi-layer graph of connectionsFastest~95-99%High (in-memory)Production workloads (Azure AI Search default)

Azure AI Search uses HNSW by default, which provides the best balance of speed and accuracy for production workloads. You can configure the HNSW parameters:

{
"name": "my-vector-index",
"fields": [
{
"name": "content_vector",
"type": "Collection(Edm.Single)",
"dimensions": 1536,
"vectorSearchProfile": "my-profile"
}
],
"vectorSearch": {
"algorithms": [
{
"name": "my-hnsw",
"kind": "hnsw",
"hnswParameters": {
"m": 4,
"efConstruction": 400,
"efSearch": 500,
"metric": "cosine"
}
}
],
"profiles": [
{
"name": "my-profile",
"algorithmConfigurationName": "my-hnsw"
}
]
}
}

5.8 Azure AI Search β€” The Swiss Army Knife​

Azure AI Search is the recommended vector store for most Azure RAG architectures. It is not just a vector database β€” it is a full-featured search platform that combines keyword search, vector search, semantic ranking, and data enrichment in a single managed service.

Core Capabilities​

Integrated Vectorization (Built-In Chunking and Embedding)​

Azure AI Search can handle the entire ingestion pipeline without custom code. You define a data source, a skillset, and an indexer β€” the service handles the rest.

Blob Storage β†’ Indexer β†’ [Document Cracking β†’ Chunking β†’ Embedding β†’ Enrichment] β†’ Index
↑ built into Azure AI Search β€” no custom code needed ↑

Configuration for integrated vectorization:

{
"name": "my-skillset",
"skills": [
{
"@odata.type": "#Microsoft.Skills.Text.SplitSkill",
"name": "chunk-text",
"description": "Split documents into chunks",
"textSplitMode": "pages",
"maximumPageLength": 2000,
"pageOverlapLength": 500,
"context": "/document"
},
{
"@odata.type": "#Microsoft.Skills.Text.AzureOpenAIEmbeddingSkill",
"name": "generate-embeddings",
"description": "Generate embeddings for each chunk",
"resourceUri": "https://my-aoai.openai.azure.com/",
"deploymentId": "text-embedding-3-large",
"modelName": "text-embedding-3-large",
"context": "/document/pages/*"
}
]
}

Hybrid Search β€” Why It Matters​

Neither keyword search nor vector search alone is sufficient. Each has blind spots.

ScenarioKeyword SearchVector SearchHybrid
User searches exact error code: ERR_CERT_AUTHORITY_INVALIDFinds it exactlyMay miss (error codes do not embed well)Finds it
User asks "how to make my app faster"Misses docs about "performance optimization"Finds semantic matchFinds it
User searches "Azure PCI DSS compliance"Finds exact term matchesFinds related compliance docs tooFinds all
User searches with a typo: "Kuberntes scaling"Misses (no fuzzy match)Finds it (embedding is robust to typos)Finds it

Hybrid search in Azure AI Search combines BM25 (keyword) and vector scores using Reciprocal Rank Fusion (RRF):

RRF_score(doc) = sum( 1 / (k + rank_in_list) ) for each result list containing the doc

Where k is a constant (default 60). This elegantly merges ranked lists without needing to normalize scores across different algorithms.

Semantic Ranker​

After hybrid retrieval returns the top candidates, the Semantic Ranker applies a transformer-based cross-encoder model to reorder results by deep semantic relevance.

Step 1: Hybrid search returns top 50 candidates
Step 2: Semantic Ranker reranks all 50 using a cross-encoder
Step 3: Return top 5 to the LLM as context

The Semantic Ranker also provides:

  • Semantic captions β€” the most relevant sentence or passage within each result
  • Semantic answers β€” a direct extractive answer if one exists in the content

Pricing Tiers​

TierReplicasPartitionsVector Index SizeSemantic RankerPrice (approx/month)
Free118 MBNo$0
Basic312 GBNo~$75
Standard S1121225 GB per partitionYes~$250/SU
Standard S21212100 GB per partitionYes~$1,000/SU
Standard S31212200 GB per partitionYes~$2,000/SU
S3 HD123200 GB per partitionYes~$2,000/SU
Architect's Guidance

For most production RAG workloads serving up to a few million documents, Standard S1 with 2 replicas provides sufficient capacity with high availability. Start there, monitor query latency, and scale up only when needed.


5.9 Retrieval Strategies​

Retrieval is the step that determines whether the right information reaches the LLM. A brilliant model given the wrong context will produce a confidently wrong answer.

Strategy 1: Keyword Search (BM25 / TF-IDF)​

The classical information retrieval approach. Matches exact terms in the query against exact terms in the documents, weighted by term frequency and inverse document frequency.

# Azure AI Search β€” keyword search
results = search_client.search(
search_text="Azure Private Link DNS configuration",
query_type="simple",
top=10
)
StrengthWeakness
Excellent for exact terms, product names, error codesMisses synonyms ("VM" vs "virtual machine")
Fast and well-understoodMisses semantic similarity ("improve speed" vs "performance optimization")
No embedding requiredSensitive to typos and vocabulary mismatch
Works with any language out of the boxCannot understand intent or meaning

Converts both the query and documents into embedding vectors and finds the nearest neighbors in vector space.

from azure.search.documents.models import VectorizableTextQuery

# Azure AI Search β€” vector search
results = search_client.search(
search_text=None,
vector_queries=[
VectorizableTextQuery(
text="How do I configure DNS for private endpoints?",
k_nearest_neighbors=10,
fields="content_vector"
)
]
)
StrengthWeakness
Understands semantic meaning and intentCan miss exact keyword matches
Robust to synonyms, paraphrases, typosEmbedding quality depends on model
Works across languages (with multilingual models)Does not work well for codes, IDs, proper nouns
Captures conceptual similarityComputationally more expensive than BM25

Strategy 3: Hybrid Search (Keyword + Vector)​

Combines both approaches using Reciprocal Rank Fusion. This is the recommended default for production RAG systems.

from azure.search.documents.models import VectorizableTextQuery

# Azure AI Search β€” hybrid search (keyword + vector)
results = search_client.search(
search_text="Azure Private Link DNS configuration", # keyword component
vector_queries=[
VectorizableTextQuery(
text="Azure Private Link DNS configuration", # vector component
k_nearest_neighbors=10,
fields="content_vector"
)
],
query_type="semantic", # enable semantic ranker on top
semantic_configuration_name="my-semantic-config",
top=5
)

Strategy 4: Semantic Ranking (Cross-Encoder Reranking)​

After initial retrieval (keyword, vector, or hybrid), a cross-encoder model evaluates each query-document pair jointly to produce a more accurate relevance score.

StrengthWeakness
Highest accuracy of any single retrieval methodCannot be used alone (needs initial retrieval first)
Understands nuanced query-document relationshipsAdds 100-300ms latency
Dramatically improves top-K precisionAdditional cost (requires Standard tier or above)

Retrieval Strategy Comparison​

StrategySemantic UnderstandingExact MatchSpeedAccuracyCostProduction Readiness
Keyword (BM25)NoneExcellentVery fastMediumLowHigh
VectorStrongPoorFastHighMediumHigh
HybridStrongExcellentFastVery HighMediumVery High
Hybrid + Semantic RankerExcellentExcellentModerate (+300ms)HighestMedium-HighHighest
The Default Recipe

For production RAG on Azure, use Hybrid Search (keyword + vector) with Semantic Ranker. This gives you the best of all worlds. There is rarely a reason to use keyword-only or vector-only search in production.


5.10 Reranking β€” The Relevance Multiplier​

Reranking is a post-retrieval step that dramatically improves the quality of the context sent to the LLM. Initial retrieval (whether keyword or vector) is fast but approximate. Reranking is slower but far more accurate.

How Reranking Works​

Why Reranking Matters​

Initial retrieval uses bi-encoders β€” the query and documents are encoded independently and then compared. This is fast but loses nuance because the query and document never "see" each other during encoding.

Reranking uses cross-encoders β€” the query and document are concatenated and processed together through a transformer. This produces far more accurate relevance scores because the model can attend to the specific relationship between the query and each document.

ApproachHow It WorksSpeedAccuracy
Bi-encoder (retrieval)Encode query and doc separately, compare vectorsFast (can search millions in ms)Good but approximate
Cross-encoder (reranking)Encode query+doc together, single relevance scoreSlow (50-100ms per pair)Excellent, nuanced

Reranking Options​

RerankerTypeIntegrationNotes
Azure AI Search Semantic RankerManaged cross-encoderBuilt into Azure AI SearchEasiest option for Azure-native solutions
Cohere RerankAPI-basedREST API callHigh quality, multilingual support
Jina RerankerAPI or self-hostedREST API or containerGood price/performance ratio
BGE RerankerOpen-source modelSelf-hosted on AKSFull control, no API costs
Custom cross-encoderFine-tuned modelSelf-hostedHighest accuracy for your domain, highest effort

Impact of Reranking​

Studies and production deployments consistently show:

MetricWithout RerankingWith RerankingImprovement
Answer Correctness~65-75%~80-90%+10-25%
Precision@5~60-70%~75-85%+10-20%
User SatisfactionModerateHighSignificant
Hallucination Rate~15-25%~5-10%-10-15%
Architect's Rule of Thumb

Always use reranking. The 100-300ms of additional latency is almost always worth the 10-25% improvement in answer quality. The only exception is extreme latency-sensitive applications where every millisecond counts.


5.11 Query Processing and Enhancement​

The user's raw question is often a poor search query. Users write incomplete sentences, use ambiguous terms, and ask multi-part questions. Query processing transforms the user's input into one or more optimized search queries.

Technique 1: Query Rewriting​

Use an LLM to rephrase the user's question into a better search query.

def rewrite_query(user_question: str) -> str:
"""Use an LLM to rewrite the user's question for better retrieval."""
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": "Rewrite the following user question as a clear, specific "
"search query optimized for document retrieval. Output only "
"the rewritten query, nothing else."
},
{"role": "user", "content": user_question}
],
temperature=0.0,
max_tokens=200
)
return response.choices[0].message.content

Example:

  • User: "why is my app slow?" --> Rewritten: "application performance troubleshooting latency optimization"
  • User: "that thing from last meeting about the database" --> Rewritten: "database discussion recent meeting notes decisions"

Technique 2: Query Expansion​

Add synonyms, related terms, or acronym expansions to the query to increase recall.

def expand_query(user_question: str) -> str:
"""Expand the query with synonyms and related terms."""
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": "Given this search query, add synonyms and related terms "
"to broaden the search. Include the original query. "
"Output a single expanded search string."
},
{"role": "user", "content": user_question}
],
temperature=0.0,
max_tokens=300
)
return response.choices[0].message.content

Technique 3: Hypothetical Document Embedding (HyDE)​

Instead of embedding the question directly, ask the LLM to generate a hypothetical answer, then embed that hypothetical answer and use it as the search query. The intuition: a hypothetical answer is more similar to real answers in your index than the question is.

def hyde_search(user_question: str):
"""HyDE: Generate a hypothetical answer and search with its embedding."""
# Step 1: Generate a hypothetical answer
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": "Write a short, factual paragraph that answers the "
"following question. Even if you are not sure, write a "
"plausible answer. Do not say 'I don't know.'"
},
{"role": "user", "content": user_question}
],
temperature=0.3,
max_tokens=300
)
hypothetical_answer = response.choices[0].message.content

# Step 2: Embed the hypothetical answer (not the question)
hyde_vector = get_embedding(hypothetical_answer)

# Step 3: Search with the hypothetical answer's embedding
results = search_client.search(
vector_queries=[
VectorizedQuery(vector=hyde_vector, k_nearest_neighbors=10, fields="content_vector")
]
)
return results

Technique 4: Multi-Query Retrieval​

Generate multiple variations of the query and search with all of them, then merge the results.

def multi_query_search(user_question: str, num_queries: int = 3):
"""Generate multiple search queries and merge results."""
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": f"Generate {num_queries} different search queries that "
"would help answer the following question. Each query "
"should approach the topic from a different angle. "
"Output one query per line."
},
{"role": "user", "content": user_question}
],
temperature=0.7,
max_tokens=500
)

queries = response.choices[0].message.content.strip().split("\n")
all_results = {}

for query in queries:
results = search_client.search(search_text=query, top=5)
for result in results:
doc_id = result["id"]
if doc_id not in all_results or result["@search.score"] > all_results[doc_id]["@search.score"]:
all_results[doc_id] = result

return sorted(all_results.values(), key=lambda x: x["@search.score"], reverse=True)[:10]

Technique 5: Query Decomposition​

Break complex multi-part questions into simpler sub-queries, retrieve for each, and then synthesize.

Example:

  • Complex: "Compare the networking and pricing differences between AKS and Azure Container Apps"
  • Decomposed into:
    1. "AKS networking features and capabilities"
    2. "Azure Container Apps networking features and capabilities"
    3. "AKS pricing model and costs"
    4. "Azure Container Apps pricing model and costs"

Query Enhancement Comparison​

TechniqueLatency AddedRecall ImprovementWhen to Use
Query Rewriting200-500ms10-20%Always (low cost, high impact)
Query Expansion200-500ms5-15%When vocabulary mismatch is a problem
HyDE500-1000ms15-30%When questions and documents differ in style
Multi-Query500-1500ms20-40%Broad or exploratory questions
Decomposition500-1500ms15-30%Complex multi-part questions

5.12 Context Assembly and Prompt Construction​

After retrieval and reranking, you have a ranked list of relevant chunks. The next step is assembling them into a prompt that the LLM can use to generate a grounded response.

The Prompt Structure​

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ SYSTEM MESSAGE β”‚
β”‚ - Role definition β”‚
β”‚ - Instructions for using context β”‚
β”‚ - Output format requirements β”‚
β”‚ - Citation format β”‚
β”‚ - Guardrails (what NOT to do) β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ RETRIEVED CONTEXT β”‚
β”‚ [Source 1: document_title.pdf, page 12] β”‚
β”‚ <chunk text> β”‚
β”‚ β”‚
β”‚ [Source 2: kb_article_456.md] β”‚
β”‚ <chunk text> β”‚
β”‚ β”‚
β”‚ [Source 3: meeting_notes_2025-03.docx] β”‚
β”‚ <chunk text> β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ USER QUESTION β”‚
β”‚ "How do I configure RBAC for our AKS clusters?" β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Example System Prompt for RAG​

system_prompt = """You are an internal knowledge assistant for Contoso Corporation.
Your job is to answer employee questions using ONLY the provided context documents.

## Rules
1. ONLY use information from the provided context to answer.
2. If the context does not contain enough information to answer, say:
"I don't have enough information in my knowledge base to answer this question."
3. NEVER make up information that is not in the context.
4. Always cite your sources using [Source N] notation.
5. If multiple sources provide conflicting information, mention the conflict.

## Output Format
- Start with a direct answer (1-2 sentences).
- Then provide supporting details with citations.
- End with the list of sources cited.

## Citation Format
Use [Source N] inline, then list sources at the end:

Sources:
[Source 1] document_name.pdf β€” page 12
[Source 2] kb_article.md β€” section "Configuration"
"""

Building the Context Block​

def build_context(search_results, max_tokens: int = 4000) -> str:
"""Assemble retrieved chunks into a context block with source attribution."""
context_parts = []
total_tokens = 0

for i, result in enumerate(search_results, 1):
chunk_text = result["content"]
chunk_tokens = count_tokens(chunk_text)

# Stop if adding this chunk would exceed the token budget
if total_tokens + chunk_tokens > max_tokens:
break

source_label = f"[Source {i}: {result['title']}]"
context_parts.append(f"{source_label}\n{chunk_text}\n")
total_tokens += chunk_tokens

return "\n---\n".join(context_parts)

Context Window Management​

The total prompt (system + context + question + expected response) must fit within the model's context window. Budget your tokens carefully.

ComponentToken BudgetNotes
System prompt200-500Keep it concise but complete
Retrieved context3,000-6,000The bulk of your budget
User question50-500Usually short
Expected response500-2,000Set via max_tokens
Buffer500Safety margin
Total~8,000-12,000Well within GPT-4o's 128K window

Relevance Threshold​

Do not include every retrieved result. Set a minimum relevance score and discard chunks below it.

RELEVANCE_THRESHOLD = 0.70  # adjust based on empirical testing

def filter_by_relevance(results, threshold=RELEVANCE_THRESHOLD):
"""Only include results above the relevance threshold."""
return [r for r in results if r["@search.reranker_score"] >= threshold]

A low-relevance chunk inserted into the context is worse than no chunk at all β€” it introduces noise that can lead the LLM to generate incorrect or confused responses.


5.13 Advanced RAG Patterns​

Not all RAG systems are the same. The field has evolved from simple retrieve-and-generate to sophisticated multi-stage architectures.

Pattern 1: Naive RAG​

The simplest implementation. Embed the query, retrieve top-K chunks, concatenate them into the prompt, generate.

AspectDetail
When to usePrototyping, simple Q&A over small document sets
StrengthsSimple, fast to implement, easy to debug
WeaknessesNo query enhancement, no reranking, poor on complex queries
Typical accuracy50-65% answer correctness

Pattern 2: Advanced RAG​

Adds query processing, hybrid search, reranking, and metadata filtering to the pipeline.

AspectDetail
When to useProduction enterprise systems
StrengthsHigh accuracy, citations, handles diverse query types
WeaknessesMore complex, higher latency (2-4 seconds)
Typical accuracy75-90% answer correctness

Pattern 3: Modular RAG​

A plug-and-play architecture where each component (retriever, reranker, generator, query processor) is an independent module that can be swapped or upgraded independently.

AspectDetail
When to useTeams that need to experiment with different components
StrengthsFlexible, testable, each module can be independently evaluated and improved
WeaknessesRequires interface contracts between modules, more engineering overhead

Pattern 4: Agentic RAG​

An AI agent decides when to retrieve, what to retrieve, and whether the retrieved results are sufficient. The agent can iterate β€” perform multiple searches, refine queries, and decide when it has enough context to answer.

AspectDetail
When to useComplex questions requiring multi-step reasoning, multi-source queries
StrengthsCan handle open-ended questions, decides its own retrieval strategy
WeaknessesNon-deterministic, harder to evaluate, higher latency (5-15 seconds)
Typical accuracy80-95% on complex queries (but variable)

Pattern 5: Graph RAG​

Combines a knowledge graph with vector retrieval. Documents are processed to extract entities and relationships, which form a graph. Retrieval uses both vector similarity and graph traversal.

AspectDetail
When to useDatasets with rich entity relationships (org charts, product catalogs, compliance frameworks)
StrengthsCaptures relationships that flat vector search misses, enables multi-hop reasoning
WeaknessesExpensive to build, requires entity extraction, graph maintenance
ImplementationMicrosoft GraphRAG library, Neo4j + vector search, Azure Cosmos DB Gremlin

Pattern 6: Multi-Modal RAG​

Extends retrieval to include images, tables, charts, and diagrams β€” not just text.

ModalityHow It WorksAzure Service
TextStandard text embedding and retrievalAzure AI Search
ImagesVision model generates text descriptions, embed thoseAzure OpenAI GPT-4o (vision)
TablesExtract tables as structured data or markdownAzure AI Document Intelligence
ChartsVision model interprets chart dataAzure OpenAI GPT-4o (vision)
AudioTranscribe to text, then standard RAGAzure AI Speech

Pattern Comparison Summary​

PatternComplexityAccuracyLatencyBest For
Naive RAGLow50-65%1-2sPrototypes, simple Q&A
Advanced RAGMedium75-90%2-4sProduction enterprise systems
Modular RAGMedium-High75-90%2-4sTeams needing flexibility
Agentic RAGHigh80-95%5-15sComplex multi-step questions
Graph RAGVery High85-95%3-8sRelationship-rich domains
Multi-Modal RAGHighVaries3-10sDocuments with images/tables

5.14 Evaluation β€” Is Your RAG Actually Working?​

You cannot improve what you do not measure. RAG evaluation is notoriously difficult because there are multiple points of failure: retrieval can fail, context assembly can fail, and generation can fail.

Evaluation Framework​

Retrieval Metrics​

MetricWhat It MeasuresFormulaTarget
Hit Rate@KDid at least one relevant doc appear in top K?(queries with a hit in top K) / (total queries)> 90%
MRR (Mean Reciprocal Rank)Average of 1/rank of first relevant resultmean(1 / rank_of_first_relevant)> 0.7
NDCG@K (Normalized Discounted Cumulative Gain)Quality of the full ranking orderConsiders position-weighted relevance> 0.75
Precision@KFraction of top K results that are relevant(relevant in top K) / K> 0.6
Recall@KFraction of all relevant docs that appear in top K(relevant in top K) / (total relevant)> 0.8

Generation Metrics​

MetricWhat It MeasuresHow to EvaluateTarget
GroundednessIs every claim in the answer supported by the retrieved context?LLM-as-judge or manual review> 90%
RelevanceDoes the answer actually address the user's question?LLM-as-judge with rubric> 85%
FaithfulnessDoes the answer avoid adding information not in the context?LLM-as-judge (check for fabrication)> 95%
CompletenessDoes the answer cover all relevant aspects from the context?LLM-as-judge or manual> 80%
CorrectnessIs the answer factually correct?Comparison against ground truth> 85%

RAGAS Framework​

RAGAS (Retrieval-Augmented Generation Assessment) is an open-source framework that automates RAG evaluation using LLM-as-judge techniques.

from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
context_recall,
)

# Prepare evaluation dataset
eval_dataset = {
"question": ["How do I configure RBAC for AKS?", ...],
"answer": ["To configure RBAC for AKS, you need to...", ...],
"contexts": [["RBAC in AKS is configured by...", "AKS supports Azure RBAC..."], ...],
"ground_truth": ["RBAC for AKS is configured using...", ...],
}

# Run evaluation
result = evaluate(
dataset=eval_dataset,
metrics=[
faithfulness, # Is the answer faithful to the context?
answer_relevancy, # Is the answer relevant to the question?
context_precision, # Are the retrieved contexts precise?
context_recall, # Do the contexts cover the ground truth?
],
)

print(result)
# {'faithfulness': 0.92, 'answer_relevancy': 0.88, 'context_precision': 0.85, 'context_recall': 0.90}

Azure AI Foundry Evaluation​

Azure AI Foundry provides built-in evaluation tools for RAG systems:

  • Built-in evaluators β€” Groundedness, Relevance, Coherence, Fluency, Similarity
  • Custom evaluators β€” Define your own LLM-as-judge prompts
  • Automated evaluation pipelines β€” Run evaluations on every deployment
  • Comparative analysis β€” Compare multiple RAG configurations side-by-side
from azure.ai.evaluation import evaluate, GroundednessEvaluator, RelevanceEvaluator

# Initialize evaluators
groundedness = GroundednessEvaluator(model_config)
relevance = RelevanceEvaluator(model_config)

# Run evaluation over test dataset
results = evaluate(
data="eval_dataset.jsonl",
evaluators={
"groundedness": groundedness,
"relevance": relevance,
},
output_path="eval_results.json"
)

Building an Evaluation Dataset​

A good evaluation dataset contains 50-200 question-answer pairs that are representative of real user queries. Each entry should include:

FieldDescriptionExample
questionA realistic user question"What is our company's parental leave policy?"
ground_truthThe correct answer (written by a domain expert)"Employees are entitled to 16 weeks of paid leave..."
relevant_doc_idsIDs of documents that contain the answer["hr_policy_v3.pdf", "benefits_guide.md"]
Do Not Skip Evaluation

Many teams build RAG systems and "eyeball" the results. This is insufficient for production. Invest in building an evaluation dataset β€” it is the single most important quality assurance artifact in your RAG system. Without it, you are flying blind.


5.15 Infrastructure for RAG at Scale​

A production RAG system is not just code β€” it is infrastructure. Architects must plan for compute, storage, networking, availability, and cost at scale.

Reference Architecture​

Compute Sizing​

ComponentServiceSizing GuidanceScaling Pattern
RAG OrchestratorApp Service or FunctionsStart with S1/P1v3, scale based on concurrent usersHorizontal (instances)
Azure OpenAIPTU or pay-as-you-goPTU for predictable load, PAYG for variableQuota-based
Azure AI SearchStandard S12 replicas for HA, add partitions for data volumeReplicas (throughput) + Partitions (data)
Embedding GenerationAzure OpenAIBatch for ingestion, real-time for queriesRate limit management
Document ProcessingAI Document IntelligenceS0 for low volume, S1 for high volumePer-transaction

Storage Estimation​

Estimate your vector index size to choose the right Azure AI Search tier.

FactorCalculation
Documents10,000 documents, average 5 pages each
Chunks per document~15-30 chunks (512 tokens each)
Total chunks10,000 x 20 = 200,000 chunks
Vector size (1536 dims)200,000 x 1,536 x 4 bytes = ~1.2 GB
Text + metadata per chunk200,000 x ~2 KB = ~400 MB
Total index size~1.6 GB
Recommended tierStandard S1 (25 GB per partition) β€” plenty of room

Cost Estimation Formula​

Monthly RAG Cost = Search Cost + Embedding Cost + LLM Cost + Storage Cost + Compute Cost

Where:
Search Cost = Azure AI Search tier price (e.g., $250/SU for S1)
Embedding Cost = (ingestion_chunks + monthly_queries) x embedding_price_per_token
LLM Cost = monthly_queries x avg_prompt_tokens x LLM_price_per_token
Storage Cost = Blob Storage for source documents (~$0.02/GB)
Compute Cost = App Service / Functions plan

Example cost estimate for a mid-size deployment (10K documents, 50K queries/month):

ComponentMonthly Cost
Azure AI Search (S1, 2 replicas)~$500
Azure OpenAI β€” Embeddings (50K queries x 100 tokens)~$1
Azure OpenAI β€” GPT-4o (50K queries x 5,000 tokens avg)~$375
Azure Blob Storage (50 GB)~$1
Azure App Service (P1v3)~$150
AI Document Intelligence (initial + updates)~$50
Total~$1,077/month

Latency Optimization​

TechniqueLatency ReductionImplementation Effort
Response streamingPerceived latency drops to ~500ms (first token)Low β€” enable stream=True in API call
Search result cachingEliminates search latency for repeated queriesMedium β€” Redis or in-memory cache
Embedding cachingEliminates embedding latency for repeated queriesLow β€” cache embedding vectors
Regional co-location20-50ms saved per cross-region callLow β€” deploy all services in same region
Connection pooling10-30ms saved per requestLow β€” reuse HTTP connections
Async indexingNo impact on query latency (ingestion is async)Medium β€” event-driven ingestion pipeline
Pre-computed embeddingsEliminates real-time embedding generationMedium β€” batch embed and cache

High Availability​

ComponentHA StrategyRPORTO
Azure AI Search2+ replicas (99.9% SLA with 2 replicas)0Minutes
Azure OpenAIMulti-region with fallback0Seconds (failover)
Blob StorageGRS or RA-GRS~15 minutesMinutes
App ServiceMulti-instance, health probes0Seconds
Cosmos DBMulti-region writes0Seconds

Networking Best Practices​

PracticeWhyHow
Private Endpoints for all servicesNo data over public internetAzure Private Link for AI Search, OpenAI, Blob
VNet integration for App ServiceOutbound traffic stays in VNetApp Service VNet Integration
Managed Identity authenticationNo API keys in configSystem-assigned MI with RBAC
DNS Private ZonesPrivate endpoint name resolutionAzure DNS Private Zones per service
NSG rulesRestrict traffic to necessary pathsAllow only orchestrator to search, orchestrator to OpenAI

Key Takeaways​

  1. RAG solves the three fundamental LLM limitations β€” knowledge cutoff, hallucination, and lack of private data access β€” by retrieving your data and injecting it into the prompt at query time.

  2. Chunking is the highest-impact design decision. Use recursive or document-aware chunking for structured content. Test multiple chunk sizes empirically. Never split tables, code blocks, or semantic units.

  3. Hybrid search (keyword + vector) with semantic reranking is the production default. There is rarely a reason to use keyword-only or vector-only search in production systems.

  4. Always use reranking. The 100-300ms of additional latency is almost always worth the 10-25% improvement in answer quality.

  5. Invest in evaluation before scaling. Build a 50-200 question evaluation dataset, measure retrieval and generation metrics, and iterate on your pipeline. Without evaluation, you are guessing.

  6. Azure AI Search is the recommended Swiss Army Knife for Azure RAG architectures β€” it combines vector search, keyword search, semantic reranking, integrated vectorization, and skillsets in a single managed service.

  7. Start with Advanced RAG, not Naive RAG. The additional complexity of query rewriting, hybrid search, and reranking is well worth the effort. Naive RAG is for prototypes only.

  8. RAG is infrastructure, not just code. Plan for compute sizing, storage estimation, networking (private endpoints), high availability, and cost management from day one.


Next Module: Module 6: AI Agents Deep Dive β€” understand how AI agents use tools, plan multi-step tasks, maintain memory, and integrate with the Microsoft Agent Framework and AutoGen.