Skip to main content

Module 1: GenAI Foundations β€” How LLMs Actually Work

Duration: 60--90 minutes | Level: Foundation Audience: Cloud Architects, Platform Engineers, Infrastructure Engineers Last Updated: March 2026


Table of Contents​


1.1 The AI Revolution in 30 Seconds​

The field of artificial intelligence did not arrive overnight. It evolved through distinct waves, each one building on the last, each one demanding more from the infrastructure underneath it.

Timeline: From Machine Learning to the GenAI Explosion​

Why Infrastructure Architects Need to Care β€” Now​

This is not a trend you can observe from the sidelines. Consider what is already happening on the infrastructure you manage:

What ChangedInfrastructure Impact
Every SaaS product is adding AI featuresYour APIs now route to GPU-backed endpoints
RAG pipelines need vector databasesNew data tier alongside SQL and NoSQL
AI agents make autonomous API callsUnpredictable traffic patterns, new security surface
Copilot integrations are enterprise-mandatedM365, GitHub, Azure β€” all require AI connectivity
Model serving requires GPUsCapacity planning now includes VRAM, not just vCPUs
Token-based pricingCost models shift from compute-hours to token volumes

The bottom line: If you build infrastructure, you are already building AI infrastructure β€” whether you realize it or not. This module gives you the foundational knowledge to do it deliberately and well.


1.2 What is a Large Language Model (LLM)?​

Strip away the hype and an LLM is a statistical model that predicts the next token in a sequence. That is it. Every response from ChatGPT, every code suggestion from GitHub Copilot, every summary from Copilot for Microsoft 365 β€” all of it is next-token prediction at massive scale.

The Core Idea​

Given the input: "The capital of France is"

The model assigns probabilities to every token in its vocabulary:

TokenProbability
Paris0.92
Lyon0.03
the0.02
a0.01
Berlin0.005
(50,000+ other tokens)(remaining probability)

The model picks one token (influenced by generation parameters, covered in Section 1.5), appends it to the sequence, and repeats. This loop β€” predict, append, repeat β€” is called autoregressive generation.

Training vs Inference β€” Why Infra Cares About Both​

These are two fundamentally different workloads, and they stress your infrastructure in completely different ways.

DimensionTrainingInference
PurposeLearn patterns from dataGenerate responses from prompts
ComputationForward + backward passForward pass only
DurationWeeks to monthsMilliseconds to seconds
GPU utilizationSustained 90--100%Bursty, 20--80%
Data movementTB/PB of training dataKB/MB per request
ParallelismData parallel + model parallel across 100s--1000s of GPUsTypically 1--8 GPUs per model instance
Who does itModel providers (OpenAI, Meta, Google)You (via API or self-hosted)
Your concern as architectRarely β€” unless fine-tuningAlways β€” this runs on your infra

Parameters β€” What "7B" and "70B" Actually Mean​

When someone says "Llama 3.1 70B," that 70B refers to 70 billion parameters. Parameters are the numerical weights inside the neural network that were learned during training. They encode everything the model "knows."

Think of parameters like this: if the model is a building, parameters are every single brick, beam, wire, and pipe. More parameters = more capacity to store knowledge and handle nuance, but also = more VRAM, more compute, more cost.

ModelParametersVRAM Required (FP16)VRAM Required (INT4)Relative Quality
Phi-3 Mini3.8B~8 GB~3 GBGood for focused tasks
Llama 3.1 8B8B~16 GB~5 GBStrong for its size
Llama 3.1 70B70B~140 GB~40 GBVery capable
Llama 3.1 405B405B~810 GB~230 GBFrontier-class
GPT-4 (estimated)~1.8T (MoE)Not self-hostableNot self-hostableLeading benchmark scores

The infrastructure relationship is direct:

VRAM required (GB) β‰ˆ Parameters (B) Γ— Bytes per parameter
  • FP32 (full precision): 4 bytes per parameter β†’ 70B model = ~280 GB VRAM
  • FP16 (half precision): 2 bytes per parameter β†’ 70B model = ~140 GB VRAM
  • INT8 (8-bit quantized): 1 byte per parameter β†’ 70B model = ~70 GB VRAM
  • INT4 (4-bit quantized): 0.5 bytes per parameter β†’ 70B model = ~35 GB VRAM

This is why quantization (Section 1.9) matters so much for infrastructure planning.


1.3 Tokens β€” The Currency of AI​

Tokens are the fundamental unit of everything in the LLM world. They determine cost, latency, memory usage, and context limits. As an infrastructure architect, understanding tokens is as essential as understanding packets in networking.

What is a Token?​

A token is a subword unit β€” not a whole word, not a single character, but something in between. Models use algorithms like Byte-Pair Encoding (BPE) to break text into tokens. Common words become single tokens. Rare words get split into multiple tokens.

Examples of tokenization (using GPT-4's tokenizer):

TextTokensToken Count
HelloHello1
infrastructureinfra structure2
KubernetesKub ernetes2
Azure Application GatewayAzure Application Gateway3
antidisestablishmentarianismant idis establish ment arian ism6
こんにけは (Japanese "hello")こんにけは1--3 (varies by model)
192.168.1.1192 . 168 . 1 . 17
A blank space before a wordIncluded in the token(spaces are part of tokens)

Key insight: A rough rule of thumb for English text is 1 token β‰ˆ 0.75 words, or equivalently, 1 word β‰ˆ 1.33 tokens. Code is typically more token-dense than natural language.

Token Limits and Context Windows​

Every model has a context window β€” the maximum number of tokens it can process in a single request (input + output combined).

Context Window = Input Tokens + Output Tokens

If a model has a 128K context window and your input is 100K tokens, you have only 28K tokens left for the output.

Input Tokens vs Output Tokens vs Total Tokens​

This distinction matters for both cost and infrastructure planning:

Token TypeWhat It IncludesCost (Typical)Latency Impact
Input tokensSystem prompt + user prompt + any context/documentsLower cost per tokenProcessed in parallel (fast)
Output tokensThe model's generated responseHigher cost per token (2--4x input)Generated sequentially (slower)
Total tokensInput + OutputSum of bothDetermines memory usage

Why Tokens Matter β€” The Three Dimensions​

1. Cost: Every API call costs money per token.

ModelInput Cost (per 1M tokens)Output Cost (per 1M tokens)
GPT-4o$2.50$10.00
GPT-4o mini$0.15$0.60
Claude 3.5 Sonnet$3.00$15.00
Claude Opus 4$15.00$75.00
Llama 3.1 70B (Azure)$0.268$0.354

Prices are illustrative and change frequently. Always check current pricing.

2. Latency: More output tokens = longer response time. Each output token is generated sequentially. A 500-token response takes roughly 5x longer than a 100-token response.

3. Infrastructure Sizing: The context window directly determines how much GPU memory (VRAM) is needed per concurrent request, because the entire context must be held in the KV cache during generation.

Token Count Reference​

To help you estimate token usage for your workloads:

Content TypeApproximate Token Count
A short sentence (10 words)~13 tokens
A paragraph (100 words)~133 tokens
A full page of text (~500 words)~667 tokens
A 10-page document~6,700 tokens
A 100-page technical manual~67,000 tokens
A full novel (80,000 words)~107,000 tokens
A Kubernetes YAML file (200 lines)~800 tokens
A Python script (500 lines)~2,500 tokens
A Terraform module (1,000 lines)~5,500 tokens

1.4 The Transformer Architecture (Simplified)​

The Transformer architecture, introduced in the 2017 paper "Attention Is All You Need," is the foundation of every major LLM. You do not need to understand every mathematical detail, but you need to understand the key innovation and why it changed everything from an infrastructure perspective.

Why Transformers Replaced RNNs​

Before Transformers, sequence models (RNNs, LSTMs) processed text one token at a time, sequentially. This was like a single-lane road β€” no matter how fast the car, throughput was limited.

Transformers introduced self-attention, which allows the model to process all tokens in parallel and learn relationships between any two tokens regardless of distance. This is like opening a multi-lane highway.

PropertyRNN/LSTMTransformer
ProcessingSequential (token by token)Parallel (all tokens at once)
Long-range dependenciesDegrades with distanceConstant (attention across all positions)
Training speedSlow (cannot parallelize)Fast (GPU-friendly parallelism)
ScalingDifficult beyond ~1B paramsScales to trillions of parameters
GPU utilizationPoor (sequential bottleneck)Excellent (matrix multiplications)

Self-Attention Explained Simply​

Self-attention answers the question: "When processing this token, how much should I pay attention to every other token in the sequence?"

Consider the sentence: "The server crashed because it ran out of memory."

When processing the word "it", self-attention computes how relevant every other word is:

WordAttention WeightWhy
server0.62"it" refers to "server"
crashed0.15Related context
memory0.12Related concept
The0.02Low relevance
because0.04Structural word
ran0.03Some relevance
out0.01Low relevance
of0.01Low relevance

This is computed using three learned projections of each token called Query (Q), Key (K), and Value (V) β€” think of it like a database lookup where the Query asks "what am I looking for?", the Keys say "here is what I contain," and the dot product between them determines relevance. The Values carry the actual information that gets passed forward.

Encoder vs Decoder vs Encoder-Decoder​

Transformers come in three flavors, each optimized for different tasks:

For infrastructure architects, the key takeaway: Almost every LLM you will serve in production is decoder-only. GPT-4, Claude, Llama, Phi, Mistral β€” all decoder-only. This matters because decoder-only models generate output one token at a time, which creates the sequential bottleneck that drives inference latency.

Simplified Transformer Flow​

Infrastructure insight: The number of Transformer blocks (layers) is one of the key scaling dimensions. GPT-3 has 96 layers. More layers = more parameters = more VRAM = more compute per token. Each layer performs matrix multiplications, which is why GPUs (optimized for matrix math) are essential.


1.5 Key Generation Parameters β€” The Control Panel​

When you send a prompt to an LLM, you do not just send text β€” you send generation parameters that control how the model selects tokens from its probability distribution. These parameters are the dials and knobs that determine whether the model's output is creative or deterministic, concise or verbose, focused or exploratory.

Understanding these parameters is critical because they directly affect output quality, latency, cost, and user experience of every AI feature your infrastructure serves.

Temperature​

What it does: Controls the randomness of token selection by scaling the probability distribution before sampling.

Technical detail: Temperature divides the logits (raw model outputs) before the softmax function. Lower temperature = sharper distribution (high-probability tokens dominate). Higher temperature = flatter distribution (lower-probability tokens get a chance).

TemperatureBehaviorOutput Example for "The best cloud provider is..."
0.0Deterministic β€” always picks the highest-probability token"The best cloud provider is Azure for enterprise workloads due to its comprehensive..." (same every time)
0.3Low randomness β€” very focused, slight variation"The best cloud provider depends on your specific requirements, but Azure offers..."
0.7Balanced β€” default for most use cases"The best cloud provider really depends on what you need. For hybrid scenarios, Azure shines..."
1.0Standard sampling β€” follows the learned distribution"Honestly, the best cloud provider is a loaded question! Each has sweet spots..."
1.5High randomness β€” creative, sometimes incoherent"The best cloud provider is like asking which star in Orion's belt dances most..."
2.0Maximum randomness β€” often nonsensicalUnpredictable, potentially garbled output

When to use which value:

Use CaseRecommended Temperature
Code generation0.0 -- 0.2
Data extraction / structured output0.0
Factual Q&A / technical documentation0.1 -- 0.3
General conversation / chatbots0.5 -- 0.7
Creative writing / brainstorming0.7 -- 1.0
Experimental / artistic content1.0 -- 1.5

Top-P (Nucleus Sampling)​

What it does: Instead of considering all tokens, Top-P considers only the smallest set of tokens whose cumulative probability exceeds the threshold P.

Example: If top_p = 0.9, the model sorts tokens by probability and includes tokens until the cumulative probability reaches 90%. All other tokens are excluded before sampling.

Token probabilities (sorted):
"Paris" = 0.70
"Lyon" = 0.10 β†’ cumulative = 0.80
"the" = 0.06 β†’ cumulative = 0.86
"a" = 0.04 β†’ cumulative = 0.90 ← cutoff at top_p=0.9
"Berlin" = 0.03 β†’ excluded
...all others β†’ excluded
Top-P ValueBehavior
0.1Only the very top tokens considered β€” very focused
0.5Moderate filtering β€” reasonable diversity
0.9Mild filtering β€” most probable tokens included (common default)
1.0No filtering β€” all tokens considered
warning

Do not adjust both Temperature and Top-P simultaneously. They both control randomness but in different ways. Changing both can produce unpredictable results. Pick one to tune and leave the other at its default.

Top-K​

What it does: Limits the candidate tokens to the K most probable tokens before sampling. Simpler than Top-P but less adaptive.

Top-K ValueBehavior
1Greedy decoding β€” always pick the top token (like temperature=0)
10Very focused β€” only top 10 tokens considered
40Moderate diversity (common default)
100Broad candidate set

Top-K vs Top-P: Top-K always considers exactly K tokens regardless of their probabilities. Top-P adapts β€” if the model is very confident, it might only consider 2 tokens; if uncertain, it might consider 200. In practice, Top-P is preferred for most applications.

Frequency Penalty​

What it does: Reduces the probability of tokens proportionally to how many times they have already appeared in the output. The more a token appears, the more it is penalized.

ValueBehavior
-2.0Strongly encourage repetition
0.0No penalty (default)
0.5Mild reduction in repetition
1.0Moderate reduction β€” noticeable reduction in repeated phrases
2.0Strong penalty β€” aggressively avoids repeating any token

Use case: Set to 0.3--0.8 when the model tends to repeat phrases or get stuck in loops. Common in longer outputs.

Presence Penalty​

What it does: Applies a flat penalty to any token that has appeared in the output at all, regardless of how many times. Unlike frequency penalty, which scales with count, presence penalty is binary β€” the token either has appeared or it has not.

ValueBehavior
-2.0Strongly encourage reuse of existing topics
0.0No penalty (default)
0.5Mildly encourages new topics
1.0Moderately encourages the model to explore new territory
2.0Strongly pushes the model to cover new topics, avoid revisiting

Use case: Set to 0.3--1.0 when you want the model to cover diverse topics and not circle back to points it already made. Useful for brainstorming and summarization.

Max Tokens (Max Completion Tokens)​

What it does: Sets the maximum number of tokens the model will generate in its response. The model will stop generating when it reaches this limit (even mid-sentence) or when it produces a natural stop token, whichever comes first.

ConsiderationDetail
Relationship to context windowinput_tokens + max_tokens must not exceed the model's context window
Cost controlSetting max_tokens prevents runaway generation that consumes budget
QualitySetting it too low truncates responses mid-thought; too high wastes potential budget
DefaultVaries by model β€” often 4096 or model maximum

Infrastructure tip: For cost-sensitive production workloads, always set max_tokens explicitly. A chatbot response rarely needs more than 1,000 tokens. A code generation task might need 4,000. A document summary might need 500. Setting appropriate limits directly controls your token spend.

Stop Sequences​

What it does: A list of strings that, when generated by the model, cause it to stop generating immediately. The stop sequence itself is not included in the response.

Examples:

Stop SequenceUse Case
"\n\n"Stop after a single paragraph
"```"Stop after a code block
"END"Stop at a specific marker
"Human:"Stop before generating the next turn in a conversation

The Complete Parameter Reference Table​

ParameterRangeDefault (typical)What It ControlsWhen to Adjust
Temperature0.0 -- 2.00.7 -- 1.0Randomness of token selectionLower for factual tasks, higher for creative tasks
Top-P0.0 -- 1.00.9 -- 1.0Cumulative probability thresholdLower for focused output; do not change with temperature
Top-K1 -- vocab size40 -- 50Hard cap on candidate tokensLower for deterministic output
Frequency Penalty-2.0 -- 2.00.0Penalty scaling with token frequencyIncrease (0.3--0.8) to reduce repetitive phrases
Presence Penalty-2.0 -- 2.00.0Flat penalty for any used tokenIncrease (0.3--1.0) to encourage topic diversity
Max Tokens1 -- context limitModel-dependentMaximum output lengthAlways set explicitly in production
Stop SequencesList of stringsNonePoints where generation haltsUse to control output format and boundaries

Practical Parameter Recipes​

ScenarioTemperatureTop-PFreq. PenaltyPresence PenaltyMax Tokens
Structured data extraction0.01.00.00.0500
Code generation0.0 -- 0.20.950.00.04096
Customer support chatbot0.50.90.30.2800
Technical documentation0.2 -- 0.30.90.00.02000
Creative brainstorming0.90.950.50.82000
Summarization0.30.90.50.5500

1.6 Context Windows β€” The Memory of AI​

The context window is the total number of tokens a model can "see" at once β€” its working memory for a single request. Everything the model reads (system prompt, conversation history, documents, user question) and everything it writes (the response) must fit within this window.

The Evolution of Context Windows​

YearModelContext WindowEquivalent Text
2022GPT-3.54,096 tokens~3,000 words (~6 pages)
2023GPT-3.5-turbo-16k16,384 tokens~12,000 words (~24 pages)
2023GPT-48,192 tokens~6,000 words (~12 pages)
2023GPT-4-32k32,768 tokens~25,000 words (~50 pages)
2023Claude 2.1200,000 tokens~150,000 words (~300 pages)
2024GPT-4o128,000 tokens~96,000 words (~192 pages)
2024Claude 3.5 Sonnet200,000 tokens~150,000 words (~300 pages)
2025Gemini 1.5 Pro1,000,000 tokens~750,000 words (~1,500 pages)
2025Claude Opus 41,000,000 tokens~750,000 words (~1,500 pages)

Why Bigger Context Is Not Always Better​

Intuitively, a larger context window seems strictly superior. But there are important tradeoffs that infrastructure architects must understand:

The "Lost in the Middle" Problem​

Research has shown that LLMs pay the most attention to information at the beginning and end of the context window, and tend to overlook information in the middle. This is called the "Lost in the Middle" effect.

What this means for architects:

  • Do not assume that stuffing a 200K-token window full of documents will produce accurate answers
  • Place the most important information at the beginning or end of the prompt
  • RAG (Retrieval-Augmented Generation) with targeted retrieval often outperforms brute-force context stuffing
  • Evaluate whether your use case genuinely needs a large context window or if a smaller window with smarter retrieval is more cost-effective

Context Window and Infrastructure Sizing​

The context window has a direct relationship to infrastructure requirements because of the KV cache (Key-Value cache) β€” a memory structure that stores attention computations for all tokens in the context.

Context LengthKV Cache per Request (approx., 70B model, FP16)Max Concurrent Requests on A100 80GB
4K tokens~0.5 GB~80
32K tokens~4 GB~15
128K tokens~16 GB~3
1M tokens~128 GB<1 (needs multiple GPUs)

Values are approximate and depend on model architecture, batch size, and optimization techniques.

Infrastructure takeaway: Context window size should be a key input to your capacity planning. Do not default to the largest available context β€” right-size it for your workload.


1.7 Inference β€” Where Infrastructure Meets AI​

Inference is the process of running a trained model to generate predictions (responses). This is where your infrastructure directly impacts user experience. Every millisecond of latency, every out-of-memory error, every throttled request β€” these are inference problems.

What Happens During an Inference Request​

The Two Phases of Inference​

Understanding these two phases is critical for infrastructure optimization:

Phase 1: Prefill (also called "prompt processing")

AspectDetail
What happensAll input tokens are processed in parallel through the model
Compute patternCompute-bound β€” heavy matrix multiplications
GPU utilizationHigh β€” all cores active
OutputKV cache is populated for all input positions
DurationProportional to input length; processed in parallel, so reasonably fast

Phase 2: Decode (also called "generation" or "autoregressive decoding")

AspectDetail
What happensTokens are generated one at a time, each attending to all previous tokens via the KV cache
Compute patternMemory-bound β€” reading KV cache dominates; GPU compute is underutilized
GPU utilizationLow per-token β€” most time spent on memory transfers
OutputOne token per forward pass
DurationProportional to number of output tokens, sequential

Key Performance Metrics​

MetricDefinitionWhat Affects ItTypical Values
TTFT (Time to First Token)Time from request to first token of responseInput length, model size, GPU speed, queue depth200ms -- 5s
TPS (Tokens Per Second)Rate of output token generationModel size, GPU memory bandwidth, batch size30 -- 100 TPS per request
Total LatencyTTFT + (output tokens / TPS)All of the above1s -- 60s+
ThroughputTotal tokens/second across all concurrent requestsBatch size, GPU count, optimization1,000 -- 50,000 TPS per GPU

Example calculation:

  • TTFT = 500ms
  • TPS = 50 tokens/second
  • Output length = 200 tokens
  • Total latency = 0.5s + (200 / 50) = 0.5s + 4.0s = 4.5 seconds

Streaming vs Non-Streaming Responses​

ModeBehaviorPerceived LatencyUse Case
Non-streamingWait for entire response, then return all at onceHigh (user sees nothing until done)Background processing, APIs returning structured data
Streaming (SSE)Return each token as it is generatedLow (user sees text appearing immediately)Chatbots, interactive UIs, any user-facing application

Infrastructure note: Streaming uses Server-Sent Events (SSE) over HTTP. Your load balancers, API gateways, and reverse proxies must support long-lived connections and chunked transfer encoding. Standard HTTP request timeouts (30s) may be too short for longer generations. Azure API Management, Application Gateway, and Front Door all have specific configurations for SSE support.

The KV Cache β€” Why It Matters for Infrastructure​

The KV (Key-Value) cache stores the attention keys and values for all processed tokens. Without it, every new output token would require reprocessing the entire sequence from scratch. With it, each new token only needs to attend to the cached values.

The tradeoff: The KV cache lives in GPU VRAM and grows linearly with context length. This is often the bottleneck that limits how many concurrent requests a GPU can serve.

KV Cache Size β‰ˆ 2 Γ— num_layers Γ— num_heads Γ— head_dim Γ— context_length Γ— bytes_per_element

For a 70B parameter model with 80 layers, 64 attention heads, head dimension of 128, and FP16 precision:

KV Cache per token β‰ˆ 2 Γ— 80 Γ— 64 Γ— 128 Γ— 2 bytes = 2.62 MB per token
For 4K context: ~10.5 GB
For 32K context: ~84 GB

This is why serving models with long context windows requires significantly more VRAM per concurrent request, directly impacting your infrastructure density and cost.


1.8 Training vs Fine-Tuning vs Inference​

These three stages form the lifecycle of an LLM. As an infrastructure architect, you will primarily deal with inference, occasionally with fine-tuning, and rarely with pre-training β€” but understanding all three helps you plan resources and have informed conversations with AI teams.

Pre-Training​

Pre-training is where a model learns language from scratch by processing trillions of tokens from the internet, books, code repositories, and other text sources.

DimensionScale
Data1--15 trillion tokens
Compute10,000 -- 50,000+ GPUs for weeks to months
Cost$10M -- $500M+
Duration1 -- 6 months
Who does itOpenAI, Anthropic, Meta, Google, Mistral
OutputBase model weights (not yet instruction-following)
Your roleNone β€” you consume the result

Alignment: RLHF and DPO​

After pre-training, the base model can predict tokens but does not know how to follow instructions or be helpful. Alignment techniques make the model useful and safe.

TechniqueFull NameHow It WorksCompute Cost
SFTSupervised Fine-TuningTrain on curated instruction-response pairsModerate
RLHFReinforcement Learning from Human FeedbackHumans rank outputs; model learns to produce preferred responsesHigh (requires reward model)
DPODirect Preference OptimizationLearn from preference pairs without an explicit reward modelLower than RLHF
Constitutional AI-Model self-critiques based on principlesModerate

Fine-Tuning​

Fine-tuning takes a pre-trained (and usually aligned) model and further trains it on your specific data to improve performance on your particular domain or task.

Parameter-Efficient Fine-Tuning (PEFT)​

Full fine-tuning updates all model parameters β€” expensive and requires as much VRAM as training. PEFT methods update only a small fraction of parameters, dramatically reducing resource needs.

MethodWhat It DoesParameters UpdatedVRAM SavingsQuality
Full Fine-TuningUpdates all weights100%NoneHighest
LoRA (Low-Rank Adaptation)Injects small trainable matrices alongside frozen weights0.1 -- 1%60--80%Near full fine-tuning
QLoRALoRA on a quantized (4-bit) base model0.1 -- 1%80--95%Slightly below full
AdaptersSmall neural network modules inserted between layers1 -- 5%50--70%Good
Prefix TuningPrepends trainable virtual tokens to the input<1%70--85%Good for specific tasks

LoRA example: Instead of fine-tuning a 70B model (requires ~280 GB VRAM and multi-GPU setup), QLoRA lets you fine-tune it on a single A100 80GB GPU by quantizing the base model to 4-bit and training only ~0.5% of parameters.

Comparison Table: Training vs Fine-Tuning vs Inference​

DimensionPre-TrainingFine-Tuning (QLoRA)Inference
PurposeLearn language from scratchSpecialize for a domain/taskGenerate responses
DataTrillions of tokensThousands of examplesSingle prompt
DurationMonthsHours to daysMilliseconds to seconds
GPU Count1,000 -- 50,000+1 -- 81 -- 8 (per model instance)
GPU VRAMMaximum (HBM3)24 -- 80 GB per GPU16 -- 80 GB per GPU
Cost$10M -- $500M$50 -- $10,000$0.001 -- $0.10 per request
FrequencyOnce per model versionPeriodically (weekly/monthly)Continuously (every request)
Infrastructure patternBatch job (scheduled)Batch job (scheduled)Real-time service (always-on)
Your responsibilityNoneSometimesAlways

1.9 Model Quantization​

Quantization is the process of reducing the numerical precision of a model's parameters. It is one of the most practical techniques an infrastructure architect should understand because it directly determines how large a model you can serve on a given GPU.

What Is Quantization?​

Neural network parameters are stored as floating-point numbers. The "full precision" format is FP32 (32-bit floating point), but inference works well at lower precision because the model's behavior is robust to small rounding errors.

Impact on Model Size and VRAM​

Using Llama 3.1 70B as an example:

PrecisionBytes per ParameterModel SizeGPU VRAM RequiredQuality Impact
FP324 bytes~280 GB~280 GB (4x A100 80GB)Baseline (maximum accuracy)
FP16 / BF162 bytes~140 GB~140 GB (2x A100 80GB)Negligible loss β€” standard for inference
INT81 byte~70 GB~70 GB (1x A100 80GB)Minimal loss β€” widely used in production
INT40.5 bytes~35 GB~40 GB (with overhead)Slight degradation β€” popular for cost savings
INT30.375 bytes~26 GB~30 GB (with overhead)Noticeable degradation on complex tasks

Note: Actual VRAM requirement exceeds raw model size due to KV cache, activation memory, and framework overhead. Add 10--30% buffer.

Quantization Formats​

Different quantization methods use different algorithms to minimize quality loss:

FormatFull NameDescriptionBest For
GPTQGPT QuantizedPost-training quantization using calibration data; GPU-optimizedGPU inference (vLLM, TGI)
GGUFGPT-Generated Unified FormatOptimized for CPU and CPU+GPU hybrid inferenceLocal deployment, llama.cpp
AWQActivation-aware Weight QuantizationPreserves important weights based on activation patternsHigh-quality INT4 on GPU
EETQEasy and Efficient Transformer QuantizationINT8 with minimal setupQuick INT8 deployment
BitsAndBytes-Integrated into Hugging Face; supports INT8 and INT4 (NF4)Fine-tuning with QLoRA

Quantization Decision Guide​

Infrastructure architect's rule of thumb: Start with INT8 for production GPU workloads. Move to INT4 if you need to fit a larger model on fewer GPUs. Use FP16 if quality is paramount and you have the GPU budget. Benchmark quality on your specific use case β€” quantization impact varies by task.


1.10 Embeddings β€” The Meaning of Text​

Embeddings are numerical representations of text (or images, audio, etc.) in a high-dimensional vector space. While LLMs generate text, embedding models convert text into vectors that capture semantic meaning β€” and these vectors are the backbone of search, retrieval, and RAG architectures.

What Are Vector Embeddings?​

An embedding model converts text into a fixed-length array of floating-point numbers (a vector). Texts with similar meanings produce vectors that are close together in this high-dimensional space.

"Deploy a Kubernetes cluster" β†’ [0.023, -0.041, 0.089, ..., 0.012]  (1536 dimensions)
"Set up a K8s cluster" β†’ [0.021, -0.039, 0.091, ..., 0.014] (very similar vector!)
"Bake a chocolate cake" β†’ [-0.087, 0.063, -0.012, ..., 0.098] (very different vector)

How Similarity Is Measured​

The most common similarity metric is cosine similarity β€” the cosine of the angle between two vectors. It ranges from -1 (opposite) to 1 (identical).

Text PairCosine SimilarityInterpretation
"Deploy a Kubernetes cluster" vs "Set up a K8s cluster"0.95Very similar meaning
"Deploy a Kubernetes cluster" vs "Container orchestration platform"0.82Related concepts
"Deploy a Kubernetes cluster" vs "Azure Virtual Network setup"0.58Loosely related (both infra)
"Deploy a Kubernetes cluster" vs "Bake a chocolate cake"0.12Unrelated

Embedding Dimensions​

ModelProviderDimensionsMax Input TokensUse Case
text-embedding-3-smallOpenAI15368,191Cost-effective general purpose
text-embedding-3-largeOpenAI30728,191Higher quality, more dimensions
text-embedding-ada-002OpenAI15368,191Legacy, widely deployed
Cohere embed-v3Cohere1024512Multilingual, search-optimized
BGE-large-enBAAI (open)1024512Strong open-source option
E5-large-v2Microsoft (open)1024512Microsoft's open embedding model

More dimensions = more nuance captured, but also = more storage, more compute for similarity search, and higher memory usage in your vector database.

Embedding Models vs Generation Models​

AspectEmbedding ModelGeneration Model (LLM)
InputText (or image, audio)Text prompt
OutputFixed-length vector (e.g., 1536 floats)Variable-length text (token by token)
ArchitectureUsually encoder-only (BERT-family)Usually decoder-only (GPT-family)
SizeSmall (100M -- 1B parameters)Large (7B -- 1T+ parameters)
VRAM1 -- 4 GB16 -- 800+ GB
SpeedVery fast (single forward pass)Slower (sequential token generation)
CostVery cheap ($0.02 -- $0.13 per 1M tokens)Expensive ($0.15 -- $75 per 1M tokens)
Use caseSearch, retrieval, clustering, classificationConversation, generation, reasoning

Why Embeddings Are the Backbone of RAG​

In a RAG (Retrieval-Augmented Generation) pipeline β€” covered in depth in Module 5 β€” embeddings are used to:

  1. Index: Convert your knowledge base documents into vectors and store them in a vector database
  2. Query: Convert the user's question into a vector using the same embedding model
  3. Retrieve: Find the most similar document vectors to the query vector
  4. Generate: Pass the retrieved documents to an LLM as context for generating an answer

Infrastructure implications of embeddings:

ConcernDetail
Storage1M documents with 1536-dim embeddings β‰ˆ 6 GB of vector data
DatabaseRequires a vector database (Azure AI Search, Qdrant, Pinecone, Weaviate, pgvector)
LatencyEmbedding generation is fast (~5ms per text); similarity search adds ~10--50ms
Batch processingInitial indexing of large document sets requires batch embedding (millions of API calls)
Model consistencyYou must use the same embedding model for indexing and querying

1.11 The Infra Architect's Mental Model​

Let us bring everything together into a single, unified view of how LLM inference works from an infrastructure perspective.

The Complete Request Flow​

Key Metrics an Infrastructure Architect Should Track​

CategoryMetricWhy It MattersTarget Range
LatencyTime to First Token (TTFT)User perceived responsiveness< 1s for interactive
LatencyTokens per Second (TPS)Speed of response generation30 -- 80 TPS per request
LatencyEnd-to-end latency (P50, P95, P99)SLA complianceP95 < 5s for chatbots
ThroughputRequests per secondCapacity planningDepends on model and GPU
ThroughputTotal TPS (all requests)GPU utilization efficiencyHigher = better GPU ROI
ResourceGPU VRAM utilizationCapacity and OOM prevention70 -- 90% (leave headroom)
ResourceGPU compute utilizationEfficiency of servingPrefill: 80%+; Decode: 30--60%
ResourceKV cache memory usageConcurrent request capacityMonitor per-request growth
CostCost per 1M tokensBudget trackingVaries by model and deployment
CostCost per request (average)Unit economicsTrack input + output separately
ReliabilityError rate (429s, 500s, timeouts)Service health< 0.1%
ReliabilityQueue depthWhether capacity is sufficientGrowing queue = scale up

Resource Requirements Checklist​

Use this checklist when planning infrastructure for an LLM workload:

DecisionQuestions to AskImpact
Model selectionWhich model? How many parameters? Open or proprietary?Determines compute tier
PrecisionFP16, INT8, or INT4?Determines VRAM per instance
Context windowWhat context length do you need? (4K? 32K? 128K?)Determines KV cache VRAM per request
ConcurrencyHow many simultaneous requests?Determines number of GPU instances
Latency SLAWhat TTFT and TPS are acceptable?Determines GPU tier (A10G vs A100 vs H100)
ThroughputHow many requests per minute/hour?Determines horizontal scaling
AvailabilityWhat uptime is required?Determines redundancy (multi-region, failover)
Data residencyWhere must data be processed?Determines Azure region and compliance
Cost modelPay-per-token (PTU) vs pay-per-request vs self-hosted?Determines deployment type
StreamingDo you need streaming responses?Impacts gateway, proxy, and load balancer config
Fine-tuningWill you fine-tune? How often?Determines training infrastructure (periodic)
EmbeddingDo you need embeddings (for RAG)?Separate model deployment, vector DB infrastructure

Quick GPU Reference for Common Models​

ModelParametersQuantizationMin GPURecommended GPUApprox. TPS
Phi-3 Mini3.8BFP161x A10G (24GB)1x A10G60 -- 100
Llama 3.1 8B8BFP161x A10G (24GB)1x A100 (40GB)50 -- 80
Llama 3.1 8B8BINT41x T4 (16GB)1x A10G (24GB)40 -- 60
Mistral 7B7BFP161x A10G (24GB)1x A100 (40GB)50 -- 80
Llama 3.1 70B70BFP162x A100 (80GB)4x A100 (80GB)20 -- 40
Llama 3.1 70B70BINT41x A100 (80GB)2x A100 (80GB)25 -- 45
Llama 3.1 405B405BFP168x A100 (80GB)8x H100 (80GB)10 -- 20
Llama 3.1 405B405BINT44x A100 (80GB)4x H100 (80GB)15 -- 25

TPS values are per-request on a single model instance. Actual performance depends on batch size, context length, serving framework (vLLM, TGI, TensorRT-LLM), and optimization settings.

The Serving Stack β€” What Sits Between the Model and Your Users​

Managed (Azure OpenAI) vs Self-Hosted decision:

FactorAzure OpenAI (Managed)Self-Hosted (AKS + vLLM)
Setup complexityLow (deploy in minutes)High (GPU provisioning, model loading, optimization)
Model choiceLimited to catalog (GPT-4o, GPT-4, etc.)Any open model (Llama, Mistral, Phi, etc.)
ScalingAutomatic (with PTU or token limits)Manual (HPA, node autoscaler)
Cost modelPer-token or PTU reservationGPU VM cost (always-on or spot)
Data controlData stays in Azure, no training on your dataFull control β€” your cluster, your data
CustomizationLimited (system prompts, fine-tuning for some models)Full (any quantization, any serving config)
SLA99.9% (Azure SLA)Depends on your implementation
Best forProduction apps using supported modelsOpen models, cost optimization, maximum control

Key Takeaways​

  1. LLMs are next-token predictors β€” every response is generated one token at a time through probability distributions shaped by generation parameters.

  2. Tokens are the currency β€” they determine cost (pricing per million tokens), latency (output tokens are sequential), and infrastructure sizing (context length drives VRAM requirements via KV cache).

  3. Transformers changed everything because self-attention enables parallelism. This is why GPUs (built for parallel matrix operations) are essential for AI workloads.

  4. Generation parameters are your control panel β€” temperature, top-p, frequency penalty, presence penalty, and max tokens give you precise control over output behavior. Always set max_tokens in production.

  5. Context windows are not free β€” larger context means more VRAM per request, fewer concurrent users per GPU, and potential attention dilution. Right-size your context for the workload.

  6. Inference has two phases β€” the compute-bound prefill phase and the memory-bound decode phase. TTFT is driven by prefill; TPS is driven by decode. Optimize them differently.

  7. Quantization is your best friend for infrastructure efficiency β€” INT8 and INT4 can reduce VRAM requirements by 2--4x with acceptable quality loss, enabling larger models on fewer GPUs.

  8. Embeddings are different from generation β€” they are small, fast, cheap encoder models that convert text to vectors. They power the retrieval half of RAG pipelines.

  9. You are already running AI infrastructure β€” whether through Azure OpenAI, Copilot integrations, or self-hosted models. Understanding these foundations lets you architect it deliberately.

  10. Capacity planning now includes VRAM β€” alongside CPU, RAM, disk, and network, GPU memory is a first-class resource that determines what models you can serve, at what concurrency, with what latency.


Next: Module 2: LLM Landscape β€” Understand the major model families (GPT, Claude, Llama, Gemini, Phi), how to compare them with benchmarks, and how to choose the right model for your workload.