Skip to main content

Module 9: AI Infrastructure for Architects β€” GPU, Scaling & Production

Duration: 60-90 minutes | Level: Strategic Audience: Cloud Architects, Platform Engineers, Infrastructure Engineers Last Updated: March 2026


9.1 The AI Compute Stack​

Everything you have learned in Modules 1 through 8 β€” embeddings, vector search, RAG pipelines, fine-tuning, prompt engineering β€” eventually runs on infrastructure. This module connects the dots between what AI does and what AI needs from a platform perspective.

Why AI Needs Different Infrastructure​

Traditional enterprise applications and AI workloads have fundamentally different compute profiles. Understanding this gap is the first step to making sound infrastructure decisions.

DimensionTraditional App (e.g., Web API)AI Inference WorkloadAI Training Workload
Compute typeCPU-bound (sequential logic)GPU-bound (parallel matrix ops)GPU-bound (massive parallelism)
Memory profileLow-moderate (MB-GB)High (GB of VRAM per model)Very high (hundreds of GB VRAM)
Scaling patternHorizontal scale-out, statelessComplex (model in memory, stateful)Distributed multi-node
Latency profileSub-100ms typical200ms-30s (depends on tokens)Hours to weeks
Network needsStandard throughputModerate (API calls, model loading)InfiniBand / RDMA for multi-node
Storage needsDatabase-centricModel weight files (GB-TB)Training datasets (TB-PB)
Cost driverCPU hours, memoryGPU hours, VRAM, tokens consumedGPU hours at massive scale
Failure modeRequest fails, retryModel hallucination, timeout, OOMTraining divergence, checkpoint loss
Key Insight

The single biggest mistake architects make is treating AI workloads like traditional web services. AI workloads are memory-bound, not CPU-bound. Scaling decisions revolve around GPU memory (VRAM), not CPU cores.

CPU vs GPU vs TPU vs NPU​

ProcessorArchitectureStrengthsAI Use CaseAzure Availability
CPUFew powerful cores, sequentialGeneral-purpose logic, low-parallelism tasksSmall model inference, preprocessing, orchestrationEvery Azure VM
GPUThousands of small cores, massively parallelMatrix multiplication, parallel processingModel training, inference for medium-large modelsN-series VMs, AKS GPU pools
TPUGoogle custom ASIC for tensor operationsOptimized for TensorFlow workloadsTraining at scale (Google Cloud only)Not available on Azure
NPUOn-device neural processing unitLow-power, edge inferenceOn-device AI (phones, laptops, IoT)Azure Percept (retired), edge devices
FPGAField-programmable gate arrayCustom low-latency inferenceSpecialized inference accelerationAzure FPGA (limited availability)

What GPUs Actually Do for AI​

At its core, every neural network β€” from a simple classifier to GPT-4o β€” performs the same fundamental operation billions of times: matrix multiplication.

Output = Input Γ— Weights + Bias

A single transformer layer in a large language model performs multiple matrix multiplications:

  1. Query, Key, Value projections β€” three separate matrix multiplications
  2. Attention score computation β€” Q times K-transpose
  3. Attention-weighted sum β€” scores times V
  4. Feed-forward network β€” two more large matrix multiplications

A model like GPT-4o with ~200 billion parameters (estimated) performs these operations across hundreds of layers, for every token generated. A CPU processes these sequentially. A GPU like the NVIDIA H100 has 16,896 CUDA cores that execute these matrix operations in parallel, delivering 100-1000x speedup over CPUs for this type of workload.

GPU Programming Frameworks​

As an architect, you will not write CUDA code β€” but you will hear these terms in vendor discussions and capacity planning conversations.

FrameworkVendorWhat It DoesWhy You Care
CUDANVIDIAGPU programming platform, the de facto standard95%+ of AI frameworks depend on it; locks you to NVIDIA GPUs
cuDNNNVIDIAOptimized deep learning primitives on CUDAAccelerates training and inference on NVIDIA hardware
TensorRTNVIDIAModel optimization and inference engineConverts models to optimized formats for production inference
ROCmAMDOpen-source GPU compute platformAlternative to CUDA for AMD GPUs (MI300X); growing ecosystem
oneAPIIntelUnified programming model across CPU/GPU/FPGAIntel GPU support, less mature for AI than CUDA
ONNX RuntimeMicrosoftCross-platform model inference engineHardware-agnostic inference; integrates with Azure ML
Architect Takeaway

NVIDIA dominates AI infrastructure because of the CUDA ecosystem, not just hardware specs. When evaluating AMD alternatives (which can be cheaper), verify that your model serving stack supports ROCm. Most production inference servers (vLLM, TensorRT-LLM) have strong CUDA support; ROCm support varies.


9.2 Azure GPU Compute Options​

Azure offers GPU compute across a spectrum from fully self-managed VMs to fully managed services. Your choice depends on control requirements, operational maturity, and cost sensitivity.

N-Series VMs (IaaS β€” Full Control)​

These are standard Azure VMs with attached GPUs. You manage the OS, drivers, CUDA toolkit, and model serving software.

NC-Series β€” Training and Inference​

VM SKUGPUGPU CountVRAM per GPUTotal VRAMvCPUsRAMUse CaseEst. Price/hr (Pay-As-You-Go)
NC6s_v3NVIDIA V100116 GB16 GB6112 GBDev/test inference~$3.06
NC24s_v3NVIDIA V100416 GB64 GB24448 GBMulti-GPU training~$12.24
NC24ads_A100_v4NVIDIA A100180 GB80 GB24220 GBLarge model inference~$3.67
NC48ads_A100_v4NVIDIA A100280 GB160 GB48440 GBTraining + multi-GPU inference~$7.35
NC96ads_A100_v4NVIDIA A100480 GB320 GB96880 GBLarge-scale training~$14.69

ND-Series β€” Large-Scale Training​

VM SKUGPUGPU CountVRAM per GPUTotal VRAMInterconnectUse CaseEst. Price/hr
ND96asr_v4NVIDIA A100840 GB320 GBInfiniBandDistributed training~$27.20
ND96amsr_A100_v4NVIDIA A100880 GB640 GBInfiniBandLarge model training~$32.77
ND96isr_H100_v5NVIDIA H100880 GB640 GBInfiniBand 400Gb/sFrontier model training~$98.32
ND96isr_H200_v5NVIDIA H2008141 GB1,128 GBInfiniBand 400Gb/sLargest model trainingContact Microsoft
ND96is_MI300X_v5AMD MI300X8192 GB1,536 GBInfiniBandAMD-based training~$75.00

NV-Series β€” Visualization​

VM SKUGPUUse Case
NV-series v3NVIDIA Tesla M60Remote visualization, VDI
NVads A10 v5NVIDIA A10Graphics + light inference
Regional Availability

GPU VMs are NOT available in every Azure region. NC A100 and ND H100 SKUs are concentrated in specific regions (East US, West US 2/3, South Central US, West Europe, Sweden Central). Always verify availability before designing your architecture. Use az vm list-skus --location <region> --resource-type virtualMachines --query "[?contains(name,'NC')]" to check.

Azure Managed GPU Compute​

ServiceGPU ManagementScalingBest For
Azure OpenAI ServiceFully managed (no GPU visibility)PTU or PAYG rate limitsGPT-4o, GPT-4.1, o3, o4-mini inference
Azure AI Foundry ServerlessFully managedAuto-scaledLlama, Mistral, Phi, Cohere models
Azure ML Managed EndpointsSemi-managed (pick VM SKU)Manual or auto-scaleCustom models, BYOM
Azure ML Compute ClustersSemi-managed (pick VM SKU)Auto-scale 0 to N nodesTraining jobs, batch inference

AKS with GPU Node Pools​

For teams with Kubernetes expertise, AKS GPU node pools offer the best balance of control and operational efficiency.

Architecture:

Key configuration for GPU node pools:

  • Install the NVIDIA device plugin as a DaemonSet β€” it exposes GPU resources (nvidia.com/gpu) to the Kubernetes scheduler
  • Use nodeSelector or nodeAffinity to pin GPU workloads to GPU nodes
  • Set resource requests: nvidia.com/gpu: 1 (or more) per pod
  • Enable cluster autoscaler on GPU node pools (scale from 0 to save costs)

GPU Sharing Strategies on AKS:

StrategyHow It WorksUse CaseTrade-off
Dedicated GPU1 pod = 1 full GPULarge models, predictable latencyExpensive, low utilization
MIG (Multi-Instance GPU)Partition an A100/H100 into isolated slicesMultiple small models on one GPUHardware isolation, fixed partitions
Time-SlicingMultiple pods share one GPU via time-divisionDev/test, light inference workloadsNo memory isolation, unpredictable latency
MPS (Multi-Process Service)CUDA-level multi-process sharingMultiple inference processesBetter than time-slicing, still shared

9.3 GPU Memory (VRAM) β€” The Real Bottleneck​

VRAM β€” Video Random Access Memory on the GPU β€” is the single most important resource for AI workloads. It determines which models you can run, how many concurrent requests you can serve, and how fast inference completes.

Why VRAM Matters More Than Compute​

When you load a model for inference, the entire model weights must reside in VRAM. If the model does not fit, it simply will not run. There is no graceful degradation β€” the process crashes with an out-of-memory (OOM) error.

Model Size to VRAM Requirements​

The VRAM required depends on the model parameter count and the numerical precision (quantization level) used.

ModelParametersFP32 (full)FP16 / BF16INT8INT4Minimum GPU
Phi-3 Mini3.8B15.2 GB7.6 GB3.8 GB1.9 GB1x A10 (INT8)
Llama 3.1 8B8B32 GB16 GB8 GB4 GB1x A100 40GB (FP16)
Mistral 7B7.3B29.2 GB14.6 GB7.3 GB3.7 GB1x A10 24GB (INT4)
Llama 3.1 70B70B280 GB140 GB70 GB35 GB2x A100 80GB (FP16)
Llama 3.1 405B405B1,620 GB810 GB405 GB203 GB8x H100 80GB (INT4)
GPT-4 (estimated)~1.8T (MoE)N/A~900 GB active~450 GBN/AMulti-node H100 cluster
The Rule of Thumb

FP16 VRAM (GB) = Parameters (B) x 2. A 7B parameter model needs approximately 14 GB of VRAM in FP16 precision. Each quantization step roughly halves the requirement: INT8 = Parameters x 1, INT4 = Parameters x 0.5.

But Wait β€” You Need More Than Just Model Weights​

VRAM usage during inference includes:

ComponentVRAM UsageNotes
Model weightsFixed (see table above)Loaded once, stays in memory
KV cacheVariable, grows with context length and batch sizeThe hidden VRAM consumer
Activation memorySmall during inferenceLarger during training
CUDA/framework overhead500 MB - 2 GBPyTorch, CUDA context

The KV cache is particularly important. For each token in the context window, the model stores key-value pairs across all attention layers. For a 70B model with 80 layers and a 128K context window, the KV cache alone can consume 40+ GB of VRAM per concurrent request with a long context.

Multi-GPU Setups​

When a model does not fit on a single GPU, you must split it across multiple GPUs.

StrategyHow It WorksWhen to Use
Tensor Parallelism (TP)Split individual layers across GPUs; each GPU holds a slice of every layerSingle-node multi-GPU (fast NVLink/NVSwitch interconnect required)
Pipeline Parallelism (PP)Assign different layers to different GPUs; data flows through sequentiallyMulti-node setups (tolerates slower interconnect)
TP + PP combinedTP within a node, PP across nodesVery large models across multiple nodes
Expert ParallelismFor MoE models, place different experts on different GPUsMixture-of-Experts architectures

9.4 Model Serving Infrastructure​

Choosing the right inference server is critical for production AI workloads. The server determines throughput, latency, and cost efficiency.

Inference Servers Compared​

ServerVendorKey FeatureBest ForAzure Integration
vLLMUC Berkeley / CommunityPagedAttention, continuous batchingHigh-throughput LLM servingAKS, Azure ML
TensorRT-LLMNVIDIAMaximum NVIDIA GPU utilizationLowest latency on NVIDIA hardwareAKS, Azure ML
Triton Inference ServerNVIDIAMulti-model, multi-framework servingServing multiple model types togetherAKS, Azure ML
OllamaCommunitySimple local model runningDev/test, single-userLocal dev, small VMs
SGLangStanford / CommunityStructured generation, RadixAttentionStructured output workloadsAKS
Azure ML Online EndpointsMicrosoftManaged deployment + scalingProduction with minimal opsNative Azure

Batching Strategies​

Batching is how inference servers process multiple requests simultaneously to maximize GPU utilization.

StrategyHow It WorksThroughputLatencyImplementation
No batchingOne request at a timeVery low (GPU idle 80%+)Lowest per-requestNaive implementation
Static batchingCollect N requests, process togetherModerateHigh (wait for batch to fill)Basic servers
Dynamic batchingBatch requests within a time windowGoodModerate (configurable wait)Triton
Continuous batchingInsert new requests into running batch as slots freeExcellentLow (no waiting)vLLM, TensorRT-LLM

PagedAttention β€” Why It Matters​

Traditional inference servers allocate a contiguous block of VRAM for each request's KV cache based on the maximum possible sequence length. This wastes enormous amounts of memory because most requests use far less than the maximum context.

PagedAttention (introduced by vLLM) borrows the concept of virtual memory paging from operating systems:

  • KV cache is stored in non-contiguous pages (blocks)
  • Pages are allocated on demand as the sequence grows
  • Freed pages are returned to a shared pool
  • Result: 2-4x more concurrent requests on the same hardware

Speculative Decoding​

A technique where a smaller, faster "draft" model generates candidate tokens, and the larger "target" model verifies them in a single forward pass. This can reduce latency by 2-3x for the target model with minimal quality loss.

Model Serving on AKS β€” Architecture​


9.5 Azure AI Infrastructure Services​

Azure provides multiple levels of abstraction for deploying AI models. The right choice depends on your team's operational maturity, performance needs, and cost constraints.

Service Comparison​

ServiceWhat You ManageWhat Azure ManagesModels AvailableScalingCost Model
Azure OpenAINothing (API calls only)Everything (infra, model, scaling)GPT-4o, GPT-4.1, o3, o4-mini, DALL-E, WhisperPTU or PAYG rate limitsPer-token or per-PTU/hr
AI Foundry ServerlessNothing (API calls only)EverythingLlama, Mistral, Phi, Cohere, JambaAuto-scaledPer-token
AI Foundry Managed ComputeModel selection, configInfrastructure, deploymentAny supported modelManual + autoPer-VM-hour
Azure ML Online EndpointsModel, container, configVM provisioning, load balancingAny model (BYOM)Manual + autoPer-VM-hour
Azure Container Apps (GPU)Container, model serving stackInfrastructure, scalingAny (containerized)Auto (KEDA)Per-second GPU usage
AKS with GPUEverything (K8s + model stack)VM provisioningAnyFull controlPer-VM-hour

Decision Tree: When to Use Which​

Architect Recommendation

For most enterprise scenarios, start with Azure OpenAI Service for GPT-family models and AI Foundry Serverless for open-source models. Move to self-managed infrastructure (AKS + vLLM) only when you need: (1) cost optimization at very high volume, (2) specific model versions/configurations not available as managed services, or (3) data residency controls beyond what managed services offer.


9.6 Networking for AI Workloads​

AI workloads introduce networking requirements that many traditional architects have not encountered. From InfiniBand for distributed training to private endpoints for inference services, networking decisions can make or break your AI platform.

Bandwidth Requirements​

OperationData VolumeLatency SensitivityNetwork Requirement
Model loading (cold start)1 GB - 800 GBTolerant (startup only)High throughput (Azure Blob β†’ VM)
Real-time inferenceKB per requestVery sensitive (<100ms network)Low latency, private endpoint
Batch inferenceGB of input dataTolerantHigh throughput
Distributed training (gradient sync)GB per step, continuouslyExtremely sensitiveInfiniBand / RDMA required
RAG retrieval (search β†’ LLM)KB per querySensitive (adds to TTFT)Low latency between services

InfiniBand for Distributed Training​

When training models across multiple GPU nodes, each node must synchronize gradients after every training step. Standard Ethernet (even 100 Gbps) introduces unacceptable latency. Azure ND-series VMs include InfiniBand networking:

  • ND A100 v4: 200 Gbps HDR InfiniBand
  • ND H100 v5: 400 Gbps NDR InfiniBand (3.2 Tbps bisection bandwidth across 8 GPUs)
  • Enables RDMA (Remote Direct Memory Access) β€” GPU-to-GPU communication bypassing the CPU and OS kernel entirely
  • Required for efficient multi-node training with frameworks like DeepSpeed, Megatron-LM

Private Endpoints for AI Services​

Every AI service on Azure supports private endpoints. For enterprise workloads, always deploy AI services on private endpoints.

ServicePrivate Endpoint SupportDNS Zone
Azure OpenAIYesprivatelink.openai.azure.com
Azure AI SearchYesprivatelink.search.windows.net
Azure AI FoundryYesprivatelink.api.azureml.ms
Azure ML WorkspaceYesprivatelink.api.azureml.ms
Azure Cosmos DBYesprivatelink.documents.azure.com
Azure Blob StorageYesprivatelink.blob.core.windows.net

Latency Chain for RAG Applications​

In a RAG-based application, the end-to-end latency is the sum of every hop in the chain. Network latency adds up quickly.

Latency optimization strategies:

  • Co-locate services in the same Azure region (eliminates cross-region latency)
  • Use private endpoints (bypasses public internet routing)
  • Enable streaming responses (user sees first token faster)
  • Use APIM connection pooling (reduces TLS handshake overhead)
  • Consider Azure Front Door for global user distribution

VNET Integration for AI Services​

ServiceVNET Integration MethodOutbound Control
Azure OpenAIPrivate endpoint + disable public accessNSG on subnet
Azure AI SearchPrivate endpoint + shared private linkService-managed
App Service / FunctionsVNET integration (outbound) + private endpoint (inbound)NSG + Route Table
AKSCNI networking, private clusterNSG + Azure Firewall
Azure MLManaged VNET (workspace-level) or custom VNETNSG + UDR

9.7 Storage for AI Workloads​

AI workloads interact with storage differently than traditional applications. Model weights, training datasets, vector indexes, and caching layers each have distinct requirements.

Storage Tiers for AI​

Data TypeVolumeAccess PatternRecommended StorageThroughput Need
Model weights1 GB - 800 GB per modelRead-heavy, loaded at cold startAzure Blob (Hot tier)High (fast cold starts)
Training datasetsGB to PBSequential read during trainingADLS Gen2Very high (parallel reads)
Vector indexesGB to TBRandom read, low-latency queriesAzure AI Search / Cosmos DBLow latency (<10ms)
Prompt/response cacheMB to GBHigh-frequency read/writeAzure Cache for RedisSub-millisecond
Conversation historyMB to GB per userRead/write per sessionCosmos DBLow latency
Fine-tuning dataMB to GB (JSONL files)Read once during trainingAzure BlobModerate
Document ingestionGB to TB (PDFs, docs)Write-once, read during indexingAzure Blob + AI Document IntelligenceModerate

Model Weight Storage Best Practices​

  • Store model weights in Azure Blob Storage (Hot tier) with LRS or ZRS redundancy
  • Use managed identity for access β€” never embed storage keys in containers
  • For AKS workloads, consider Azure Blob CSI driver to mount weights as a volume
  • For Azure ML, the model registry handles storage automatically
  • Pre-download weights into the container image for fastest cold starts (tradeoff: larger image size)

Caching for AI Workloads​

Cache TypeWhat It CachesToolSavings
Semantic cacheSimilar prompts β†’ cached LLM responsesAzure Cache for Redis + embedding similarity50-90% cost reduction for repeated queries
Retrieval cacheSearch query β†’ cached search resultsAzure Cache for RedisReduced AI Search costs, lower latency
Embedding cacheText β†’ cached embedding vectorAzure Cache for RedisAvoid re-embedding identical text
KV cache (model-level)Prefix tokens β†’ cached internal statevLLM prefix caching2-5x throughput for shared-prefix workloads

9.8 Scaling AI Workloads​

Scaling AI workloads is fundamentally different from scaling traditional web applications. You cannot simply "add more instances" without understanding the memory, compute, and cost implications.

Vertical vs Horizontal Scaling​

ApproachHowWhenLimitation
VerticalBigger GPU VM (A10 β†’ A100 β†’ H100)Model needs more VRAM, need lower latencyVM SKU ceiling, single-point-of-failure
HorizontalMore instances of same VMNeed more throughput (requests/sec)Model must fit on each instance, load balancing complexity
DistributedMulti-GPU across nodesModel too large for single nodeInfiniBand required, complex orchestration

Azure OpenAI Scaling​

Azure OpenAI uses a unique scaling model based on rate limits (PAYG) or provisioned throughput units (PTU).

Scaling ModelHow It WorksWhen to UseCost
PAYG (Pay-As-You-Go)Per-token billing, shared capacity, rate-limited (TPM/RPM)Development, variable/unpredictable workloads~$2.50/1M input tokens (GPT-4o)
PTU (Provisioned Throughput)Reserved capacity, guaranteed throughput, monthly commitmentProduction with predictable, high-volume workloads~$2/PTU/hr (model-dependent)
PTU-M (Managed)Azure-managed PTU with auto-scalingProduction, want managed experiencePremium over standard PTU

PTU Sizing Rule of Thumb: 1 PTU roughly corresponds to approximately 6 requests per minute for GPT-4o with ~500 input + ~200 output tokens. Actual capacity depends heavily on prompt/completion lengths. Always use the Azure OpenAI capacity calculator for accurate sizing.

AI Search Scaling​

DimensionWhat It ScalesHowImpact
ReplicasQuery throughput (QPS) and availabilityAdd replicas (1-12, or more at higher tiers)Each replica is a full copy of all indexes
PartitionsData capacity (index size)Add partitions (1, 2, 3, 4, 6, or 12)Data is sharded across partitions
TierOverall limits (index count, size, features)Upgrade SKU (Free β†’ Basic β†’ S1 β†’ S2 β†’ S3 β†’ L1 β†’ L2)Higher tiers unlock larger indexes and more features
Search SLA

Azure AI Search requires 2+ replicas for read SLA (99.9%) and 3+ replicas for read/write SLA (99.9%). The default single-replica deployment has no SLA.

Auto-Scaling Patterns for Inference​

Queue-Based Scaling for Async Inference​

For workloads that can tolerate latency (document processing, batch analysis, content generation), use a queue-based pattern to decouple request ingestion from processing.

ComponentRoleAzure Service
QueueBuffer incoming requestsAzure Service Bus / Storage Queue
WorkersPull from queue, call model, store resultsAzure Functions / AKS pods
Results storeStore completed inferencesCosmos DB / Blob Storage
ScalerScale workers based on queue depthKEDA (AKS) / Azure Functions auto-scale

Load Balancing Across Model Endpoints with APIM​

APIM is the recommended AI Gateway for load balancing across multiple Azure OpenAI or model endpoints. This connects directly to Module 5 (Azure OpenAI) and is a critical architectural pattern.


9.9 High Availability for AI​

AI services require the same HA patterns as any critical system, with additional considerations around model-specific behavior and capacity allocation.

Multi-Region Azure OpenAI Deployment​

PatternDescriptionComplexityAvailability
Single region, PAYGOne deployment in one regionLowSubject to throttling (429)
Single region, PTUReserved capacity in one regionLow-MediumNo throttling, single-region risk
Multi-region, APIM load balancedPTU in 2+ regions, APIM routes trafficMediumHigh β€” survive region outage
Multi-region, PTU + PAYG overflowPTU primary, PAYG in other region as overflowMediumVery high β€” handles spikes + outages
Multi-region, multi-model fallbackGPT-4o primary β†’ GPT-4.1-mini fallback β†’ open-source fallbackHighMaximum β€” survive service-level issues

APIM as AI Gateway β€” HA Configuration​

APIM provides built-in capabilities that make it ideal as an AI Gateway:

  • Backend pool: Define multiple Azure OpenAI endpoints as a backend pool
  • Load balancing: Round-robin, weighted, or priority-based routing
  • Circuit breaker: Automatically stop routing to unhealthy backends
  • Retry policy: On 429 (rate limit) or 503, retry to next backend
  • Rate limiting: Enforce per-subscription or per-application token limits
  • Caching: Cache responses for identical prompts
  • Monitoring: Track per-backend latency, error rates, token consumption

Fallback Model Strategies​

PriorityModelEndpointStrategy
PrimaryGPT-4.1Azure OpenAI (PTU, East US)All traffic goes here first
SecondaryGPT-4.1Azure OpenAI (PTU, West US)On 429/503 from primary
TertiaryGPT-4o-miniAzure OpenAI (PAYG, Europe)Cheaper model, acceptable quality
EmergencyLlama 3.1 70BAKS self-hosted (any region)Full control, no external dependency

RPO/RTO for AI Services​

ServiceRPORTOHA Mechanism
Azure OpenAI0 (stateless service)Minutes (APIM failover)Multi-region deployment
Azure AI SearchNear-0 (replicated indexes)Minutes (replica failover)Multi-replica, or multi-region with index sync
Cosmos DB (vector store)Seconds (multi-region writes)AutomaticMulti-region, automatic failover
Model weights (Blob)0 (GRS/GZRS)MinutesGeo-redundant storage
Conversation historyDepends on storeDepends on storeCosmos DB multi-region

9.10 Cost Optimization for AI Infrastructure​

AI infrastructure costs can escalate rapidly. A single H100 node costs ~$98/hour. A 10-node training cluster runs ~$24,000/day. Production inference can consume $10,000-$100,000+/month in Azure OpenAI tokens alone. Cost optimization is not optional β€” it is a core architectural responsibility.

Azure OpenAI: PTU vs PAYG Break-Even​

DimensionPAYGPTU
BillingPer token consumedPer unit per hour (monthly commitment)
Best forVariable, unpredictable workloadsSteady, high-volume production
AvailabilityShared capacity (may get 429s)Guaranteed, reserved capacity
Cost efficiencyBetter at low volumeBetter at high, consistent volume

Break-even analysis (approximate, GPT-4o):

  • 1 PTU costs approximately $2/hour = ~$1,460/month
  • 1 PTU supports approximately 6 RPM (with ~700 total tokens per request)
  • At consistent usage above ~60% utilization, PTU becomes cheaper than PAYG
  • Below ~40% utilization, PAYG is more cost-effective
Always Measure First

Deploy on PAYG first to understand your actual usage patterns (RPM, TPM, peak vs average). Use 2-4 weeks of production data to model PTU sizing. Premature PTU commitment is a common cost mistake β€” you pay whether you use it or not.

Cost Optimization Levers​

LeverSavings PotentialEffortTrade-off
Model routing (use cheapest capable model)30-80%MediumNeed to validate quality per model
Semantic caching (cache similar prompts)50-90% for repeated queriesMediumStale responses for dynamic data
Prompt compression (shorter prompts)10-40%LowMay reduce quality if over-compressed
Output token limits (max_tokens)10-30%LowMay truncate needed output
Batch inference (async, Azure Batch API)50%Low24-hour SLA, not real-time
Quantized models (INT8/INT4 for self-hosted)50-75% GPU costMediumMinor quality degradation
Spot instances (for training / batch)60-90%MediumCan be evicted mid-job
Reserved VM instances (1-year or 3-year)30-60%LowCommitment, less flexibility
Auto-scaling from zero (AKS GPU nodepool)Variable (no idle cost)MediumCold start latency (2-10 min for GPU nodes)
Streaming (stop early if user cancels)VariableLowEngineering for cancellation handling

Model Routing for Cost Optimization​

Not every query requires GPT-4.1. A smart routing layer can classify queries and route to the cheapest model that can handle them.


9.11 Observability & Monitoring​

You cannot manage what you cannot measure. AI workloads introduce new metrics that traditional monitoring does not cover: token consumption, time-to-first-token, model quality, and hallucination rates.

Key Metrics for AI Workloads​

MetricWhat It MeasuresTargetTool
TTFT (Time to First Token)Latency before first token appears<1 secondApplication Insights
TPS (Tokens Per Second)Generation speed30-100 TPSAzure OpenAI diagnostics
E2E LatencyTotal time from request to complete response<5s for interactiveApplication Insights
Token consumption (input + output)Cost driverWithin budgetAzure Monitor
429 rateRate limit rejections<1% in productionAzure Monitor
Error rate (4xx, 5xx)Service reliability<0.1%Azure Monitor
GPU utilizationCompute efficiency (self-hosted)60-85% sustainedNVIDIA DCGM / Azure Monitor
VRAM utilizationMemory pressure (self-hosted)<90%NVIDIA DCGM
Queue depthBacklog of pending requestsNear zero for real-timeCustom metric
Groundedness scoreFactual accuracy of responses>4.0/5.0Azure AI Foundry evaluation

Azure Monitor for Azure OpenAI​

Azure OpenAI emits diagnostic logs and metrics that should be sent to a Log Analytics workspace.

Enable diagnostic settings:

  • Audit logs: Track who called the API, when, from where
  • Request/Response logs: Log prompts and completions (be cautious with PII)
  • Metrics: Token usage, latency, HTTP status codes

KQL Queries for Azure OpenAI Monitoring​

Token consumption by deployment (last 24 hours):

AzureDiagnostics
| where ResourceProvider == "MICROSOFT.COGNITIVESERVICES"
| where Category == "RequestResponse"
| where TimeGenerated > ago(24h)
| extend model = tostring(properties_s)
| summarize
TotalRequests = count(),
TotalPromptTokens = sum(toint(properties_promptTokens_d)),
TotalCompletionTokens = sum(toint(properties_completionTokens_d))
by bin(TimeGenerated, 1h), model
| order by TimeGenerated desc

429 Rate Limit Tracking:

AzureDiagnostics
| where ResourceProvider == "MICROSOFT.COGNITIVESERVICES"
| where ResultSignature == "429"
| summarize ThrottledRequests = count() by bin(TimeGenerated, 5m)
| render timechart

P95 Latency by Model:

AzureDiagnostics
| where ResourceProvider == "MICROSOFT.COGNITIVESERVICES"
| where Category == "RequestResponse"
| extend latency = DurationMs
| summarize P50 = percentile(latency, 50),
P95 = percentile(latency, 95),
P99 = percentile(latency, 99)
by bin(TimeGenerated, 1h)
| render timechart

Application Insights Integration​

For end-to-end tracing of AI application requests, use Application Insights with the Azure Monitor OpenTelemetry SDK.

What to TraceHowCorrelation
User request β†’ App ServiceApplication Insights auto-instrumentationOperation ID
App Service β†’ Azure AI SearchDependency trackingSame Operation ID
App Service β†’ Azure OpenAIDependency trackingSame Operation ID
Token usage per requestCustom metrics via TelemetryClientSame Operation ID
Model quality scoresCustom eventsPer-request evaluation

This gives you a single correlated trace from user request through search retrieval to LLM inference and back, enabling root-cause analysis of slow or poor-quality responses.


9.12 Security for AI Infrastructure​

AI infrastructure introduces new attack surfaces: prompt injection, model theft, data exfiltration through model responses, and abuse of expensive GPU resources. Security must be layered across identity, network, data, and application tiers.

Identity and Access Control​

PrincipleImplementationAzure Service
Managed Identity for service-to-service authApp Service / AKS β†’ Azure OpenAI via MIManaged Identity + RBAC
No API keys in codeStore keys in Key Vault, prefer MI over keysAzure Key Vault
Least-privilege RBACCognitive Services OpenAI User for inference, Contributor for managementAzure RBAC
Per-application identityEach app gets its own MI, audited separatelyUser-Assigned Managed Identity
Human access controlJIT access for production AI resourcesPIM (Privileged Identity Management)

Network Security​

LayerControlConfiguration
AI service exposurePrivate endpoints, disable public accessPE + publicNetworkAccess: disabled
Subnet isolationDedicated subnets for AI workloadsNSG rules: deny all inbound except from app subnet
DNS resolutionPrivate DNS zones for PE resolutionAzure Private DNS Zones linked to VNET
Egress controlControl what AI services can reach outboundAzure Firewall / NSG outbound rules
Kubernetes networkNetwork policies isolating GPU podsCalico / Azure Network Policy
DDoS protectionProtect public-facing AI endpointsAzure DDoS Protection

Data Protection​

Data StateProtectionImplementation
At restAES-256 encryptionAzure-managed keys (default) or CMK via Key Vault
In transitTLS 1.2+Enforced by Azure services
In use (training data)Access control, data classificationPurview + RBAC
Prompts & responsesContent filtering, logging controlsAzure OpenAI content safety filters
Model weightsAccess control (prevent model theft)RBAC on storage, private endpoints

Content Safety​

Azure OpenAI includes built-in content safety filters that operate on both prompts and completions:

Filter CategoryWhat It DetectsSeverity Levels
HateContent targeting identity groupsLow, Medium, High
SexualSexually explicit contentLow, Medium, High
ViolenceViolent content or threatsLow, Medium, High
Self-harmContent related to self-harmLow, Medium, High
Jailbreak detectionAttempts to bypass system instructionsBinary (detected/not)
Protected materialCopyrighted text or code reproductionBinary (detected/not)

Configure filter thresholds in Azure OpenAI content filtering policies. For enterprise deployments, set filters to medium sensitivity as a baseline and adjust based on your use case.


9.13 Reference Architecture β€” Enterprise AI Platform​

This section brings together all the infrastructure components covered throughout this module into a cohesive enterprise reference architecture.

Complete Architecture Diagram​

Architecture Layers Explained​

LayerComponentsPurposeKey Decisions
EdgeAzure Front Door, WAFGlobal distribution, DDoS protection, SSL terminationEnable WAF rules for OWASP + custom AI-specific rules
ApplicationApp Service, FunctionsBusiness logic, orchestration, UIVNET-integrated, Managed Identity enabled
AI GatewayAPIMCentralized AI endpoint managementBackend pools for multi-region, retry policies, token tracking
AI ServicesAzure OpenAI, AI SearchModel inference, knowledge retrievalMulti-region PTU + PAYG overflow, 3+ search replicas
DataBlob, Cosmos DB, RedisDocument storage, state, cachingPrivate endpoints on all, semantic cache in Redis
Document PipelineEvent Grid, Functions, Doc IntelAutomated document ingestionEvent-driven, scalable, idempotent
ObservabilityAzure Monitor, App InsightsMetrics, logs, traces, dashboardsCorrelated tracing from user to LLM, KQL dashboards
SecurityEntra ID, Managed Identity, Key VaultAuthN, AuthZ, secrets managementZero API keys in code, RBAC everywhere, private endpoints

Networking Architecture​

All services communicate over private endpoints within a hub-and-spoke VNET topology:

SubnetCIDR (example)Contains
snet-appservice10.1.1.0/24App Service VNET integration
snet-apim10.1.2.0/24APIM (internal mode)
snet-privateendpoints10.1.3.0/24Private endpoints for all PaaS services
snet-aks-gpu10.1.4.0/22AKS GPU node pool (if self-hosted models)
snet-functions10.1.5.0/24Azure Functions VNET integration

RBAC Assignments​

IdentityRoleScopePurpose
App Service MICognitive Services OpenAI UserAzure OpenAI resourceInvoke models
App Service MISearch Index Data ReaderAI Search resourceQuery search indexes
App Service MIStorage Blob Data ReaderStorage accountRead documents
Functions MICognitive Services OpenAI UserAzure OpenAI resourceGenerate embeddings
Functions MISearch Index Data ContributorAI Search resourceWrite to search indexes
APIM MICognitive Services OpenAI UserAzure OpenAI resource(s)Route requests to OpenAI
DevOps SPContributorResource groupIaC deployments
AI Platform TeamCognitive Services OpenAI ContributorAzure OpenAI resourceManage deployments and models

Key Takeaways​

  1. AI workloads are memory-bound, not CPU-bound. VRAM is the primary constraint for model inference. Always size infrastructure based on model VRAM requirements first, then consider compute throughput.

  2. Start managed, go self-hosted only when justified. Azure OpenAI and AI Foundry Serverless handle infrastructure complexity. Move to AKS + vLLM only for cost optimization at scale, specific model requirements, or data sovereignty.

  3. APIM is your AI Gateway. It solves load balancing, failover, rate limiting, token tracking, and retry logic in a single service. Every enterprise AI deployment should route through APIM.

  4. Multi-region is not optional for production. Deploy Azure OpenAI in at least two regions with APIM-based failover. PTU provides guaranteed capacity; PAYG in a secondary region handles overflow and outages.

  5. Cost optimization has massive ROI. Model routing (use the cheapest capable model), semantic caching, and batch inference can reduce AI infrastructure costs by 50-80%. Measure first with PAYG, then commit to PTU.

  6. Observability must include AI-specific metrics. Token consumption, TTFT, TPS, 429 rates, and model quality scores are as important as traditional metrics like CPU and memory.

  7. Security is layered. Managed Identity over API keys, private endpoints on every service, content safety filters, RBAC with least privilege, and VNET isolation form the security baseline for enterprise AI.

  8. The reference architecture is a starting point. Adapt it based on your organization's scale, compliance requirements, and operational maturity. Not every organization needs multi-region from Day 1, but every organization should design for it.


What Is Next​

In Module 10: Responsible AI β€” Ethics, Safety & Governance, we shift from infrastructure to responsibility. You will learn about AI fairness, transparency, content safety, and governance frameworks β€” the principles that ensure your AI infrastructure serves users ethically and complies with regulations. Infrastructure without governance is a liability; governance without infrastructure is theory.


Module 9 Complete. You now understand the full AI infrastructure stack β€” from GPU silicon to production reference architectures. Combined with the knowledge from Modules 1-8, you can design, deploy, and operate enterprise AI systems on Azure with confidence.