Skip to main content

Module 2: The LLM Landscape β€” Models, Families & Selection Guide

Duration: 45-60 minutes | Level: Foundation Audience: Cloud Architects, Platform Engineers, CSAs Last Updated: March 2026


You do not need to train models. You do not need to understand backpropagation. But you absolutely need to know which models exist, what they are good at, how they are licensed, and how to deploy them on Azure. This module gives you the full map.

By the end of this module you will be able to:

  • Distinguish proprietary, open-weight, and open-source models and explain why it matters for infrastructure
  • Name every major model family and its flagship models
  • Understand reasoning models, small language models, and multimodal models as distinct categories
  • Read benchmarks critically and know their limitations
  • Select the right Azure OpenAI deployment type for a given workload
  • Apply a decision framework to choose the right model for any production scenario

2.1 The Model Universe​

Not all models are created equal and not all models are licensed the same way. Before diving into specific families, you need a clear taxonomy.

Three Categories of Model Openness​

CategoryDefinitionExamplesCan Self-Host?Can Fine-Tune?Can See Weights?
ProprietaryModel weights are never released. Access only via API. Provider controls everything.GPT-4o, Claude Opus 4, Gemini 2.5 ProNoLimited (provider-hosted)No
Open-WeightModel weights are publicly downloadable. License may restrict commercial use or modification. Training data/code often not shared.Llama 4, Mistral Large, Phi-4YesYesWeights only
Open-SourceWeights, training code, training data, and methodology are all public. Permissive license.OLMo, BLOOM, PythiaYesYesEverything
Common Misconception

"Open-weight" and "open-source" are not the same thing. Meta's Llama models are open-weight (you can download and run them), but they are not open-source by the OSI definition because Meta's license includes restrictions on commercial use above 700M monthly active users and does not release training data. This distinction matters for legal and procurement teams.

Why This Classification Matters for Infrastructure​

Decision FactorProprietary (API)Open-Weight (Self-Hosted)
InfrastructureNone (managed)GPU VMs, networking, storage
Data residencyProvider's regionsYour regions, your control
Cost modelPay-per-token (variable)Fixed compute (GPU hours)
Latency controlLimitedFull control
ScalingAutomatic (with limits)Manual (AKS, VMSS)
ComplianceDepends on provider DPAFull control
Vendor lock-inHighLow
Operational burdenMinimalSignificant

Licensing Implications​

For architects who need to advise procurement and legal:

LicenseModelsCommercial UseFine-TuningRedistributionKey Restriction
Proprietary ToSGPT-4o, Claude, GeminiVia API onlyProvider-managed onlyNoBound to provider platform
Llama Community LicenseLlama 3.1, 3.2, 4Yes (with conditions)YesYes700M MAU threshold requires Meta license
Apache 2.0Phi-4, Mistral (some), GemmaYesYesYesNo significant restrictions
Mistral Research LicenseMistral Large (some versions)Check versionYesLimitedMay restrict commercial use

2.2 Major Model Families​

OpenAI / Azure OpenAI​

The largest and most widely deployed model family in enterprise. Azure OpenAI Service provides the same models as OpenAI's API, but running on Microsoft's Azure infrastructure with enterprise security, compliance, and network integration.

GPT-4 Class Models​

ModelContext WindowKey CapabilitiesBest ForAzure Deployment
GPT-4o128K tokensText, image, audio input. Fast. Flagship multimodal.General-purpose production workloads, vision tasksAll deployment types
GPT-4o-mini128K tokensSmaller, faster, cheaper. 80% of GPT-4o quality.High-volume, cost-sensitive workloadsAll deployment types
GPT-4.11M tokensEvolution of GPT-4o. Better coding, instruction following.Complex coding, long-document analysisStandard, Global Standard
GPT-4.1-mini1M tokensSmaller GPT-4.1. Excellent cost/performance.Production at scale where 4.1 quality is needed cheaplyStandard, Global Standard
GPT-4.1-nano1M tokensSmallest GPT-4.1. Ultra-low cost.Classification, extraction, routing, simple tasksStandard, Global Standard

Reasoning Models (o-series)​

ModelContext WindowThinking TokensKey CapabilitiesBest For
o1200K tokensYes (hidden, billed)Deep reasoning, math, science, logicComplex multi-step analysis
o3200K tokensYes (hidden, billed)Stronger reasoning, better at STEMResearch, scientific reasoning
o4-mini200K tokensYes (hidden, billed)Efficient reasoning at lower costReasoning tasks on a budget

Specialized Models​

ModelTypeCapabilityBest For
DALL-E 3Image generationText-to-image, high qualityMarketing content, design prototyping
WhisperSpeech-to-textMultilingual transcriptionMeeting transcription, voice interfaces
Text Embedding (3-small, 3-large)EmbeddingConvert text to vectorsRAG pipelines, semantic search
For Architects

GPT-4o-mini and GPT-4.1-nano are your workhorses. At least 80% of production workloads (classification, extraction, summarization, simple Q&A) do NOT need a full GPT-4o or GPT-4.1. Start with the smallest model that meets quality requirements. Scale up only when you have evidence that the smaller model fails.

Azure OpenAI Specifics​

Azure OpenAI is not just "OpenAI on Azure" -- it adds enterprise-grade features:

  • Azure AD / Entra ID authentication (no API keys in production)
  • Virtual Network integration via Private Endpoints
  • Content filtering enabled by default (configurable)
  • Managed Identity support for zero-secret deployments
  • Data processing guarantees (your prompts and completions are NOT used for model training)
  • Regional deployment for data residency compliance (EU, US, etc.)
  • Provisioned throughput (PTU) for guaranteed capacity

Anthropic Claude​

Claude is Anthropic's model family, built with a focus on safety, helpfulness, and long-context understanding. Available via Anthropic's API and also accessible through Azure AI Foundry's model catalog (as a third-party model).

ModelContext WindowKey CapabilitiesBest For
Claude Opus 4200K tokensMost capable Claude. Deep analysis, complex reasoning, extended thinking, tool use.Complex research, long-document analysis, agentic workflows
Claude Sonnet 4200K tokensBalanced performance and speed. Strong coding. Extended thinking available.Production workloads needing quality + speed balance
Claude Haiku 3.5200K tokensFastest, cheapest Claude. Surprisingly capable.High-volume, cost-sensitive, real-time applications

Key Differentiators vs GPT​

DimensionClaudeGPT-4o / GPT-4.1
Context window200K tokens standard128K (GPT-4o), 1M (GPT-4.1)
Extended thinkingExplicit thinking mode where model reasons step-by-step (visible/hidden thinking tokens)o-series models have built-in reasoning
System prompt adherenceExceptionally strong -- tends to follow long, complex system prompts very closelyStrong, but can drift on very complex instructions
Safety approachConstitutional AI (trained against a constitution of rules)RLHF + content filtering layer
CodingExcellent, especially for refactoring and explainingExcellent, especially for generation
Azure availabilityVia Azure AI Foundry model catalog (serverless API)Native Azure OpenAI Service
Data residencyAnthropic's US/EU regionsAzure regions worldwide

Meta Llama​

Meta's Llama family is the most significant open-weight model family. It democratized access to large-scale LLMs and created an entire ecosystem of fine-tuned derivatives.

ModelParametersContext WindowKey CapabilitiesBest For
Llama 3.1 8B8 billion128K tokensLightweight, fast, fine-tunableEdge deployment, task-specific fine-tuning
Llama 3.1 70B70 billion128K tokensStrong general performanceSelf-hosted production workloads
Llama 3.1 405B405 billion128K tokensLargest open-weight model. Near-GPT-4 quality.Maximum quality in self-hosted scenarios
Llama 3.2 1B1 billion128K tokensUltra-small, on-device capableMobile, IoT, edge inference
Llama 3.2 3B3 billion128K tokensSmall but capableEdge with limited GPU
Llama 3.2 11B Vision11 billion128K tokensMultimodal (text + image)Visual understanding on-prem
Llama 3.2 90B Vision90 billion128K tokensLarge multimodalHigh-quality vision tasks, self-hosted
Llama 4 Scout17B active (109B total)10M tokensMixture-of-Experts (MoE). 16 experts. Extreme context.Long-context tasks with efficient compute
Llama 4 Maverick17B active (400B total)1M tokensMoE with 128 experts. Strongest open Llama.Best open-weight quality, multi-turn, multilingual

What "Open-Weight" Means for Self-Hosting​

When you download Llama, you get the trained model weights (the numbers that define the model's knowledge). You can:

  1. Run inference locally on your own GPU infrastructure
  2. Fine-tune the model on your domain-specific data
  3. Quantize the model (reduce precision) to run on smaller hardware
  4. Deploy via frameworks like vLLM, TGI, or Ollama

Infrastructure implications:

Model SizeMinimum VRAM (FP16)Recommended GPUMonthly Azure VM Cost (est.)
1B-3B4-8 GBSingle T4 or A10~$300-600
8B16 GBSingle A10 or A100~$600-2,000
70B140 GB2x A100 80GB~$6,000-10,000
405B810 GB8-16x A100 80GB~$30,000-60,000
Availability on Azure

All Llama models are available on Azure AI Foundry via the Model Catalog. You can deploy them as Serverless API (pay-per-token, similar to Azure OpenAI) or as Managed Compute (dedicated VM with the model). No need to download weights manually.


Google Gemini​

Google's Gemini models are natively multimodal -- trained from the ground up on text, images, audio, and video, rather than having vision bolted on after text training.

ModelContext WindowKey CapabilitiesBest For
Gemini 2.0 Flash1M tokensFast, multimodal, tool use, agenticLow-latency multimodal tasks
Gemini 2.5 Pro1M tokensMost capable Gemini. Thinking mode built-in.Complex reasoning, long-context, multimodal analysis
Gemini 2.5 Flash1M tokensFast + thinking. Balanced cost/quality.Production multimodal at scale

Azure Availability​

Gemini models are not natively available on Azure AI Foundry as of March 2026. They are accessed via Google's Vertex AI or Google AI Studio. However, Azure AI Foundry's Model Catalog does include some Google models:

  • Gemma (Google's open-weight small models) -- available on Azure AI Foundry
  • Gemma 2 (2B, 9B, 27B) -- Apache 2.0 licensed, deployable on Azure

For multi-cloud architectures, teams sometimes use Gemini via Google's API alongside Azure OpenAI, but this adds complexity around data residency, networking, and billing management.


Microsoft Phi (Small Language Models)​

Microsoft's Phi family proves that smaller models, trained on high-quality data, can rival models 10-100x their size on specific tasks. These are Small Language Models (SLMs) -- purpose-built to be efficient.

ModelParametersContext WindowKey CapabilitiesBest For
Phi-3-mini3.8B128K tokensStrong reasoning for its sizeEdge, mobile, on-device
Phi-3-small7B128K tokensMultilingual, strong on benchmarksCost-efficient general tasks
Phi-3-medium14B128K tokensBest Phi-3 qualityProduction tasks needing local deployment
Phi-3.5-mini3.8B128K tokensImproved over Phi-3-mini, multilingualEdge with multilingual needs
Phi-3.5-MoE6.6B active (42B total)128K tokensMixture-of-Experts architectureHigh quality at efficient compute
Phi-3.5-vision4.2B128K tokensMultimodal (text + image)Visual tasks on constrained hardware
Phi-414B16K tokensStrongest Phi. STEM reasoning leader.Math, science, structured reasoning
Phi-4-mini3.8B128K tokensCompact, strong reasoningLightweight deployment with good quality
Phi-4-multimodal5.6B128K tokensText + image + audioMultimodal on-device scenarios

Why SLMs Matter for Architects​

  1. Run on CPUs -- Models under 4B parameters can run (slowly) on CPUs. No GPU required.
  2. Edge deployment -- Azure IoT Edge, Windows devices, even mobile phones.
  3. Cost -- Orders of magnitude cheaper than large model API calls at high volume.
  4. Latency -- Smaller models respond faster. Critical for real-time scenarios.
  5. Data sovereignty -- Run entirely on-premises, air-gapped, zero data leaving your network.
Available on Azure

All Phi models are available on Azure AI Foundry via the Model Catalog with Serverless API and Managed Compute deployment options. Phi-4 and Phi-3.5 models are also optimized for ONNX Runtime and can run on Windows devices via Windows Copilot Runtime.


Mistral​

Mistral AI is a French AI company that has quickly become a major force, particularly strong in European markets and multilingual workloads.

ModelParametersContext WindowKey CapabilitiesBest For
Mistral Large~123B128K tokensFlagship. Strong reasoning, multilingual.Complex enterprise tasks, European compliance
Mistral Small~22B128K tokensEfficient, fastCost-sensitive production
Mistral Nemo12B128K tokensApache 2.0 license. Very capable for size.Self-hosted, fine-tuning friendly
Codestral22B32K tokensCode-specializedCode generation, completion, review
Pixtral12B128K tokensMultimodal (text + image)Visual understanding at low cost

Key Strengths​

  • Multilingual excellence -- particularly strong in European languages (French, German, Spanish, Italian)
  • European data sovereignty -- Mistral offers EU-hosted endpoints compliant with EU AI Act
  • Apache 2.0 licensing on several models -- true permissive open-source
  • Available on Azure AI Foundry -- both Serverless API and Managed Compute options

Comprehensive Model Comparison Table​

ModelProviderParametersContext WindowStrengthsAzure AvailableBest For
GPT-4oOpenAIUndisclosed128KMultimodal, fast, versatileAzure OpenAI (native)General-purpose production
GPT-4o-miniOpenAIUndisclosed128KCost-efficient, 80% of 4o qualityAzure OpenAI (native)High-volume workloads
GPT-4.1OpenAIUndisclosed1MLong context, strong codingAzure OpenAI (native)Code, long documents
GPT-4.1-miniOpenAIUndisclosed1MGreat cost/quality ratioAzure OpenAI (native)Scaled production
GPT-4.1-nanoOpenAIUndisclosed1MUltra-cheap, fastAzure OpenAI (native)Classification, routing
o3OpenAIUndisclosed200KDeep reasoning, STEMAzure OpenAI (native)Complex analysis, math
o4-miniOpenAIUndisclosed200KEfficient reasoningAzure OpenAI (native)Reasoning on a budget
Claude Opus 4AnthropicUndisclosed200KStrongest Claude, extended thinkingAI Foundry (serverless)Complex research, agents
Claude Sonnet 4AnthropicUndisclosed200KBalanced quality/speedAI Foundry (serverless)Production coding, analysis
Claude Haiku 3.5AnthropicUndisclosed200KFastest Claude, very cheapAI Foundry (serverless)High-volume, real-time
Llama 4 MaverickMeta17B active / 400B1MMoE, strong open-weightAI Foundry (serverless/managed)Best open-weight quality
Llama 4 ScoutMeta17B active / 109B10MExtreme context lengthAI Foundry (serverless/managed)Very long context tasks
Llama 3.1 405BMeta405B128KLargest dense open modelAI Foundry (managed)Maximum open self-hosted quality
Llama 3.1 70BMeta70B128KSolid open-weight workhorseAI Foundry (both)Self-hosted production
Gemini 2.5 ProGoogleUndisclosed1MNative multimodal, thinkingNot on Azure (Vertex AI)Multimodal analysis
Gemma 2 27BGoogle27B8KApache 2.0, openAI Foundry (managed)Permissive self-hosting
Phi-4Microsoft14B16KSTEM reasoning, smallAI Foundry (both)Edge, cost-efficient reasoning
Phi-4-multimodalMicrosoft5.6B128KText + image + audio, tinyAI Foundry (both)On-device multimodal
Mistral LargeMistral~123B128KMultilingual, EuropeanAI Foundry (serverless)EU compliance workloads
Mistral NemoMistral12B128KApache 2.0, fine-tunableAI Foundry (both)Custom fine-tuning

2.3 Reasoning Models β€” A New Category​

In early 2024, a fundamentally new type of model emerged: reasoning models. These are not just larger or better-trained versions of existing models. They represent a different approach to how LLMs solve problems.

Standard Models vs Reasoning Models​

How Reasoning Models Work​

  1. Thinking tokens -- Before producing a visible answer, the model generates internal reasoning tokens (a "chain-of-thought") that the user may or may not see. These tokens are processed and billed but represent the model "working through the problem."

  2. Variable compute -- A standard model spends roughly the same compute on "What is 2+2?" and "Prove Fermat's Last Theorem." A reasoning model automatically allocates more thinking tokens to harder problems.

  3. Self-correction -- During the thinking phase, the model can identify mistakes in its own reasoning, backtrack, and try a different approach. Standard models cannot do this.

Reasoning Model Landscape​

ModelProviderApproachThinking VisibilityCost vs Base Model
o1OpenAIBuilt-in CoT reasoningHidden (summary only)~3-6x GPT-4o
o3OpenAIAdvanced reasoning, tool useHidden (summary only)~4-8x GPT-4o
o4-miniOpenAIEfficient reasoningHidden (summary only)~2-3x GPT-4o-mini
Claude with Extended ThinkingAnthropicExplicit thinking modeVisible thinking blockVaries by thinking budget
Gemini 2.5 Pro (thinking)GoogleBuilt-in thinking modeConfigurableIncluded in standard pricing

When to Use Reasoning vs Standard Models​

Use CaseStandard ModelReasoning Model
Simple Q&A, summarizationPreferred (faster, cheaper)Overkill
Classification, extractionPreferredOverkill
Multi-step math problemsMay struggleExcels
Complex code generationGoodBetter, but slower
Scientific analysisAdequateSignificantly better
Planning and strategyAdequateSignificantly better
Ambiguous/nuanced problemsMay miss nuanceHandles nuance well
Real-time chatRequired (low latency)Too slow
Batch analysis of complex dataPossibleIdeal

Cost Implications of Thinking Tokens​

This is critical for architects planning capacity:

Standard model:  Input tokens + Output tokens = Total billed tokens
Reasoning model: Input tokens + Thinking tokens + Output tokens = Total billed tokens

A reasoning model answering a hard question might generate 10,000-50,000 thinking tokens before producing a 500-token answer. You are billed for all of them. For simple questions, the model may generate very few thinking tokens -- variable compute means variable cost.

Budget Planning

Reasoning models can be unpredictably expensive because thinking token count varies dramatically by question difficulty. For budgeting purposes, assume 5-10x the cost of an equivalent standard model when using reasoning models. Monitor thinking token usage closely during initial deployment.


2.4 Small Language Models (SLMs)​

Why Bigger Is Not Always Better​

The industry narrative of "bigger = better" is misleading for production systems. A 14B-parameter Phi-4 can outperform GPT-3.5-Turbo (175B parameters) on math and reasoning benchmarks. The key insight: data quality and training methodology matter more than raw parameter count.

The SLM Landscape​

ModelParametersLicenseRuns on CPU?Key Strength
Phi-414BMITSlowlySTEM reasoning champion
Phi-4-mini3.8BMITYesCompact with strong reasoning
Phi-3.5-mini3.8BMITYesMultilingual, long context
Llama 3.2 1B1BLlama LicenseYesUltra-small, on-device
Llama 3.2 3B3BLlama LicenseYesSlightly larger, better quality
Gemma 2 2B2BApache 2.0YesPermissive license, fine-tunable
Gemma 2 9B9BApache 2.0SlowlyStrong for size, permissive
Mistral Nemo12BApache 2.0SlowlyExcellent multilingual
Qwen 2.5 3B3BApache 2.0YesStrong on benchmarks for size

Edge / Device Deployment Scenarios​

ScenarioModelHardwareLatencyUse Case
Factory floorPhi-4-mini (quantized INT4)Intel NUC / Jetson~200msEquipment log analysis, anomaly classification
Retail POSLlama 3.2 1BWindows tablet CPU~500msProduct description generation, receipt parsing
Vehicle telematicsPhi-4-mini ONNXARM Cortex~1sDriving pattern classification
Healthcare kioskPhi-3.5-miniGPU-equipped kiosk~150msSymptom triage (air-gapped, HIPAA)
Developer laptopPhi-4 / Mistral NemoApple M-series GPU~100msLocal code completion, offline IDE assistant

When SLMs Beat LLMs​

ScenarioSLM Advantage
Air-gapped environmentsNo network required. LLMs need API connectivity.
Sub-100ms latencySLMs on local GPU respond in milliseconds. API calls add network round-trip.
High-volume classificationAt 1M+ requests/day, API costs for GPT-4o dwarf the cost of a single GPU running Phi-4.
Regulatory constraintsData never leaves the device/network. No third-party processing.
Predictable costsFixed GPU cost vs variable per-token API cost.
Fine-tuned specialistsA fine-tuned 3B model on your domain data can beat a general 100B+ model on your specific task.

2.5 Multimodal Models​

Beyond Text β€” The Multimodal Revolution​

Modern LLMs are no longer text-only. Multimodal models can process and generate across multiple modalities: text, images, audio, and increasingly video.

ModalityInputOutputExample Models
TextPrompts, conversationsText responsesAll LLMs
Image (Vision)Photos, diagrams, screenshotsText descriptions, analysisGPT-4o, Claude Opus 4, Gemini 2.5 Pro, Llama 3.2 Vision
Image GenerationText promptsGenerated imagesDALL-E 3, Stable Diffusion
Audio (Speech)Voice recordings, audio filesTranscriptions, text analysisWhisper, Gemini 2.5
Audio GenerationText promptsSpoken audioOpenAI TTS, ElevenLabs
VideoVideo files, streamsText analysis of video contentGemini 2.5 Pro (native), GPT-4o (frame extraction)

Vision Capabilities Comparison​

CapabilityGPT-4oClaude Opus 4 / Sonnet 4Gemini 2.5 ProLlama 3.2 Vision
Document analysisExcellentExcellentExcellentGood
Chart/graph readingExcellentExcellentExcellentGood
UI screenshot analysisExcellentExcellentExcellentModerate
Medical imagingGood (general)Good (general)Good (general)Limited
Multi-image comparisonYesYesYes (native)Limited
PDF processingVia text extractionVia text extractionNative PDF inputNo
Video understandingFrame-by-frameFrame-by-frameNative video inputNo
Azure availabilityAzure OpenAIAI FoundryNot on AzureAI Foundry

Architecture Implications of Multimodal​

Multimodal inputs change your infrastructure requirements significantly:

Infrastructure FactorText-OnlyText + VisionFull Multimodal
Request payload sizeKBsMBsTens of MBs
Blob storage neededMinimalYes (image store)Significant (media store)
Network bandwidthStandardElevatedHigh
Processing latency0.5-5s2-15s5-60s
Token cost per requestLow3-5x higher (image tokens)5-20x higher
Preprocessing pipelineNoneImage resize, format conversionTranscoding, chunking, frame extraction

2.6 Model Benchmarks β€” What They Mean​

Major Benchmarks Explained​

When model providers announce new releases, they cite benchmark scores. Understanding what each benchmark measures (and does not measure) is essential for making informed decisions.

BenchmarkFull NameWhat It MeasuresScore FormatLimitations
MMLUMassive Multitask Language UnderstandingBroad knowledge across 57 subjects (STEM, humanities, social sciences)% accuracyMultiple choice only; does not test generation quality
MMLU-ProMMLU ProfessionalHarder version of MMLU with 10 choices and reasoning-heavy questions% accuracyStill multiple choice
HumanEval--Python code generation correctness (164 problems)pass@1 (%)Python-only; simple function-level problems
MBPPMostly Basic Python ProblemsBroader Python coding (974 problems)% accuracyStill basic; does not reflect production code complexity
GSM8KGrade School Math 8KGrade-school word math problems (8.5K problems)% accuracySimple for frontier models (many score 95%+)
MATH--Competition-level mathematics% accuracyVery hard; meaningful differentiation between models
ARC-ChallengeAI2 Reasoning ChallengeGrade-school science reasoning% accuracyMultiple choice; may be too easy for frontier models
HellaSwag--Commonsense reasoning (sentence completion)% accuracySaturated -- most frontier models score 95%+
GPQAGraduate-Level Google-Proof Q&APhD-level science questions (experts struggle too)% accuracySmall dataset; high variance
Arena ELOChatbot ArenaHuman preference ranking from blind comparisonsELO ratingBiased toward chat-style tasks; may not reflect enterprise use
SWE-BenchSoftware Engineering BenchReal GitHub issue resolution (full repo context)% resolvedVery hard; reflects real-world coding ability
LiveBench--Continuously updated benchmark to prevent contaminationComposite scoreNewer, less established

Why Benchmarks Do Not Tell the Whole Story​

Key problems with relying on benchmarks alone:

  1. Benchmark contamination -- Models may have been trained on benchmark data, inflating scores.
  2. Multiple choice bias -- Many benchmarks are multiple choice, which does not test generation quality.
  3. Task mismatch -- Your production tasks (summarizing customer tickets, extracting invoice data) are nothing like MMLU questions.
  4. Cherry-picking -- Providers highlight benchmarks where they perform best and ignore the rest.
  5. Saturation -- Many benchmarks (HellaSwag, ARC) are effectively "solved" by frontier models, providing zero differentiation.
The Architect's Approach to Model Evaluation

Do not rely solely on benchmarks. Instead:

  1. Define 50-100 representative examples from YOUR actual production data.
  2. Run them through 2-3 candidate models.
  3. Have domain experts blind-evaluate the outputs (which response is better, without knowing which model produced it).
  4. Measure latency, cost, and throughput on YOUR workload pattern.
  5. Only then make a deployment decision.

This process is called "eval-driven development" and it is the gold standard for model selection.

Arena / ELO Ratings​

The Chatbot Arena (run by LMSys at UC Berkeley) is a live platform where users submit prompts and blindly compare two model responses. Over millions of comparisons, an ELO rating (like chess) is computed. As of early 2026, typical rankings place Gemini 2.5 Pro, Claude Opus 4, GPT-4.1, and o3 in the top tier, often within a few ELO points of each other. The key insight: frontier models are converging in general capability, making factors like cost, latency, data residency, and ecosystem integration more important differentiators than raw quality.


2.7 Azure OpenAI Deployment Types​

This is one of the most critical sections for Azure architects. How you deploy an Azure OpenAI model determines your cost, latency, throughput, and data residency guarantees.

Deployment Types at a Glance​

Detailed Comparison​

Deployment TypeBilling ModelData ProcessingRoutingLatencyThroughput GuaranteeBest For
StandardPay-per-tokenSingle region you chooseYour region onlyLow, but shared capacityNo (best-effort)Region-specific compliance, predictable geo
Global StandardPay-per-token (cheapest PAYG)Microsoft chooses region dynamicallyGlobal (Microsoft routes)Low (optimized routing)No (best-effort, but higher ceiling)Cost optimization, no strict data residency
Data Zone StandardPay-per-tokenWithin a data zone (e.g., US, EU)Within data zoneLowNo (best-effort)EU/US data residency compliance
Provisioned (PTU)$/PTU/month (reserved)Single region you chooseYour region onlyConsistent, lowYes (guaranteed throughput)Production workloads needing guaranteed SLA
Global Provisioned$/PTU/month (reserved)Microsoft chooses regionGlobalConsistentYesHigh-throughput global apps
Data Zone Provisioned$/PTU/month (reserved)Within a data zoneWithin data zoneConsistentYesData-residency + guaranteed throughput
Global BatchPay-per-token (50% discount)Microsoft chooses regionGlobalHigh (24h turnaround)NoLarge-scale async processing
Data Zone BatchPay-per-token (50% discount)Within a data zoneWithin data zoneHigh (24h turnaround)NoBatch + data residency

Understanding Provisioned Throughput Units (PTU)​

PTU is Azure OpenAI's reserved capacity model. Instead of paying per token, you reserve a fixed amount of compute capacity.

Key concepts:

  • 1 PTU = a unit of model throughput capacity (not a fixed number of tokens -- it depends on the model)
  • Minimum commitment = typically 50-100 PTUs depending on model
  • Monthly commitment = PTUs are billed monthly whether you use them or not
  • Guaranteed throughput = no throttling, no 429 errors, consistent latency
  • Right-sizing = Azure provides a capacity calculator to estimate PTUs needed
FactorPay-As-You-Go (PAYG)Provisioned (PTU)
Cost at low volumeCheaperExpensive (paying for unused capacity)
Cost at high volumeCan become very expensiveMore predictable, often cheaper
Break-even point--Typically ~60-70% utilization of reserved capacity
Latency consistencyVariable (shared infra)Consistent (dedicated capacity)
Throttling riskYes (429 errors under load)No (capacity is reserved)
CommitmentNoneMonthly or yearly
ScalingAutomatic (with limits)Manual (add more PTUs)
PTU Decision Rule

Use PAYG for development, testing, and low-volume production. Use PTU when you have predictable, steady-state production workloads exceeding ~$5,000/month in token costs, or when you cannot tolerate 429 throttling errors.

Data Residency Decision Guide​

RequirementDeployment TypeData Stays In
No data residency requirementGlobal StandardAnywhere Microsoft operates
Data must stay in USData Zone Standard (US)US Azure regions
Data must stay in EUData Zone Standard (EU)EU Azure regions
Data must stay in specific regionStandardThe one region you specify
Strictest compliance (e.g., government)Standard + Private Endpoint + CMKSingle region, encrypted, network-isolated

2.8 Model Selection Framework​

The Decision Tree​

Choosing the right model is one of the most impactful decisions an architect makes. This framework guides you through the key decision points.

Selection Factors Matrix​

FactorFavors Small/Cheap ModelsFavors Large/Premium Models
Task complexitySimple, well-defined tasksAmbiguous, multi-step reasoning
VolumeHigh (millions of requests)Low (thousands of requests)
Latency requirementStrict (< 1s)Relaxed (> 5s acceptable)
BudgetTightFlexible
Quality bar"Good enough" is acceptableErrors have high cost
Data sensitivityNot sensitive (Global OK)Highly sensitive (on-prem needed)
Context length neededShort (< 4K tokens)Long (> 32K tokens)

The 80/20 Principle for Model Selection​

This is the single most important guideline:

80% of production workloads can use a mini/small model. GPT-4o-mini, GPT-4.1-nano, Claude Haiku, or Phi-4-mini will handle classification, extraction, summarization, reformatting, simple Q&A, and routing. Reserve large models (GPT-4o, Claude Opus 4, o3) for the 20% of tasks that genuinely need them.

Practical Model Pairing Patterns​

Enterprise architectures rarely use a single model. Smart architectures use model routing to send each request to the most cost-effective model.

PatternHow It WorksCost Savings
Classifier + WorkerA tiny model (nano) classifies the request type, then routes to the appropriate model40-60%
Small-then-LargeTry the small model first. If confidence is low, escalate to the large model.50-70%
Reasoning on DemandUse standard model by default. Switch to reasoning model (o3) only for flagged complex queries.60-80%
Edge + CloudSLM handles simple queries locally. Complex queries go to cloud API.30-50% on API costs

2.9 The Infrastructure Impact​

Model Size vs Hardware Requirements​

Understanding the relationship between model parameters and infrastructure is fundamental for capacity planning.

Model SizeParametersFP16 VRAMINT8 VRAMINT4 VRAMRecommended Azure VMGPUs
Tiny1-3B2-6 GB1-3 GB0.5-1.5 GBStandard_NC4as_T4_v31x T4
Small7-14B14-28 GB7-14 GB3.5-7 GBStandard_NC24ads_A100_v41x A100
Medium30-70B60-140 GB30-70 GB15-35 GBStandard_NC48ads_A100_v42x A100 80GB
Large70-120B140-240 GB70-120 GB35-60 GBStandard_NC96ads_A100_v44x A100 80GB
XL400B+800+ GB400+ GB200+ GBStandard_ND96isr_H100_v58x H100
Quantization Explained

FP16 = Full 16-bit floating point precision. Best quality but most VRAM. INT8 = 8-bit integer quantization. ~95-99% quality, half the VRAM. INT4 = 4-bit integer quantization. ~90-95% quality, quarter the VRAM. Quantization is how you fit a 70B model on 2 GPUs instead of 4. The quality trade-off is often negligible for production inference.

Self-Hosting vs Managed API: Cost Comparison​

FactorSelf-Hosted (Azure VM + GPU)Managed API (Azure OpenAI)
Low volume (1K req/day)Expensive ($3,000-10,000/mo for GPU VM running 24/7)Cheap ($50-500/mo in tokens)
Medium volume (50K req/day)Moderate ($5,000-15,000/mo)Moderate ($2,000-10,000/mo)
High volume (1M req/day)Cost-effective ($10,000-30,000/mo, amortized)Expensive ($20,000-100,000+/mo)
Operational burdenHigh (patching, scaling, monitoring, model updates)Low (Microsoft manages everything)
Latency controlFull (tune batch size, concurrency, quantization)Limited (shared infrastructure)
Model flexibilityAny open-weight model, any versionOnly models offered by Azure OpenAI
ComplianceMaximum controlDepends on Azure OpenAI DPA
Scale-to-zeroNot easily (GPU VMs take minutes to start)Yes (serverless endpoints)

Break-even rule of thumb: Self-hosting becomes cost-effective when you are consistently spending more than $10,000-15,000/month on managed API tokens for a single model, AND you have the engineering team to manage GPU infrastructure.

Scaling Characteristics​

Scaling DimensionSmall Models (< 14B)Large Models (70B+)Managed API
Scale-upAdd more powerful GPUAdd more GPUs (tensor parallelism)Increase rate limits / add PTUs
Scale-outMore replicas (easy)More replicas (expensive)Automatic (PAYG) or manual (PTU)
Cold startSecondsMinutes (model loading)None (always warm)
Minimum footprint1 GPU2-8 GPUs0 (pay-per-token)
Auto-scalingHPA on GPU utilizationComplex (GPU memory/utilization)Built-in (with rate limits)
Cost of idle1 GPU VM running2-8 GPU VMs running$0

Network Bandwidth for Model Serving​

An often-overlooked factor: models generate tokens sequentially, so the bottleneck is usually GPU compute, not network. However, multimodal inputs and model downloading can strain bandwidth.

OperationBandwidth RequirementNotes
Text inference< 1 Mbps per concurrent userToken-by-token streaming; minimal bandwidth
Image input5-50 Mbps burstUploading images for vision analysis
Model download10+ Gbps preferredDownloading a 70B model is ~140 GB; takes 2 min at 10 Gbps, ~20 min at 1 Gbps
Multi-GPU communicationNVLink / InfiniBandFor tensor-parallel inference across GPUs; standard Ethernet adds latency
Streaming responses< 1 Mbps per userServer-Sent Events (SSE) for token streaming

Key Takeaways​

  1. Know the taxonomy. Proprietary, open-weight, and open-source models have fundamentally different infrastructure, cost, and compliance implications. Choose based on your constraints, not hype.

  2. Model families are converging in quality. GPT-4o, Claude Opus 4, Gemini 2.5 Pro, and Llama 4 Maverick are all excellent. Differentiation increasingly comes from cost, latency, data residency, and ecosystem integration rather than raw capability.

  3. Reasoning models are a separate category. The o-series and extended thinking modes spend variable compute on thinking tokens. They are powerful but expensive and slow. Use them deliberately for tasks that require deep reasoning.

  4. Start small. GPT-4o-mini, GPT-4.1-nano, Phi-4, and Claude Haiku can handle 80% of production workloads. Only scale up with evidence.

  5. Azure OpenAI deployment types matter. The difference between Standard, Global Standard, Data Zone, Provisioned, and Batch can mean 2-5x cost differences and completely different compliance postures. Choose intentionally.

  6. Benchmarks are directional, not definitive. Build your own evaluation set from real production data. No benchmark can tell you how well a model will perform on YOUR specific task.

  7. Self-hosting makes sense at scale. Below ~$10K/month in API spend, managed APIs are almost always the right choice. Above that threshold, and especially with strict data residency requirements, self-hosting open-weight models becomes compelling.


What's Next​

You now have the map of the model landscape. In Module 3: Azure AI Foundry, you will learn how to actually deploy, manage, and operationalize these models on Azure -- the Model Catalog, endpoint types, prompt flow, evaluation, and fine-tuning workflows.