Module 2: The LLM Landscape β Models, Families & Selection Guide
Duration: 45-60 minutes | Level: Foundation Audience: Cloud Architects, Platform Engineers, CSAs Last Updated: March 2026
You do not need to train models. You do not need to understand backpropagation. But you absolutely need to know which models exist, what they are good at, how they are licensed, and how to deploy them on Azure. This module gives you the full map.
By the end of this module you will be able to:
- Distinguish proprietary, open-weight, and open-source models and explain why it matters for infrastructure
- Name every major model family and its flagship models
- Understand reasoning models, small language models, and multimodal models as distinct categories
- Read benchmarks critically and know their limitations
- Select the right Azure OpenAI deployment type for a given workload
- Apply a decision framework to choose the right model for any production scenario
2.1 The Model Universeβ
Not all models are created equal and not all models are licensed the same way. Before diving into specific families, you need a clear taxonomy.
Three Categories of Model Opennessβ
| Category | Definition | Examples | Can Self-Host? | Can Fine-Tune? | Can See Weights? |
|---|---|---|---|---|---|
| Proprietary | Model weights are never released. Access only via API. Provider controls everything. | GPT-4o, Claude Opus 4, Gemini 2.5 Pro | No | Limited (provider-hosted) | No |
| Open-Weight | Model weights are publicly downloadable. License may restrict commercial use or modification. Training data/code often not shared. | Llama 4, Mistral Large, Phi-4 | Yes | Yes | Weights only |
| Open-Source | Weights, training code, training data, and methodology are all public. Permissive license. | OLMo, BLOOM, Pythia | Yes | Yes | Everything |
"Open-weight" and "open-source" are not the same thing. Meta's Llama models are open-weight (you can download and run them), but they are not open-source by the OSI definition because Meta's license includes restrictions on commercial use above 700M monthly active users and does not release training data. This distinction matters for legal and procurement teams.
Why This Classification Matters for Infrastructureβ
| Decision Factor | Proprietary (API) | Open-Weight (Self-Hosted) |
|---|---|---|
| Infrastructure | None (managed) | GPU VMs, networking, storage |
| Data residency | Provider's regions | Your regions, your control |
| Cost model | Pay-per-token (variable) | Fixed compute (GPU hours) |
| Latency control | Limited | Full control |
| Scaling | Automatic (with limits) | Manual (AKS, VMSS) |
| Compliance | Depends on provider DPA | Full control |
| Vendor lock-in | High | Low |
| Operational burden | Minimal | Significant |
Licensing Implicationsβ
For architects who need to advise procurement and legal:
| License | Models | Commercial Use | Fine-Tuning | Redistribution | Key Restriction |
|---|---|---|---|---|---|
| Proprietary ToS | GPT-4o, Claude, Gemini | Via API only | Provider-managed only | No | Bound to provider platform |
| Llama Community License | Llama 3.1, 3.2, 4 | Yes (with conditions) | Yes | Yes | 700M MAU threshold requires Meta license |
| Apache 2.0 | Phi-4, Mistral (some), Gemma | Yes | Yes | Yes | No significant restrictions |
| Mistral Research License | Mistral Large (some versions) | Check version | Yes | Limited | May restrict commercial use |
2.2 Major Model Familiesβ
OpenAI / Azure OpenAIβ
The largest and most widely deployed model family in enterprise. Azure OpenAI Service provides the same models as OpenAI's API, but running on Microsoft's Azure infrastructure with enterprise security, compliance, and network integration.
GPT-4 Class Modelsβ
| Model | Context Window | Key Capabilities | Best For | Azure Deployment |
|---|---|---|---|---|
| GPT-4o | 128K tokens | Text, image, audio input. Fast. Flagship multimodal. | General-purpose production workloads, vision tasks | All deployment types |
| GPT-4o-mini | 128K tokens | Smaller, faster, cheaper. 80% of GPT-4o quality. | High-volume, cost-sensitive workloads | All deployment types |
| GPT-4.1 | 1M tokens | Evolution of GPT-4o. Better coding, instruction following. | Complex coding, long-document analysis | Standard, Global Standard |
| GPT-4.1-mini | 1M tokens | Smaller GPT-4.1. Excellent cost/performance. | Production at scale where 4.1 quality is needed cheaply | Standard, Global Standard |
| GPT-4.1-nano | 1M tokens | Smallest GPT-4.1. Ultra-low cost. | Classification, extraction, routing, simple tasks | Standard, Global Standard |
Reasoning Models (o-series)β
| Model | Context Window | Thinking Tokens | Key Capabilities | Best For |
|---|---|---|---|---|
| o1 | 200K tokens | Yes (hidden, billed) | Deep reasoning, math, science, logic | Complex multi-step analysis |
| o3 | 200K tokens | Yes (hidden, billed) | Stronger reasoning, better at STEM | Research, scientific reasoning |
| o4-mini | 200K tokens | Yes (hidden, billed) | Efficient reasoning at lower cost | Reasoning tasks on a budget |
Specialized Modelsβ
| Model | Type | Capability | Best For |
|---|---|---|---|
| DALL-E 3 | Image generation | Text-to-image, high quality | Marketing content, design prototyping |
| Whisper | Speech-to-text | Multilingual transcription | Meeting transcription, voice interfaces |
| Text Embedding (3-small, 3-large) | Embedding | Convert text to vectors | RAG pipelines, semantic search |
GPT-4o-mini and GPT-4.1-nano are your workhorses. At least 80% of production workloads (classification, extraction, summarization, simple Q&A) do NOT need a full GPT-4o or GPT-4.1. Start with the smallest model that meets quality requirements. Scale up only when you have evidence that the smaller model fails.
Azure OpenAI Specificsβ
Azure OpenAI is not just "OpenAI on Azure" -- it adds enterprise-grade features:
- Azure AD / Entra ID authentication (no API keys in production)
- Virtual Network integration via Private Endpoints
- Content filtering enabled by default (configurable)
- Managed Identity support for zero-secret deployments
- Data processing guarantees (your prompts and completions are NOT used for model training)
- Regional deployment for data residency compliance (EU, US, etc.)
- Provisioned throughput (PTU) for guaranteed capacity
Anthropic Claudeβ
Claude is Anthropic's model family, built with a focus on safety, helpfulness, and long-context understanding. Available via Anthropic's API and also accessible through Azure AI Foundry's model catalog (as a third-party model).
| Model | Context Window | Key Capabilities | Best For |
|---|---|---|---|
| Claude Opus 4 | 200K tokens | Most capable Claude. Deep analysis, complex reasoning, extended thinking, tool use. | Complex research, long-document analysis, agentic workflows |
| Claude Sonnet 4 | 200K tokens | Balanced performance and speed. Strong coding. Extended thinking available. | Production workloads needing quality + speed balance |
| Claude Haiku 3.5 | 200K tokens | Fastest, cheapest Claude. Surprisingly capable. | High-volume, cost-sensitive, real-time applications |
Key Differentiators vs GPTβ
| Dimension | Claude | GPT-4o / GPT-4.1 |
|---|---|---|
| Context window | 200K tokens standard | 128K (GPT-4o), 1M (GPT-4.1) |
| Extended thinking | Explicit thinking mode where model reasons step-by-step (visible/hidden thinking tokens) | o-series models have built-in reasoning |
| System prompt adherence | Exceptionally strong -- tends to follow long, complex system prompts very closely | Strong, but can drift on very complex instructions |
| Safety approach | Constitutional AI (trained against a constitution of rules) | RLHF + content filtering layer |
| Coding | Excellent, especially for refactoring and explaining | Excellent, especially for generation |
| Azure availability | Via Azure AI Foundry model catalog (serverless API) | Native Azure OpenAI Service |
| Data residency | Anthropic's US/EU regions | Azure regions worldwide |
Meta Llamaβ
Meta's Llama family is the most significant open-weight model family. It democratized access to large-scale LLMs and created an entire ecosystem of fine-tuned derivatives.
| Model | Parameters | Context Window | Key Capabilities | Best For |
|---|---|---|---|---|
| Llama 3.1 8B | 8 billion | 128K tokens | Lightweight, fast, fine-tunable | Edge deployment, task-specific fine-tuning |
| Llama 3.1 70B | 70 billion | 128K tokens | Strong general performance | Self-hosted production workloads |
| Llama 3.1 405B | 405 billion | 128K tokens | Largest open-weight model. Near-GPT-4 quality. | Maximum quality in self-hosted scenarios |
| Llama 3.2 1B | 1 billion | 128K tokens | Ultra-small, on-device capable | Mobile, IoT, edge inference |
| Llama 3.2 3B | 3 billion | 128K tokens | Small but capable | Edge with limited GPU |
| Llama 3.2 11B Vision | 11 billion | 128K tokens | Multimodal (text + image) | Visual understanding on-prem |
| Llama 3.2 90B Vision | 90 billion | 128K tokens | Large multimodal | High-quality vision tasks, self-hosted |
| Llama 4 Scout | 17B active (109B total) | 10M tokens | Mixture-of-Experts (MoE). 16 experts. Extreme context. | Long-context tasks with efficient compute |
| Llama 4 Maverick | 17B active (400B total) | 1M tokens | MoE with 128 experts. Strongest open Llama. | Best open-weight quality, multi-turn, multilingual |
What "Open-Weight" Means for Self-Hostingβ
When you download Llama, you get the trained model weights (the numbers that define the model's knowledge). You can:
- Run inference locally on your own GPU infrastructure
- Fine-tune the model on your domain-specific data
- Quantize the model (reduce precision) to run on smaller hardware
- Deploy via frameworks like vLLM, TGI, or Ollama
Infrastructure implications:
| Model Size | Minimum VRAM (FP16) | Recommended GPU | Monthly Azure VM Cost (est.) |
|---|---|---|---|
| 1B-3B | 4-8 GB | Single T4 or A10 | ~$300-600 |
| 8B | 16 GB | Single A10 or A100 | ~$600-2,000 |
| 70B | 140 GB | 2x A100 80GB | ~$6,000-10,000 |
| 405B | 810 GB | 8-16x A100 80GB | ~$30,000-60,000 |
All Llama models are available on Azure AI Foundry via the Model Catalog. You can deploy them as Serverless API (pay-per-token, similar to Azure OpenAI) or as Managed Compute (dedicated VM with the model). No need to download weights manually.
Google Geminiβ
Google's Gemini models are natively multimodal -- trained from the ground up on text, images, audio, and video, rather than having vision bolted on after text training.
| Model | Context Window | Key Capabilities | Best For |
|---|---|---|---|
| Gemini 2.0 Flash | 1M tokens | Fast, multimodal, tool use, agentic | Low-latency multimodal tasks |
| Gemini 2.5 Pro | 1M tokens | Most capable Gemini. Thinking mode built-in. | Complex reasoning, long-context, multimodal analysis |
| Gemini 2.5 Flash | 1M tokens | Fast + thinking. Balanced cost/quality. | Production multimodal at scale |
Azure Availabilityβ
Gemini models are not natively available on Azure AI Foundry as of March 2026. They are accessed via Google's Vertex AI or Google AI Studio. However, Azure AI Foundry's Model Catalog does include some Google models:
- Gemma (Google's open-weight small models) -- available on Azure AI Foundry
- Gemma 2 (2B, 9B, 27B) -- Apache 2.0 licensed, deployable on Azure
For multi-cloud architectures, teams sometimes use Gemini via Google's API alongside Azure OpenAI, but this adds complexity around data residency, networking, and billing management.
Microsoft Phi (Small Language Models)β
Microsoft's Phi family proves that smaller models, trained on high-quality data, can rival models 10-100x their size on specific tasks. These are Small Language Models (SLMs) -- purpose-built to be efficient.
| Model | Parameters | Context Window | Key Capabilities | Best For |
|---|---|---|---|---|
| Phi-3-mini | 3.8B | 128K tokens | Strong reasoning for its size | Edge, mobile, on-device |
| Phi-3-small | 7B | 128K tokens | Multilingual, strong on benchmarks | Cost-efficient general tasks |
| Phi-3-medium | 14B | 128K tokens | Best Phi-3 quality | Production tasks needing local deployment |
| Phi-3.5-mini | 3.8B | 128K tokens | Improved over Phi-3-mini, multilingual | Edge with multilingual needs |
| Phi-3.5-MoE | 6.6B active (42B total) | 128K tokens | Mixture-of-Experts architecture | High quality at efficient compute |
| Phi-3.5-vision | 4.2B | 128K tokens | Multimodal (text + image) | Visual tasks on constrained hardware |
| Phi-4 | 14B | 16K tokens | Strongest Phi. STEM reasoning leader. | Math, science, structured reasoning |
| Phi-4-mini | 3.8B | 128K tokens | Compact, strong reasoning | Lightweight deployment with good quality |
| Phi-4-multimodal | 5.6B | 128K tokens | Text + image + audio | Multimodal on-device scenarios |
Why SLMs Matter for Architectsβ
- Run on CPUs -- Models under 4B parameters can run (slowly) on CPUs. No GPU required.
- Edge deployment -- Azure IoT Edge, Windows devices, even mobile phones.
- Cost -- Orders of magnitude cheaper than large model API calls at high volume.
- Latency -- Smaller models respond faster. Critical for real-time scenarios.
- Data sovereignty -- Run entirely on-premises, air-gapped, zero data leaving your network.
All Phi models are available on Azure AI Foundry via the Model Catalog with Serverless API and Managed Compute deployment options. Phi-4 and Phi-3.5 models are also optimized for ONNX Runtime and can run on Windows devices via Windows Copilot Runtime.
Mistralβ
Mistral AI is a French AI company that has quickly become a major force, particularly strong in European markets and multilingual workloads.
| Model | Parameters | Context Window | Key Capabilities | Best For |
|---|---|---|---|---|
| Mistral Large | ~123B | 128K tokens | Flagship. Strong reasoning, multilingual. | Complex enterprise tasks, European compliance |
| Mistral Small | ~22B | 128K tokens | Efficient, fast | Cost-sensitive production |
| Mistral Nemo | 12B | 128K tokens | Apache 2.0 license. Very capable for size. | Self-hosted, fine-tuning friendly |
| Codestral | 22B | 32K tokens | Code-specialized | Code generation, completion, review |
| Pixtral | 12B | 128K tokens | Multimodal (text + image) | Visual understanding at low cost |
Key Strengthsβ
- Multilingual excellence -- particularly strong in European languages (French, German, Spanish, Italian)
- European data sovereignty -- Mistral offers EU-hosted endpoints compliant with EU AI Act
- Apache 2.0 licensing on several models -- true permissive open-source
- Available on Azure AI Foundry -- both Serverless API and Managed Compute options
Comprehensive Model Comparison Tableβ
| Model | Provider | Parameters | Context Window | Strengths | Azure Available | Best For |
|---|---|---|---|---|---|---|
| GPT-4o | OpenAI | Undisclosed | 128K | Multimodal, fast, versatile | Azure OpenAI (native) | General-purpose production |
| GPT-4o-mini | OpenAI | Undisclosed | 128K | Cost-efficient, 80% of 4o quality | Azure OpenAI (native) | High-volume workloads |
| GPT-4.1 | OpenAI | Undisclosed | 1M | Long context, strong coding | Azure OpenAI (native) | Code, long documents |
| GPT-4.1-mini | OpenAI | Undisclosed | 1M | Great cost/quality ratio | Azure OpenAI (native) | Scaled production |
| GPT-4.1-nano | OpenAI | Undisclosed | 1M | Ultra-cheap, fast | Azure OpenAI (native) | Classification, routing |
| o3 | OpenAI | Undisclosed | 200K | Deep reasoning, STEM | Azure OpenAI (native) | Complex analysis, math |
| o4-mini | OpenAI | Undisclosed | 200K | Efficient reasoning | Azure OpenAI (native) | Reasoning on a budget |
| Claude Opus 4 | Anthropic | Undisclosed | 200K | Strongest Claude, extended thinking | AI Foundry (serverless) | Complex research, agents |
| Claude Sonnet 4 | Anthropic | Undisclosed | 200K | Balanced quality/speed | AI Foundry (serverless) | Production coding, analysis |
| Claude Haiku 3.5 | Anthropic | Undisclosed | 200K | Fastest Claude, very cheap | AI Foundry (serverless) | High-volume, real-time |
| Llama 4 Maverick | Meta | 17B active / 400B | 1M | MoE, strong open-weight | AI Foundry (serverless/managed) | Best open-weight quality |
| Llama 4 Scout | Meta | 17B active / 109B | 10M | Extreme context length | AI Foundry (serverless/managed) | Very long context tasks |
| Llama 3.1 405B | Meta | 405B | 128K | Largest dense open model | AI Foundry (managed) | Maximum open self-hosted quality |
| Llama 3.1 70B | Meta | 70B | 128K | Solid open-weight workhorse | AI Foundry (both) | Self-hosted production |
| Gemini 2.5 Pro | Undisclosed | 1M | Native multimodal, thinking | Not on Azure (Vertex AI) | Multimodal analysis | |
| Gemma 2 27B | 27B | 8K | Apache 2.0, open | AI Foundry (managed) | Permissive self-hosting | |
| Phi-4 | Microsoft | 14B | 16K | STEM reasoning, small | AI Foundry (both) | Edge, cost-efficient reasoning |
| Phi-4-multimodal | Microsoft | 5.6B | 128K | Text + image + audio, tiny | AI Foundry (both) | On-device multimodal |
| Mistral Large | Mistral | ~123B | 128K | Multilingual, European | AI Foundry (serverless) | EU compliance workloads |
| Mistral Nemo | Mistral | 12B | 128K | Apache 2.0, fine-tunable | AI Foundry (both) | Custom fine-tuning |
2.3 Reasoning Models β A New Categoryβ
In early 2024, a fundamentally new type of model emerged: reasoning models. These are not just larger or better-trained versions of existing models. They represent a different approach to how LLMs solve problems.
Standard Models vs Reasoning Modelsβ
How Reasoning Models Workβ
-
Thinking tokens -- Before producing a visible answer, the model generates internal reasoning tokens (a "chain-of-thought") that the user may or may not see. These tokens are processed and billed but represent the model "working through the problem."
-
Variable compute -- A standard model spends roughly the same compute on "What is 2+2?" and "Prove Fermat's Last Theorem." A reasoning model automatically allocates more thinking tokens to harder problems.
-
Self-correction -- During the thinking phase, the model can identify mistakes in its own reasoning, backtrack, and try a different approach. Standard models cannot do this.
Reasoning Model Landscapeβ
| Model | Provider | Approach | Thinking Visibility | Cost vs Base Model |
|---|---|---|---|---|
| o1 | OpenAI | Built-in CoT reasoning | Hidden (summary only) | ~3-6x GPT-4o |
| o3 | OpenAI | Advanced reasoning, tool use | Hidden (summary only) | ~4-8x GPT-4o |
| o4-mini | OpenAI | Efficient reasoning | Hidden (summary only) | ~2-3x GPT-4o-mini |
| Claude with Extended Thinking | Anthropic | Explicit thinking mode | Visible thinking block | Varies by thinking budget |
| Gemini 2.5 Pro (thinking) | Built-in thinking mode | Configurable | Included in standard pricing |
When to Use Reasoning vs Standard Modelsβ
| Use Case | Standard Model | Reasoning Model |
|---|---|---|
| Simple Q&A, summarization | Preferred (faster, cheaper) | Overkill |
| Classification, extraction | Preferred | Overkill |
| Multi-step math problems | May struggle | Excels |
| Complex code generation | Good | Better, but slower |
| Scientific analysis | Adequate | Significantly better |
| Planning and strategy | Adequate | Significantly better |
| Ambiguous/nuanced problems | May miss nuance | Handles nuance well |
| Real-time chat | Required (low latency) | Too slow |
| Batch analysis of complex data | Possible | Ideal |
Cost Implications of Thinking Tokensβ
This is critical for architects planning capacity:
Standard model: Input tokens + Output tokens = Total billed tokens
Reasoning model: Input tokens + Thinking tokens + Output tokens = Total billed tokens
A reasoning model answering a hard question might generate 10,000-50,000 thinking tokens before producing a 500-token answer. You are billed for all of them. For simple questions, the model may generate very few thinking tokens -- variable compute means variable cost.
Reasoning models can be unpredictably expensive because thinking token count varies dramatically by question difficulty. For budgeting purposes, assume 5-10x the cost of an equivalent standard model when using reasoning models. Monitor thinking token usage closely during initial deployment.
2.4 Small Language Models (SLMs)β
Why Bigger Is Not Always Betterβ
The industry narrative of "bigger = better" is misleading for production systems. A 14B-parameter Phi-4 can outperform GPT-3.5-Turbo (175B parameters) on math and reasoning benchmarks. The key insight: data quality and training methodology matter more than raw parameter count.
The SLM Landscapeβ
| Model | Parameters | License | Runs on CPU? | Key Strength |
|---|---|---|---|---|
| Phi-4 | 14B | MIT | Slowly | STEM reasoning champion |
| Phi-4-mini | 3.8B | MIT | Yes | Compact with strong reasoning |
| Phi-3.5-mini | 3.8B | MIT | Yes | Multilingual, long context |
| Llama 3.2 1B | 1B | Llama License | Yes | Ultra-small, on-device |
| Llama 3.2 3B | 3B | Llama License | Yes | Slightly larger, better quality |
| Gemma 2 2B | 2B | Apache 2.0 | Yes | Permissive license, fine-tunable |
| Gemma 2 9B | 9B | Apache 2.0 | Slowly | Strong for size, permissive |
| Mistral Nemo | 12B | Apache 2.0 | Slowly | Excellent multilingual |
| Qwen 2.5 3B | 3B | Apache 2.0 | Yes | Strong on benchmarks for size |
Edge / Device Deployment Scenariosβ
| Scenario | Model | Hardware | Latency | Use Case |
|---|---|---|---|---|
| Factory floor | Phi-4-mini (quantized INT4) | Intel NUC / Jetson | ~200ms | Equipment log analysis, anomaly classification |
| Retail POS | Llama 3.2 1B | Windows tablet CPU | ~500ms | Product description generation, receipt parsing |
| Vehicle telematics | Phi-4-mini ONNX | ARM Cortex | ~1s | Driving pattern classification |
| Healthcare kiosk | Phi-3.5-mini | GPU-equipped kiosk | ~150ms | Symptom triage (air-gapped, HIPAA) |
| Developer laptop | Phi-4 / Mistral Nemo | Apple M-series GPU | ~100ms | Local code completion, offline IDE assistant |
When SLMs Beat LLMsβ
| Scenario | SLM Advantage |
|---|---|
| Air-gapped environments | No network required. LLMs need API connectivity. |
| Sub-100ms latency | SLMs on local GPU respond in milliseconds. API calls add network round-trip. |
| High-volume classification | At 1M+ requests/day, API costs for GPT-4o dwarf the cost of a single GPU running Phi-4. |
| Regulatory constraints | Data never leaves the device/network. No third-party processing. |
| Predictable costs | Fixed GPU cost vs variable per-token API cost. |
| Fine-tuned specialists | A fine-tuned 3B model on your domain data can beat a general 100B+ model on your specific task. |
2.5 Multimodal Modelsβ
Beyond Text β The Multimodal Revolutionβ
Modern LLMs are no longer text-only. Multimodal models can process and generate across multiple modalities: text, images, audio, and increasingly video.
| Modality | Input | Output | Example Models |
|---|---|---|---|
| Text | Prompts, conversations | Text responses | All LLMs |
| Image (Vision) | Photos, diagrams, screenshots | Text descriptions, analysis | GPT-4o, Claude Opus 4, Gemini 2.5 Pro, Llama 3.2 Vision |
| Image Generation | Text prompts | Generated images | DALL-E 3, Stable Diffusion |
| Audio (Speech) | Voice recordings, audio files | Transcriptions, text analysis | Whisper, Gemini 2.5 |
| Audio Generation | Text prompts | Spoken audio | OpenAI TTS, ElevenLabs |
| Video | Video files, streams | Text analysis of video content | Gemini 2.5 Pro (native), GPT-4o (frame extraction) |
Vision Capabilities Comparisonβ
| Capability | GPT-4o | Claude Opus 4 / Sonnet 4 | Gemini 2.5 Pro | Llama 3.2 Vision |
|---|---|---|---|---|
| Document analysis | Excellent | Excellent | Excellent | Good |
| Chart/graph reading | Excellent | Excellent | Excellent | Good |
| UI screenshot analysis | Excellent | Excellent | Excellent | Moderate |
| Medical imaging | Good (general) | Good (general) | Good (general) | Limited |
| Multi-image comparison | Yes | Yes | Yes (native) | Limited |
| PDF processing | Via text extraction | Via text extraction | Native PDF input | No |
| Video understanding | Frame-by-frame | Frame-by-frame | Native video input | No |
| Azure availability | Azure OpenAI | AI Foundry | Not on Azure | AI Foundry |
Architecture Implications of Multimodalβ
Multimodal inputs change your infrastructure requirements significantly:
| Infrastructure Factor | Text-Only | Text + Vision | Full Multimodal |
|---|---|---|---|
| Request payload size | KBs | MBs | Tens of MBs |
| Blob storage needed | Minimal | Yes (image store) | Significant (media store) |
| Network bandwidth | Standard | Elevated | High |
| Processing latency | 0.5-5s | 2-15s | 5-60s |
| Token cost per request | Low | 3-5x higher (image tokens) | 5-20x higher |
| Preprocessing pipeline | None | Image resize, format conversion | Transcoding, chunking, frame extraction |
2.6 Model Benchmarks β What They Meanβ
Major Benchmarks Explainedβ
When model providers announce new releases, they cite benchmark scores. Understanding what each benchmark measures (and does not measure) is essential for making informed decisions.
| Benchmark | Full Name | What It Measures | Score Format | Limitations |
|---|---|---|---|---|
| MMLU | Massive Multitask Language Understanding | Broad knowledge across 57 subjects (STEM, humanities, social sciences) | % accuracy | Multiple choice only; does not test generation quality |
| MMLU-Pro | MMLU Professional | Harder version of MMLU with 10 choices and reasoning-heavy questions | % accuracy | Still multiple choice |
| HumanEval | -- | Python code generation correctness (164 problems) | pass@1 (%) | Python-only; simple function-level problems |
| MBPP | Mostly Basic Python Problems | Broader Python coding (974 problems) | % accuracy | Still basic; does not reflect production code complexity |
| GSM8K | Grade School Math 8K | Grade-school word math problems (8.5K problems) | % accuracy | Simple for frontier models (many score 95%+) |
| MATH | -- | Competition-level mathematics | % accuracy | Very hard; meaningful differentiation between models |
| ARC-Challenge | AI2 Reasoning Challenge | Grade-school science reasoning | % accuracy | Multiple choice; may be too easy for frontier models |
| HellaSwag | -- | Commonsense reasoning (sentence completion) | % accuracy | Saturated -- most frontier models score 95%+ |
| GPQA | Graduate-Level Google-Proof Q&A | PhD-level science questions (experts struggle too) | % accuracy | Small dataset; high variance |
| Arena ELO | Chatbot Arena | Human preference ranking from blind comparisons | ELO rating | Biased toward chat-style tasks; may not reflect enterprise use |
| SWE-Bench | Software Engineering Bench | Real GitHub issue resolution (full repo context) | % resolved | Very hard; reflects real-world coding ability |
| LiveBench | -- | Continuously updated benchmark to prevent contamination | Composite score | Newer, less established |
Why Benchmarks Do Not Tell the Whole Storyβ
Key problems with relying on benchmarks alone:
- Benchmark contamination -- Models may have been trained on benchmark data, inflating scores.
- Multiple choice bias -- Many benchmarks are multiple choice, which does not test generation quality.
- Task mismatch -- Your production tasks (summarizing customer tickets, extracting invoice data) are nothing like MMLU questions.
- Cherry-picking -- Providers highlight benchmarks where they perform best and ignore the rest.
- Saturation -- Many benchmarks (HellaSwag, ARC) are effectively "solved" by frontier models, providing zero differentiation.
Do not rely solely on benchmarks. Instead:
- Define 50-100 representative examples from YOUR actual production data.
- Run them through 2-3 candidate models.
- Have domain experts blind-evaluate the outputs (which response is better, without knowing which model produced it).
- Measure latency, cost, and throughput on YOUR workload pattern.
- Only then make a deployment decision.
This process is called "eval-driven development" and it is the gold standard for model selection.
Arena / ELO Ratingsβ
The Chatbot Arena (run by LMSys at UC Berkeley) is a live platform where users submit prompts and blindly compare two model responses. Over millions of comparisons, an ELO rating (like chess) is computed. As of early 2026, typical rankings place Gemini 2.5 Pro, Claude Opus 4, GPT-4.1, and o3 in the top tier, often within a few ELO points of each other. The key insight: frontier models are converging in general capability, making factors like cost, latency, data residency, and ecosystem integration more important differentiators than raw quality.
2.7 Azure OpenAI Deployment Typesβ
This is one of the most critical sections for Azure architects. How you deploy an Azure OpenAI model determines your cost, latency, throughput, and data residency guarantees.
Deployment Types at a Glanceβ
Detailed Comparisonβ
| Deployment Type | Billing Model | Data Processing | Routing | Latency | Throughput Guarantee | Best For |
|---|---|---|---|---|---|---|
| Standard | Pay-per-token | Single region you choose | Your region only | Low, but shared capacity | No (best-effort) | Region-specific compliance, predictable geo |
| Global Standard | Pay-per-token (cheapest PAYG) | Microsoft chooses region dynamically | Global (Microsoft routes) | Low (optimized routing) | No (best-effort, but higher ceiling) | Cost optimization, no strict data residency |
| Data Zone Standard | Pay-per-token | Within a data zone (e.g., US, EU) | Within data zone | Low | No (best-effort) | EU/US data residency compliance |
| Provisioned (PTU) | $/PTU/month (reserved) | Single region you choose | Your region only | Consistent, low | Yes (guaranteed throughput) | Production workloads needing guaranteed SLA |
| Global Provisioned | $/PTU/month (reserved) | Microsoft chooses region | Global | Consistent | Yes | High-throughput global apps |
| Data Zone Provisioned | $/PTU/month (reserved) | Within a data zone | Within data zone | Consistent | Yes | Data-residency + guaranteed throughput |
| Global Batch | Pay-per-token (50% discount) | Microsoft chooses region | Global | High (24h turnaround) | No | Large-scale async processing |
| Data Zone Batch | Pay-per-token (50% discount) | Within a data zone | Within data zone | High (24h turnaround) | No | Batch + data residency |
Understanding Provisioned Throughput Units (PTU)β
PTU is Azure OpenAI's reserved capacity model. Instead of paying per token, you reserve a fixed amount of compute capacity.
Key concepts:
- 1 PTU = a unit of model throughput capacity (not a fixed number of tokens -- it depends on the model)
- Minimum commitment = typically 50-100 PTUs depending on model
- Monthly commitment = PTUs are billed monthly whether you use them or not
- Guaranteed throughput = no throttling, no 429 errors, consistent latency
- Right-sizing = Azure provides a capacity calculator to estimate PTUs needed
| Factor | Pay-As-You-Go (PAYG) | Provisioned (PTU) |
|---|---|---|
| Cost at low volume | Cheaper | Expensive (paying for unused capacity) |
| Cost at high volume | Can become very expensive | More predictable, often cheaper |
| Break-even point | -- | Typically ~60-70% utilization of reserved capacity |
| Latency consistency | Variable (shared infra) | Consistent (dedicated capacity) |
| Throttling risk | Yes (429 errors under load) | No (capacity is reserved) |
| Commitment | None | Monthly or yearly |
| Scaling | Automatic (with limits) | Manual (add more PTUs) |
Use PAYG for development, testing, and low-volume production. Use PTU when you have predictable, steady-state production workloads exceeding ~$5,000/month in token costs, or when you cannot tolerate 429 throttling errors.
Data Residency Decision Guideβ
| Requirement | Deployment Type | Data Stays In |
|---|---|---|
| No data residency requirement | Global Standard | Anywhere Microsoft operates |
| Data must stay in US | Data Zone Standard (US) | US Azure regions |
| Data must stay in EU | Data Zone Standard (EU) | EU Azure regions |
| Data must stay in specific region | Standard | The one region you specify |
| Strictest compliance (e.g., government) | Standard + Private Endpoint + CMK | Single region, encrypted, network-isolated |
2.8 Model Selection Frameworkβ
The Decision Treeβ
Choosing the right model is one of the most impactful decisions an architect makes. This framework guides you through the key decision points.
Selection Factors Matrixβ
| Factor | Favors Small/Cheap Models | Favors Large/Premium Models |
|---|---|---|
| Task complexity | Simple, well-defined tasks | Ambiguous, multi-step reasoning |
| Volume | High (millions of requests) | Low (thousands of requests) |
| Latency requirement | Strict (< 1s) | Relaxed (> 5s acceptable) |
| Budget | Tight | Flexible |
| Quality bar | "Good enough" is acceptable | Errors have high cost |
| Data sensitivity | Not sensitive (Global OK) | Highly sensitive (on-prem needed) |
| Context length needed | Short (< 4K tokens) | Long (> 32K tokens) |
The 80/20 Principle for Model Selectionβ
This is the single most important guideline:
80% of production workloads can use a mini/small model. GPT-4o-mini, GPT-4.1-nano, Claude Haiku, or Phi-4-mini will handle classification, extraction, summarization, reformatting, simple Q&A, and routing. Reserve large models (GPT-4o, Claude Opus 4, o3) for the 20% of tasks that genuinely need them.
Practical Model Pairing Patternsβ
Enterprise architectures rarely use a single model. Smart architectures use model routing to send each request to the most cost-effective model.
| Pattern | How It Works | Cost Savings |
|---|---|---|
| Classifier + Worker | A tiny model (nano) classifies the request type, then routes to the appropriate model | 40-60% |
| Small-then-Large | Try the small model first. If confidence is low, escalate to the large model. | 50-70% |
| Reasoning on Demand | Use standard model by default. Switch to reasoning model (o3) only for flagged complex queries. | 60-80% |
| Edge + Cloud | SLM handles simple queries locally. Complex queries go to cloud API. | 30-50% on API costs |
2.9 The Infrastructure Impactβ
Model Size vs Hardware Requirementsβ
Understanding the relationship between model parameters and infrastructure is fundamental for capacity planning.
| Model Size | Parameters | FP16 VRAM | INT8 VRAM | INT4 VRAM | Recommended Azure VM | GPUs |
|---|---|---|---|---|---|---|
| Tiny | 1-3B | 2-6 GB | 1-3 GB | 0.5-1.5 GB | Standard_NC4as_T4_v3 | 1x T4 |
| Small | 7-14B | 14-28 GB | 7-14 GB | 3.5-7 GB | Standard_NC24ads_A100_v4 | 1x A100 |
| Medium | 30-70B | 60-140 GB | 30-70 GB | 15-35 GB | Standard_NC48ads_A100_v4 | 2x A100 80GB |
| Large | 70-120B | 140-240 GB | 70-120 GB | 35-60 GB | Standard_NC96ads_A100_v4 | 4x A100 80GB |
| XL | 400B+ | 800+ GB | 400+ GB | 200+ GB | Standard_ND96isr_H100_v5 | 8x H100 |
FP16 = Full 16-bit floating point precision. Best quality but most VRAM. INT8 = 8-bit integer quantization. ~95-99% quality, half the VRAM. INT4 = 4-bit integer quantization. ~90-95% quality, quarter the VRAM. Quantization is how you fit a 70B model on 2 GPUs instead of 4. The quality trade-off is often negligible for production inference.
Self-Hosting vs Managed API: Cost Comparisonβ
| Factor | Self-Hosted (Azure VM + GPU) | Managed API (Azure OpenAI) |
|---|---|---|
| Low volume (1K req/day) | Expensive ($3,000-10,000/mo for GPU VM running 24/7) | Cheap ($50-500/mo in tokens) |
| Medium volume (50K req/day) | Moderate ($5,000-15,000/mo) | Moderate ($2,000-10,000/mo) |
| High volume (1M req/day) | Cost-effective ($10,000-30,000/mo, amortized) | Expensive ($20,000-100,000+/mo) |
| Operational burden | High (patching, scaling, monitoring, model updates) | Low (Microsoft manages everything) |
| Latency control | Full (tune batch size, concurrency, quantization) | Limited (shared infrastructure) |
| Model flexibility | Any open-weight model, any version | Only models offered by Azure OpenAI |
| Compliance | Maximum control | Depends on Azure OpenAI DPA |
| Scale-to-zero | Not easily (GPU VMs take minutes to start) | Yes (serverless endpoints) |
Break-even rule of thumb: Self-hosting becomes cost-effective when you are consistently spending more than $10,000-15,000/month on managed API tokens for a single model, AND you have the engineering team to manage GPU infrastructure.
Scaling Characteristicsβ
| Scaling Dimension | Small Models (< 14B) | Large Models (70B+) | Managed API |
|---|---|---|---|
| Scale-up | Add more powerful GPU | Add more GPUs (tensor parallelism) | Increase rate limits / add PTUs |
| Scale-out | More replicas (easy) | More replicas (expensive) | Automatic (PAYG) or manual (PTU) |
| Cold start | Seconds | Minutes (model loading) | None (always warm) |
| Minimum footprint | 1 GPU | 2-8 GPUs | 0 (pay-per-token) |
| Auto-scaling | HPA on GPU utilization | Complex (GPU memory/utilization) | Built-in (with rate limits) |
| Cost of idle | 1 GPU VM running | 2-8 GPU VMs running | $0 |
Network Bandwidth for Model Servingβ
An often-overlooked factor: models generate tokens sequentially, so the bottleneck is usually GPU compute, not network. However, multimodal inputs and model downloading can strain bandwidth.
| Operation | Bandwidth Requirement | Notes |
|---|---|---|
| Text inference | < 1 Mbps per concurrent user | Token-by-token streaming; minimal bandwidth |
| Image input | 5-50 Mbps burst | Uploading images for vision analysis |
| Model download | 10+ Gbps preferred | Downloading a 70B model is ~140 GB; takes 2 min at 10 Gbps, ~20 min at 1 Gbps |
| Multi-GPU communication | NVLink / InfiniBand | For tensor-parallel inference across GPUs; standard Ethernet adds latency |
| Streaming responses | < 1 Mbps per user | Server-Sent Events (SSE) for token streaming |
Key Takeawaysβ
-
Know the taxonomy. Proprietary, open-weight, and open-source models have fundamentally different infrastructure, cost, and compliance implications. Choose based on your constraints, not hype.
-
Model families are converging in quality. GPT-4o, Claude Opus 4, Gemini 2.5 Pro, and Llama 4 Maverick are all excellent. Differentiation increasingly comes from cost, latency, data residency, and ecosystem integration rather than raw capability.
-
Reasoning models are a separate category. The o-series and extended thinking modes spend variable compute on thinking tokens. They are powerful but expensive and slow. Use them deliberately for tasks that require deep reasoning.
-
Start small. GPT-4o-mini, GPT-4.1-nano, Phi-4, and Claude Haiku can handle 80% of production workloads. Only scale up with evidence.
-
Azure OpenAI deployment types matter. The difference between Standard, Global Standard, Data Zone, Provisioned, and Batch can mean 2-5x cost differences and completely different compliance postures. Choose intentionally.
-
Benchmarks are directional, not definitive. Build your own evaluation set from real production data. No benchmark can tell you how well a model will perform on YOUR specific task.
-
Self-hosting makes sense at scale. Below ~$10K/month in API spend, managed APIs are almost always the right choice. Above that threshold, and especially with strict data residency requirements, self-hosting open-weight models becomes compelling.
What's Nextβ
You now have the map of the model landscape. In Module 3: Azure AI Foundry, you will learn how to actually deploy, manage, and operationalize these models on Azure -- the Model Catalog, endpoint types, prompt flow, evaluation, and fine-tuning workflows.
Previous: Module 1: GenAI Foundations | Next: Module 3: Azure AI Foundry | Home: AI Nexus Index