Module 3: Azure AI Foundry β Microsoft's AI Platform Deep Dive
Duration: 60-90 minutes | Level: Platform Audience: Cloud Architects, Platform Engineers, CSAs Last Updated: March 2026
3.1 What is Azure AI Foundry?β
Azure AI Foundry is Microsoft's unified platform for building, evaluating, and deploying generative AI applications at enterprise scale. If you are an infrastructure or platform architect, think of it as the control plane for everything AI inside Azure -- model catalog, deployments, evaluation pipelines, prompt orchestration, fine-tuning, and integrated AI services all accessible through a single portal, SDK, and CLI.
The Evolutionβ
The branding journey matters because customers encounter all three names in documentation, blog posts, and portal URLs.
| Era | Name | What It Was | Key Limitation |
|---|---|---|---|
| 2016-2022 | Azure Machine Learning Studio | Drag-and-drop ML training and deployment | Focused on classical ML; poor LLM support |
| 2023-2024 | Azure AI Studio | Preview portal for generative AI projects | Separate from Azure ML; fragmented experience |
| Late 2024+ | Azure AI Foundry | Unified platform merging Azure ML + AI Studio | Current GA platform -- this module's focus |
Azure AI Foundry is not a separate resource type that replaces Azure ML. Under the hood, the Azure ML workspace resource (Microsoft.MachineLearningServices/workspaces) is still the ARM building block. Foundry is a unified experience layer that consolidates model catalog, prompt flow, evaluation, deployments, and AI services into a single portal and SDK.
Three Interfaces, One Platformβ
| Interface | Best For | Example Use Case |
|---|---|---|
Azure AI Foundry portal (ai.azure.com) | Exploration, visual prompt flow building, model comparison | A CSA demonstrating RAG to a customer |
| Azure AI Foundry SDK (Python) | Programmatic model deployment, evaluation pipelines, CI/CD | A platform team automating model rollouts |
Azure CLI (az ml) | Infrastructure provisioning, DevOps integration | An IaC pipeline deploying Hubs and Projects |
Think of Azure AI Foundry as "Azure Resource Manager for AI workloads." Just as ARM gives you a control plane for VMs, networking, and storage, Foundry gives you a control plane for models, endpoints, evaluations, and AI services -- with the same RBAC, networking, and compliance story you already know.
3.2 Architecture & Resource Modelβ
This is the section that matters most to platform architects. Azure AI Foundry introduces a two-tier workspace hierarchy: Hubs and Projects.
Hub and Project Modelβ
Hub (Parent Resource)β
The AI Foundry Hub is the shared administrative boundary. It owns:
- Connections -- credentials and endpoints for Azure OpenAI, Azure AI Search, Storage, Key Vault, and external services (e.g., a Snowflake database, a custom API)
- Compute resources -- shared compute instances and clusters that Projects can use
- Networking configuration -- public access, private endpoints, managed VNet
- Security policies -- RBAC role assignments, managed identity, customer-managed keys
- Container Registry -- shared ACR for custom model images
A Hub maps to the ARM resource type Microsoft.MachineLearningServices/workspaces with kind: hub.
Project (Child Resource)β
A Project is an isolated workspace scoped to a single AI application or workload. It inherits connections and compute from the parent Hub but maintains its own:
- Model deployments and endpoints
- Prompt flow definitions
- Evaluation runs and datasets
- Fine-tuning jobs
- Artifacts and logs
A Project maps to the ARM resource type Microsoft.MachineLearningServices/workspaces with kind: project and a hubResourceId pointing to its parent Hub.
Resource Relationship Summaryβ
| Resource | Scope | Cardinality | Key Responsibility |
|---|---|---|---|
| Hub | Organization/Team level | 1 per team or business unit | Shared config, connections, networking, policies |
| Project | Application level | Many per Hub | Isolated workspace for a specific AI app |
| Azure OpenAI | Connected resource | 1 or more per Hub | LLM API access (GPT-4o, GPT-4.1, o-series) |
| Azure AI Search | Connected resource | 0 or more per Hub | Vector search for RAG workloads |
| Storage Account | Connected resource | 1 per Hub | Data, artifacts, prompt flow files, logs |
| Key Vault | Connected resource | 1 per Hub | Secrets, connection strings, API keys |
| Container Registry | Connected resource | 0 or 1 per Hub | Custom model images, prompt flow images |
RBAC Modelβ
Azure AI Foundry uses Azure RBAC with purpose-built roles.
| Role | Scope | Permissions |
|---|---|---|
| Azure AI Developer | Project | Deploy models, run evaluations, create prompt flows, manage endpoints |
| Azure AI Inference Deployment Operator | Project | Deploy and manage inference endpoints only |
| Azure ML Data Scientist | Project | Full access to experiments, compute, data assets |
| Contributor | Hub or Project | Full resource management (create/delete projects, manage connections) |
| Reader | Hub or Project | View resources and configurations, no modifications |
| Owner | Hub | Full control including RBAC assignment |
Assign roles at the Project level, not the Hub level. This follows the principle of least privilege -- a developer working on the RAG application should not have access to the internal Copilot project's data and deployments. Use the Hub-level Contributor role only for platform administrators who manage shared infrastructure.
Networking Optionsβ
| Mode | Description | Use Case |
|---|---|---|
| Public | Hub and Projects are accessible over public internet with AAD authentication | Development, PoCs, non-sensitive workloads |
| Private Endpoints | Hub, connected resources (AOAI, Search, Storage, KV) exposed only via private endpoints in your VNet | Production enterprise workloads |
| Managed VNet | Foundry manages a VNet on your behalf; you control outbound rules. Compute runs inside this managed VNet | Simplified private networking without BYO VNet complexity |
| Managed VNet + Data Exfiltration Protection | Managed VNet with outbound restricted to approved destinations only | Highly regulated industries (financial services, healthcare) |
3.3 Model Catalogβ
The Model Catalog is the front door of Azure AI Foundry. It is a curated marketplace of 1,800+ models from Microsoft and the open-source ecosystem, ready to deploy with a few clicks or a single SDK call.
Model Providersβ
| Provider | Example Models | License Type |
|---|---|---|
| Azure OpenAI (Microsoft) | GPT-4o, GPT-4.1, GPT-4.1-mini, GPT-4.1-nano, o3, o4-mini | Proprietary (Microsoft-hosted) |
| Microsoft Research | Phi-4, Phi-4-mini, Phi-4-multimodal, MAI-1 | Open-weight (MIT license) |
| Meta | Llama 4 Scout, Llama 4 Maverick, Llama 3.3 70B | Open-weight (Llama license) |
| Mistral | Mistral Large 2, Mistral Small, Codestral, Pixtral | Open-weight / Commercial |
| Cohere | Command R+, Command R, Embed v3 | Commercial |
| AI21 Labs | Jamba 1.5 Large, Jamba 1.5 Mini | Commercial |
| Hugging Face | Hundreds of community models (BERT, T5, Whisper variants) | Various open-source |
| NVIDIA | Nemotron, NV-Embed | Open-weight |
Two Deployment Paradigmsβ
The catalog offers two fundamentally different ways to run a model. Understanding this distinction is critical for cost planning and architecture.
| Dimension | Models as a Service (MaaS) | Managed Compute |
|---|---|---|
| What it is | Serverless API -- Microsoft hosts the model | You deploy the model onto dedicated VMs |
| Billing | Pay-per-token (input + output tokens) | Pay-per-hour for the VM SKU |
| Compute management | None -- fully managed | You choose VM size, instance count, scaling rules |
| Cold start | None (always warm) | Possible if scaled to zero |
| Customization | System prompts and parameters only | Full control -- custom containers, fine-tuned weights |
| Available models | Azure OpenAI models, select partner models (Llama, Mistral, Cohere) | Any model from the catalog or a custom model |
| Networking | Public endpoint with AAD auth; private endpoint available | Private endpoint, VNet integration, managed VNet |
| Best for | Quick prototyping, variable workloads, multi-model testing | Production workloads with predictable traffic, custom models, strict isolation |
Serverless API Endpoints (MaaS)β
Serverless API endpoints are the simplest path from model selection to production. You select a model from the catalog, accept the terms, and receive an API endpoint and key. No compute provisioning, no VM sizing, no scaling configuration.
Key characteristics:
- Pay-per-token pricing -- you pay only for the tokens you consume (input + output)
- No infrastructure to manage -- Microsoft handles scaling, availability, and hardware
- Azure OpenAI-compatible API -- same SDK and API contract as Azure OpenAI deployments
- Immediate availability -- endpoint is live within seconds of deployment
- Regional availability matters -- not all models are available in all regions
Managed Online Endpointsβ
Managed Online Endpoints give you dedicated compute for model serving. You deploy a model (from the catalog or your own custom model) to a specific VM SKU and control scaling.
Key characteristics:
- Dedicated VMs -- choose from CPU or GPU SKUs (e.g.,
Standard_NC24ads_A100_v4) - Autoscaling -- scale based on request count, CPU, or custom metrics
- Blue-green deployments -- traffic splitting across multiple deployments for safe rollouts
- Custom containers -- bring your own inference server (e.g., vLLM, TGI, Triton)
- VNet integration -- deploy endpoints inside a managed VNet with private endpoint access
Comparing Models in the Catalogβ
The portal provides built-in tools for model comparison:
- Benchmark scores -- view standardized benchmarks (MMLU, HumanEval, MT-Bench) side by side
- Model cards -- detailed descriptions of capabilities, limitations, and intended use cases
- Try it out -- interactive playground to test models with your own prompts before deploying
- Pricing calculator -- estimate costs based on expected token volume
- Region availability -- check which regions support which models
3.4 Model Deploymentsβ
Deployment is where architecture decisions meet operational reality. Azure AI Foundry supports three deployment types, each with different performance, cost, and management characteristics.
Deployment Type 1: Azure OpenAI Deploymentsβ
These are deployments of Microsoft's proprietary models (GPT-4o, GPT-4.1, o3, etc.) through the Azure OpenAI Service resource connected to your Hub.
| Variant | Description | Billing | SLA | Best For |
|---|---|---|---|---|
| Standard | Shared capacity in a single region | Pay-per-token | 99.9% | Development, moderate production workloads |
| Global Standard | Shared capacity across global regions (auto-routed) | Pay-per-token (same price) | 99.9% | Production workloads that benefit from global capacity and lower latency |
| Provisioned (PTU) | Reserved throughput units in a specific region | Pay-per-PTU-hour (reserved) | 99.9% | Predictable high-volume workloads, latency-sensitive apps |
| Global Provisioned (PTU) | Reserved throughput units across global regions | Pay-per-PTU-hour (reserved) | 99.9% | High-volume global workloads needing guaranteed throughput |
| Data Zone Standard | Shared capacity within a data boundary (e.g., EU) | Pay-per-token | 99.9% | Data residency requirements |
| Data Zone Provisioned | Reserved throughput within a data boundary | Pay-per-PTU-hour | 99.9% | High-volume workloads with data residency requirements |
A Provisioned Throughput Unit (PTU) is a unit of reserved model processing capacity. One PTU does not equal one request -- the relationship depends on the model, prompt size, and generation length. Use the Azure OpenAI capacity calculator to estimate how many PTUs your workload needs. PTUs are committed in monthly or yearly reservations, with significant discounts for longer commitments.
Deployment Type 2: Serverless API Deploymentsβ
These are pay-per-token deployments for partner models (Llama, Mistral, Cohere) and some Microsoft models that use the Models as a Service (MaaS) infrastructure.
- No compute to manage
- You accept the model provider's terms of use
- Charged per million input/output tokens
- Endpoint is Azure OpenAI-compatible (same SDK, same API shape)
Deployment Type 3: Managed Online Endpointsβ
These are dedicated-compute deployments for any model -- from the catalog or custom.
- You select the VM SKU and instance count
- Support for autoscaling (min/max replicas, scaling metric)
- Blue-green deployment with traffic splitting
- Custom inference containers supported
- Full VNet integration and private endpoint access
Deployment Types Comparisonβ
| Dimension | Azure OpenAI (Standard) | Azure OpenAI (PTU) | Serverless API (MaaS) | Managed Online Endpoint |
|---|---|---|---|---|
| Models | GPT-4o, GPT-4.1, o3, etc. | GPT-4o, GPT-4.1, o3, etc. | Llama, Mistral, Cohere, etc. | Any (catalog or custom) |
| Billing | Per token | Per PTU-hour (reserved) | Per token | Per VM-hour |
| Throughput | Shared, rate-limited | Guaranteed (reserved) | Shared, rate-limited | Dedicated (VM-bound) |
| Latency | Variable (shared pool) | Predictable (reserved) | Variable (shared pool) | Predictable (dedicated) |
| Scaling | Automatic (within quota) | Fixed PTU allocation | Automatic (within quota) | Manual or autoscale |
| Networking | Public + Private Endpoint | Public + Private Endpoint | Public + Private Endpoint | Managed VNet + Private Endpoint |
| GPU management | None | None | None | You choose VM SKU |
| Customization | System prompt, parameters | System prompt, parameters | System prompt, parameters | Full (custom container, weights) |
Quotas and Rate Limitsβ
Every deployment type has quotas. Platform architects must plan for quota as a first-class infrastructure concern.
| Quota Type | Applies To | Unit | How to Increase |
|---|---|---|---|
| Tokens per Minute (TPM) | Azure OpenAI Standard | Tokens/min per deployment | Azure portal quota page or support ticket |
| Requests per Minute (RPM) | Azure OpenAI Standard | Requests/min per deployment | Derived from TPM (approximately TPM / 6) |
| PTU allocation | Azure OpenAI Provisioned | PTU count per subscription/region | Capacity reservation via portal or support |
| Endpoint count | Managed Online Endpoints | Endpoints per subscription/region | Support ticket |
| VM cores | Managed Online Endpoints | vCPU cores per subscription/region | Standard Azure quota increase |
3.5 Prompt Flowβ
Prompt Flow is Azure AI Foundry's visual orchestration tool for building LLM applications. If you are familiar with Azure Logic Apps or Power Automate, the mental model is similar -- but purpose-built for AI workflows.
What is Prompt Flow?β
Prompt Flow lets you build DAG-based flows (Directed Acyclic Graphs) where each node performs a specific operation: call an LLM, execute Python code, process a prompt template, or invoke a tool. The output of one node feeds into the next, creating a composable pipeline.
Node Typesβ
| Node Type | Purpose | Example |
|---|---|---|
| LLM | Call a language model (Azure OpenAI, Serverless API) | Generate a response given context and query |
| Prompt | Define a prompt template with variable substitution | Build a system message with {{context}} and {{query}} placeholders |
| Python | Execute arbitrary Python code | Parse JSON, call an external API, transform data |
| Tool | Invoke a pre-built or custom tool | Azure AI Search retrieval, Bing Search, custom REST calls |
| LLM + Function Calling | Call an LLM with tool definitions for autonomous tool selection | Agent-style node that decides which tools to call |
| Conditional | Branch the flow based on a condition | Route to different LLMs based on query complexity |
Use Case: Building a RAG Pipeline in Prompt Flowβ
Here is how you would build a Retrieval-Augmented Generation pipeline visually in Prompt Flow:
Step 1: Input Node -- Accept the user's query as a string input.
Step 2: Embedding Node (Python) -- Call the Azure OpenAI embedding model to convert the query into a vector.
Step 3: Search Node (Tool) -- Query Azure AI Search with the vector to retrieve the top-k most relevant document chunks.
Step 4: Prompt Node -- Construct an augmented prompt that injects the retrieved chunks as context, along with the user query.
Step 5: LLM Node -- Send the augmented prompt to GPT-4o for answer generation.
Step 6: Output Node -- Return the generated answer along with source citations.
The entire flow is defined as a YAML file (flow.dag.yaml) that can be version-controlled in Git, making it CI/CD-friendly.
Evaluation Flowsβ
Prompt Flow supports a special type of flow called an evaluation flow. Instead of processing user queries, evaluation flows score the quality of outputs produced by your main flow.
An evaluation flow typically:
- Takes the main flow's output (answer), the ground-truth answer, and the original question as inputs
- Calls an LLM (or runs custom Python logic) to score the output on metrics like groundedness, relevance, and coherence
- Outputs numerical scores that can be aggregated across a test dataset
This enables automated quality gates in your CI/CD pipeline -- if the evaluation scores drop below a threshold, the deployment is blocked.
Deploying a Flow as an Endpointβ
Once you have built and tested a Prompt Flow:
- Build the flow into a Docker container (Foundry handles this automatically)
- Deploy the container to a Managed Online Endpoint
- Configure autoscaling, traffic splitting, and authentication
- Monitor with built-in metrics (latency, throughput, error rate, token consumption)
The deployed flow exposes a REST API endpoint that your application calls -- just like any other microservice.
3.6 Model Evaluationβ
Evaluation is the most underinvested area in most AI projects and the most important area for production readiness. Azure AI Foundry provides built-in evaluation capabilities that let you systematically measure model quality before deployment.
Why Evaluation Mattersβ
| Without Evaluation | With Evaluation |
|---|---|
| "It seems to work okay in my testing" | Quantified quality scores across hundreds of test cases |
| Ship and hope | Ship with confidence backed by metrics |
| Catch problems from user complaints | Catch problems before users see them |
| No regression detection | Automated regression testing in CI/CD |
| Anecdotal quality assessment | Data-driven model selection and prompt optimization |
Built-in Evaluation Metricsβ
Azure AI Foundry provides LLM-as-a-judge evaluation metrics that use a grader model (typically GPT-4o) to score your application's outputs.
| Metric | What It Measures | Scale | When It Matters |
|---|---|---|---|
| Groundedness | Is the answer supported by the provided context? (Not hallucinated) | 1-5 | RAG applications -- critical for factual accuracy |
| Relevance | Does the answer address the user's actual question? | 1-5 | All applications -- ensures on-topic responses |
| Coherence | Is the answer logically structured and readable? | 1-5 | Long-form generation -- reports, summaries, explanations |
| Fluency | Is the language natural, grammatically correct? | 1-5 | Customer-facing applications |
| Similarity | How close is the answer to a known ground-truth answer? | 1-5 | Applications with deterministic expected outputs |
| F1 Score | Token-level overlap with ground-truth | 0-1 | Extractive QA tasks |
| ROUGE | N-gram overlap with reference text | 0-1 | Summarization tasks |
| BLEU | Precision of n-gram overlap | 0-1 | Translation tasks |
Custom Evaluation Metricsβ
When built-in metrics are not sufficient, you can define custom evaluation metrics using:
- Python functions -- Write a Python function that takes the model output and returns a score
- LLM-as-a-judge prompts -- Write a custom prompt that instructs GPT-4o to score the output on your domain-specific criteria (e.g., "Does this medical summary include all required ICD-10 codes?")
- Composite metrics -- Combine multiple metrics into a single quality score with weighted averages
Red-Teaming Evaluationsβ
Red-teaming tests whether your AI application can be manipulated into producing harmful, biased, or policy-violating outputs.
Azure AI Foundry supports red-teaming through:
- Automated adversarial testing -- Built-in adversarial datasets that probe for jailbreaks, prompt injections, and content policy violations
- Custom red-team datasets -- Define your own adversarial prompts tailored to your application's domain
- Azure AI Content Safety integration -- Automatically score outputs for hate speech, violence, self-harm, and sexual content severity levels
- Human-in-the-loop review -- Export flagged outputs for manual review by your safety team
Evaluation Datasets and Test Suitesβ
A robust evaluation requires a well-curated test dataset. Best practices:
| Component | Description | Recommended Size |
|---|---|---|
| Golden dataset | Curated question-answer pairs with verified ground-truth | 100-500 examples |
| Edge case dataset | Unusual, ambiguous, or boundary-condition queries | 50-100 examples |
| Adversarial dataset | Prompt injection attempts, jailbreak probes, out-of-scope queries | 50-200 examples |
| Regression dataset | Previously failed cases that were fixed -- prevents regressions | Grows over time |
The highest-maturity AI teams treat evaluation as a deployment gate. Every PR that changes a prompt, updates a RAG pipeline, or swaps a model triggers an automated evaluation run. If scores drop below the baseline, the deployment is blocked. This is no different from blocking a deployment on failing unit tests -- the principle is identical.
3.7 Fine-Tuningβ
Fine-tuning is the process of further training a pre-trained model on your domain-specific data to improve its performance on your specific use case. Azure AI Foundry supports fine-tuning for select models directly within the platform.
When to Fine-Tune (And When NOT To)β
Fine-Tuning Decision Matrixβ
| Technique | Cost | Time to Implement | Data Required | Best For | Risk |
|---|---|---|---|---|---|
| Prompt Engineering | Free | Minutes to hours | 0 - a few examples | Formatting, behavior control, output structure | Low |
| RAG | Medium ($) | Hours to days | Document corpus | Factual grounding, dynamic/changing knowledge | Low-Medium |
| Fine-Tuning | High ($$) | Days to weeks | 50-10,000+ examples | Domain style, specialized vocabulary, consistent tone | Medium (catastrophic forgetting) |
| Pre-Training from scratch | Very High ($$$$$) | Weeks to months | Billions of tokens | Entirely new language or domain | Very High |
Supported Models for Fine-Tuningβ
| Model | Minimum Examples | Recommended Examples | Max Training Tokens | Fine-Tuning Method |
|---|---|---|---|---|
| GPT-4o | 10 | 50-500 | Varies by tier | Supervised fine-tuning |
| GPT-4o-mini | 10 | 50-500 | Varies by tier | Supervised fine-tuning |
| GPT-4.1-mini | 10 | 50-500 | Varies by tier | Supervised fine-tuning |
| Phi-4 | 10 | 100-1,000 | Varies | LoRA / Full fine-tuning |
| Phi-4-mini | 10 | 100-1,000 | Varies | LoRA / Full fine-tuning |
| Llama 3.3 70B | 10 | 100-1,000 | Varies | LoRA |
| Mistral models | 10 | 100-1,000 | Varies | LoRA |
Fine-Tuning Workflowβ
Data preparation format (JSONL):
{"messages": [{"role": "system", "content": "You are a medical coding specialist."}, {"role": "user", "content": "Patient presents with acute bronchitis and mild dehydration."}, {"role": "assistant", "content": "ICD-10 Codes:\n- J20.9 (Acute bronchitis, unspecified)\n- E86.0 (Dehydration)"}]}
{"messages": [{"role": "system", "content": "You are a medical coding specialist."}, {"role": "user", "content": "Follow-up for type 2 diabetes, well controlled."}, {"role": "assistant", "content": "ICD-10 Code:\n- E11.65 (Type 2 diabetes mellitus with hyperglycemia)"}]}
LoRA and Parameter-Efficient Fine-Tuningβ
LoRA (Low-Rank Adaptation) is a technique that fine-tunes only a small number of additional parameters (adapters) rather than updating all model weights. This has major implications for architects:
| Dimension | Full Fine-Tuning | LoRA Fine-Tuning |
|---|---|---|
| Parameters updated | All (billions) | Small adapter matrices (millions) |
| GPU memory required | Very high (40-80 GB+) | Much lower (often fits on a single GPU) |
| Training time | Hours to days | Minutes to hours |
| Storage per model | Full model copy (tens of GB) | Small adapter file (tens of MB) |
| Risk of catastrophic forgetting | Higher | Lower (base model unchanged) |
| Multiple specialties | Need a full copy per specialty | Swap adapters at inference time |
| Quality | Marginally better for large domain shifts | Excellent for most use cases |
Cost and Compute Requirementsβ
Fine-tuning costs are driven by three factors:
- Training compute -- GPU hours consumed during training (typically
Standard_NC24ads_A100_v4or similar) - Hosting cost -- Fine-tuned Azure OpenAI models incur higher per-token costs than base models; custom models on Managed Online Endpoints cost per VM-hour
- Data preparation -- Human time to curate, clean, and validate training data (often the most expensive part)
3.8 Azure AI Services (Integrated)β
Azure AI Foundry integrates with the broader Azure AI Services family -- pre-built, task-specific AI capabilities that were previously known as Azure Cognitive Services. These services complement generative AI models by handling specialized tasks like speech recognition, document parsing, and content moderation.
Service Overviewβ
| Service | Capabilities | Common Use Cases | Integration with Foundry |
|---|---|---|---|
| Azure AI Speech | Speech-to-text (STT), text-to-speech (TTS), speech translation, speaker recognition | Voice-enabled copilots, call center analytics, accessibility | Prompt Flow speech nodes, real-time conversation APIs |
| Azure AI Vision | Image analysis, OCR, spatial analysis, face detection, custom image classification | Document digitization, visual search, accessibility | Multi-modal RAG (image + text), document processing pipelines |
| Azure AI Language | Named Entity Recognition (NER), sentiment analysis, key phrase extraction, summarization, PII detection | Customer feedback analysis, compliance scanning, content tagging | Pre-processing nodes in Prompt Flow, PII redaction before LLM calls |
| Azure AI Document Intelligence | Form extraction, invoice processing, receipt parsing, layout analysis, custom document models | Accounts payable automation, contract analysis, claims processing | Document ingestion for RAG pipelines, structured data extraction |
| Azure AI Content Safety | Text and image content moderation, prompt shield, groundedness detection, protected material detection | Guardrails for AI applications, user-generated content moderation | Built-in content filtering for Azure OpenAI deployments, evaluation metrics |
| Azure AI Translator | Text translation (100+ languages), document translation, custom terminology | Multi-language copilots, document localization | Pre/post-processing in Prompt Flow |
How These Integrate with Foundry Projectsβ
Key integration patterns:
- Document Intelligence as RAG Ingestion -- Use Document Intelligence to extract text, tables, and structure from PDFs and images, then chunk and embed the output for vector search
- Content Safety as a Guardrail -- Content Safety filters run automatically on Azure OpenAI deployments; you can also invoke them explicitly in Prompt Flow for custom models
- Speech as an I/O Layer -- Add voice input/output to any Prompt Flow by using Speech STT (input) and TTS (output) nodes
- Language for Pre-Processing -- Use PII detection to redact sensitive data before sending to an LLM; use NER to extract entities for structured queries
3.9 Infrastructure Considerationsβ
This section addresses the infrastructure decisions that platform architects must make when deploying Azure AI Foundry in production.
Compute Optionsβ
| Compute Type | Used For | Management | GPU | Typical SKUs |
|---|---|---|---|---|
| Serverless (MaaS) | Azure OpenAI and partner model inference | Fully managed by Microsoft | N/A (abstracted) | N/A |
| Managed Compute Instance | Development, Prompt Flow authoring, notebooks | Managed VM (start/stop) | Optional | Standard_DS3_v2, Standard_NC6s_v3 |
| Managed Compute Cluster | Training, fine-tuning, batch inference | Managed cluster (auto-scaling) | Yes | Standard_NC24ads_A100_v4, Standard_ND96amsr_A100_v4 |
| Managed Online Endpoint | Production model serving | Managed deployment (auto-scaling) | Yes, for LLM serving | Standard_NC24ads_A100_v4, Standard_NC48ads_H100_v5 |
| Kubernetes (AKS) | Self-managed model serving via attached AKS | Customer-managed | Yes (GPU node pools) | Any AKS-supported GPU VM |
Networking Deep Diveβ
Production deployments of Azure AI Foundry require careful networking design. The following table summarizes the network endpoints you need to plan for:
| Resource | Private Endpoint Required? | DNS Zone | Notes |
|---|---|---|---|
| AI Foundry Hub | Yes | privatelink.api.azureml.ms | Controls access to the workspace API |
| Azure OpenAI | Yes | privatelink.openai.azure.com | Must be in the same or peered VNet |
| Azure AI Search | Yes | privatelink.search.windows.net | Required for private RAG pipelines |
| Storage Account (blob) | Yes | privatelink.blob.core.windows.net | Data, artifacts, logs |
| Storage Account (file) | Yes | privatelink.file.core.windows.net | File shares for compute instances |
| Key Vault | Yes | privatelink.vaultcore.azure.net | Secrets and connection strings |
| Container Registry | Yes | privatelink.azurecr.io | Custom model images |
| Managed Online Endpoint | Automatic | Managed by Foundry | When using Managed VNet |
Private endpoints require proper DNS resolution. Use Azure Private DNS Zones linked to your VNet, or configure conditional forwarders in your on-premises DNS infrastructure. Missing or incorrect DNS resolution is the #1 cause of connectivity failures in private AI Foundry deployments.
Data Residency and Complianceβ
| Concern | How Azure AI Foundry Addresses It |
|---|---|
| Data residency | Choose Hub region carefully. Data (prompts, completions, training data) stays in the Hub's region. Azure OpenAI processing region depends on deployment type (Standard = single region; Global Standard = Microsoft-routed). |
| Data processing | Prompts and completions are NOT used to train Microsoft models. Opt-out is the default for Azure OpenAI Service. |
| Compliance certifications | Azure OpenAI and AI Foundry inherit Azure's compliance portfolio (SOC 2, ISO 27001, HIPAA BAA, FedRAMP, etc.). Verify per-model availability in compliance-scoped regions. |
| Customer-Managed Keys (CMK) | Supported at the Hub level for encrypting data at rest with your own Key Vault key. |
| Managed Identity | Hub and Projects use system-assigned or user-assigned managed identities for authentication to connected resources -- no API keys in code. |
RBAC and Security Model (Expanded)β
Security best practices:
- Use Managed Identity for all service-to-service communication. Avoid storing API keys in Key Vault when managed identity is supported.
- Enable Managed VNet with data exfiltration protection for regulated workloads.
- Apply RBAC at the Project level -- not the Hub level -- to enforce least privilege.
- Use Conditional Access policies in Entra ID to enforce MFA and compliant device requirements for Foundry portal access.
- Enable diagnostic logging to send Hub and Project audit logs to a Log Analytics workspace or SIEM.
Cost Managementβ
| Cost Driver | How to Optimize |
|---|---|
| Azure OpenAI Standard (pay-per-token) | Monitor token usage per deployment; set TPM quotas to prevent runaway costs; use smaller models (GPT-4.1-mini, GPT-4.1-nano) for simpler tasks |
| Azure OpenAI Provisioned (PTU) | Right-size PTU allocation using the capacity calculator; commit to 1-year reservations for ~40% discount; consolidate workloads on shared PTUs |
| Managed Online Endpoints | Enable autoscaling with scale-to-zero for dev/test; use spot VMs for non-production fine-tuning; right-size GPU SKUs |
| Storage | Lifecycle management policies for training data and logs; delete unused evaluation datasets |
| Azure AI Search | Right-size the Search SKU; use semantic ranker only when needed; partition indexes by workload |
| Compute Instances | Auto-shutdown schedules for dev instances; use small SKUs for Prompt Flow authoring |
3.10 Azure AI Foundry vs AWS Bedrock vs Google Vertex AIβ
Platform architects in multi-cloud environments need to understand how Azure AI Foundry compares to its counterparts on AWS and Google Cloud.
Feature Comparisonβ
| Capability | Azure AI Foundry | AWS Bedrock | Google Vertex AI |
|---|---|---|---|
| Unified AI platform | Yes (portal, SDK, CLI) | Partial (Bedrock + SageMaker) | Yes (Vertex AI console, SDK, CLI) |
| Model catalog size | 1,800+ models | ~30 models (fewer providers) | 200+ models (Model Garden) |
| Proprietary frontier models | GPT-4.1, o3, o4-mini (Azure OpenAI exclusive) | Claude (Anthropic), Nova (Amazon) | Gemini 2.5 (Google-exclusive) |
| Open-weight models | Llama, Mistral, Phi, Cohere, AI21 | Llama, Mistral, Cohere | Llama, Mistral, Gemma |
| Serverless API (MaaS) | Yes (Azure OpenAI + partner models) | Yes (all Bedrock models) | Yes (Vertex AI endpoint) |
| Dedicated compute deployment | Managed Online Endpoints | SageMaker Endpoints | Vertex AI Endpoints |
| Fine-tuning | GPT-4o, Phi, Llama, Mistral (in-platform) | Claude, Llama, Titan (limited) | Gemini, Llama (in-platform) |
| Prompt orchestration | Prompt Flow (visual + YAML) | Bedrock Agents + Step Functions | Vertex AI Reasoning Engine |
| RAG built-in | Yes (AI Search integration) | Yes (Knowledge Bases) | Yes (Vertex AI Search) |
| Evaluation framework | Built-in (groundedness, relevance, etc.) | Limited (Bedrock Evaluation - preview) | Built-in (AutoSxS, Gen AI Evaluation) |
| Content safety | Azure AI Content Safety (integrated) | Bedrock Guardrails | Vertex AI Safety Filters |
| Enterprise networking | Private Endpoints, Managed VNet, data exfiltration protection | VPC endpoints, PrivateLink | VPC-SC, Private Service Connect |
| RBAC granularity | Hub + Project level RBAC | IAM policies (resource-level) | IAM + Vertex AI roles |
| Compliance portfolio | 100+ certifications | 90+ certifications | 100+ certifications |
| Agent framework | Azure AI Agent Service, Semantic Kernel, AutoGen | Bedrock Agents | Vertex AI Agent Builder |
| Workspace hierarchy | Hub/Project (two-tier) | Flat (per-account) | Flat (per-project) |
Where Azure AI Foundry Shinesβ
- Azure OpenAI exclusivity -- GPT-4.1, o3, and o4-mini are available on Azure with enterprise SLAs, private networking, and content filtering. No other cloud offers these models as managed services.
- Enterprise security posture -- The Hub/Project model, Managed VNet with data exfiltration protection, customer-managed keys, and deep Entra ID integration provide a security story that is hard to match.
- Microsoft ecosystem integration -- Seamless connection to Microsoft 365 Copilot, Copilot Studio, Dynamics 365, Power Platform, and GitHub Copilot. If the customer is a Microsoft shop, the integration advantage is substantial.
- Model catalog breadth -- 1,800+ models (catalog + Hugging Face) give architects maximum flexibility without leaving the platform.
- Prompt Flow -- A mature, visual orchestration tool with YAML-based CI/CD support that neither Bedrock nor Vertex AI match in capability.
Where Competitors Have Edgesβ
| Area | Competitor Edge | Detail |
|---|---|---|
| Anthropic Claude models | AWS Bedrock | Claude is available on Bedrock as a first-party integration with fine-tuning support. On Azure, Claude is available via the model catalog but with fewer features. |
| Google Gemini models | Google Vertex AI | Gemini 2.5 Pro/Flash are Vertex AI exclusives with massive context windows (up to 1M tokens) and native multi-modal capabilities. |
| SageMaker maturity | AWS SageMaker | For classical ML training and MLOps, SageMaker has a longer track record and deeper feature set than Azure ML components within Foundry. |
| Grounding with search | Google Vertex AI | Google Search grounding in Vertex AI provides real-time web search context natively in API calls -- a unique capability. |
| Simplicity | AWS Bedrock | Bedrock's flat model (no Hub/Project hierarchy) is simpler for teams that do not need multi-project isolation. |
Multi-Cloud Considerationsβ
For organizations running multi-cloud AI strategies:
- Standardize on OpenAI-compatible API format -- Azure OpenAI, many Serverless API models, and several third-party providers all support the OpenAI chat completions API shape. Building your application against this API makes cloud portability easier.
- Abstract the orchestration layer -- Use Semantic Kernel or LangChain as your orchestration framework rather than cloud-specific tools (Prompt Flow, Bedrock Agents). These frameworks support multiple model providers.
- Separate model selection from infrastructure -- Design your architecture so that swapping a model provider (Azure OpenAI to Bedrock Claude) requires a configuration change, not a code rewrite.
- Evaluate per workload -- Some workloads may be best served by Azure (GPT-4.1 for enterprise chat), others by AWS (Claude for long-context analysis), and others by Google (Gemini for multi-modal). Let the use case drive the platform decision.
Key Takeawaysβ
| # | Takeaway |
|---|---|
| 1 | Azure AI Foundry is the unified platform for building, evaluating, and deploying AI applications on Azure -- it consolidates what was previously Azure ML Studio and Azure AI Studio. |
| 2 | The Hub/Project model is your workspace hierarchy: Hub = shared infrastructure and connections, Project = isolated workspace per application. Design your Hub/Project topology like you design your subscription topology. |
| 3 | The Model Catalog offers 1,800+ models via two paths: Serverless API (pay-per-token, zero compute management) and Managed Online Endpoints (dedicated compute, full control). |
| 4 | Azure OpenAI deployments come in Standard (pay-per-token) and Provisioned (PTU, reserved throughput) variants. PTU planning is a first-class infrastructure concern for high-volume workloads. |
| 5 | Prompt Flow is a visual DAG-based orchestration tool for building LLM pipelines (RAG, agents, chatbots) that can be deployed as managed endpoints and version-controlled in Git. |
| 6 | Evaluation is not optional -- use built-in metrics (groundedness, relevance, coherence) and red-teaming evaluations as CI/CD gates before every production deployment. |
| 7 | Fine-tune only when prompt engineering and RAG are insufficient -- fine-tuning is for style, tone, and specialized vocabulary, not for injecting knowledge (use RAG for that). |
| 8 | Azure AI Services (Speech, Vision, Language, Document Intelligence, Content Safety) integrate natively with Foundry projects as pre/post-processing capabilities. |
| 9 | Networking, RBAC, and cost management are the same disciplines you apply to any Azure workload -- private endpoints, managed identity, least-privilege RBAC, right-sized compute, and autoscaling. |
| 10 | In a multi-cloud comparison, Azure AI Foundry's advantages are OpenAI model exclusivity, enterprise security depth, and Microsoft ecosystem integration. Competitors have edges in specific models (Claude on Bedrock, Gemini on Vertex AI) and simplicity. |
What is Next?β
You now understand the platform where AI applications are built and deployed. In the next module, we explore the consumer-facing side of Microsoft's AI strategy -- the Copilot ecosystem that sits on top of this platform.
Next: Module 4: Microsoft Copilot Ecosystem -- M365 Copilot, Copilot Studio, Copilot Actions, GitHub Copilot, Copilot for Azure, and the extensibility model that connects them all.
Previous: Module 2: LLM Landscape