T3: Production Architecture Patterns
Duration: 60β90 minutes | Level: Strategic Part of: π FROOT Transformation Layer Prerequisites: O4 (Azure AI Platform), O5 (AI Infrastructure) Last Updated: March 2026
Table of Contentsβ
- T3.1 Production AI is Different
- T3.2 The AI Application Architecture Stack
- T3.3 Hosting Patterns: Where Agents Live
- T3.4 API Gateway for AI
- T3.5 Latency Optimization Patterns
- T3.6 Cost Control Architecture
- T3.7 Multi-Agent Production Patterns
- T3.8 Monitoring & Observability for AI
- T3.9 Resilience Patterns
- T3.10 The Production Readiness Checklist
- Key Takeaways
T3.1 Production AI is Differentβ
Your POC worked beautifully in the demo. Now you need to serve 10,000 users concurrently, handle API rate limits, manage costs, respond in under 2 seconds, never hallucinate on financial data, and stay available 99.9% of the time.
Welcome to production AI.
T3.2 The AI Application Architecture Stackβ
Every production AI system has these layers. Missing any one of them is a production incident waiting to happen.
T3.3 Hosting Patterns: Where Agents Liveβ
Pattern Comparisonβ
Decision Matrixβ
| Criterion | Container Apps | AKS | App Service | Functions | Copilot Studio |
|---|---|---|---|---|---|
| Complexity | Low-Medium | High | Low | Low | Very Low |
| Scaling | Auto (0βN) | Auto (custom) | Manual/Auto | Auto (0βN) | Managed |
| GPU Support | β Preview | β Full | β | β | N/A |
| Long-running | β | β | β | β οΈ (max 10 min) | β |
| WebSocket/SSE | β | β | β | β | β |
| Dapr sidecar | β Built-in | β Add-on | β | β | N/A |
| Cost at scale | π°π° | π°π°π° | π°π° | π° | π°π° |
| Best for | AI APIs, agents | ML serving, multi-model | Simple APIs | Event-driven AI | Business users |
Container Apps β The Sweet Spot for Most AI Workloadsβ
T3.4 API Gateway for AIβ
Azure API Management (APIM) becomes critical for production AI β it's the control plane for all AI traffic.
AI Gateway Capabilitiesβ
| Capability | What It Does | Why It Matters |
|---|---|---|
| Semantic Caching | Cache similar queries, not just identical ones | 30-50% cost reduction on repeated patterns |
| Token Rate Limiting | Limit tokens/minute per user or app | Prevent runaway costs |
| Load Balancing | Distribute across multiple Azure OpenAI instances | Handle rate limits, improve availability |
| Circuit Breaking | Stop calling failing endpoints | Protect against cascading failures |
| Token Metering | Track token consumption per user/team | Cost allocation and chargeback |
| Content Safety | Pre-screen requests before they hit models | Prevent policy violations |
| Prompt Injection Detection | Detect and block injection attempts | Security guardrail |
Multi-Region AI Gatewayβ
T3.5 Latency Optimization Patternsβ
Where Latency Hidesβ
| Component | Typical Latency | Optimization |
|---|---|---|
| Network to AOAI | 10-50ms | Private endpoints, regional affinity |
| Token generation | 20-80ms per token | Smaller model, shorter output, PTU |
| Embedding generation | 50-200ms | Batch, cache frequently used |
| Vector search | 10-50ms | HNSW index, filter before search |
| Reranking | 100-500ms | Limit to top-20 candidates |
| Total RAG pipeline | 500ms-3s | Parallel retrieval, streaming |
Streaming for Perceived Performanceβ
Instead of waiting for the full response, stream tokens to the user as they're generated:
Without streaming: [---- 3 seconds of nothing ----] Full response appears
With streaming: H-e-l-l-o-,- -h-e-r-e-'-s- -y-o-u-r- -a-n-s-w-e-r-... (progressive)
TTFT (Time To First Token) drops from 3s to ~200ms. The user sees progress immediately.
Caching Strategiesβ
| Cache Type | What It Caches | Hit Rate | Savings |
|---|---|---|---|
| Exact cache | Identical queries | 5-10% | 100% per hit |
| Semantic cache | Similar queries (embedding similarity) | 20-40% | 100% per hit |
| Embedding cache | Document embeddings | 80%+ | Avoid re-embedding |
| Context cache | RAG retrieval results | 30-50% | Skip retrieval step |
T3.6 Cost Control Architectureβ
Token Economicsβ
Cost per request = (input_tokens Γ input_rate) + (output_tokens Γ output_rate)
Example (GPT-4o, March 2026):
System message: 800 tokens Γ $2.50/1M = $0.002
User message: 200 tokens Γ $2.50/1M = $0.0005
RAG context: 2,000 tokens Γ $2.50/1M = $0.005
Output: 500 tokens Γ $10.00/1M = $0.005
βββββββββββββββββββββββββββββββββββββββββββββββββββββ
Total per request: $0.0125
At 100K requests/day = $1,250/day = $37,500/month
Cost Optimization Decision Treeβ
T3.7 Multi-Agent Production Patternsβ
Pattern 1: Supervisor Agentβ
When: Clear domain boundaries, need routing intelligence, want centralized control.
Pattern 2: Pipeline (Sequential Handoff)β
When: Document processing, data pipelines, workflows with clear sequential steps.
Pattern 3: Swarm (Peer-to-Peer)β
When: Creative tasks, complex reasoning, research β agents negotiate and collaborate without a central controller.
Hosting Multi-Agent: The Microservices Approachβ
Agent 1 (Supervisor) β Container App (scale 2-10)
Agent 2 (Billing) β Container App (scale 0-5)
Agent 3 (Tech Support) β Container App (scale 0-5)
Agent 4 (Product) β Container App (scale 0-5)
Communication: Dapr pub/sub (async) or HTTP (sync)
State: Cosmos DB (conversation memory)
Observability: Application Insights (distributed tracing)
T3.8 Monitoring & Observability for AIβ
The AI Observability Stackβ
| Layer | What to Monitor | Tool |
|---|---|---|
| Infrastructure | CPU, memory, GPU, network | Azure Monitor, Container Insights |
| API | Latency, throughput, errors, rate limits | APIM Analytics, App Insights |
| Model | Token usage, TTFT, quality scores | Custom metrics in App Insights |
| Quality | Hallucination rate, groundedness, relevance | LLM evaluation pipeline |
| Cost | Token consumption, cost per request, per user | Cost Management + custom dashboards |
| Safety | Content filter triggers, injection attempts | Azure AI Content Safety logs |
Key Metrics Dashboardβ
| Metric | Formula | Alert Threshold |
|---|---|---|
| TTFT P95 | 95th percentile Time To First Token | >1 second |
| Total Latency P95 | 95th percentile end-to-end | >5 seconds |
| Token Cost/Request | Total tokens Γ rate / request count | >$0.05/request |
| Error Rate | Failed requests / total requests | >2% |
| Rate Limit Hits | 429 responses / total requests | >5% |
| Groundedness Score | Avg score from evaluation pipeline | <0.85 |
| User Satisfaction | Thumbs up / (thumbs up + thumbs down) | <80% |
T3.9 Resilience Patternsβ
Circuit Breaker for AI Endpointsβ
Fallback Chainβ
Primary: Azure OpenAI (East US) GPT-4o
β (429 or 500)
Fallback 1: Azure OpenAI (West US) GPT-4o
β (429 or 500)
Fallback 2: Azure OpenAI (East US) GPT-4o-mini (degraded quality, lower cost)
β (both down)
Fallback 3: Cached response for similar queries
β (no cache hit)
Fallback 4: "I'm experiencing high demand. Please try again shortly."
T3.10 The Production Readiness Checklistβ
Before going live, verify every item:
Architecture & Infrastructureβ
- Hosting platform selected (Container Apps / AKS / App Service)
- Private endpoints for all AI services
- Managed Identity (no API keys in code)
- Multi-region deployment (if SLA requires >99.9%)
- Auto-scaling configured with appropriate min/max
- Load testing completed at 2x expected peak
Securityβ
- Prompt injection detection enabled
- Content Safety filters configured
- Input validation on all user inputs
- PII detection and redaction
- RBAC configured with least privilege
- Audit logging for all AI interactions
Quality & Reliabilityβ
- Evaluation pipeline running (offline + online)
- Hallucination rate measured and <5%
- Groundedness score >0.85
- Circuit breaker and fallback chain configured
- Rate limiting per user/application
- Retry with exponential backoff for transient errors
Cost Managementβ
- Token budget per user/team configured
- Cost alerts at 50%, 80%, 100% of budget
- Semantic caching for common queries
- Model tier optimization (mini for simple, full for complex)
- Cost dashboard accessible to stakeholders
Monitoring & Operationsβ
- Application Insights with custom AI metrics
- Distributed tracing across agent interactions
- Alerts for latency, errors, cost, quality degradation
- Runbook for common incidents (rate limits, model outage)
- On-call rotation for AI-specific issues
Key Takeawaysβ
- AI is an API problem, not a magic problem. Apply the same engineering rigor you'd use for any production API β rate limiting, caching, circuit breaking, monitoring.
- Container Apps is the sweet spot. For most AI agent workloads, Container Apps gives you auto-scaling (including zero), Dapr integration, and managed infrastructure without Kubernetes complexity.
- APIM is your AI control plane. Semantic caching, token metering, load balancing across regions, and prompt injection detection β all in one gateway.
- Stream everything. Users don't mind waiting 3 seconds for an answer if they see tokens appearing after 200ms. Streaming is non-negotiable for interactive AI.
- Cost is the silent killer. A runaway AI agent can burn through thousands of dollars in hours. Token budgets, semantic caching, and model tier optimization are as important as the AI logic itself.
FrootAI T3 β Production AI is where engineering meets intelligence. Build it like infrastructure. Monitor it like a service. Budget it like a business.