Skip to main content

T3: Production Architecture Patterns

Duration: 60–90 minutes | Level: Strategic Part of: 🍎 FROOT Transformation Layer Prerequisites: O4 (Azure AI Platform), O5 (AI Infrastructure) Last Updated: March 2026


Table of Contents​


T3.1 Production AI is Different​

Your POC worked beautifully in the demo. Now you need to serve 10,000 users concurrently, handle API rate limits, manage costs, respond in under 2 seconds, never hallucinate on financial data, and stay available 99.9% of the time.

Welcome to production AI.


T3.2 The AI Application Architecture Stack​

Every production AI system has these layers. Missing any one of them is a production incident waiting to happen.


T3.3 Hosting Patterns: Where Agents Live​

Pattern Comparison​

Decision Matrix​

CriterionContainer AppsAKSApp ServiceFunctionsCopilot Studio
ComplexityLow-MediumHighLowLowVery Low
ScalingAuto (0β†’N)Auto (custom)Manual/AutoAuto (0β†’N)Managed
GPU Supportβœ… Previewβœ… Full❌❌N/A
Long-runningβœ…βœ…βœ…βš οΈ (max 10 min)βœ…
WebSocket/SSEβœ…βœ…βœ…βŒβœ…
Dapr sidecarβœ… Built-inβœ… Add-on❌❌N/A
Cost at scaleπŸ’°πŸ’°πŸ’°πŸ’°πŸ’°πŸ’°πŸ’°πŸ’°πŸ’°πŸ’°
Best forAI APIs, agentsML serving, multi-modelSimple APIsEvent-driven AIBusiness users

Container Apps β€” The Sweet Spot for Most AI Workloads​


T3.4 API Gateway for AI​

Azure API Management (APIM) becomes critical for production AI β€” it's the control plane for all AI traffic.

AI Gateway Capabilities​

CapabilityWhat It DoesWhy It Matters
Semantic CachingCache similar queries, not just identical ones30-50% cost reduction on repeated patterns
Token Rate LimitingLimit tokens/minute per user or appPrevent runaway costs
Load BalancingDistribute across multiple Azure OpenAI instancesHandle rate limits, improve availability
Circuit BreakingStop calling failing endpointsProtect against cascading failures
Token MeteringTrack token consumption per user/teamCost allocation and chargeback
Content SafetyPre-screen requests before they hit modelsPrevent policy violations
Prompt Injection DetectionDetect and block injection attemptsSecurity guardrail

Multi-Region AI Gateway​


T3.5 Latency Optimization Patterns​

Where Latency Hides​

ComponentTypical LatencyOptimization
Network to AOAI10-50msPrivate endpoints, regional affinity
Token generation20-80ms per tokenSmaller model, shorter output, PTU
Embedding generation50-200msBatch, cache frequently used
Vector search10-50msHNSW index, filter before search
Reranking100-500msLimit to top-20 candidates
Total RAG pipeline500ms-3sParallel retrieval, streaming

Streaming for Perceived Performance​

Instead of waiting for the full response, stream tokens to the user as they're generated:

Without streaming:  [---- 3 seconds of nothing ----] Full response appears
With streaming: H-e-l-l-o-,- -h-e-r-e-'-s- -y-o-u-r- -a-n-s-w-e-r-... (progressive)

TTFT (Time To First Token) drops from 3s to ~200ms. The user sees progress immediately.

Caching Strategies​

Cache TypeWhat It CachesHit RateSavings
Exact cacheIdentical queries5-10%100% per hit
Semantic cacheSimilar queries (embedding similarity)20-40%100% per hit
Embedding cacheDocument embeddings80%+Avoid re-embedding
Context cacheRAG retrieval results30-50%Skip retrieval step

T3.6 Cost Control Architecture​

Token Economics​

Cost per request = (input_tokens Γ— input_rate) + (output_tokens Γ— output_rate)

Example (GPT-4o, March 2026):
System message: 800 tokens Γ— $2.50/1M = $0.002
User message: 200 tokens Γ— $2.50/1M = $0.0005
RAG context: 2,000 tokens Γ— $2.50/1M = $0.005
Output: 500 tokens Γ— $10.00/1M = $0.005
─────────────────────────────────────────────────────
Total per request: $0.0125

At 100K requests/day = $1,250/day = $37,500/month

Cost Optimization Decision Tree​


T3.7 Multi-Agent Production Patterns​

Pattern 1: Supervisor Agent​

When: Clear domain boundaries, need routing intelligence, want centralized control.

Pattern 2: Pipeline (Sequential Handoff)​

When: Document processing, data pipelines, workflows with clear sequential steps.

Pattern 3: Swarm (Peer-to-Peer)​

When: Creative tasks, complex reasoning, research β€” agents negotiate and collaborate without a central controller.

Hosting Multi-Agent: The Microservices Approach​

Agent 1 (Supervisor)   β†’ Container App (scale 2-10)
Agent 2 (Billing) β†’ Container App (scale 0-5)
Agent 3 (Tech Support) β†’ Container App (scale 0-5)
Agent 4 (Product) β†’ Container App (scale 0-5)

Communication: Dapr pub/sub (async) or HTTP (sync)
State: Cosmos DB (conversation memory)
Observability: Application Insights (distributed tracing)

T3.8 Monitoring & Observability for AI​

The AI Observability Stack​

LayerWhat to MonitorTool
InfrastructureCPU, memory, GPU, networkAzure Monitor, Container Insights
APILatency, throughput, errors, rate limitsAPIM Analytics, App Insights
ModelToken usage, TTFT, quality scoresCustom metrics in App Insights
QualityHallucination rate, groundedness, relevanceLLM evaluation pipeline
CostToken consumption, cost per request, per userCost Management + custom dashboards
SafetyContent filter triggers, injection attemptsAzure AI Content Safety logs

Key Metrics Dashboard​

MetricFormulaAlert Threshold
TTFT P9595th percentile Time To First Token>1 second
Total Latency P9595th percentile end-to-end>5 seconds
Token Cost/RequestTotal tokens Γ— rate / request count>$0.05/request
Error RateFailed requests / total requests>2%
Rate Limit Hits429 responses / total requests>5%
Groundedness ScoreAvg score from evaluation pipeline<0.85
User SatisfactionThumbs up / (thumbs up + thumbs down)<80%

T3.9 Resilience Patterns​

Circuit Breaker for AI Endpoints​

Fallback Chain​

Primary:    Azure OpenAI (East US) GPT-4o
↓ (429 or 500)
Fallback 1: Azure OpenAI (West US) GPT-4o
↓ (429 or 500)
Fallback 2: Azure OpenAI (East US) GPT-4o-mini (degraded quality, lower cost)
↓ (both down)
Fallback 3: Cached response for similar queries
↓ (no cache hit)
Fallback 4: "I'm experiencing high demand. Please try again shortly."

T3.10 The Production Readiness Checklist​

Before going live, verify every item:

Architecture & Infrastructure​

  • Hosting platform selected (Container Apps / AKS / App Service)
  • Private endpoints for all AI services
  • Managed Identity (no API keys in code)
  • Multi-region deployment (if SLA requires >99.9%)
  • Auto-scaling configured with appropriate min/max
  • Load testing completed at 2x expected peak

Security​

  • Prompt injection detection enabled
  • Content Safety filters configured
  • Input validation on all user inputs
  • PII detection and redaction
  • RBAC configured with least privilege
  • Audit logging for all AI interactions

Quality & Reliability​

  • Evaluation pipeline running (offline + online)
  • Hallucination rate measured and <5%
  • Groundedness score >0.85
  • Circuit breaker and fallback chain configured
  • Rate limiting per user/application
  • Retry with exponential backoff for transient errors

Cost Management​

  • Token budget per user/team configured
  • Cost alerts at 50%, 80%, 100% of budget
  • Semantic caching for common queries
  • Model tier optimization (mini for simple, full for complex)
  • Cost dashboard accessible to stakeholders

Monitoring & Operations​

  • Application Insights with custom AI metrics
  • Distributed tracing across agent interactions
  • Alerts for latency, errors, cost, quality degradation
  • Runbook for common incidents (rate limits, model outage)
  • On-call rotation for AI-specific issues

Key Takeaways​

The Five Rules of Production AI Architecture
  1. AI is an API problem, not a magic problem. Apply the same engineering rigor you'd use for any production API β€” rate limiting, caching, circuit breaking, monitoring.
  2. Container Apps is the sweet spot. For most AI agent workloads, Container Apps gives you auto-scaling (including zero), Dapr integration, and managed infrastructure without Kubernetes complexity.
  3. APIM is your AI control plane. Semantic caching, token metering, load balancing across regions, and prompt injection detection β€” all in one gateway.
  4. Stream everything. Users don't mind waiting 3 seconds for an answer if they see tokens appearing after 200ms. Streaming is non-negotiable for interactive AI.
  5. Cost is the silent killer. A runaway AI agent can burn through thousands of dollars in hours. Token budgets, semantic caching, and model tier optimization are as important as the AI logic itself.

FrootAI T3 β€” Production AI is where engineering meets intelligence. Build it like infrastructure. Monitor it like a service. Budget it like a business.