Voice & Speech AI
Real-time conversational systems on Azure — STT, LLM, TTS, telephony, and the 800ms latency budget.
The Voice AI Problem
Voice AI is the only AI workload where every architectural decision is constrained by a single human-perceptible latency budget: 800ms end-to-end. Text RAG can take 3 seconds. A voice agent that pauses 3 seconds feels broken. This module covers the canonical Azure stack for production voice agents — Speech Service, Communication Services, OpenAI — and the patterns that keep you under budget.
The Latency Budget
User stops speaking
→ 0ms
End-of-Utterance detection
→ 200–400ms
STT finalization
→ 50–150ms
LLM first token
→ 200–500ms
TTS first audio
→ 100–300ms
First sound reaches user
→ ~850ms totalMost production voice systems land between 700ms (excellent) and 1500ms (acceptable). Below 500ms feels superhuman. Above 2000ms feels broken.
The Canonical Azure Stack
- Azure AI Speech — streaming STT, neural TTS with 400+ voices, batch transcription
- Azure Communication Services — PSTN connectivity, Call Automation API, SIP/Direct Routing
- Azure OpenAI — gpt-4o-mini for routing, gpt-4o for content
- Container Apps or AKS — pre-warmed pool to absorb call spikes
- Application Insights — per-call latency breakdown
Streaming STT — Python
import azure.cognitiveservices.speech as speechsdk
from azure.identity import DefaultAzureCredential
# Use Managed Identity in production — never API keys
credential = DefaultAzureCredential()
token = credential.get_token("https://cognitiveservices.azure.com/.default")
speech_config = speechsdk.SpeechConfig(
auth_token=f"aad#{RESOURCE_ID}#{token.token}",
region="eastus2",
)
speech_config.speech_recognition_language = "en-US"
speech_config.set_property(
speechsdk.PropertyId.SpeechServiceConnection_EndSilenceTimeoutMs, "500"
)
recognizer = speechsdk.SpeechRecognizer(
speech_config=speech_config,
audio_config=speechsdk.audio.AudioConfig(use_default_microphone=True),
)
recognizer.recognizing.connect(lambda evt: print(f"[partial] {evt.result.text}"))
recognizer.recognized.connect(lambda evt: print(f"[final] {evt.result.text}"))
recognizer.start_continuous_recognition()Patterns That Work
- Sentence-level TTS streaming — start synthesis on the first . ! ? boundary, continue accumulating the next sentence in parallel
- Filler phrases during LLM call— "Let me check that for you." hides 500ms of think time
- Backchannel filtering— don't treat "uh-huh" or "okay" as a turn
- Short conversation memory — last 5–10 turns only; longer prompts add latency every turn
- gpt-4o-mini for routing, gpt-4o for content — 80% cost reduction with quality preserved
- Two-region active-active — sub-second failover via Front Door
Anti-Patterns to Avoid
- Waiting for the full LLM response before starting TTS — adds 2–3 seconds of perceived delay
- Using gpt-4o for every turn — cost explosion with no quality gain on routing decisions
- Long system prompts (>2000 tokens) — adds 200ms+ per turn
- Buffering audio queues without bounds — eventually breaks barge-in detection
- API keys in source code — security incident waiting to happen
- Cross-region STT + LLM + TTS — adds 100–200ms per hop and breaks the budget
- No PII redaction on transcripts — regulatory and reputational risk
Related
- Solution Play 04 — Call Center Voice AI (full reference implementation)
- Solution Play 14 — Cost-Optimized AI Gateway (voice routing patterns)
- Solution Play 17 — AI Observability (voice latency dashboards)
- Module R3 — Deterministic AI (matters more in voice — no second chance to read)
- Module O5 — AI Infrastructure (Container Apps for the call handler)
- Module T2 — Responsible AI (content safety on voice output, accent bias audits)
The full V1 module (~30KB, 10 sections, code samples in Python) ships in the FrootAI MCP server. Run npx frootai-mcp and ask your AI assistant "get module V1" for the complete content.