Voice & Speech AI

Real-time conversational systems on Azure — STT, LLM, TTS, telephony, and the 800ms latency budget.

V1·60–90 min read·Deep Dive

The Voice AI Problem

Voice AI is the only AI workload where every architectural decision is constrained by a single human-perceptible latency budget: 800ms end-to-end. Text RAG can take 3 seconds. A voice agent that pauses 3 seconds feels broken. This module covers the canonical Azure stack for production voice agents — Speech Service, Communication Services, OpenAI — and the patterns that keep you under budget.

The Latency Budget

End-to-End Voice Pipeline

User stops speaking
  → 0ms
End-of-Utterance detection
  → 200–400ms
STT finalization
  → 50–150ms
LLM first token
  → 200–500ms
TTS first audio
  → 100–300ms
First sound reaches user
  → ~850ms total

Most production voice systems land between 700ms (excellent) and 1500ms (acceptable). Below 500ms feels superhuman. Above 2000ms feels broken.

The Canonical Azure Stack

Azure AI Speech — streaming STT, neural TTS with 400+ voices, batch transcription
Azure Communication Services — PSTN connectivity, Call Automation API, SIP/Direct Routing
Azure OpenAI — gpt-4o-mini for routing, gpt-4o for content
Container Apps or AKS — pre-warmed pool to absorb call spikes
Application Insights — per-call latency breakdown

Streaming STT — Python

speech.py — continuous recognition

import azure.cognitiveservices.speech as speechsdk
from azure.identity import DefaultAzureCredential

# Use Managed Identity in production — never API keys
credential = DefaultAzureCredential()
token = credential.get_token("https://cognitiveservices.azure.com/.default")
speech_config = speechsdk.SpeechConfig(
    auth_token=f"aad#{RESOURCE_ID}#{token.token}",
    region="eastus2",
)
speech_config.speech_recognition_language = "en-US"
speech_config.set_property(
    speechsdk.PropertyId.SpeechServiceConnection_EndSilenceTimeoutMs, "500"
)

recognizer = speechsdk.SpeechRecognizer(
    speech_config=speech_config,
    audio_config=speechsdk.audio.AudioConfig(use_default_microphone=True),
)

recognizer.recognizing.connect(lambda evt: print(f"[partial] {evt.result.text}"))
recognizer.recognized.connect(lambda evt: print(f"[final]   {evt.result.text}"))
recognizer.start_continuous_recognition()

Patterns That Work

Sentence-level TTS streaming — start synthesis on the first . ! ? boundary, continue accumulating the next sentence in parallel
Filler phrases during LLM call— "Let me check that for you." hides 500ms of think time
Backchannel filtering— don't treat "uh-huh" or "okay" as a turn
Short conversation memory — last 5–10 turns only; longer prompts add latency every turn
gpt-4o-mini for routing, gpt-4o for content — 80% cost reduction with quality preserved
Two-region active-active — sub-second failover via Front Door

Anti-Patterns to Avoid

Waiting for the full LLM response before starting TTS — adds 2–3 seconds of perceived delay
Using gpt-4o for every turn — cost explosion with no quality gain on routing decisions
Long system prompts (>2000 tokens) — adds 200ms+ per turn
Buffering audio queues without bounds — eventually breaks barge-in detection
API keys in source code — security incident waiting to happen
Cross-region STT + LLM + TTS — adds 100–200ms per hop and breaks the budget
No PII redaction on transcripts — regulatory and reputational risk

Solution Play 04 — Call Center Voice AI (full reference implementation)
Solution Play 14 — Cost-Optimized AI Gateway (voice routing patterns)
Solution Play 17 — AI Observability (voice latency dashboards)
Module R3 — Deterministic AI (matters more in voice — no second chance to read)
Module O5 — AI Infrastructure (Container Apps for the call handler)
Module T2 — Responsible AI (content safety on voice output, accent bias audits)

The full V1 module (~30KB, 10 sections, code samples in Python) ships in the FrootAI MCP server. Run npx frootai-mcp and ask your AI assistant "get module V1" for the complete content.