Voice search & voice chat

FrootAI.dev ships browser-native voice input across every search bar and chat surface. Tap the mic icon, speak naturally, and the engine turns your sentence into keywords that find the right primitive, play, or recipe. For Agent FAI it goes one step further: keywords are expanded into a grounded-context block that's injected into the LLM prompt, so the model answers with our SoT instead of guessing.

🔒 Privacy: all transcription happens in your browser via the Web Speech API. No audio bytes leave your device. We log only aggregate counters (mic on/off, transcript length bucket, language tag) via cookie-free Plausible analytics. See Data protection §2.5 for the full story.

§1 — Where the mic appears

Surface	Auto-search on final	Continuous mode	Settings popover
Orchard search	✅	❌	❌
Primitives index	✅	❌	❌
Primitives / [category]	✅	❌	❌
Solution Plays	✅	❌	❌
Marketplace	✅	❌	❌
Registry-site	✅	❌	❌
Playground	✅	❌	❌
Workflows	✅	❌	❌
Cookbook	✅	❌	❌
Chatbot (Agent FAI full)	✅ auto-send	✅	✅
AgentFaiWidget (mini chat)	✅ auto-send	✅	✅

The 9 catalog search bars use single-utterance mode in English by default — they're short-form lookup interfaces and the gear-icon settings popover would add visual noise. Both chat surfaces expose the gear icon next to the mic so you can pick a different language and flip hands-free mode on for a long brainstorming session.

§2 — Supported languages

The settings popover offers eleven BCP-47 language tags out of the box:

`en-US`, `en-GB`, `es-ES`, `fr-FR`, `de-DE`, `pt-BR`, `hi-IN`, `ja-JP`, `ko-KR`, `zh-CN`, `ar-SA`.

Your choice is saved in `localStorage` (`frootai-voice-prefs-v1`) so it persists across visits. The default on first load is whatever your browser reports via `navigator.language`.

Actual transcription quality depends on your browser engine — Chrome and Edge use Google's cloud recognition, Safari uses Apple's on-device model. We don't choose for you; we just hand the BCP-47 tag to the browser.

§3 — Hands-free (continuous) mode

By default the mic listens for one utterance, fires the final transcript on pause, and turns itself off. That's the right behaviour for one-shot questions.

For long brainstorming sessions, open the gear popover next to the mic and toggle Hands-free mode on. The recognition session stays open across pauses; each pause fires a fresh final transcript and Agent FAI replies. Tap the mic again to stop.

§4 — Browser support matrix

The Web Speech API is widely deployed but not universal. The mic button auto-hides itself entirely when the API is unavailable, so the search bar / chat input never shows a broken control.

Browser	Status	Notes
Chrome desktop	✅ full	Uses Google cloud recognition
Chrome Android	✅ full	Uses Google cloud recognition
Edge desktop	✅ full	Same engine as Chrome
Safari macOS 14.1+	✅ full	On-device recognition
Safari iOS 14.5+	✅ full	On-device recognition
Brave / Vivaldi / Arc	✅ full	Chromium-based
Firefox desktop	❌ off by default	Requires `media.webspeech.recognition.enable` flag in `about:config`. Until flipped, mic auto-hides.
Firefox Android	❌ unsupported	Same as desktop
Tor Browser	❌ unsupported	API blocked by privacy hardening
Older browsers	❌ unsupported	Mic hides automatically

You'll also need to grant the microphone permission the first time the page asks. If you previously denied it, the mic button will appear but tapping it will silently fail — re-enable it in your browser's site settings.

§5 — Agent FAI grounding (what makes voice answers smart)

When you speak (or type) a sentence at Agent FAI, we don't just send the raw text to the LLM. We run a deterministic grounding step first:

Tokenize the sentence, strip voice fillers ("could", "please", "show me", "how do I", etc.).
Expand the remaining tokens via our shared synonym pack — so "rag" pulls in "retrieval", "vector", "search"; "chatbot" pulls in "agent", "assistant"; "infra" pulls in "infrastructure", "bicep", "terraform".
Run smartSearch across five primitive catalogs (agents, skills, instructions, hooks, plugins) using the same per-catalog presets the visual search uses.
Inject the top matches as a `[GROUNDED CONTEXT]` block at the end of the user message before it goes to the LLM. The model sees the user's question plus the canonical names/descriptions of the primitives we already shipped that solve it.
Show the extracted keywords + matched primitives as chips above the input, so you can see what we matched on (and click through directly to the primitive doc if you don't even need the LLM answer).

This is the maturity layer: voice + grounding turn a casual sentence into a query that lands on the right SoT entry, every time. See ground-query.ts for the source.

§6 — Privacy and analytics

We track four cookie-free, content-free events via Plausible:

`voice_start` { surface, lang, continuous } — mic turned on
`voice_final` { surface, transcript_length, lang } — final transcript received (NO content)
`search_quality` { bucket, query_length } — applies to all searches, not just voice
`search_no_results` { query_hint } — first two words only, lowercased, for documentation-gap analysis

No audio, no transcript content, no user identifier. The browser does the recognition; we only count that it happened. See Data protection §2.5 for the legal text.

§7 — Troubleshooting

The mic icon doesn't appear at all. Your browser doesn't expose the Web Speech API. On Firefox, enable the flag in `about:config` (see browser matrix above). Otherwise, switch to a Chromium-based browser or Safari.

The mic appears but tapping it does nothing. You probably denied microphone permission earlier. Re-enable it in your browser's site settings for frootai.dev, then reload.

The transcript is wrong / in the wrong language. Open the gear icon next to the mic and pick the right BCP-47 language tag. Your choice persists across sessions.

Agent FAI didn't use the grounded context. Look at the chips above the input — if they show keywords and primitives, the grounding block was injected. If the LLM ignored it anyway, that's a model limitation; rephrase more directly ("using <primitive name>") to nudge it.

§8 — Automated end-to-end testing (verdict)

Decision: we are NOT shipping Playwright end-to-end coverage for voice. The voice rollout is feature-complete at this point.

The reasoning, documented so future contributors don't re-open the question:

The Web Speech API (`SpeechRecognition` / `webkitSpeechRecognition`) has no headless test path. Unlike `getUserMedia`, Chromium offers no flag to fake recognition results — you have to inject a hand-rolled JS shim via Playwright's `page.addInitScript()` that replaces `window.SpeechRecognition` with a fake constructor and manually fires `onresult` / `onerror` / `onend` events on a timer.
The shim has to reproduce the exact `SpeechRecognitionResultList` shape our `useVoiceSearch` hook reads (array-like, numeric indices, nested `[0].transcript`, `isFinal` boolean) — most off-the-shelf mocks get this wrong.
The shim only works on Chromium; Firefox and WebKit specs would all be `test.skip()` with a doc comment.
The new mic chirp creates an `AudioContext` on toggle, so the test also needs `--autoplay-policy=no-user-gesture-required` or the cue silently no-ops and pollutes assertions.
External signal confirms the niche: as of mid-2026 there are zero Stack Overflow questions tagged `web-speech-api + playwright` and zero open or closed Playwright issues matching `SpeechRecognition mock`. There is no community-validated pattern to copy.

What we DO have covering voice today:

38/38 Node `--test` suites pass on the pure-logic layer voice feeds into (`ground-query`, `smart-search`, `search-presets`).
Manual smoke is straightforward — open `/chatbot`, click the mic, speak, observe the keyword chips and the grounded answer. Same on the 9 catalog search bars.
Plausible `voice_start` / `voice_final` counters give us a real-user health signal (mic-toggle rate, transcript-length distribution per surface, language mix).

If a regression ever surfaces, the cheapest mitigation is a jsdom + Vitest unit test on `useVoiceSearch` (mock `SpeechRecognition` at module scope; assert state transitions, transcript accumulation, error handling, continuous-mode loop). That's ~80% of the failure coverage at ~20% of the effort of a Playwright shim. We have not written it yet because nothing has regressed; we'll write it the first time something does.