Recipe: Voice Assistant
This recipe builds a talk-to-it assistant. The ears transcribe what a person says, the lab thinks up a reply, and the mouth speaks it back. All three live inside one Intelligence Lab — the platform’s compound element for hearing, reasoning, and speaking.
The problem it solves
A voice interface needs three capabilities that usually mean three separate vendors — speech-to-text, an LLM, and text-to-speech — plus the glue to pass audio and text between them. Wiring those together (and keeping the voices and models consistent) is most of the work. This recipe uses one lab whose ears, brain, and mouth are parts of the same element, so the round trip from spoken question to spoken answer lives in one place.
Elements
| Element | Role |
|---|---|
ears | Speech-to-text transcriber inside the lab — turns audio into text. |
lab | Reasons over the transcript and generates the reply. |
mouth | Text-to-speech voice inside the lab — speaks the answer. |
Flow
- Create a
lab— the Intelligence Lab that contains the brain (the model) and provides a home for itsearsandmouth. - Hear the user. Capture their audio and turn it into text with the
earstranscribeoperation (one-shot transcription of a complete audio buffer; for live mic capture there is a streaming route). - Think. Feed the transcript to the lab —
chat(orgenerate) produces the assistant’s reply grounded in whatever context and tools you attached. - Speak. Turn the reply into audio with the
mouthsynthesizeoperation, choosing avoice_id(list-voicesshows the voices configured on it; PCM gives the lowest latency, ~0.7s to first audio). - Loop: audio in →
transcribe→chat→synthesize→ audio out, turn after turn.
What this shows
The lab is a compound element: its ears and mouth are not loose, separately-integrated services — they are parts of the same Intelligence Lab, alongside the brain. That is what makes a voice loop tractable: hearing, reasoning, and speaking share one element, so the transcript flows straight from ears to the lab’s chat to the mouth’s synthesize without cross-vendor plumbing. Change the voice once on the mouth and every spoken answer changes; swap the brain and the assistant gets smarter — without touching the speech path.