Building Voice Mode for the Emissary
The Emissary can hear you now.
We shipped voice mode last month — you can speak to the Planetary Emissary instead of typing, and it speaks back. This is a devlog on how it works, what went wrong, and a few decisions worth explaining.
Why voice
Journaling by typing is fine. But there are moments — standing somewhere, noticing something — when getting your phone out and typing feels like the wrong move. It breaks the spell. Voice is faster, more immediate, and for many people more natural when describing a place.
We also wanted voice to feel like a different mode of being with the Emissary. Not faster journaling. A different register of conversation.
The architecture
The voice stack is built on OpenAI's Realtime API, which uses WebRTC to establish a direct audio channel between the browser and OpenAI's servers.
The flow:
The key detail: the ephemeral key is minted server-side with the full system prompt already embedded. OpenAI's server knows who the Emissary is, what stone traits shape its voice, and what the user's recent journal history looks like — before the user says a word.
This matters because the system prompt includes context we do not want to expose client-side (recent journal entries, location, stone personality). The ephemeral key approach keeps all of that server-side while still allowing the browser to connect directly to OpenAI without proxying audio through our server.
Transcript handling
During the conversation, two event types come over the WebRTC data channel:
conversation.item.input_audio_transcription.completed— what the user said (Whisper transcription)response.audio_transcript.done— what the Emissary said
We collect these in a ref (not state — avoids stale closures in the cleanup path), and when the user ends the session, we POST the transcript to /api/conversations/[id]/voice-transcript. This saves it to the conversation's message history as regular messages, marked with inputMode: 'voice' in metadata.
The result: voice conversations appear in the same timeline as typed ones. The Emissary remembers what you said whether you typed it or spoke it.
What broke
Double sessions
In development, React 18's Strict Mode mounts components twice to catch side effects. Our auto-start useEffect fired twice, creating two WebRTC connections — two Emissary voices talking simultaneously, with different system prompts (because two ephemeral keys were minted in sequence, and the context may differ slightly between requests).
The fix was a startingRef — a ref (not state) that guards against concurrent starts. Refs are immune to the stale closure problem that makes state-based guards unreliable in this situation. The second call checks startingRef.current synchronously before doing anything async.
We also added proper cleanup to the effect: return () => { dismiss(); }. When Strict Mode unmounts and remounts the component, dismiss() tears down any in-flight peer connection and resets the ref cleanly.
AudioContext suspension
WebKit requires an AudioContext to be resumed after a user gesture. Our visualizer (which shows the user's mic level as an animated ring) was creating an AudioContext correctly, but the audio element for the Emissary's voice was routed through a separate context that started suspended.
The symptom: the playhead moved, duration was correct, but no sound. The fix: await ctx.resume() before el.play() on the first user interaction.
The Emissary sounded like a pebble
The system prompt included the stone's name and traits as raw context, and the model was interpreting this as an instruction to become the stone. Responses were laced with pebble metaphors ("as quiet as a pebble at the bottom of a still lake...").
The fix was explicit: "You are the Planetary Emissary. You are not the user's stone. You are not a pebble." The stone's traits now inform the Emissary's conversational register — not its identity. The prompt explicitly prohibits roleplaying as the stone.
The UI
The voice overlay is a full-screen dark panel. An animated ring pulses with the mic's audio level. Transcript entries appear as a live chat as the conversation unfolds.
Three controls:
- A Cancel button during connection (before audio is flowing — you should always be able to abort a slow connection)
- A mute toggle when active (mic track enabled/disabled without stopping the stream)
- An End button to stop the session and save the transcript
This is more than the minimal implementation. But voice interactions feel different to users — more exposed, more consequential — and the controls need to match that.
What is next
We want to add location context to voice sessions: if you are standing somewhere new, the Emissary should know that when it responds. Currently location is optional and not surfaced in voice mode.
We also want to explore voice-initiated journal entries — the Emissary could ask "should I save that as a post?" after a meaningful exchange and create a draft if you agree. The tool-use infrastructure for this already exists in the text-based interface.
The Emissary is listening. That feels like the right direction.