Half-Cascade: The Sweet Spot for Voice AI Architecture
There's a dirty secret in voice AI: "speech-to-speech" models aren't actually speech-to-speech.
When OpenAI announced the Realtime API with "native audio," we assumed it was end-to-end audio processing. Voice in, voice out, no text in the middle.
We were wrong. And understanding why we were wrong led us to the architecture that powers Ferni today—an architecture built not just for speed, but for something more important: giving each persona their own authentic voice.
Because when Maya coaches you through a tough moment, she shouldn't sound like a generic AI. She should sound like Maya.
The Three Architectures
Let's define terms clearly:
Full Cascade (Traditional Pipeline)
User speaks → STT → Text → LLM → Text → TTS → User hears
Every step is sequential. User audio becomes text (STT), text goes to LLM, LLM output becomes audio (TTS). Each handoff adds latency.
| Component | Typical Latency |
|---|---|
| STT | 100-500ms |
| LLM | 350ms-1s+ |
| TTS | 75-200ms |
| Total | 500-1700ms |
This is how most voice assistants worked until 2024. It's simple, debuggable, and... slow.
Native S2S (End-to-End)
User speaks → Unified Audio Model → User hears
True speech-to-speech. The model processes audio natively from input to output. No text representation in the middle.
Examples: Kyutai's Moshi, early research models.
| Metric | Value |
|---|---|
| Latency | 200-250ms |
| Debuggability | Very low (no text layer) |
| Voice quality | Limited (model generates voice) |
| Cost | Very high (massive models) |
This is the "holy grail" but has serious practical limitations we'll discuss.
Half-Cascade (Hybrid)
User speaks → Native Audio Understanding → Text Reasoning → TTS → User hears
The model understands audio natively but reasons in text. This is what OpenAI Realtime and Gemini Live actually do.
From Softcery's analysis:
"Contrary to popular belief, models like OpenAI's Realtime API aren't true end-to-end speech models. They operate as what the industry calls 'Half-Cascades': Audio understanding happens natively, but the model still performs text-based reasoning before synthesizing speech output."
| Metric | Value |
|---|---|
| Latency | 250-350ms |
| Debuggability | Medium (text reasoning is inspectable) |
| Voice quality | High (can use external TTS) |
| Cost | Medium |
This is what we use. Here's why.
Why Not Full Cascade?
The obvious question: why not just use STT → LLM → TTS? It's simple and well-understood.
Latency Kills Conversation
Human conversation has a rhythm. Research shows we expect responses within 300-500ms. Beyond 500ms feels unnatural. Beyond 1 second feels broken.
Full cascade averages 800-1200ms in production. That's not conversation - it's taking turns.
Full Cascade Timeline:
0ms ─┬─ User finishes speaking
100ms ─┤ STT processing...
400ms ─┤ STT complete, send to LLM
500ms ─┤ LLM processing...
900ms ─┤ LLM response complete
1000ms ─┤ TTS processing...
1200ms ─┴─ User finally hears response
Prosody Loss
When you convert speech to text, you lose:
- Tone: Is "great" enthusiastic or sarcastic?
- Pacing: Fast and anxious or slow and contemplative?
- Emphasis: Which word was stressed?
- Emotion: Happy, sad, frustrated?
STT gives you the words. It doesn't give you the meaning behind the words.
Streaming Complexity
To reduce latency, you need streaming at every stage:
- Streaming STT (partial transcripts)
- Streaming LLM (token-by-token)
- Streaming TTS (chunked audio)
Each streaming interface adds complexity and potential failure points.
Why Not Native S2S?
Pure speech-to-speech sounds ideal. Why didn't we go that route?
Voice Customization
Native S2S models generate their own voice. You get whatever voice the model was trained on.
We have 6 distinct personas, each with their own voice:
- Ferni (warm, supportive)
- Peter (analytical, measured)
- Maya (energetic, coaching)
- Alex (professional, clear)
- Jordan (creative, expressive)
- Nayan (wise, contemplative)
Native S2S can't do this. The model would sound the same regardless of which persona is speaking.
Cost Explosion
From Softcery's research:
"A 5-minute conversation might cost $0.30/min, while a 30-minute conversation could cost $1.50/min or more due to accumulated context."
Native S2S models re-tokenize all previous audio on every turn. The context window fills with audio tokens (much larger than text tokens), and you pay for all of them repeatedly.
| Conversation Length | Full Cascade | Native S2S |
|---|---|---|
| 5 minutes | ~$0.75 | ~$1.50 |
| 30 minutes | ~$4.50 | ~$45.00+ |
That 10x cost multiplier makes native S2S impractical for extended conversations.
Debugging Opacity
When something goes wrong with native S2S, you have no visibility:
User: "Play some jazz"
Model: [audio output that sounds wrong]
What went wrong? Did it:
- Mishear "jazz" as "jacks"?
- Understand correctly but generate wrong response?
- Generate correct response but synthesize wrong audio?
You can't tell because there's no text layer to inspect.
With half-cascade, we have the text reasoning to examine:
User: "Play some jazz" → [audio in]
Text reasoning: "User wants to play jazz music"
Response: "Here's some smooth jazz for you" ← We can see this!
Audio out: [persona voice]
Production Readiness
As of January 2026:
| Architecture | Production Status |
|---|---|
| Full Cascade | Mature, widespread |
| Half-Cascade | Generally available (OpenAI, Gemini) |
| Native S2S | Experimental (Moshi, research) |
Native S2S isn't ready for production voice AI at scale.
Half-Cascade: The Sweet Spot
Half-cascade gives us:
- Native audio understanding - No STT latency, preserves prosody
- Text reasoning - Debuggable, controllable, efficient
- External TTS - Custom persona voices, high quality
- Reasonable cost - Text tokens, not audio tokens, for reasoning
Our Implementation
// gemini-live.ts - Our half-cascade configuration
/**
* Default Gemini model for TEXT modality with external TTS (half-cascade architecture).
*
* IMPORTANT: Native-audio models (gemini-*-native-audio-*) do NOT support TEXT modality!
* They fail with "Cannot extract voices from a non-audio request" error.
*
* For TEXT mode with Cartesia TTS, use standard models:
* - gemini-2.0-flash-exp (recommended - stable, fast)
*/
const DEFAULT_GEMINI_MODEL = 'gemini-2.0-flash-exp';
// Session configuration
const session = await gemini.live.connect({
model: DEFAULT_GEMINI_MODEL,
config: {
responseModalities: ['TEXT'], // Half-cascade: text output
// Audio INPUT is still native - Gemini processes raw audio
}
});
The key insight: Gemini receives raw audio (native audio understanding) but outputs text (which we send to Cartesia TTS).
The Audio Flow
┌─────────────────────────────────────────────────────────────────────────┐
│ USER SPEAKS │
│ (raw audio stream) │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ GEMINI LIVE API │
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Native Audio │ → │ Text-Based │ → │ Text Output │ │
│ │ Understanding │ │ Reasoning │ │ (not audio) │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
│ │
│ Prosody preserved ✓ Debuggable ✓ Efficient ✓ │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ CARTESIA TTS │
│ │
│ Text → Persona Voice → Audio │
│ │
│ - Ferni voice (warm, supportive) │
│ - Peter voice (analytical, measured) │
│ - Maya voice (energetic, coaching) │
│ - ... 6 distinct persona voices │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ USER HEARS │
│ (persona-specific voice) │
└─────────────────────────────────────────────────────────────────────────┘
Latency Breakdown
| Stage | Half-Cascade | Full Cascade | Savings |
|---|---|---|---|
| Audio → Understanding | ~50ms (native) | ~300ms (STT) | 250ms |
| Understanding → Response | ~200ms | ~200ms | 0ms |
| Response → Audio | ~100ms (Cartesia) | ~100ms | 0ms |
| Total | ~350ms | ~600ms | ~250ms |
That 250ms savings comes from skipping the STT step. The audio goes directly into Gemini's native understanding.
Why Cartesia for TTS?
We chose Cartesia over Gemini's built-in TTS for several reasons:
1. Voice Cloning
Cartesia allows us to create custom voices for each persona:
const PERSONA_VOICES = {
ferni: 'voice_abc123', // Warm, supportive
peter: 'voice_def456', // Analytical, measured
maya: 'voice_ghi789', // Energetic, coaching
// ...
};
async function speak(text: string, personaId: string) {
return cartesia.tts.generate({
text,
voice: PERSONA_VOICES[personaId],
model: 'sonic-2',
});
}
2. Emotion Control
Cartesia supports natural language emotion control:
// Add emotion through SSML-like tags
const textWithEmotion = `<emotion name="excitement">${text}</emotion>`;
This lets us match the TTS emotion to the conversation context.
3. Latency
Cartesia's streaming TTS is consistently under 100ms time-to-first-byte:
| TTS Provider | TTFB | Quality (MOS) |
|---|---|---|
| Cartesia Sonic-2 | ~80ms | 4.7 |
| ElevenLabs Turbo | ~138ms | 4.84 |
| Gemini Built-in | ~150ms | 4.2 |
4. Cost Efficiency
Text-to-speech is priced per character, regardless of the conversation length. No context accumulation cost.
The Tradeoffs
Half-cascade isn't perfect. Here's what we give up:
1. Interruption Handling
Native S2S handles interruptions naturally - the model processes overlapping audio streams. With half-cascade, we need explicit interruption detection:
// Detect when user starts speaking during agent response
session.on('user_speech_start', () => {
// Stop current TTS
cartesia.stop();
// Cancel pending LLM generation
gemini.cancelResponse();
});
2. Prosody in Output
Native S2S preserves input prosody through to output. Our text layer loses some of that:
- User speaks sarcastically
- Gemini understands the sarcasm (native audio)
- Gemini outputs text response
- Cartesia doesn't know it should sound sarcastic
We work around this with emotion detection:
// Detect emotion from user input
const emotion = await detectEmotion(userAudio);
// Pass to TTS
const response = await cartesia.generate({
text: llmResponse,
emotion: mapEmotionToVoice(emotion),
});
3. Some Latency Overhead
Pure native S2S could theoretically be faster since there's no TTS step. But in practice, the difference is minimal (~50ms) and the benefits of custom TTS outweigh it.
When to Use Which Architecture
| Use Case | Recommended Architecture |
|---|---|
| Quick commands (lights, timers) | Full Cascade (simplest) |
| Extended conversations | Half-Cascade |
| Single-voice assistant | Native S2S (when mature) |
| Multi-persona platform | Half-Cascade (required) |
| Cost-sensitive applications | Full Cascade or Half-Cascade |
| Maximum quality/latency | Half-Cascade with premium TTS |
For Ferni—an extended-conversation, multi-persona platform—half-cascade is the clear winner.
But the real reason we chose half-cascade isn't in any of these tables. It's this: our personas are real characters. Peter sounds different than Ferni sounds different than Maya. Each has their own voice, their own way of speaking, their own presence.
Native S2S would force them all to sound the same. Half-cascade lets them be who they are.
That's not a technical choice. That's a values choice. And we'd make it again.
2026 Outlook
The industry is converging on half-cascade as the production standard:
- OpenAI Realtime API: Half-cascade (text reasoning, optional built-in TTS)
- Gemini Live API: Half-cascade (native audio in, text out option)
- Anthropic: Expected to follow similar pattern
Native S2S will continue improving but likely remain specialized for:
- Simple, single-voice assistants
- Research applications
- Use cases where voice customization doesn't matter
For sophisticated voice AI with personas, emotions, and extended conversations, half-cascade is the architecture of 2026.
But more than that: for AI that feels like someone who actually shows up for you—with their own voice, their own presence, their own way of caring—half-cascade is what makes it possible.
Learn more about our architecture in The Movie Production Paradigm or how we handle tool calling without LLMs.