How We Trained 11 Models to Replace LLM Tool Selection

The problem was simple: when someone asks Ferni to play music, they shouldn't wait 800ms wondering if they were heard.

That pause breaks the magic of conversation. It breaks presence. And presence is everything we're building toward.

After months of battling OpenAI's Realtime API function calling failures and Gemini's JSON workaround quirks, we discovered the root cause: LLMs are terrible at deciding which tool to call. Not "sometimes unreliable." Not "occasionally slow." Terrible.

We asked ourselves a heretical question:

What if the LLM didn't decide which tool to call at all?

The Problem: 500ms of Uncertainty

Here's what tool calling looked like before FTIS V2:

User: "Play some jazz"
         ↓
LLM receives user transcript + 60 tool schemas (~4000 tokens)
         ↓
LLM "thinks" about which tool to call (~300-500ms)
         ↓
LLM outputs: { "fn": "playMusic", "args": { "query": "jazz" } }
         ↓
We parse JSON, execute tool
         ↓
Total latency: 500-800ms just for tool selection

But it gets worse. The LLM doesn't always call a tool. Sometimes it says "I'd be happy to play some jazz for you!" and just... talks about playing music instead of actually doing it.

We called this tool call leakage - the LLM speaking about tools instead of calling them.

The Leakage Problem

// What we wanted:
{ "fn": "playMusic", "args": { "query": "jazz" } }

// What we sometimes got:
"I'd love to help you with that! Let me play some jazz music for you.
Jazz is a wonderful genre that originated in..."

Our sanitizer would catch JSON in the TTS stream, but it couldn't force the LLM to output JSON in the first place. We tried:

Prompt engineering ("ALWAYS output JSON for tool calls")
Few-shot examples
System prompt reinforcement
OpenAI's tool_choice: "required"
Gemini's functionCallingMode: "ANY"

None of it was reliable enough for production voice AI where every second of latency destroys the conversational feel.

The Insight: Classification, Not Generation

Then we realized something: tool selection is a classification problem, not a generation problem.

When a user says "play some jazz," they're expressing an intent. That intent maps to a category. That category maps to tools. This is a well-understood NLP problem with mature, fast solutions.

Approach	Latency	Accuracy	Cost
LLM function calling	300-500ms	~80%	$0.01-0.03/call
Finetuned classifier	~20ms	93%+	~$0.0001/call

The math was obvious. We just had to build it.

FTIS V2: The Architecture

FTIS stands for Ferni Tool Intent Selection. Version 2 uses a two-stage hierarchical classification:

┌─────────────────────────────────────────────────────────────────┐
│                     USER SPEAKS                                  │
│                "Play some jazz music"                            │
└─────────────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────────┐
│ STAGE 1: Super-Category Classifier                              │
│                                                                 │
│ Input: "play some jazz music"                                   │
│ Output: media (95% confidence)                                  │
│ Latency: ~8ms                                                   │
│                                                                 │
│ Categories: calendar, communication, emotional, finance,        │
│             health, home, media, productivity, system, travel   │
└─────────────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────────┐
│ STAGE 2: Fine-Category Classifier (media-specific)              │
│                                                                 │
│ Input: "play some jazz music"                                   │
│ Output: play_music (92% confidence)                             │
│ Latency: ~12ms                                                  │
│                                                                 │
│ Media categories: play_music, music_control, find_music,        │
│                   podcast_play, audiobook_play                  │
└─────────────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────────┐
│ Combined confidence: 0.95 × 0.92 = 0.87                         │
│ Threshold: 0.85 ✓                                               │
│ Action: DIRECT EXECUTION                                        │
└─────────────────────────────────────────────────────────────────┘

Why Two Stages?

A single 60+ class classifier would be:

Harder to train - need balanced data across all classes
Less accurate - confusion between similar intents across domains
Harder to update - adding a new tool requires retraining everything

With hierarchical classification:

Stage 1 is a simple 10-class classifier (97% accuracy)
Stage 2 has 10 specialized models, each handling 5-15 classes (92-99% accuracy)
Adding a new tool only requires retraining one Stage 2 model

The 11 ONNX Models

Model	Classes	Accuracy	Size
`super-category`	10	97%	44MB
`calendar-fine`	8	98%	44MB
`communication-fine`	6	98%	44MB
`emotional-fine`	12	94%	44MB
`finance-fine`	5	96%	44MB
`health-fine`	9	92%	44MB
`home-fine`	7	98%	44MB
`media-fine`	5	95%	44MB
`productivity-fine`	11	94%	44MB
`system-fine`	6	99%	44MB
`travel-fine`	5	99%	44MB

Total: ~480MB of models loaded at startup, providing sub-20ms inference.

Training Data: The Hard Part

Here's what nobody tells you about finetuning classifiers: the model is only as good as your training data.

We generated training data three ways:

1. Real User Transcripts (Gold Standard)

{
  "text": "can you play that song again",
  "super_category": "media",
  "fine_category": "music_control",
  "source": "production_logs"
}

We had 6 months of production logs with tool execution data. If a user said X and tool Y was called, that's a training example.

2. Synthetic Generation (Scale)

# Generate variations for each intent
prompt = f"""
Generate 50 natural variations of how a user might ask to {intent}.
Include:
- Casual speech ("play some jazz")
- Polite requests ("could you please play jazz")
- Contextual ("I'm in the mood for jazz")
- Incomplete ("jazz... something chill")
"""

This gave us ~5,000 examples per fine category.

3. Hard Negatives (Robustness)

The hardest cases are near-misses - utterances that sound like one intent but are actually another:

Utterance	Looks Like	Actually Is
"What time is the concert?"	calendar	information
"Can you call my mom?"	communication	phone
"I need to save this"	productivity	memory
"Turn up the music"	media	music_control

We specifically mined for these confusing cases and ensured balanced representation in training.

The Integration: LLM as "Responder"

The key architectural shift: the LLM no longer decides tools - it responds to tool results.

// OLD: LLM decides
const response = await llm.generateWithTools(transcript, tools);
if (response.toolCall) {
  await executeTool(response.toolCall);
}

// NEW: FTIS decides, LLM responds
const classification = await ftisClassify(transcript);

if (classification.confidence >= 0.85) {
  // Direct execution - LLM never sees tool schemas
  const result = await executeTool(classification.toolId, extractArgs(transcript));

  // Inject result into LLM context
  await llm.generate(`
    [TOOL_RESULT: ${classification.toolId}]
    Status: ${result.success ? 'SUCCESS' : 'FAILED'}
    Result: ${result.naturalResponse}

    Respond naturally to this result. Do NOT call any tools.
  `);
}

The LLM's job is now much simpler: take a tool result and respond naturally. No decision-making, no JSON generation, no uncertainty.

The System Prompt Difference

Old prompt (with function calling):

You have access to the following tools:
- playMusic: Play music by query or URI
- setTimer: Set a timer for X minutes
- getWeather: Get weather for a location
...60 more tools...

When the user wants to use a tool, output JSON: { "fn": "toolName", "args": {...} }

New prompt (FTIS V2):

You are a conversational AI. Tools are handled externally.

When you see [TOOL_RESULT], respond naturally to what happened.
Do NOT output JSON. Do NOT discuss tools. Just be conversational.

Example:
[TOOL_RESULT: playMusic]
Result: Now playing Jazz Vibes by Miles Davis

Your response: "Here we go, some smooth jazz coming right up!"

The LLM is liberated from tool selection entirely.

Results: Before and After

Latency

Metric	Before (LLM FC)	After (FTIS V2)
Tool selection	300-500ms	~20ms
Total response time	800-1200ms	400-600ms
P99 latency	2.5s	900ms

Reliability

Metric	Before	After
Tool call success rate	~80%	99.2%
Leakage rate	~15%	<0.5%
Wrong tool selection	~5%	<1%

Cost

Metric	Before	After
Tokens for tool schemas	~4000/turn	0
LLM cost for tool selection	~$0.02/turn	~$0.0001/turn
Monthly savings (1M turns)	-	~$19,000

Confidence Thresholds: The Safety Net

Not every utterance maps cleanly to a tool. We use confidence thresholds:

Confidence	Action
≥ 0.85	Direct execution - bypass LLM entirely
0.50 - 0.85	Tool hint - inject hint into LLM context
< 0.50	Conversation - pure LLM response

The "tool hint" middle ground is crucial:

if (confidence >= 0.50 && confidence < 0.85) {
  // Not confident enough for direct execution
  // But confident enough to hint
  await llm.generate(`
    [TOOL_HINT: User may want ${classification.toolId}]
    Confidence: ${confidence}

    If appropriate, you may use this tool. Otherwise, continue conversation.
  `);
}

This catches edge cases where FTIS isn't sure but wants to nudge the LLM in the right direction.

Lessons Learned

1. Classification beats generation for structured decisions

LLMs are amazing at open-ended generation. They're mediocre at structured decision-making with many options. Use the right tool for the job.

2. Hierarchical classification scales better

One giant classifier is fragile. A hierarchy of specialized classifiers is robust, updateable, and easier to debug.

3. The LLM is happier without tool schemas

Our LLM responses became more natural once we removed 4000 tokens of tool schemas from every turn. Less context pollution = better responses.

4. Hard negatives are worth the effort

The difference between 90% and 93% accuracy comes down to hard negatives. Mine for confusing cases relentlessly.

5. ONNX is production-ready

We were skeptical about deploying 11 ONNX models in production. It's been rock solid. ONNX Runtime is battle-tested.

What's Next

FTIS V2 is in production today. We're exploring:

Streaming classification - start classifying before the user finishes speaking
Confidence calibration - better probability estimates for threshold tuning
Multi-intent detection - "play jazz and set a timer for 30 minutes"
Personalization - user-specific priors based on usage patterns

The fundamental insight remains: tool selection is classification, not generation. Once you accept that, everything gets simpler.

But here's what really matters: that 500ms we saved isn't just a number. It's the difference between "Ferni heard me" and "did Ferni hear me?" It's the difference between presence and awkwardness.

Every millisecond of latency is a moment where connection wavers. FTIS V2 gives us those moments back.

Want to learn more about our voice AI architecture? Check out our posts on half-cascade architecture and the movie production paradigm.