Performance Optimization: Achieving Sub-200ms Response Times

In voice AI, latency is everything.

Humans perceive delays over 200ms as unnatural. Above 500ms, conversations feel broken. Our goal is to make voice AI feel as natural as talking to a friend.

Here's how we achieve it - and how you can too.

The Latency Budget

A voice interaction has multiple stages:

User speaks → STT → LLM → Your code → TTS → User hears
              50ms   80ms   ???ms     60ms

Total budget: 200ms
Your budget:  ~10ms

Every millisecond in "Your code" directly impacts user experience.

Connection Optimization

Use Persistent Connections

Don't reconnect for each interaction:

// ❌ Bad: New connection per request
app.post('/voice', async (req, res) => {
  const client = new FerniClient({ apiKey });  // 50-100ms overhead
  const response = await client.process(req.body);
  res.json(response);
});

// ✅ Good: Reuse connection pool
const clientPool = new FerniClientPool({
  apiKey,
  poolSize: 10,
  idleTimeout: 300000,  // 5 minutes
});

app.post('/voice', async (req, res) => {
  const client = await clientPool.acquire();  // <1ms
  try {
    const response = await client.process(req.body);
    res.json(response);
  } finally {
    clientPool.release(client);
  }
});

Pre-warm Connections

Initialize before users need them:

// On server start
async function warmup() {
  const client = new FerniClient({ apiKey });

  // Pre-authenticate
  await client.authenticate();

  // Pre-load common contexts
  await client.preloadContext(['greeting', 'faq', 'help']);

  console.log('Ferni client warmed up');
}

warmup();

Context Window Optimization

The LLM context window is your biggest performance lever.

Keep Context Lean

// ❌ Bad: Everything in context
const context = {
  fullConversationHistory: [...],  // 10,000 tokens
  allUserPreferences: {...},        // 2,000 tokens
  entireKnowledgeBase: {...},       // 50,000 tokens
};

// ✅ Good: Relevant context only
const context = {
  recentMessages: conversationHistory.slice(-10),  // 500 tokens
  relevantPreferences: getRelevantPrefs(intent),   // 100 tokens
  semanticSearchResults: await search(query, 3),   // 300 tokens
};

Use Semantic Compression

// Compress long conversations
async function compressConversation(messages) {
  if (messages.length < 20) return messages;

  const summary = await client.summarize(messages.slice(0, -10));

  return [
    { role: 'system', content: `Previous conversation summary: ${summary}` },
    ...messages.slice(-10),
  ];
}

Progressive Context Loading

Load context based on conversation stage:

const contextLoaders = {
  greeting: () => ({
    userBasics: user.name,
  }),

  deepConversation: async () => ({
    userBasics: user.name,
    recentHistory: await getRecentHistory(user.id, 10),
    preferences: await getPreferences(user.id),
  }),

  taskExecution: async (task) => ({
    userBasics: user.name,
    taskContext: await getTaskContext(task),
    relevantTools: await getToolsForTask(task),
  }),
};

// Load only what's needed
const context = await contextLoaders[conversationStage]();

Parallel Processing

Don't wait for things that can run concurrently:

// ❌ Bad: Sequential
const preferences = await getPreferences(userId);      // 20ms
const history = await getConversationHistory(userId);  // 30ms
const calendar = await getCalendarContext(userId);     // 25ms
// Total: 75ms

// ✅ Good: Parallel
const [preferences, history, calendar] = await Promise.all([
  getPreferences(userId),       // 20ms
  getConversationHistory(userId), // 30ms
  getCalendarContext(userId),   // 25ms
]);
// Total: 30ms (max of all)

Speculative Execution

Predict what you'll need:

// While user is speaking, prepare likely responses
client.on('speech_start', async (partialTranscript) => {
  // Start loading potential contexts in background
  const predictions = predictIntent(partialTranscript);

  predictions.forEach(intent => {
    prefetchContext(intent);  // Non-blocking
  });
});

client.on('speech_end', async (fullTranscript) => {
  // Context likely already cached
  const context = await getContext(detectIntent(fullTranscript));
});

Caching Strategies

Multi-Layer Cache

const cache = new MultiLayerCache({
  layers: [
    // L1: In-memory (fastest)
    {
      type: 'memory',
      maxSize: 1000,
      ttl: 60,  // 1 minute
    },
    // L2: Redis (shared across instances)
    {
      type: 'redis',
      url: process.env.REDIS_URL,
      ttl: 300,  // 5 minutes
    },
    // L3: Database (persistent)
    {
      type: 'postgres',
      ttl: 3600,  // 1 hour
    },
  ],
});

// Automatic cascading lookup
const userContext = await cache.get(`context:${userId}`);

Cache Warming

Pre-populate cache for active users:

// Warm cache before daily peak hours
async function warmCaches() {
  const activeUsers = await getActiveUsers({ lastDay: true });

  await Promise.all(
    activeUsers.map(user =>
      cache.set(`context:${user.id}`, await buildContext(user.id))
    )
  );
}

// Run at 6 AM local time
schedule('0 6 * * *', warmCaches);

Database Optimization

Use Connection Pooling

import { Pool } from 'pg';

const pool = new Pool({
  max: 20,                    // Max connections
  idleTimeoutMillis: 30000,   // Close idle connections
  connectionTimeoutMillis: 2000,
});

// Reuse connections
async function query(sql, params) {
  const client = await pool.connect();
  try {
    return await client.query(sql, params);
  } finally {
    client.release();
  }
}

Index Your Queries

Most voice AI queries filter by user + time:

-- Essential indexes
CREATE INDEX idx_conversations_user_time
ON conversations(user_id, created_at DESC);

CREATE INDEX idx_memories_user_relevance
ON memories(user_id, relevance_score DESC);

-- Partial index for active conversations
CREATE INDEX idx_active_conversations
ON conversations(user_id, id)
WHERE status = 'active';

Measuring Performance

Track the Right Metrics

import { metrics } from '@ferni/sdk';

// Track end-to-end latency
const timer = metrics.startTimer('voice_response_latency');

const response = await client.process(input);

timer.end({
  intent: response.intent,
  cached: response.fromCache,
  contextSize: response.contextTokens,
});

// Alert on p95 > 200ms
metrics.alert('voice_response_latency', {
  percentile: 95,
  threshold: 200,
  action: 'page',
});

Profile Regularly

// Enable detailed profiling in development
const client = new FerniClient({
  apiKey,
  profiling: process.env.NODE_ENV === 'development',
});

client.on('profile', (data) => {
  console.table({
    stage: data.stage,
    duration: `${data.durationMs}ms`,
    tokens: data.tokens,
  });
});

Performance Checklist

Before going live:

[ ] Connection pooling enabled
[ ] Context window under 4,000 tokens
[ ] Database queries under 10ms (p95)
[ ] Multi-layer caching implemented
[ ] Parallel processing for independent operations
[ ] Latency metrics and alerts configured
[ ] Cache warming for peak hours
[ ] Connection pre-warming on startup

Benchmarks

Our target metrics for production deployments:

Metric	Target	Acceptable
End-to-end latency (p50)	<150ms	<200ms
End-to-end latency (p95)	<200ms	<300ms
Context load time	<10ms	<25ms
Database queries	<5ms	<10ms
Cache hit rate	>80%	>60%

Next Steps

Questions? Join us on Discord.