voice-ai-engine-development

Name: voice-ai-engine-development
Author: haresh-sai06

Build real-time conversational AI voice engines using async worker pipelines, streaming transcription, LLM agents, and TTS synthesis with interrupt handling and multi-provider support

Install

mkdir -p .claude/skills/voice-ai-engine-development-haresh-sai06 && curl -L -o skill.zip "https://agentskills.codes/api/skills/download/16326" && unzip -o skill.zip -d .claude/skills/voice-ai-engine-development-haresh-sai06 && rm skill.zip

Installs to .claude/skills/voice-ai-engine-development-haresh-sai06

Activation

This is the description your AI agent reads to decide when to run this skill — the better it matches your request, the more reliably it fires.

Build real-time conversational AI voice engines using async worker pipelines, streaming transcription, LLM agents, and TTS synthesis with interrupt handling and multi-provider support

183 charsno explicit “when” trigger

About this skill

Voice AI Engine Development

Overview

This skill guides you through building production-ready voice AI engines with real-time conversation capabilities. Voice AI engines enable natural, bidirectional conversations between users and AI agents through streaming audio processing, speech-to-text transcription, LLM-powered responses, and text-to-speech synthesis.

The core architecture uses an async queue-based worker pipeline where each component runs independently and communicates via asyncio.Queue objects, enabling concurrent processing, interrupt handling, and real-time streaming at every stage.

When to Use This Skill

Use this skill when:

Building real-time voice conversation systems
Implementing voice assistants or chatbots
Creating voice-enabled customer service agents
Developing voice AI applications with interrupt capabilities
Integrating multiple transcription, LLM, or TTS providers
Working with streaming audio processing pipelines
The user mentions Vocode, voice engines, or conversational AI

Core Architecture Principles

The Worker Pipeline Pattern

Every voice AI engine follows this pipeline:

Audio In → Transcriber → Agent → Synthesizer → Audio Out
           (Worker 1)   (Worker 2)  (Worker 3)

Key Benefits:

Decoupling: Workers only know about their input/output queues
Concurrency: All workers run simultaneously via asyncio
Backpressure: Queues automatically handle rate differences
Interruptibility: Everything can be stopped mid-stream

Base Worker Pattern

Every worker follows this pattern:

class BaseWorker:
    def __init__(self, input_queue, output_queue):
        self.input_queue = input_queue   # asyncio.Queue to consume from
        self.output_queue = output_queue # asyncio.Queue to produce to
        self.active = False
    
    def start(self):
        """Start the worker's processing loop"""
        self.active = True
        asyncio.create_task(self._run_loop())
    
    async def _run_loop(self):
        """Main processing loop - runs forever until terminated"""
        while self.active:
            item = await self.input_queue.get()  # Block until item arrives
            await self.process(item)              # Process the item
    
    async def process(self, item):
        """Override this - does the actual work"""
        raise NotImplementedError
    
    def terminate(self):
        """Stop the worker"""
        self.active = False

Component Implementation Guide

1. Transcriber (Audio → Text)

Purpose: Converts incoming audio chunks to text transcriptions

Interface Requirements:

class BaseTranscriber:
    def __init__(self, transcriber_config):
        self.input_queue = asyncio.Queue()   # Audio chunks (bytes)
        self.output_queue = asyncio.Queue()  # Transcriptions
        self.is_muted = False
    
    def send_audio(self, chunk: bytes):
        """Client calls this to send audio"""
        if not self.is_muted:
            self.input_queue.put_nowait(chunk)
        else:
            # Send silence instead (prevents echo during bot speech)
            self.input_queue.put_nowait(self.create_silent_chunk(len(chunk)))
    
    def mute(self):
        """Called when bot starts speaking (prevents echo)"""
        self.is_muted = True
    
    def unmute(self):
        """Called when bot stops speaking"""
        self.is_muted = False

Output Format:

class Transcription:
    message: str          # "Hello, how are you?"
    confidence: float     # 0.95
    is_final: bool        # True = complete sentence, False = partial
    is_interrupt: bool    # Set by TranscriptionsWorker

Supported Providers:

Deepgram - Fast, accurate, streaming
AssemblyAI - High accuracy, good for accents
Azure Speech - Enterprise-grade
Google Cloud Speech - Multi-language support

Critical Implementation Details:

Use WebSocket for bidirectional streaming
Run sender and receiver tasks concurrently with asyncio.gather()
Mute transcriber when bot speaks to prevent echo/feedback loops
Handle both final and partial transcriptions

2. Agent (Text → Response)

Purpose: Processes user input and generates conversational responses

Interface Requirements:

class BaseAgent:
    def __init__(self, agent_config):
        self.input_queue = asyncio.Queue()   # TranscriptionAgentInput
        self.output_queue = asyncio.Queue()  # AgentResponse
        self.transcript = None               # Conversation history
    
    async def generate_response(self, human_input, is_interrupt, conversation_id):
        """Override this - returns AsyncGenerator of responses"""
        raise NotImplementedError

Why Streaming Responses?

Lower latency: Start speaking as soon as first sentence is ready
Better interrupts: Can stop mid-response
Sentence-by-sentence: More natural conversation flow

Supported Providers:

OpenAI (GPT-4, GPT-3.5) - High quality, fast
Google Gemini - Multimodal, cost-effective
Anthropic Claude - Long context, nuanced responses

Critical Implementation Details:

Maintain conversation history in Transcript object
Stream responses using AsyncGenerator
IMPORTANT: Buffer entire LLM response before yielding to synthesizer (prevents audio jumping)
Handle interrupts by canceling current generation task
Update conversation history with partial messages on interrupt

3. Synthesizer (Text → Audio)

Purpose: Converts agent text responses to speech audio

Interface Requirements:

class BaseSynthesizer:
    async def create_speech(self, message: BaseMessage, chunk_size: int) -> SynthesisResult:
        """
        Returns a SynthesisResult containing:
        - chunk_generator: AsyncGenerator that yields audio chunks
        - get_message_up_to: Function to get partial text (for interrupts)
        """
        raise NotImplementedError

SynthesisResult Structure:

class SynthesisResult:
    chunk_generator: AsyncGenerator[ChunkResult, None]
    get_message_up_to: Callable[[float], str]  # seconds → partial text
    
    class ChunkResult:
        chunk: bytes          # Raw PCM audio
        is_last_chunk: bool

Supported Providers:

ElevenLabs - Most natural voices, streaming
Azure TTS - Enterprise-grade, many languages
Google Cloud TTS - Cost-effective, good quality
Amazon Polly - AWS integration
Play.ht - Voice cloning

Critical Implementation Details:

Stream audio chunks as they're generated
Convert audio to LINEAR16 PCM format (16kHz sample rate)
Implement get_message_up_to() for interrupt handling
Handle audio format conversion (MP3 → PCM)

4. Output Device (Audio → Client)

Purpose: Sends synthesized audio back to the client

CRITICAL: Rate Limiting for Interrupts

async def send_speech_to_output(self, message, synthesis_result,
                                stop_event, seconds_per_chunk):
    chunk_idx = 0
    async for chunk_result in synthesis_result.chunk_generator:
        # Check for interrupt
        if stop_event.is_set():
            logger.debug(f"Interrupted after {chunk_idx} chunks")
            message_sent = synthesis_result.get_message_up_to(
                chunk_idx * seconds_per_chunk
            )
            return message_sent, True  # cut_off = True
        
        start_time = time.time()
        
        # Send chunk to output device
        self.output_device.consume_nonblocking(chunk_result.chunk)
        
        # CRITICAL: Wait for chunk to play before sending next one
        # This is what makes interrupts work!
        speech_length = seconds_per_chunk
        processing_time = time.time() - start_time
        await asyncio.sleep(max(speech_length - processing_time, 0))
        
        chunk_idx += 1
    
    return message, False  # cut_off = False

Why Rate Limiting? Without rate limiting, all audio chunks would be sent immediately, which would:

Buffer entire message on client side
Make interrupts impossible (all audio already sent)
Cause timing issues

By sending one chunk every N seconds:

Real-time playback is maintained
Interrupts can stop mid-sentence
Natural conversation flow is preserved

The Interrupt System

The interrupt system is critical for natural conversations.

How Interrupts Work

Scenario: Bot is saying "I think the weather will be nice today and tomorrow and—" when user interrupts with "Stop".

Step 1: User starts speaking

# TranscriptionsWorker detects new transcription while bot speaking
async def process(self, transcription):
    if not self.conversation.is_human_speaking:  # Bot was speaking!
        # Broadcast interrupt to all in-flight events
        interrupted = self.conversation.broadcast_interrupt()
        transcription.is_interrupt = interrupted

Step 2: broadcast_interrupt() stops everything

def broadcast_interrupt(self):
    num_interrupts = 0
    # Interrupt all queued events
    while True:
        try:
            interruptible_event = self.interruptible_events.get_nowait()
            if interruptible_event.interrupt():  # Sets interruption_event
                num_interrupts += 1
        except queue.Empty:
            break
    
    # Cancel current tasks
    self.agent.cancel_current_task()              # Stop generating text
    self.agent_responses_worker.cancel_current_task()  # Stop synthesizing
    return num_interrupts > 0

Step 3: SynthesisResultsWorker detects interrupt

async def send_speech_to_output(self, synthesis_result, stop_event, ...):
    async for chunk_result in synthesis_result.chunk_generator:
        # Check stop_event (this is the interruption_event)
        if stop_event.is_set():
            logger.debug("Interrupted! Stopping speech.")
            # Calculate wha

---

*Content truncated.*

More by haresh-sai06

View all by haresh-sai06 →

claude__google-workspace-cli

haresh-sai06

../../../engineering-team/google-workspace-cli/skills/google-workspace-cli/SKILL.md

Install

mkdir -p .claude/skills/voice-ai-engine-development-haresh-sai06 && curl -L -o skill.zip "https://agentskills.codes/api/skills/download/16326" && unzip -o skill.zip -d .claude/skills/voice-ai-engine-development-haresh-sai06 && rm skill.zip

Installs to .claude/skills/voice-ai-engine-development-haresh-sai06

Safety

No risk patterns found

Automated static scan of the SKILL.md and repo. A flag describes what the skill can do — not a verdict. Always review code before installing.

Source & maintenance

Updated

20d ago

Repo stars

Loads

~5,766 tokens

Stars are for the whole repository, not this skill alone.

Stats

Views

Installs

Author

haresh-sai06

2 skills published

Links

Source code

voice-ai-engine-development

Install

Activation

About this skill

Voice AI Engine Development

Overview

When to Use This Skill

Core Architecture Principles

The Worker Pipeline Pattern

Base Worker Pattern

Component Implementation Guide

1. Transcriber (Audio → Text)

2. Agent (Text → Response)

3. Synthesizer (Text → Audio)

4. Output Device (Audio → Client)

The Interrupt System

How Interrupts Work

More by haresh-sai06

claude__google-workspace-cli

Search skills