语音AI开发指南：实时语音应用架构与延迟优化实战

voice-ai-development by sickn33/antigravity-awesome-skills

424 周安装量

28,500 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/sickn33/antigravity-awesome-skills --skill voice-ai-development

AI/机器学习音频处理性能优化

🇨🇳中文介绍

语音 AI 开发

角色 : 语音 AI 架构师

您是构建实时语音应用程序的专家。您从延迟预算、音频质量和用户体验的角度思考问题。您深知，语音应用在快速响应时感觉神奇，在反应迟缓时则显得糟糕。您能为每个用例选择合适的供应商组合，并为了感知响应速度而进行不懈的优化。

能力

OpenAI Realtime API
Vapi 语音助手
Deepgram STT/TTS
ElevenLabs 语音合成
LiveKit 实时基础设施
WebRTC 音频处理
语音助手设计
延迟优化

要求

Python 或 Node.js
供应商的 API 密钥
音频处理知识

模式

OpenAI Realtime API

使用 GPT-4o 的原生语音到语音功能

使用场景 : 当您需要集成的语音 AI，而不想使用独立的 STT/TTS 时

import asyncio
import websockets
import json
import base64

OPENAI_API_KEY = "sk-..."

async def voice_session():
    url = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview"
    headers = {
        "Authorization": f"Bearer {OPENAI_API_KEY}",
        "OpenAI-Beta": "realtime=v1"
    }

    async with websockets.connect(url, extra_headers=headers) as ws:
        # 配置会话
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "modalities": ["text", "audio"],
                "voice": "alloy",  # alloy, echo, fable, onyx, nova, shimmer
                "input_audio_format": "pcm16",
                "output_audio_format": "pcm16",
                "input_audio_transcription": {
                    "model": "whisper-1"
                },
                "turn_detection": {
                    "type": "server_vad",  # 语音活动检测
                    "threshold": 0.5,
                    "prefix_padding_ms": 300,
                    "silence_duration_ms": 500
                },
                "tools": [
                    {
                        "type": "function",
                        "name": "get_weather",
                        "description": "Get weather for a location",
                        "parameters": {
                            "type": "object",
                            "properties": {
                                "location": {"type": "string"}
                            }
                        }
                    }
                ]
            }
        }))

        # 发送音频 (PCM16, 24kHz, 单声道)
        async def send_audio(audio_bytes):
            await ws.send(json.dumps({
                "type": "input_audio_buffer.append",
                "audio": base64.b64encode(audio_bytes).decode()
            }))

        # 接收事件
        async for message in ws:
            event = json.loads(message)

            if event["type"] == "resp

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

相关 Skills

find-skills 技能搜索工具 - Vercel Labs 开源智能体技能包管理器

776,000 周安装

React 组合模式指南：Vercel 组件架构最佳实践，提升代码可维护性

106,200 周安装

AI Elements：基于shadcn/ui的AI原生应用组件库，快速构建对话界面

56,200 周安装

AI 代码实施计划编写技能 | 自动化开发任务分解与 TDD 流程规划工具

43,400 周安装

# Vapi 提供带有 Webhook 的托管语音助手

from flask import Flask, request, jsonify
import vapi

app = Flask(__name__)
client = vapi.Vapi(api_key="...")

# 创建一个助手
assistant = client.assistants.create(
    name="Support Agent",
    model={
        "provider": "openai",
        "model": "gpt-4o",
        "messages": [
            {
                "role": "system",
                "content": "You are a helpful support agent..."
            }
        ]
    },
    voice={
        "provider": "11labs",
        "voiceId": "21m00Tcm4TlvDq8ikWAM"  # Rachel
    },
    firstMessage="Hi! How can I help you today?",
    transcriber={
        "provider": "deepgram",
        "model": "nova-2"
    }
)

# 用于对话事件的 Webhook
@app.route("/vapi/webhook", methods=["POST"])
def vapi_webhook():
    event = request.json

    if event["type"] == "function-call":
        # 处理工具调用
        name = event["functionCall"]["name"]
        args = event["functionCall"]["parameters"]

        if name == "check_order":
            result = check_order(args["order_id"])
            return jsonify({"result": result})

    elif event["type"] == "end-of-call-report":
        # 通话结束 - 保存转录文本
        transcript = event["transcript"]
        save_transcript(event["call"]["id"], transcript)

    return jsonify({"ok": True})

# 发起外呼
call = client.calls.create(
    assistant_id=assistant.id,
    customer={
        "number": "+1234567890"
    },
    phoneNumber={
        "twilioPhoneNumber": "+0987654321"
    }
)

# 或创建网页通话
web_call = client.calls.create(
    assistant_id=assistant.id,
    type="web"
)
# 返回用于 WebRTC 连接的 URL

import asyncio
from deepgram import DeepgramClient, LiveTranscriptionEvents
from elevenlabs import ElevenLabs

# Deepgram 实时转录
deepgram = DeepgramClient(api_key="...")

async def transcribe_stream(audio_stream):
    connection = deepgram.listen.live.v("1")

    async def on_transcript(result):
        transcript = result.channel.alternatives[0].transcript
        if transcript:
            print(f"Heard: {transcript}")
            if result.is_final:
                # 处理最终转录文本
                await handle_user_input(transcript)

    connection.on(LiveTranscriptionEvents.Transcript, on_transcript)

    await connection.start({
        "model": "nova-2",  # 最佳质量
        "language": "en",
        "smart_format": True,
        "interim_results": True,  # 获取部分结果
        "utterance_end_ms": 1000,
        "vad_events": True,  # 语音活动检测
        "encoding": "linear16",
        "sample_rate": 16000
    })

    # 流式传输音频
    async for chunk in audio_stream:
        await connection.send(chunk)

    await connection.finish()

# ElevenLabs 流式合成
eleven = ElevenLabs(api_key="...")

def text_to_speech_stream(text: str):
    """流式传输 TTS 音频块。"""
    audio_stream = eleven.text_to_speech.convert_as_stream(
        voice_id="21m00Tcm4TlvDq8ikWAM",  # Rachel
        model_id="eleven_turbo_v2_5",  # 最快
        text=text,
        output_format="pcm_24000"  # 原始 PCM 格式，低延迟
    )

    for chunk in audio_stream:
        yield chunk

# 或者使用 WebSocket 实现最低延迟
async def tts_websocket(text_stream):
    async with eleven.text_to_speech.stream_async(
        voice_id="21m00Tcm4TlvDq8ikWAM",
        model_id="eleven_turbo_v2_5"
    ) as tts:
        async for text_chunk in text_stream:
            audio = await tts.send(text_chunk)
            yield audio

        # 刷新剩余的音频
        final_audio = await tts.flush()
        yield final_audio

🇺🇸English

Voice AI Development

Role : Voice AI Architect

You are an expert in building real-time voice applications. You think in terms of latency budgets, audio quality, and user experience. You know that voice apps feel magical when fast and broken when slow. You choose the right combination of providers for each use case and optimize relentlessly for perceived responsiveness.

Capabilities

OpenAI Realtime API
Vapi voice agents
Deepgram STT/TTS
ElevenLabs voice synthesis
LiveKit real-time infrastructure
WebRTC audio handling
Voice agent design
Latency optimization

Requirements

Python or Node.js
API keys for providers
Audio handling knowledge

Patterns

OpenAI Realtime API

Native voice-to-voice with GPT-4o

When to use : When you want integrated voice AI without separate STT/TTS

import asyncio
import websockets
import json
import base64

OPENAI_API_KEY = "sk-..."

async def voice_session():
    url = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview"
    headers = {
        "Authorization": f"Bearer {OPENAI_API_KEY}",
        "OpenAI-Beta": "realtime=v1"
    }

    async with websockets.connect(url, extra_headers=headers) as ws:
        # Configure session
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "modalities": ["text", "audio"],
                "voice": "alloy",  # alloy, echo, fable, onyx, nova, shimmer
                "input_audio_format": "pcm16",
                "output_audio_format": "pcm16",
                "input_audio_transcription": {
                    "model": "whisper-1"
                },
                "turn_detection": {
                    "type": "server_vad",  # Voice activity detection
                    "threshold": 0.5,
                    "prefix_padding_ms": 300,
                    "silence_duration_ms": 500
                },
                "tools": [
                    {
                        "type": "function",
                        "name": "get_weather",
                        "description": "Get weather for a location",
                        "parameters": {
                            "type": "object",
                            "properties": {
                                "location": {"type": "string"}
                            }
                        }
                    }
                ]
            }
        }))

        # Send audio (PCM16, 24kHz, mono)
        async def send_audio(audio_bytes):
            await ws.send(json.dumps({
                "type": "input_audio_buffer.append",
                "audio": base64.b64encode(audio_bytes).decode()
            }))

        # Receive events
        async for message in ws:
            event = json.loads(message)

            if event["type"] == "resp

Vapi Voice Agent

Build voice agents with Vapi platform

When to use : Phone-based agents, quick deployment

# Vapi provides hosted voice agents with webhooks

from flask import Flask, request, jsonify
import vapi

app = Flask(__name__)
client = vapi.Vapi(api_key="...")

# Create an assistant
assistant = client.assistants.create(
    name="Support Agent",
    model={
        "provider": "openai",
        "model": "gpt-4o",
        "messages": [
            {
                "role": "system",
                "content": "You are a helpful support agent..."
            }
        ]
    },
    voice={
        "provider": "11labs",
        "voiceId": "21m00Tcm4TlvDq8ikWAM"  # Rachel
    },
    firstMessage="Hi! How can I help you today?",
    transcriber={
        "provider": "deepgram",
        "model": "nova-2"
    }
)

# Webhook for conversation events
@app.route("/vapi/webhook", methods=["POST"])
def vapi_webhook():
    event = request.json

    if event["type"] == "function-call":
        # Handle tool call
        name = event["functionCall"]["name"]
        args = event["functionCall"]["parameters"]

        if name == "check_order":
            result = check_order(args["order_id"])
            return jsonify({"result": result})

    elif event["type"] == "end-of-call-report":
        # Call ended - save transcript
        transcript = event["transcript"]
        save_transcript(event["call"]["id"], transcript)

    return jsonify({"ok": True})

# Start outbound call
call = client.calls.create(
    assistant_id=assistant.id,
    customer={
        "number": "+1234567890"
    },
    phoneNumber={
        "twilioPhoneNumber": "+0987654321"
    }
)

# Or create web call
web_call = client.calls.create(
    assistant_id=assistant.id,
    type="web"
)
# Returns URL for WebRTC connection

Deepgram STT + ElevenLabs TTS

Best-in-class transcription and synthesis

When to use : High quality voice, custom pipeline

import asyncio
from deepgram import DeepgramClient, LiveTranscriptionEvents
from elevenlabs import ElevenLabs

# Deepgram real-time transcription
deepgram = DeepgramClient(api_key="...")

async def transcribe_stream(audio_stream):
    connection = deepgram.listen.live.v("1")

    async def on_transcript(result):
        transcript = result.channel.alternatives[0].transcript
        if transcript:
            print(f"Heard: {transcript}")
            if result.is_final:
                # Process final transcript
                await handle_user_input(transcript)

    connection.on(LiveTranscriptionEvents.Transcript, on_transcript)

    await connection.start({
        "model": "nova-2",  # Best quality
        "language": "en",
        "smart_format": True,
        "interim_results": True,  # Get partial results
        "utterance_end_ms": 1000,
        "vad_events": True,  # Voice activity detection
        "encoding": "linear16",
        "sample_rate": 16000
    })

    # Stream audio
    async for chunk in audio_stream:
        await connection.send(chunk)

    await connection.finish()

# ElevenLabs streaming synthesis
eleven = ElevenLabs(api_key="...")

def text_to_speech_stream(text: str):
    """Stream TTS audio chunks."""
    audio_stream = eleven.text_to_speech.convert_as_stream(
        voice_id="21m00Tcm4TlvDq8ikWAM",  # Rachel
        model_id="eleven_turbo_v2_5",  # Fastest
        text=text,
        output_format="pcm_24000"  # Raw PCM for low latency
    )

    for chunk in audio_stream:
        yield chunk

# Or with WebSocket for lowest latency
async def tts_websocket(text_stream):
    async with eleven.text_to_speech.stream_async(
        voice_id="21m00Tcm4TlvDq8ikWAM",
        model_id="eleven_turbo_v2_5"
    ) as tts:
        async for text_chunk in text_stream:
            audio = await tts.send(text_chunk)
            yield audio

        # Flush remaining audio
        final_audio = await tts.flush()
        yield final_audio

Anti-Patterns

❌ Non-streaming Pipeline

Why bad : Adds seconds of latency. User perceives as slow. Loses conversation flow.

Instead : Stream everything:

STT: interim results
LLM: token streaming
TTS: chunk streaming Start TTS before LLM finishes.

❌ Ignoring Interruptions

Why bad : Frustrating user experience. Feels like talking to a machine. Wastes time.

Instead : Implement barge-in detection. Use VAD to detect user speech. Stop TTS immediately. Clear audio queue.

❌ Single Provider Lock-in

Why bad : May not be best quality. Single point of failure. Harder to optimize.

Instead : Mix best providers:

Deepgram for STT (speed + accuracy)
ElevenLabs for TTS (voice quality)
OpenAI/Anthropic for LLM

Limitations

Latency varies by provider
Cost per minute adds up
Quality depends on network
Complex debugging

Related Skills

Works well with: langgraph, structured-output, langfuse

When to Use

This skill is applicable to execute the workflow or actions described in the overview.

Weekly Installs

397

Repository

sickn33/antigra…e-skills

GitHub Stars

27.1K

First Seen

Jan 19, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykFail

Installed on

opencode325

gemini-cli322

claude-code304

codex278

cursor278

antigravity268

语音AI开发指南：实时语音应用架构与延迟优化实战

🇨🇳中文介绍

语音 AI 开发

能力

要求

模式

OpenAI Realtime API

相关 Skills

Vapi 语音助手

Deepgram STT + ElevenLabs TTS

反面模式

❌ 非流式处理流程

❌ 忽略打断

❌ 单一供应商锁定

限制

相关技能

使用时机

🇺🇸English

Voice AI Development

Capabilities

Requirements

Patterns

OpenAI Realtime API

Vapi Voice Agent

Deepgram STT + ElevenLabs TTS

Anti-Patterns

❌ Non-streaming Pipeline

❌ Ignoring Interruptions

❌ Single Provider Lock-in

Limitations

Related Skills

When to Use

最新 Skills