⚠️

重要前提

安装AI Skills的关键前提是：必须科学上网，且开启TUN模式，这一点至关重要，直接决定安装能否顺利完成，在此郑重提醒三遍：科学上网，科学上网，科学上网。查看完整安装教程 →

低延迟语音AI流水线：500ms内完成STT+LLM+TTS，替代OpenAI的实时语音解决方案

voice-ai by scientiacapital/skills

70 周安装量

7 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/scientiacapital/skills --skill voice-ai

AI/机器学习音频处理性能优化

🇨🇳中文介绍

STT - Deepgram Nova-3 流式转录 (~150ms)
LLM - Groq llama-3.1-8b-instant 用于最快推理 (~220ms)
TTS - Cartesia Sonic 用于超逼真语音 (~90ms)
Telephony - Twilio Media Streams 用于实时双向音频

关键：不使用 OPENAI - 切勿使用 from openai import OpenAI

关键交付成果：

带语音活动检测的流式 STT
为语音优化的低延迟 LLM 响应
带情感控制的富有表现力的 TTS
Twilio Media Streams WebSocket 处理器

<quick_start> 最小语音流水线 (~50 行，<500ms)：

import os
import asyncio
from groq import AsyncGroq
from deepgram import AsyncDeepgramClient
from cartesia import AsyncCartesia

# NEVER: from openai import OpenAI

async def voice_pipeline(user_audio: bytes) -> bytes:
    """处理音频输入，返回音频响应。"""

    # 1. STT: Deepgram Nova-3 (~150ms)
    dg = AsyncDeepgramClient(api_key=os.getenv("DEEPGRAM_API_KEY"))
    result = await dg.listen.rest.v1.transcribe(
        {"buffer": user_audio, "mimetype": "audio/wav"},
        {"model": "nova-3", "language": "en-US"}
    )
    user_text = result.results.channels[0].alternatives[0].transcript

    # 2. LLM: Groq (~220ms) - NOT OpenAI
    groq = AsyncGroq(api_key=os.getenv("GROQ_API_KEY"))
    response = await groq.chat.completions.create(
        model="llama-3.1-8b-instant",
        messages=[
            {"role": "system", "content": "Keep responses under 2 sentences."},
            {"role": "user", "content": user_text}
        ],
        max_tokens=150
    )
    response_text = response.choices[0].message.content

    # 3. TTS: Cartesia Sonic-2 (~90ms)
    cartesia = AsyncCartesia(api_key=os.getenv("CARTESIA_API_KEY"))
    audio_chunks = []
    for chunk in cartesia.tts.sse(
        model_id="sonic-2",
        transcript=response_text,
        voice={"id": "f9836c6e-a0bd-460e-9d3c-f7299fa60f94"},
        output_format={"container": "raw", "encoding": "pcm_s16le", "sample_rate": 8000}
    ):
        if chunk.audio:
            audio_chunks.append(chunk.audio)

    return b"".join(audio_chunks)  # Total: ~460ms

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

VozLux 测试过的技术栈

组件	提供商	模型	延迟	备注
STT	Deepgram	Nova-3	~150ms	流式、VAD、话语检测
LLM	Groq	llama-3.1-8b-instant	~220ms	LPU 硬件，最快推理
TTS	Cartesia	Sonic-2	~90ms	流式、情感、双语
总计	-	-	~460ms	达到低于 500ms 的目标

LLM 优先级（绝不使用 OpenAI）

LLM_PRIORITY = [
    ("groq", "GROQ_API_KEY", "~220ms"),      # 首选
    ("cerebras", "CEREBRAS_API_KEY", "~200ms"),  # 备用
    ("anthropic", "ANTHROPIC_API_KEY", "~500ms"),  # 质量备用
]
# NEVER: from openai import OpenAI

层级	延迟	STT	LLM	TTS	功能
免费版	3000ms	TwiML Gather	Groq	Polly	基础 IVR
专业版	600ms	Deepgram Nova	Groq	Cartesia	Media Streams
企业版	400ms	Deepgram + VAD	Groq	Cartesia	插话功能
</optimal_stack>

Deepgram STT (v5 SDK)

流式 WebSocket 模式

from deepgram import AsyncDeepgramClient
from deepgram.core.events import EventType
from deepgram.extensions.types.sockets import (
    ListenV1SocketClientResponse,
    ListenV1MediaMessage,
    ListenV1ControlMessage
)

async def streaming_stt():
    client = AsyncDeepgramClient(api_key=os.getenv("DEEPGRAM_API_KEY"))

    async with client.listen.v1.connect(model="nova-3") as connection:
        def on_message(message: ListenV1SocketClientResponse):
            msg_type = getattr(message, "type", None)

            if msg_type == "Results":
                channel = getattr(message, "channel", None)
                if channel and channel.alternatives:
                    text = channel.alternatives[0].transcript
                    is_final = getattr(message, "is_final", False)
                    if text:
                        print(f"{'[FINAL]' if is_final else '[INTERIM]'} {text}")

            elif msg_type == "UtteranceEnd":
                print("[USER FINISHED SPEAKING]")

            elif msg_type == "SpeechStarted":
                print("[USER STARTED SPEAKING - barge-in trigger]")

        connection.on(EventType.MESSAGE, on_message)
        await connection.start_listening()

        # 发送音频块
        await connection.send_media(ListenV1MediaMessage(data=audio_bytes))

        # 长会话保持连接
        await connection.send_control(ListenV1ControlMessage(type="KeepAlive"))

options = {
    "model": "nova-3",
    "language": "en-US",
    "encoding": "mulaw",      # Twilio 格式
    "sample_rate": 8000,      # 电话标准
    "interim_results": True,   # 获取部分转录结果
    "utterance_end_ms": 1000,  # 静音结束话语
    "vad_events": True,        # 语音活动检测
}

完整流式设置请参阅 reference/deepgram-setup.md。 </deepgram_stt>

Groq LLM（最快推理）

from groq import AsyncGroq

class GroqVoiceLLM:
    def __init__(self, model: str = "llama-3.1-8b-instant"):
        self.client = AsyncGroq()
        self.model = model
        self.system_prompt = (
            "You are a helpful voice assistant. "
            "Keep responses to 2-3 sentences max. "
            "Speak naturally as if on a phone call."
        )

    async def generate_stream(self, user_input: str):
        """流式处理以获得最低的 TTFB。"""
        stream = await self.client.chat.completions.create(
            model=self.model,
            messages=[
                {"role": "system", "content": self.system_prompt},
                {"role": "user", "content": user_input}
            ],
            max_tokens=150,
            temperature=0.7,
            stream=True,
        )

        async for chunk in stream:
            content = chunk.choices[0].delta.content
            if content:
                yield content  # 立即传输到 TTS

模型	速度	质量	使用场景
llama-3.1-8b-instant	~220ms	良好	主要语音
llama-3.3-70b-versatile	~500ms	最佳	复杂查询
mixtral-8x7b-32768	~300ms	良好	长上下文

上下文管理请参阅 reference/groq-voice-llm.md。 </groq_llm>

Cartesia TTS (Sonic-2)

from cartesia import AsyncCartesia

class CartesiaTTS:
    VOICES = {
        "en": "f9836c6e-a0bd-460e-9d3c-f7299fa60f94",  # 温暖女声
        "es": "5c5ad5e7-1020-476b-8b91-fdcbe9cc313c",  # 墨西哥西班牙语
    }

    EMOTIONS = {
        "greeting": "excited",
        "confirmation": "grateful",
        "info": "calm",
        "complaint": "sympathetic",
        "apology": "apologetic",
    }

    def __init__(self, api_key: str):
        self.client = AsyncCartesia(api_key=api_key)

    async def synthesize_stream(
        self,
        text: str,
        language: str = "en",
        emotion: str = "neutral"
    ):
        voice_id = self.VOICES.get(language, self.VOICES["en"])

        response = self.client.tts.sse(
            model_id="sonic-2",
            transcript=text,
            voice={
                "id": voice_id,
                "experimental_controls": {
                    "speed": "normal",
                    "emotion": [emotion] if emotion != "neutral" else []
                }
            },
            language=language,
            output_format={
                "container": "raw",
                "encoding": "pcm_s16le",
                "sample_rate": 8000,  # 电话
            },
        )

        for chunk in response:
            if chunk.audio:
                yield chunk.audio

response = client.tts.sse(
    model_id="sonic-2",
    transcript="Hello, world!",
    voice={"id": voice_id},
    output_format={"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100},
    add_timestamps=True,
)

for chunk in response:
    if chunk.word_timestamps:
        for word, start, end in zip(
            chunk.word_timestamps.words,
            chunk.word_timestamps.start,
            chunk.word_timestamps.end
        ):
            print(f"'{word}': {start:.2f}s - {end:.2f}s")

所有 57 种情感请参阅 reference/cartesia-tts.md。 </cartesia_tts>

<twilio_media_streams>

Twilio Media Streams

WebSocket 处理器 (FastAPI)

from fastapi import FastAPI, WebSocket, Request
from fastapi.responses import Response
from twilio.twiml.voice_response import VoiceResponse, Connect, Stream
import json, base64, audioop

app = FastAPI()

@app.post("/voice/incoming")
async def incoming_call(request: Request):
    """路由到 Media Streams WebSocket。"""
    form = await request.form()
    caller = form.get("From", "")
    lang = "es" if caller.startswith("+52") else "en"

    response = VoiceResponse()
    connect = Connect()
    connect.append(Stream(url=f"wss://your-app.com/voice/stream?lang={lang}"))
    response.append(connect)

    return Response(content=str(response), media_type="application/xml")

@app.websocket("/voice/stream")
async def media_stream(websocket: WebSocket, lang: str = "en"):
    await websocket.accept()
    stream_sid = None

    while True:
        message = await websocket.receive_text()
        data = json.loads(message)

        if data["event"] == "start":
            stream_sid = data["start"]["streamSid"]
            # 初始化 STT，发送问候语

        elif data["event"] == "media":
            audio = base64.b64decode(data["media"]["payload"])
            # 发送到 Deepgram STT

        elif data["event"] == "stop":
            break

async def send_audio(websocket, stream_sid: str, pcm_audio: bytes):
    """将 PCM 转换为 mu-law 并发送到 Twilio。"""
    mulaw = audioop.lin2ulaw(pcm_audio, 2)
    await websocket.send_text(json.dumps({
        "event": "media",
        "streamSid": stream_sid,
        "media": {"payload": base64.b64encode(mulaw).decode()}
    }))

完整处理器请参阅 reference/twilio-webhooks.md。 </twilio_media_streams>

双语支持 (EN/ES)

def detect_language(caller_number: str) -> str:
    if caller_number.startswith("+52"):
        return "es"  # 墨西哥
    elif caller_number.startswith("+1"):
        return "en"  # 美国/加拿大
    return "es"  # 默认西班牙语

GREETINGS = {
    "en": "Hello! How can I help you today?",
    "es": "Hola! En que puedo ayudarle hoy?",  # 使用 "usted" 表示尊重
}

</bilingual_support>

VOICE_PROMPT = """
# 角色
You are a bilingual voice assistant for {business_name}.

# 语气
- 2-3 sentences max for phone clarity
- NEVER use bullet points, lists, or markdown
- Spell out emails: "john at company dot com"
- Phone numbers with pauses: "five one two... eight seven seven..."
- Spanish: Use "usted" for formal respect

# 护栏
- Never make up information
- Transfer to human after 3 failed attempts
- Match caller's language

# 错误恢复
English: "I want to make sure I got that right. Did you say [repeat]?"
Spanish: "Quiero asegurarme de entender bien. Dijo [repetir]?"
"""

完整模板请参阅 reference/voice-prompts.md。 </voice_prompts>

reference/deepgram-setup.md - 完整的流式 STT 设置
reference/groq-voice-llm.md - 语音 Groq 模式
reference/cartesia-tts.md - 所有 57 种情感，语音克隆
reference/twilio-webhooks.md - 完整的 Media Streams 处理器
reference/latency-optimization.md - 低于 500ms 的技术
reference/voice-prompts.md - 语音优化提示 </file_locations>

用户想要语音代理： → 提供完整技术栈 (Deepgram + Groq + Cartesia + Twilio) → 从 quick_start 流水线开始

用户只想要 STT： → 提供 Deepgram 流式模式 → 参考：reference/deepgram-setup.md

用户只想要 TTS： → 提供带情感的 Cartesia 模式 → 参考：reference/cartesia-tts.md

用户想要延迟优化： → 审计当前技术栈，识别瓶颈 → 参考：reference/latency-optimization.md

用户提到 OpenAI： → 立即重定向到 Groq → 解释："不使用 OPENAI - 使用 Groq 以获得最低延迟"

# 必需（绝不使用 OpenAI）
DEEPGRAM_API_KEY=your_key
GROQ_API_KEY=gsk_xxxx
CARTESIA_API_KEY=your_key

# Twilio
TWILIO_ACCOUNT_SID=ACxxxx
TWILIO_AUTH_TOKEN=your_token
TWILIO_PHONE_NUMBER=+15551234567

# 备用
ANTHROPIC_API_KEY=sk-ant-xxxx
CEREBRAS_API_KEY=csk_xxxx



pip install deepgram-sdk groq cartesia twilio fastapi

STACK:
  STT: Deepgram Nova-3 (~150ms)
  LLM: Groq llama-3.1-8b-instant (~220ms) - NOT OPENAI
  TTS: Cartesia Sonic-2 (~90ms)

LATENCY TARGETS:
  Pro: 600ms (Media Streams)
  Enterprise: 400ms (Full streaming + barge-in)

BILINGUAL:
  +52 -> Spanish (es)
  +1 -> English (en)
  Default -> Spanish

EMOTIONS (Cartesia):
  greeting -> excited
  confirmation -> grateful
  complaint -> sympathetic

发出结果 Sidecar

写入 ~/.claude/skill-analytics/last-outcome-voice-ai.json：{"ts":"[UTC ISO8601]","skill":"voice-ai","version":"1.0.0","variant":"default","status":"[success|partial|error]","runtime_ms":[ms],"metrics":{"agents_built":[n],"endpoints_configured":[n],"latency_targets_met":[n]},"error":null,"session_id":"[YYYY-MM-DD]"} </quick_reference>

🇺🇸English

STT - Deepgram Nova-3 streaming transcription (~150ms)
LLM - Groq llama-3.1-8b-instant for fastest inference (~220ms)
TTS - Cartesia Sonic for ultra-realistic voice (~90ms)
Telephony - Twilio Media Streams for real-time bidirectional audio

CRITICAL: NO OPENAI - Never usefrom openai import OpenAI

Key deliverables:

Streaming STT with voice activity detection
Low-latency LLM responses optimized for voice
Expressive TTS with emotion controls
Twilio Media Streams WebSocket handler

<quick_start> Minimal Voice Pipeline (~50 lines, <500ms):

import os
import asyncio
from groq import AsyncGroq
from deepgram import AsyncDeepgramClient
from cartesia import AsyncCartesia

# NEVER: from openai import OpenAI

async def voice_pipeline(user_audio: bytes) -> bytes:
    """Process audio input, return audio response."""

    # 1. STT: Deepgram Nova-3 (~150ms)
    dg = AsyncDeepgramClient(api_key=os.getenv("DEEPGRAM_API_KEY"))
    result = await dg.listen.rest.v1.transcribe(
        {"buffer": user_audio, "mimetype": "audio/wav"},
        {"model": "nova-3", "language": "en-US"}
    )
    user_text = result.results.channels[0].alternatives[0].transcript

    # 2. LLM: Groq (~220ms) - NOT OpenAI
    groq = AsyncGroq(api_key=os.getenv("GROQ_API_KEY"))
    response = await groq.chat.completions.create(
        model="llama-3.1-8b-instant",
        messages=[
            {"role": "system", "content": "Keep responses under 2 sentences."},
            {"role": "user", "content": user_text}
        ],
        max_tokens=150
    )
    response_text = response.choices[0].message.content

    # 3. TTS: Cartesia Sonic-2 (~90ms)
    cartesia = AsyncCartesia(api_key=os.getenv("CARTESIA_API_KEY"))
    audio_chunks = []
    for chunk in cartesia.tts.sse(
        model_id="sonic-2",
        transcript=response_text,
        voice={"id": "f9836c6e-a0bd-460e-9d3c-f7299fa60f94"},
        output_format={"container": "raw", "encoding": "pcm_s16le", "sample_rate": 8000}
    ):
        if chunk.audio:
            audio_chunks.append(chunk.audio)

    return b"".join(audio_chunks)  # Total: ~460ms

</quick_start>

<success_criteria> A voice AI agent is successful when:

Total latency is under 500ms (STT + LLM + TTS)
STT correctly transcribes with utterance end detection
TTS sounds natural and conversational
Barge-in (interruption) works smoothly (Enterprise tier)
Bilingual support handles language switching </success_criteria>

<optimal_stack>

VozLux-Tested Stack

Component	Provider	Model	Latency	Notes
STT	Deepgram	Nova-3	~150ms	Streaming, VAD, utterance detection
LLM	Groq	llama-3.1-8b-instant	~220ms	LPU hardware, fastest inference
TTS	Cartesia	Sonic-2	~90ms	Streaming, emotions, bilingual
TOTAL	-	-	~460ms	Sub-500ms target achieved

LLM Priority (Never OpenAI)

LLM_PRIORITY = [
    ("groq", "GROQ_API_KEY", "~220ms"),      # Primary
    ("cerebras", "CEREBRAS_API_KEY", "~200ms"),  # Fallback
    ("anthropic", "ANTHROPIC_API_KEY", "~500ms"),  # Quality fallback
]
# NEVER: from openai import OpenAI

Tier Architecture

Tier	Latency	STT	LLM	TTS	Features
Free	3000ms	TwiML Gather	Groq	Polly	Basic IVR
Pro	600ms	Deepgram Nova	Groq	Cartesia	Media Streams
Enterprise	400ms	Deepgram + VAD	Groq	Cartesia	Barge-in
</optimal_stack>

<deepgram_stt>

Deepgram STT (v5 SDK)

Streaming WebSocket Pattern

from deepgram import AsyncDeepgramClient
from deepgram.core.events import EventType
from deepgram.extensions.types.sockets import (
    ListenV1SocketClientResponse,
    ListenV1MediaMessage,
    ListenV1ControlMessage
)

async def streaming_stt():
    client = AsyncDeepgramClient(api_key=os.getenv("DEEPGRAM_API_KEY"))

    async with client.listen.v1.connect(model="nova-3") as connection:
        def on_message(message: ListenV1SocketClientResponse):
            msg_type = getattr(message, "type", None)

            if msg_type == "Results":
                channel = getattr(message, "channel", None)
                if channel and channel.alternatives:
                    text = channel.alternatives[0].transcript
                    is_final = getattr(message, "is_final", False)
                    if text:
                        print(f"{'[FINAL]' if is_final else '[INTERIM]'} {text}")

            elif msg_type == "UtteranceEnd":
                print("[USER FINISHED SPEAKING]")

            elif msg_type == "SpeechStarted":
                print("[USER STARTED SPEAKING - barge-in trigger]")

        connection.on(EventType.MESSAGE, on_message)
        await connection.start_listening()

        # Send audio chunks
        await connection.send_media(ListenV1MediaMessage(data=audio_bytes))

        # Keep alive for long sessions
        await connection.send_control(ListenV1ControlMessage(type="KeepAlive"))

Connection Options

options = {
    "model": "nova-3",
    "language": "en-US",
    "encoding": "mulaw",      # Twilio format
    "sample_rate": 8000,      # Telephony standard
    "interim_results": True,   # Get partial transcripts
    "utterance_end_ms": 1000,  # Silence to end utterance
    "vad_events": True,        # Voice activity detection
}

See reference/deepgram-setup.md for full streaming setup. </deepgram_stt>

<groq_llm>

Groq LLM (Fastest Inference)

Voice-Optimized Pattern

from groq import AsyncGroq

class GroqVoiceLLM:
    def __init__(self, model: str = "llama-3.1-8b-instant"):
        self.client = AsyncGroq()
        self.model = model
        self.system_prompt = (
            "You are a helpful voice assistant. "
            "Keep responses to 2-3 sentences max. "
            "Speak naturally as if on a phone call."
        )

    async def generate_stream(self, user_input: str):
        """Streaming for lowest TTFB."""
        stream = await self.client.chat.completions.create(
            model=self.model,
            messages=[
                {"role": "system", "content": self.system_prompt},
                {"role": "user", "content": user_input}
            ],
            max_tokens=150,
            temperature=0.7,
            stream=True,
        )

        async for chunk in stream:
            content = chunk.choices[0].delta.content
            if content:
                yield content  # Pipe to TTS immediately

Model Selection

Model	Speed	Quality	Use Case
llama-3.1-8b-instant	~220ms	Good	Primary voice
llama-3.3-70b-versatile	~500ms	Best	Complex queries
mixtral-8x7b-32768	~300ms	Good	Long context

See reference/groq-voice-llm.md for context management. </groq_llm>

<cartesia_tts>

Cartesia TTS (Sonic-2)

Streaming Pattern

from cartesia import AsyncCartesia

class CartesiaTTS:
    VOICES = {
        "en": "f9836c6e-a0bd-460e-9d3c-f7299fa60f94",  # Warm female
        "es": "5c5ad5e7-1020-476b-8b91-fdcbe9cc313c",  # Mexican Spanish
    }

    EMOTIONS = {
        "greeting": "excited",
        "confirmation": "grateful",
        "info": "calm",
        "complaint": "sympathetic",
        "apology": "apologetic",
    }

    def __init__(self, api_key: str):
        self.client = AsyncCartesia(api_key=api_key)

    async def synthesize_stream(
        self,
        text: str,
        language: str = "en",
        emotion: str = "neutral"
    ):
        voice_id = self.VOICES.get(language, self.VOICES["en"])

        response = self.client.tts.sse(
            model_id="sonic-2",
            transcript=text,
            voice={
                "id": voice_id,
                "experimental_controls": {
                    "speed": "normal",
                    "emotion": [emotion] if emotion != "neutral" else []
                }
            },
            language=language,
            output_format={
                "container": "raw",
                "encoding": "pcm_s16le",
                "sample_rate": 8000,  # Telephony
            },
        )

        for chunk in response:
            if chunk.audio:
                yield chunk.audio

With Timestamps

response = client.tts.sse(
    model_id="sonic-2",
    transcript="Hello, world!",
    voice={"id": voice_id},
    output_format={"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100},
    add_timestamps=True,
)

for chunk in response:
    if chunk.word_timestamps:
        for word, start, end in zip(
            chunk.word_timestamps.words,
            chunk.word_timestamps.start,
            chunk.word_timestamps.end
        ):
            print(f"'{word}': {start:.2f}s - {end:.2f}s")

See reference/cartesia-tts.md for all 57 emotions. </cartesia_tts>

<twilio_media_streams>

Twilio Media Streams

WebSocket Handler (FastAPI)

from fastapi import FastAPI, WebSocket, Request
from fastapi.responses import Response
from twilio.twiml.voice_response import VoiceResponse, Connect, Stream
import json, base64, audioop

app = FastAPI()

@app.post("/voice/incoming")
async def incoming_call(request: Request):
    """Route to Media Streams WebSocket."""
    form = await request.form()
    caller = form.get("From", "")
    lang = "es" if caller.startswith("+52") else "en"

    response = VoiceResponse()
    connect = Connect()
    connect.append(Stream(url=f"wss://your-app.com/voice/stream?lang={lang}"))
    response.append(connect)

    return Response(content=str(response), media_type="application/xml")

@app.websocket("/voice/stream")
async def media_stream(websocket: WebSocket, lang: str = "en"):
    await websocket.accept()
    stream_sid = None

    while True:
        message = await websocket.receive_text()
        data = json.loads(message)

        if data["event"] == "start":
            stream_sid = data["start"]["streamSid"]
            # Initialize STT, send greeting

        elif data["event"] == "media":
            audio = base64.b64decode(data["media"]["payload"])
            # Send to Deepgram STT

        elif data["event"] == "stop":
            break

async def send_audio(websocket, stream_sid: str, pcm_audio: bytes):
    """Convert PCM to mu-law and send to Twilio."""
    mulaw = audioop.lin2ulaw(pcm_audio, 2)
    await websocket.send_text(json.dumps({
        "event": "media",
        "streamSid": stream_sid,
        "media": {"payload": base64.b64encode(mulaw).decode()}
    }))

See reference/twilio-webhooks.md for complete handler. </twilio_media_streams>

<bilingual_support>

Bilingual Support (EN/ES)

Auto-Detection

def detect_language(caller_number: str) -> str:
    if caller_number.startswith("+52"):
        return "es"  # Mexico
    elif caller_number.startswith("+1"):
        return "en"  # US/Canada
    return "es"  # Default Spanish

Voice Prompts

GREETINGS = {
    "en": "Hello! How can I help you today?",
    "es": "Hola! En que puedo ayudarle hoy?",  # Use "usted" for respect
}

</bilingual_support>

<voice_prompts>

Voice Prompt Engineering

VOICE_PROMPT = """
# Role
You are a bilingual voice assistant for {business_name}.

# Tone
- 2-3 sentences max for phone clarity
- NEVER use bullet points, lists, or markdown
- Spell out emails: "john at company dot com"
- Phone numbers with pauses: "five one two... eight seven seven..."
- Spanish: Use "usted" for formal respect

# Guardrails
- Never make up information
- Transfer to human after 3 failed attempts
- Match caller's language

# Error Recovery
English: "I want to make sure I got that right. Did you say [repeat]?"
Spanish: "Quiero asegurarme de entender bien. Dijo [repetir]?"
"""

See reference/voice-prompts.md for full template. </voice_prompts>

<file_locations>

Reference Files

reference/deepgram-setup.md - Full streaming STT setup
reference/groq-voice-llm.md - Groq patterns for voice
reference/cartesia-tts.md - All 57 emotions, voice cloning
reference/twilio-webhooks.md - Complete Media Streams handler
reference/latency-optimization.md - Sub-500ms techniques
reference/voice-prompts.md - Voice-optimized prompts </file_locations>

User wants voice agent: → Provide full stack (Deepgram + Groq + Cartesia + Twilio) → Start with quick_start pipeline

User wants STT only: → Provide Deepgram streaming pattern → Reference: reference/deepgram-setup.md

User wants TTS only: → Provide Cartesia pattern with emotions → Reference: reference/cartesia-tts.md

User wants latency optimization: → Audit current stack, identify bottlenecks → Reference: reference/latency-optimization.md

User mentions OpenAI: → REDIRECT to Groq immediately → Explain: "NO OPENAI - Use Groq for lowest latency"

<env_setup>

Environment Variables

# Required (NEVER OpenAI)
DEEPGRAM_API_KEY=your_key
GROQ_API_KEY=gsk_xxxx
CARTESIA_API_KEY=your_key

# Twilio
TWILIO_ACCOUNT_SID=ACxxxx
TWILIO_AUTH_TOKEN=your_token
TWILIO_PHONE_NUMBER=+15551234567

# Fallbacks
ANTHROPIC_API_KEY=sk-ant-xxxx
CEREBRAS_API_KEY=csk_xxxx



pip install deepgram-sdk groq cartesia twilio fastapi

</env_setup>

<quick_reference>

Quick Reference Card

STACK:
  STT: Deepgram Nova-3 (~150ms)
  LLM: Groq llama-3.1-8b-instant (~220ms) - NOT OPENAI
  TTS: Cartesia Sonic-2 (~90ms)

LATENCY TARGETS:
  Pro: 600ms (Media Streams)
  Enterprise: 400ms (Full streaming + barge-in)

BILINGUAL:
  +52 -> Spanish (es)
  +1 -> English (en)
  Default -> Spanish

EMOTIONS (Cartesia):
  greeting -> excited
  confirmation -> grateful
  complaint -> sympathetic

Emit Outcome Sidecar

Write to ~/.claude/skill-analytics/last-outcome-voice-ai.json: {"ts":"[UTC ISO8601]","skill":"voice-ai","version":"1.0.0","variant":"default","status":"[success|partial|error]","runtime_ms":[ms],"metrics":{"agents_built":[n],"endpoints_configured":[n],"latency_targets_met":[n]},"error":null,"session_id":"[YYYY-MM-DD]"} </quick_reference>

Weekly Installs

Repository

scientiacapital/skills

GitHub Stars

First Seen

Jan 22, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykWarn

Installed on

gemini-cli63

codex61

opencode59

cursor57

github-copilot56

amp50

超能力技能使用指南：AI助手技能调用优先级与工作流程详解

56,600 周安装

低延迟语音AI流水线：500ms内完成STT+LLM+TTS，替代OpenAI的实时语音解决方案

🇨🇳中文介绍

相关 Skills

VozLux 测试过的技术栈

LLM 优先级（绝不使用 OpenAI）

层级架构

Deepgram STT (v5 SDK)

流式 WebSocket 模式

连接选项

Groq LLM（最快推理）

语音优化模式

模型选择

Cartesia TTS (Sonic-2)

流式模式

带时间戳

Twilio Media Streams

WebSocket 处理器 (FastAPI)

双语支持 (EN/ES)

自动检测

语音提示

语音提示工程

参考文件

环境变量

快速参考卡片

发出结果 Sidecar

🇺🇸English

VozLux-Tested Stack

LLM Priority (Never OpenAI)

Tier Architecture

Deepgram STT (v5 SDK)

Streaming WebSocket Pattern

Connection Options

Groq LLM (Fastest Inference)

Voice-Optimized Pattern

Model Selection

Cartesia TTS (Sonic-2)

Streaming Pattern

With Timestamps

Twilio Media Streams

WebSocket Handler (FastAPI)

Bilingual Support (EN/ES)

Auto-Detection

Voice Prompts

Voice Prompt Engineering

Reference Files

Environment Variables

Quick Reference Card

Emit Outcome Sidecar

最新 Skills