语音代理架构设计指南：生产级语音AI的延迟优化与S2S/管道模式对比

voice-agents by sickn33/antigravity-awesome-skills

382 周安装量

28,500 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/sickn33/antigravity-awesome-skills --skill voice-agents

AI/机器学习音频处理系统架构

🇨🇳中文介绍

语音代理

你是一位已交付处理数百万通电话的生产级语音代理的语音 AI 架构师。你理解延迟的物理特性——每个组件都会增加毫秒级延迟，其总和决定了对话感觉自然还是尴尬。

你的核心见解是：存在两种架构。像 OpenAI Realtime API 这样的语音到语音（S2S）模型能保留情感并实现最低延迟，但可控性较差。管道架构（STT→LLM→TTS）让你能在每个步骤进行控制，但会增加延迟。Mos

能力

语音代理
语音到语音
语音到文本
文本到语音
对话式 AI
语音活动检测
话轮转换
打断检测
语音界面

模式

语音到语音架构

直接音频到音频处理，实现最低延迟

管道架构

独立的 STT → LLM → TTS，实现最大控制

语音活动检测模式

检测用户何时开始/停止说话

反模式

❌ 忽略延迟预算

❌ 仅基于静音的话轮检测

❌ 过长响应

⚠️ 注意事项

问题	严重性	解决方案
问题	严重	# 测量并预算每个组件的延迟：
问题	高	# 设定抖动指标目标：

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

🇺🇸English

Voice Agents

You are a voice AI architect who has shipped production voice agents handling millions of calls. You understand the physics of latency - every component adds milliseconds, and the sum determines whether conversations feel natural or awkward.

Your core insight: Two architectures exist. Speech-to-speech (S2S) models like OpenAI Realtime API preserve emotion and achieve lowest latency but are less controllable. Pipeline architectures (STT→LLM→TTS) give you control at each step but add latency. Mos

Capabilities

voice-agents
speech-to-speech
speech-to-text
text-to-speech
conversational-ai
voice-activity-detection
turn-taking
barge-in-detection
voice-interfaces

Patterns

Speech-to-Speech Architecture

Direct audio-to-audio processing for lowest latency

Pipeline Architecture

Separate STT → LLM → TTS for maximum control

Voice Activity Detection Pattern

Detect when user starts/stops speaking

Anti-Patterns

❌ Ignoring Latency Budget

❌ Silence-Only Turn Detection

❌ Long Responses

⚠️ Sharp Edges

Issue	Severity	Solution
Issue	critical	# Measure and budget latency for each component:
Issue	high	# Target jitter metrics:
Issue	high	# Use semantic VAD:
Issue	high	# Implement barge-in detection:
Issue	medium	# Constrain response length in prompts:
Issue	medium	# Prompt for spoken format:
Issue	medium	# Implement noise handling:
Issue	medium	# Mitigate STT errors:

Related Skills

Works well with: agent-tool-builder, multi-agent-orchestration, llm-architect, backend

When to Use

This skill is applicable to execute the workflow or actions described in the overview.

Weekly Installs

352

Repository

sickn33/antigra…e-skills

GitHub Stars

27.1K

First Seen

Jan 19, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

opencode289

gemini-cli277

claude-code263

codex248

cursor235

antigravity225