语音与音频工程师：TTS语音合成、语音克隆与播客制作全指南

voice-audio-engineer by erichowens/some_claude_skills

100 周安装量

85 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/erichowens/some_claude_skills --skill voice-audio-engineer

AI/机器学习内容创作音频处理

🇨🇳中文介绍

语音与音频工程师：语音合成、TTS 与语音处理

精通使用 ElevenLabs 和专业音频技术进行语音合成、语音处理和声音制作。专长于 TTS、语音克隆、播客制作和语音 UI 设计。

何时使用此技能

✅ 适用于：

文本转语音生成
语音克隆与语音设计
语音到语音的转换
播客制作与编辑
有声书制作
语音 UI/对话式 AI 音频
对话混音与处理
响度标准化
语音质量增强
转录与语音转文本

❌ 不适用于：

空间音频 → sound-engineer
音效生成 → sound-engineer
游戏音频中间件 → sound-engineer
音乐作曲/制作 → DAW 工具
现场音乐会/活动音频 → 专业领域

MCP 集成

MCP 工具	用途
`text_to_speech`	根据文本生成语音，可选择声音
`speech_to_speech`

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

专家与新手鉴别点

主题	新手	专家
TTS 质量	"任何声音都行"	声音匹配品牌；考虑情感、节奏、风格
语音克隆	"上传任何音频"	知道需要 30 秒到 3 分钟干净、多样的语音；单人说话
响度	"让它大声点"	播客目标 -16 到 -19 LUFS；流媒体 -14 LUFS
齿音消除	"不重要"	知道齿音位于 5-8kHz；使用频率选择性压缩
压缩	"压扁它"	对话使用 3:1-4:1 比率；慢启动以保留瞬态
高通滤波	"从不使用"	人声始终在 80-100Hz 高通滤波；去除隆隆声、爆破音
真实峰值	"峰值就是峰值"	知道采样间峰值会超过 0dBFS；目标 -1 dBTP
ElevenLabs 模型	"使用默认"	`eleven_multilingual_v2` 追求质量；`eleven_flash_v2_5` 追求速度

反模式：使用嘈杂音频进行语音克隆

表现：使用带有背景噪音、回声的手机录音进行语音克隆 错误原因 ：克隆会学习噪音；输出产生伪影 正确做法 ：先使用 isolate_audio；在安静空间录音；提供 1-3 分钟多样的语音

反模式：忽略响度标准

表现：播客为 -6 LUFS，随后被平台标准化 → 动态被压扁 错误原因 ：每个平台标准化方式不同；太响 = 失真，太轻 = 听不清 正确做法 ：播客母带处理到 -16 LUFS；流媒体 -14 LUFS；始终检查真实峰值 < -1 dBTP

反模式：TTS 不进行声音匹配

表现：为高端产品使用默认的机器人声音 错误原因 ：声音即品牌；错误的声音 = 错误的情感连接 正确做法 ：使用 search_voices 寻找匹配的语调；考虑定制克隆以保持品牌一致性

反模式：处理后的语音不进行齿音消除

表现：压缩和均衡提升后出现"嘶嘶"声 错误原因 ：压缩会提升齿音；在 3-5kHz 提升均衡会使情况更糟 正确做法 ：在压缩前于 5-8kHz 进行齿音消除；使用频率选择性压缩

反模式：单次录制，不做编辑

表现：播客包含 20 个"呃"、呼吸声、长停顿 错误原因 ：听众易疲劳；不专业；降低参与度 正确做法 ：编辑掉填充词；门限或手动剪切呼吸声；收紧节奏

2020 年前：机器人式 TTS

拼接合成
明显的机器人质量
有限的声音选项

2020-2022 年：神经 TTS 出现

Tacotron、WaveNet 提高了自然度
仍可检测为合成音
语音克隆需要数小时数据

2023-2024 年：AI 语音革命

ElevenLabs 即时语音克隆
TTS 接近人声质量
实时语音转换
用于客户服务的语音智能体

2025 年+：当前最佳实践

情感 TTS
跨语言语音克隆
应用中的实时语音转换
个性化语音智能体
语音认证集成

ElevenLabs 声音选择

模型	质量	延迟	语言	使用场景
`eleven_multilingual_v2`	最佳	较高	29	制作、质量关键型
`eleven_flash_v2_5`	良好	最低	32	实时、语音 UI
`eleven_turbo_v2_5`	更好	低	32	平衡

# Stability: 0-1 (lower = more expressive, higher = more consistent)
# Similarity boost: 0-1 (higher = closer to original voice)
# Style: 0-1 (higher = more exaggerated style)

# For natural speech:
stability = 0.5       # Balanced expression
similarity = 0.75     # Close to voice but natural
style = 0.0           # Neutral (increase for dramatic)

语音克隆最佳实践

时长：1-3 分钟
质量：干净、无背景噪音、无混响
内容：多样的语音
格式：WAV/MP3，44.1kHz 或更高

克隆工作流：

使用 isolate_audio 清理源素材
使用清理后的音频进行 voice_clone
用不同的提示词测试
调整稳定性/相似度以获得最佳输出质量

标准语音处理链：

[Raw Recording]
    ↓
[High-Pass Filter @ 80Hz]  ← Remove rumble, plosives
    ↓
[De-esser @ 5-8kHz]        ← Before compression!
    ↓
[Compressor 3:1, 10ms/100ms] ← Smooth dynamics
    ↓
[EQ: +2dB @ 3kHz presence] ← Clarity boost
    ↓
[Limiter -1 dBTP]          ← Prevent clipping
    ↓
[Loudness Norm -16 LUFS]   ← Target loudness

平台/格式	目标 LUFS	真实峰值
播客	-16 到 -19	-1 dBTP
有声书	-18 到 -23 RMS	-3 dBFS
YouTube	-14	-1 dBTP
Spotify/Apple Music	-14	-1 dBTP
广播	-23 ±1	-1 dBTP

LUFS = 响度单位全量程
真实峰值 = 包括采样间峰值的最大电平
始终使用 K 加权测量

对话式 AI 智能体

ElevenLabs 智能体配置：

create_agent(
    name="Support Agent",
    first_message="Hi, how can I help you today?",
    system_prompt="You are a helpful customer support agent...",
    voice_id="your_voice_id",
    language="en",
    llm="gemini-2.0-flash-001",  # Fast for conversation
    temperature=0.5,
    asr_quality="high",          # Speech recognition quality
    turn_timeout=7,              # Seconds before agent responds
    max_duration_seconds=300     # 5 minute call limit
)

语音 UI 注意事项：

实时场景使用快速模型
保持回复简洁
添加停顿以实现自然的对话流程
优雅处理打断

声音选择决策树

品牌/专业内容？ → 定制克隆或精选声音
实时/交互式？ → eleven_flash_v2_5 模型
质量关键型？ → eleven_multilingual_v2 模型
多语言？ → 检查每个声音的语言支持

声音听起来浑浊？ → 80Hz 高通滤波，提升 3kHz
齿音刺耳？ → 5-8kHz 齿音消除
音量不一致？ → 3:1 压缩，然后限制
太轻？ → 标准化到目标 LUFS
背景噪音？ → 先使用 isolate_audio

De-esser: 5-8kHz, -6dB reduction, Q=2
Compressor: 3:1 ratio, -20dB threshold, 10ms attack, 100ms release
EQ presence: +2-3dB shelf at 3kHz
HPF: 80-100Hz, 12dB/oct
Limiter: -1 dBTP ceiling

处理言语不流畅

类型	特征	ASR 影响
口吃	重复、拖长、阻塞	词边界混淆；重复被误识别
杂乱语	不规则语速、音节合并、填充词过多、离题言语	词语合并；语速变化混淆时序

ASR 在处理不流畅语音时的挑战

大多数 ASR 模型在流畅语音上训练。不流畅会导致：

词边界检测错误
重复被逐字转录
合并的音节完全被遗漏
时序模型被不规则语速混淆

解决方案与变通方法

1. 模型选择：

Whisper large-v3 - 对不流畅最稳健
ElevenLabs speech_to_text - 对多样语音表现良好
Google Speech-to-Text - 使用增强模型时表现尚可
快速/轻量模型 - 通常最差

# Normalize speech rate before ASR
# Use librosa to stretch irregular segments toward target rate
import librosa
y, sr = librosa.load("disfluent.wav")
y_stretched = librosa.effects.time_stretch(y, rate=0.9)  # Slow down

移除重复词语
过滤常见填充词
使用 LLM 清理转录文本，同时保留含义

4. 微调 Whisper：

# Fine-tune on disfluent speech dataset
# Datasets: FluencyBank, UCLASS, SEP-28k (stuttering)
from transformers import WhisperForConditionalGeneration, WhisperProcessor

model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large-v3")
# Fine-tune on your speech samples with corrected transcripts
# Training loop with disfluent audio → fluent transcript pairs

5. ElevenLabs 语音克隆方法：

从流畅片段克隆你的声音
使用 TTS 生成带有你声音的流畅输出
非常适合预录制内容，不适用于直播

始终提供手动转录校正选项
考虑混合模式：ASR + 人工审核
对于语音 UI：设置更长的超时时间，使用确认提示
与目标人群的实际用户进行测试

操作	典型时间
TTS	2-5 秒
语音克隆创建	10-30 秒
语音到语音转换	3-8 秒
转录	5-15 秒
音频分离	5-20 秒

sound-engineer - 用于空间音频、游戏音频、程序化音效
native-app-designer - 应用中的语音 UI 实现
vr-avatar-engineer - 虚拟形象语音集成

详细实现 ：参见 /references/implementations.md

记住：声音是亲密的——它直接与听众的大脑对话。声音要匹配品牌，处理是为了清晰而非响亮，并且始终尊重平台的响度标准。使用 ElevenLabs，你可以即时获得专业的语音合成能力；请深思熟虑地使用它。

🇺🇸English

Voice & Audio Engineer: Voice Synthesis, TTS & Speech Processing

Expert in voice synthesis, speech processing, and vocal production using ElevenLabs and professional audio techniques. Specializes in TTS, voice cloning, podcast production, and voice UI design.

When to Use This Skill

✅ Use for:

Text-to-speech (TTS) generation
Voice cloning and voice design
Speech-to-speech voice transformation
Podcast production and editing
Audiobook production
Voice UI/conversational AI audio
Dialogue mixing and processing
Loudness normalization (LUFS)
Voice quality enhancement (de-essing, compression)
Transcription and speech-to-text

❌ Do NOT use for:

Spatial audio (HRTF, Ambisonics) → sound-engineer
Sound effects generation → sound-engineer (ElevenLabs SFX)
Game audio middleware (Wwise, FMOD) → sound-engineer
Music composition/production → DAW tools
Live concert/event audio → specialized domain

MCP Integrations

MCP Tool	Purpose
`text_to_speech`	Generate speech from text with voice selection
`speech_to_speech`	Transform voice recordings to different voices
`voice_clone`	Create instant voice clones from audio samples
`search_voices`	Find voices in ElevenLabs library
`speech_to_text`	Transcribe audio with speaker diarization
`isolate_audio`	Separate voice from background noise
`create_agent`	Build conversational AI agents with voice

Expert vs Novice Shibboleths

Topic	Novice	Expert
TTS quality	"Any voice works"	Matches voice to brand; considers emotion, pace, style
Voice cloning	"Upload any audio"	Knows 30s-3min of clean, varied speech needed; single speaker
Loudness	"Make it loud"	Targets -16 to -19 LUFS for podcasts; -14 for streaming
De-essing	"Doesn't matter"	Knows sibilance lives at 5-8kHz; frequency-selective compression
Compression	"Squash it"	Uses 3:1-4:1 for dialogue; slow attack (10-20ms) to preserve transients
High-pass	"Never use it"	Always HPF at 80-100Hz for voice; removes rumble, plosives
True peak	"Peak is peak"	Knows intersample peaks exceed 0dBFS; targets -1 dBTP

Common Anti-Patterns

Anti-Pattern: Uploading Noisy Audio for Voice Cloning

What it looks like : Voice clone from phone recording with background noise, echo Why it's wrong : Clone learns the noise; output has artifacts What to do instead : Use isolate_audio first; record in quiet space; provide 1-3 min of varied speech

Anti-Pattern: Ignoring Loudness Standards

What it looks like : Podcast at -6 LUFS, then normalized by platform → crushed dynamics Why it's wrong : Each platform normalizes differently; too loud = distortion, too quiet = inaudible What to do instead : Master to -16 LUFS for podcasts; -14 LUFS for streaming; always check true peak < -1 dBTP

Anti-Pattern: TTS Without Voice Matching

What it looks like : Using default robotic voice for premium product Why it's wrong : Voice IS brand; wrong voice = wrong emotional connection What to do instead : search_voices to find matching tone; consider custom clone for brand consistency

Anti-Pattern: No De-essing on Processed Voice

What it looks like : "SSSSibilant" speech after compression and EQ boost Why it's wrong : Compression brings up sibilance; EQ boost at 3-5kHz makes it worse What to do instead : De-ess at 5-8kHz before compression; use frequency-selective compression

Anti-Pattern: Single Take, No Editing

What it looks like : Podcast with 20 "ums", breath sounds, long pauses Why it's wrong : Listeners fatigue; unprofessional; reduces engagement What to do instead : Edit out filler words; gate or manually cut breaths; tighten pacing

Evolution Timeline

Pre-2020: Robotic TTS

Concatenative synthesis (spliced recordings)
Obvious robotic quality
Limited voice options

2020-2022: Neural TTS Emerges

Tacotron, WaveNet improve naturalness
Still detectable as synthetic
Voice cloning requires hours of data

2023-2024: AI Voice Revolution

ElevenLabs instant voice cloning (30 seconds)
Near-human quality in TTS
Real-time voice transformation
Voice agents for customer service

2025+: Current Best Practices

Emotional TTS (control tone, pace, emotion)
Cross-lingual voice cloning
Real-time voice transformation in apps
Personalized voice agents
Voice authentication integration

Core Concepts

ElevenLabs Voice Selection

Model comparison:

Model	Quality	Latency	Languages	Use Case
`eleven_multilingual_v2`	Best	Higher	29	Production, quality-critical
`eleven_flash_v2_5`	Good	Lowest	32	Real-time, voice UI
`eleven_turbo_v2_5`	Better	Low	32	Balanced

Voice parameters:

# Stability: 0-1 (lower = more expressive, higher = more consistent)
# Similarity boost: 0-1 (higher = closer to original voice)
# Style: 0-1 (higher = more exaggerated style)

# For natural speech:
stability = 0.5       # Balanced expression
similarity = 0.75     # Close to voice but natural
style = 0.0           # Neutral (increase for dramatic)

Voice Cloning Best Practices

Audio requirements:

Duration: 1-3 minutes (more = better, diminishing returns after 3min)
Quality: Clean, no background noise, no reverb
Content: Varied speech (questions, statements, emotions)
Format: WAV/MP3, 44.1kHz or higher

Cloning workflow:

isolate_audio to clean source material
voice_clone with cleaned audio
Test with varied prompts
Adjust stability/similarity for output quality

Voice Processing Chain

Standard voice chain (order matters!):

[Raw Recording]
    ↓
[High-Pass Filter @ 80Hz]  ← Remove rumble, plosives
    ↓
[De-esser @ 5-8kHz]        ← Before compression!
    ↓
[Compressor 3:1, 10ms/100ms] ← Smooth dynamics
    ↓
[EQ: +2dB @ 3kHz presence] ← Clarity boost
    ↓
[Limiter -1 dBTP]          ← Prevent clipping
    ↓
[Loudness Norm -16 LUFS]   ← Target loudness

Loudness Standards

Platform/Format	Target LUFS	True Peak
Podcast	-16 to -19	-1 dBTP
Audiobook (ACX)	-18 to -23 RMS	-3 dBFS
YouTube	-14	-1 dBTP
Spotify/Apple Music	-14	-1 dBTP
Broadcast (EBU R128)	-23 ±1	-1 dBTP

Measurement:

LUFS = Loudness Units Full Scale (integrated)
True Peak = Maximum level including intersample peaks
Always measure with K-weighting (ITU-R BS.1770)

Conversational AI Agents

ElevenLabs agent configuration:

create_agent(
    name="Support Agent",
    first_message="Hi, how can I help you today?",
    system_prompt="You are a helpful customer support agent...",
    voice_id="your_voice_id",
    language="en",
    llm="gemini-2.0-flash-001",  # Fast for conversation
    temperature=0.5,
    asr_quality="high",          # Speech recognition quality
    turn_timeout=7,              # Seconds before agent responds
    max_duration_seconds=300     # 5 minute call limit
)

Voice UI considerations:

Use fast model (eleven_flash_v2_5) for real-time
Keep responses concise (< 30 seconds)
Add pauses for natural conversation flow
Handle interruptions gracefully

Quick Reference

Voice Selection Decision Tree

Brand/professional content? → Custom clone or curated voice
Real-time/interactive? → eleven_flash_v2_5 model
Quality-critical? → eleven_multilingual_v2 model
Multiple languages? → Check language support per voice

Processing Decision Tree

Voice sounds muddy? → HPF at 80Hz, boost 3kHz
Sibilance harsh? → De-ess at 5-8kHz
Inconsistent volume? → Compress 3:1, then limit
Too quiet? → Normalize to target LUFS
Background noise? → Use isolate_audio first

Common Settings

De-esser: 5-8kHz, -6dB reduction, Q=2
Compressor: 3:1 ratio, -20dB threshold, 10ms attack, 100ms release
EQ presence: +2-3dB shelf at 3kHz
HPF: 80-100Hz, 12dB/oct
Limiter: -1 dBTP ceiling

Working With Speech Disfluencies

Cluttering vs Stuttering

Type	Characteristics	ASR Impact
Stuttering	Repetitions ("I-I-I"), prolongations ("wwwant"), blocks (silent pauses)	Word boundaries confused; repetitions misrecognized
Cluttering	Irregular rate, collapsed syllables, filler overload, tangential speech	Words merged; rate changes confuse timing

ASR Challenges with Disfluent Speech

Most ASR models trained on fluent speech. Disfluencies cause:

Word boundary detection errors
Repetitions transcribed literally ("I I I want" vs "I want")
Collapsed syllables missed entirely
Timing models confused by irregular pace

Solutions & Workarounds

1. Model selection (best to worst for disfluencies):

Whisper large-v3 - Most robust to disfluencies
ElevenLabs speech_to_text - Good with varied speech
Google Speech-to-Text - Decent with enhanced models
Fast/lightweight models - Usually worst

2. Pre-processing:

# Normalize speech rate before ASR
# Use librosa to stretch irregular segments toward target rate
import librosa
y, sr = librosa.load("disfluent.wav")
y_stretched = librosa.effects.time_stretch(y, rate=0.9)  # Slow down

3. Post-processing:

Remove duplicate words: "I I I want" → "I want"
Filter common fillers: "um", "uh", "like", "you know"
Use LLM to clean transcripts while preserving meaning

4. Fine-tuning Whisper (advanced):

# Fine-tune on disfluent speech dataset
# Datasets: FluencyBank, UCLASS, SEP-28k (stuttering)
from transformers import WhisperForConditionalGeneration, WhisperProcessor

model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large-v3")
# Fine-tune on your speech samples with corrected transcripts
# Training loop with disfluent audio → fluent transcript pairs

5. ElevenLabs voice cloning approach:

Clone your voice from fluent segments
Use TTS for fluent output with your voice
Great for pre-recorded content, not live

Accessibility Considerations

Always provide manual transcript correction option
Consider hybrid: ASR + human review
For voice UI: longer timeout, confirmation prompts
Test with actual users from target population

Performance Targets

Operation	Typical Time
TTS (100 words)	2-5 seconds
Voice clone creation	10-30 seconds
Speech-to-speech	3-8 seconds
Transcription (1 min audio)	5-15 seconds
Audio isolation	5-20 seconds

Integrates With

sound-engineer - For spatial audio, game audio, procedural SFX
native-app-designer - Voice UI implementation in apps
vr-avatar-engineer - Avatar voice integration

For detailed implementations : See /references/implementations.md

Remember : Voice is intimate—it speaks directly to the listener's brain. Match voice to brand, process for clarity not loudness, and always respect the platform's loudness standards. With ElevenLabs, you have instant access to professional voice synthesis; use it thoughtfully.

Weekly Installs

100

Repository

erichowens/some…e_skills

GitHub Stars

First Seen

Jan 22, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

opencode87

cursor87

codex86

gemini-cli86

github-copilot77

claude-code70

AI 代码实施计划编写技能 | 自动化开发任务分解与 TDD 流程规划工具

50,900 周安装