voice-audio-engineer by erichowens/some_claude_skills
npx skills add https://github.com/erichowens/some_claude_skills --skill voice-audio-engineer精通使用 ElevenLabs 和专业音频技术进行语音合成、语音处理和声音制作。专长于 TTS、语音克隆、播客制作和语音 UI 设计。
✅ 适用于:
❌ 不适用于:
| MCP 工具 | 用途 |
|---|---|
text_to_speech | 根据文本生成语音,可选择声音 |
speech_to_speech |
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
| 将语音录音转换为不同的声音 |
voice_clone | 从音频样本创建即时语音克隆 |
search_voices | 在 ElevenLabs 库中查找声音 |
speech_to_text | 转录音频,支持说话人分离 |
isolate_audio | 将人声与背景噪音分离 |
create_agent | 构建带语音的对话式 AI 智能体 |
| 主题 | 新手 | 专家 |
|---|---|---|
| TTS 质量 | "任何声音都行" | 声音匹配品牌;考虑情感、节奏、风格 |
| 语音克隆 | "上传任何音频" | 知道需要 30 秒到 3 分钟干净、多样的语音;单人说话 |
| 响度 | "让它大声点" | 播客目标 -16 到 -19 LUFS;流媒体 -14 LUFS |
| 齿音消除 | "不重要" | 知道齿音位于 5-8kHz;使用频率选择性压缩 |
| 压缩 | "压扁它" | 对话使用 3:1-4:1 比率;慢启动以保留瞬态 |
| 高通滤波 | "从不使用" | 人声始终在 80-100Hz 高通滤波;去除隆隆声、爆破音 |
| 真实峰值 | "峰值就是峰值" | 知道采样间峰值会超过 0dBFS;目标 -1 dBTP |
| ElevenLabs 模型 | "使用默认" | eleven_multilingual_v2 追求质量;eleven_flash_v2_5 追求速度 |
表现 :使用带有背景噪音、回声的手机录音进行语音克隆 错误原因 :克隆会学习噪音;输出产生伪影 正确做法 :先使用 isolate_audio;在安静空间录音;提供 1-3 分钟多样的语音
表现 :播客为 -6 LUFS,随后被平台标准化 → 动态被压扁 错误原因 :每个平台标准化方式不同;太响 = 失真,太轻 = 听不清 正确做法 :播客母带处理到 -16 LUFS;流媒体 -14 LUFS;始终检查真实峰值 < -1 dBTP
表现 :为高端产品使用默认的机器人声音 错误原因 :声音即品牌;错误的声音 = 错误的情感连接 正确做法 :使用 search_voices 寻找匹配的语调;考虑定制克隆以保持品牌一致性
表现 :压缩和均衡提升后出现"嘶嘶"声 错误原因 :压缩会提升齿音;在 3-5kHz 提升均衡会使情况更糟 正确做法 :在压缩前于 5-8kHz 进行齿音消除;使用频率选择性压缩
表现 :播客包含 20 个"呃"、呼吸声、长停顿 错误原因 :听众易疲劳;不专业;降低参与度 正确做法 :编辑掉填充词;门限或手动剪切呼吸声;收紧节奏
模型比较:
| 模型 | 质量 | 延迟 | 语言 | 使用场景 |
|---|---|---|---|---|
eleven_multilingual_v2 | 最佳 | 较高 | 29 | 制作、质量关键型 |
eleven_flash_v2_5 | 良好 | 最低 | 32 | 实时、语音 UI |
eleven_turbo_v2_5 | 更好 | 低 | 32 | 平衡 |
语音参数:
# Stability: 0-1 (lower = more expressive, higher = more consistent)
# Similarity boost: 0-1 (higher = closer to original voice)
# Style: 0-1 (higher = more exaggerated style)
# For natural speech:
stability = 0.5 # Balanced expression
similarity = 0.75 # Close to voice but natural
style = 0.0 # Neutral (increase for dramatic)
音频要求:
克隆工作流:
isolate_audio 清理源素材voice_clone标准语音处理链:
[Raw Recording]
↓
[High-Pass Filter @ 80Hz] ← Remove rumble, plosives
↓
[De-esser @ 5-8kHz] ← Before compression!
↓
[Compressor 3:1, 10ms/100ms] ← Smooth dynamics
↓
[EQ: +2dB @ 3kHz presence] ← Clarity boost
↓
[Limiter -1 dBTP] ← Prevent clipping
↓
[Loudness Norm -16 LUFS] ← Target loudness
| 平台/格式 | 目标 LUFS | 真实峰值 |
|---|---|---|
| 播客 | -16 到 -19 | -1 dBTP |
| 有声书 | -18 到 -23 RMS | -3 dBFS |
| YouTube | -14 | -1 dBTP |
| Spotify/Apple Music | -14 | -1 dBTP |
| 广播 | -23 ±1 | -1 dBTP |
测量:
ElevenLabs 智能体配置:
create_agent(
name="Support Agent",
first_message="Hi, how can I help you today?",
system_prompt="You are a helpful customer support agent...",
voice_id="your_voice_id",
language="en",
llm="gemini-2.0-flash-001", # Fast for conversation
temperature=0.5,
asr_quality="high", # Speech recognition quality
turn_timeout=7, # Seconds before agent responds
max_duration_seconds=300 # 5 minute call limit
)
语音 UI 注意事项:
eleven_flash_v2_5 模型eleven_multilingual_v2 模型isolate_audioDe-esser: 5-8kHz, -6dB reduction, Q=2
Compressor: 3:1 ratio, -20dB threshold, 10ms attack, 100ms release
EQ presence: +2-3dB shelf at 3kHz
HPF: 80-100Hz, 12dB/oct
Limiter: -1 dBTP ceiling
| 类型 | 特征 | ASR 影响 |
|---|---|---|
| 口吃 | 重复、拖长、阻塞 | 词边界混淆;重复被误识别 |
| 杂乱语 | 不规则语速、音节合并、填充词过多、离题言语 | 词语合并;语速变化混淆时序 |
大多数 ASR 模型在流畅语音上训练。不流畅会导致:
1. 模型选择:
2. 预处理:
# Normalize speech rate before ASR
# Use librosa to stretch irregular segments toward target rate
import librosa
y, sr = librosa.load("disfluent.wav")
y_stretched = librosa.effects.time_stretch(y, rate=0.9) # Slow down
3. 后处理:
4. 微调 Whisper:
# Fine-tune on disfluent speech dataset
# Datasets: FluencyBank, UCLASS, SEP-28k (stuttering)
from transformers import WhisperForConditionalGeneration, WhisperProcessor
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large-v3")
# Fine-tune on your speech samples with corrected transcripts
# Training loop with disfluent audio → fluent transcript pairs
5. ElevenLabs 语音克隆方法:
| 操作 | 典型时间 |
|---|---|
| TTS | 2-5 秒 |
| 语音克隆创建 | 10-30 秒 |
| 语音到语音转换 | 3-8 秒 |
| 转录 | 5-15 秒 |
| 音频分离 | 5-20 秒 |
详细实现 :参见 /references/implementations.md
记住 :声音是亲密的——它直接与听众的大脑对话。声音要匹配品牌,处理是为了清晰而非响亮,并且始终尊重平台的响度标准。使用 ElevenLabs,你可以即时获得专业的语音合成能力;请深思熟虑地使用它。
每周安装数
100
代码仓库
GitHub 星标数
85
首次出现
Jan 22, 2026
安全审计
安装于
opencode87
cursor87
codex86
gemini-cli86
github-copilot77
claude-code70
Expert in voice synthesis, speech processing, and vocal production using ElevenLabs and professional audio techniques. Specializes in TTS, voice cloning, podcast production, and voice UI design.
✅ Use for:
❌ Do NOT use for:
| MCP Tool | Purpose |
|---|---|
text_to_speech | Generate speech from text with voice selection |
speech_to_speech | Transform voice recordings to different voices |
voice_clone | Create instant voice clones from audio samples |
search_voices | Find voices in ElevenLabs library |
speech_to_text | Transcribe audio with speaker diarization |
isolate_audio | Separate voice from background noise |
create_agent | Build conversational AI agents with voice |
| Topic | Novice | Expert |
|---|---|---|
| TTS quality | "Any voice works" | Matches voice to brand; considers emotion, pace, style |
| Voice cloning | "Upload any audio" | Knows 30s-3min of clean, varied speech needed; single speaker |
| Loudness | "Make it loud" | Targets -16 to -19 LUFS for podcasts; -14 for streaming |
| De-essing | "Doesn't matter" | Knows sibilance lives at 5-8kHz; frequency-selective compression |
| Compression | "Squash it" | Uses 3:1-4:1 for dialogue; slow attack (10-20ms) to preserve transients |
| High-pass | "Never use it" | Always HPF at 80-100Hz for voice; removes rumble, plosives |
| True peak | "Peak is peak" | Knows intersample peaks exceed 0dBFS; targets -1 dBTP |
What it looks like : Voice clone from phone recording with background noise, echo Why it's wrong : Clone learns the noise; output has artifacts What to do instead : Use isolate_audio first; record in quiet space; provide 1-3 min of varied speech
What it looks like : Podcast at -6 LUFS, then normalized by platform → crushed dynamics Why it's wrong : Each platform normalizes differently; too loud = distortion, too quiet = inaudible What to do instead : Master to -16 LUFS for podcasts; -14 LUFS for streaming; always check true peak < -1 dBTP
What it looks like : Using default robotic voice for premium product Why it's wrong : Voice IS brand; wrong voice = wrong emotional connection What to do instead : search_voices to find matching tone; consider custom clone for brand consistency
What it looks like : "SSSSibilant" speech after compression and EQ boost Why it's wrong : Compression brings up sibilance; EQ boost at 3-5kHz makes it worse What to do instead : De-ess at 5-8kHz before compression; use frequency-selective compression
What it looks like : Podcast with 20 "ums", breath sounds, long pauses Why it's wrong : Listeners fatigue; unprofessional; reduces engagement What to do instead : Edit out filler words; gate or manually cut breaths; tighten pacing
Model comparison:
| Model | Quality | Latency | Languages | Use Case |
|---|---|---|---|---|
eleven_multilingual_v2 | Best | Higher | 29 | Production, quality-critical |
eleven_flash_v2_5 | Good | Lowest | 32 | Real-time, voice UI |
eleven_turbo_v2_5 | Better | Low | 32 | Balanced |
Voice parameters:
# Stability: 0-1 (lower = more expressive, higher = more consistent)
# Similarity boost: 0-1 (higher = closer to original voice)
# Style: 0-1 (higher = more exaggerated style)
# For natural speech:
stability = 0.5 # Balanced expression
similarity = 0.75 # Close to voice but natural
style = 0.0 # Neutral (increase for dramatic)
Audio requirements:
Cloning workflow:
isolate_audio to clean source materialvoice_clone with cleaned audioStandard voice chain (order matters!):
[Raw Recording]
↓
[High-Pass Filter @ 80Hz] ← Remove rumble, plosives
↓
[De-esser @ 5-8kHz] ← Before compression!
↓
[Compressor 3:1, 10ms/100ms] ← Smooth dynamics
↓
[EQ: +2dB @ 3kHz presence] ← Clarity boost
↓
[Limiter -1 dBTP] ← Prevent clipping
↓
[Loudness Norm -16 LUFS] ← Target loudness
| Platform/Format | Target LUFS | True Peak |
|---|---|---|
| Podcast | -16 to -19 | -1 dBTP |
| Audiobook (ACX) | -18 to -23 RMS | -3 dBFS |
| YouTube | -14 | -1 dBTP |
| Spotify/Apple Music | -14 | -1 dBTP |
| Broadcast (EBU R128) | -23 ±1 | -1 dBTP |
Measurement:
ElevenLabs agent configuration:
create_agent(
name="Support Agent",
first_message="Hi, how can I help you today?",
system_prompt="You are a helpful customer support agent...",
voice_id="your_voice_id",
language="en",
llm="gemini-2.0-flash-001", # Fast for conversation
temperature=0.5,
asr_quality="high", # Speech recognition quality
turn_timeout=7, # Seconds before agent responds
max_duration_seconds=300 # 5 minute call limit
)
Voice UI considerations:
eleven_flash_v2_5) for real-timeeleven_flash_v2_5 modeleleven_multilingual_v2 modelisolate_audio firstDe-esser: 5-8kHz, -6dB reduction, Q=2
Compressor: 3:1 ratio, -20dB threshold, 10ms attack, 100ms release
EQ presence: +2-3dB shelf at 3kHz
HPF: 80-100Hz, 12dB/oct
Limiter: -1 dBTP ceiling
| Type | Characteristics | ASR Impact |
|---|---|---|
| Stuttering | Repetitions ("I-I-I"), prolongations ("wwwant"), blocks (silent pauses) | Word boundaries confused; repetitions misrecognized |
| Cluttering | Irregular rate, collapsed syllables, filler overload, tangential speech | Words merged; rate changes confuse timing |
Most ASR models trained on fluent speech. Disfluencies cause:
1. Model selection (best to worst for disfluencies):
2. Pre-processing:
# Normalize speech rate before ASR
# Use librosa to stretch irregular segments toward target rate
import librosa
y, sr = librosa.load("disfluent.wav")
y_stretched = librosa.effects.time_stretch(y, rate=0.9) # Slow down
3. Post-processing:
4. Fine-tuning Whisper (advanced):
# Fine-tune on disfluent speech dataset
# Datasets: FluencyBank, UCLASS, SEP-28k (stuttering)
from transformers import WhisperForConditionalGeneration, WhisperProcessor
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large-v3")
# Fine-tune on your speech samples with corrected transcripts
# Training loop with disfluent audio → fluent transcript pairs
5. ElevenLabs voice cloning approach:
| Operation | Typical Time |
|---|---|
| TTS (100 words) | 2-5 seconds |
| Voice clone creation | 10-30 seconds |
| Speech-to-speech | 3-8 seconds |
| Transcription (1 min audio) | 5-15 seconds |
| Audio isolation | 5-20 seconds |
For detailed implementations : See /references/implementations.md
Remember : Voice is intimate—it speaks directly to the listener's brain. Match voice to brand, process for clarity not loudness, and always respect the platform's loudness standards. With ElevenLabs, you have instant access to professional voice synthesis; use it thoughtfully.
Weekly Installs
100
Repository
GitHub Stars
85
First Seen
Jan 22, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykPass
Installed on
opencode87
cursor87
codex86
gemini-cli86
github-copilot77
claude-code70
AI 代码实施计划编写技能 | 自动化开发任务分解与 TDD 流程规划工具
50,900 周安装
dbg 调试器:支持 Node.js、Bun 和原生代码(C/C++/Rust/Swift)的 CLI 调试工具
74 周安装
Slidev主题使用指南:官方与社区主题安装、自定义配色、弹出与创建完整教程
74 周安装
ActiveCampaign自动化集成指南:通过Rube MCP实现CRM与营销自动化
74 周安装
Python Excel文件操作技能:使用openpyxl实现读取、写入、编辑、格式化和导出
74 周安装
PPTX 创建编辑分析工具 - 使用Python脚本处理PowerPoint演示文稿
74 周安装
Umbraco 健康检查自定义开发指南:创建系统诊断与监控插件
74 周安装
| ElevenLabs models | "Use default" | eleven_multilingual_v2 for quality; eleven_flash_v2_5 for speed |