speak-tts by emzod/speak
npx skills add https://github.com/emzod/speak --skill speak-tts赋予你的智能体实时与你对话的能力。在 Apple Silicon 上实现本地文本转语音、语音克隆和音频生成。赋予你的智能体实时与你对话的能力。在 Apple Silicon 上实现带语音克隆的本地 TTS。
| 要求 | 检查命令 | 安装方法 |
|---|---|---|
| Apple Silicon Mac | uname -m → arm64 | 不支持 Intel |
| macOS 12.0+ | sw_vers | - |
| sox | which sox | brew install sox |
| ffmpeg | which ffmpeg |
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
brew install ffmpeg |
| poppler (PDF) | which pdftotext | brew install poppler |
| 来源 | 示例 |
|---|---|
| 文本文件 | speak article.txt |
| Markdown | speak doc.md |
| 直接字符串 | speak "Hello" |
| 剪贴板 | `pbpaste |
| 标准输入 | `cat file.txt |
lynx -dump -nolist "https://example.com/article" | speak --output article.wav
| 格式 | 转换命令 |
|---|---|
pdftotext doc.pdf doc.txt | |
| DOCX | textutil -convert txt doc.docx |
| HTML | pandoc -f html -t plain doc.html > doc.txt |
| 目标 | 命令 |
|---|---|
| 保存供以后使用 | speak text.txt --output file.wav |
| 立即收听(流式) | speak text.txt --stream |
| 立即收听(完整) | speak text.txt --play |
| 两者都做 | speak text.txt --stream --output file.wav |
speak article.txt # → ~/Audio/speak/article.wav (无播放)
speak "Hello" # → ~/Audio/speak/speak_<timestamp>.wav
| 目录 | 是否自动创建? |
|---|---|
~/Audio/speak/ | ✓ 是 |
~/.chatter/voices/ | ✗ 否 |
| 自定义目录 | ✗ 否 |
请始终先创建自定义目录:
mkdir -p ~/.chatter/voices/
mkdir -p ~/Audio/custom/
语音克隆可以从一段简短的录音中生成与您声音特征(音高、音调、节奏)相匹配的语音。
使用 QuickTime:
使用 sox(命令行):
# -d = 使用默认麦克风
# 录制立即开始,25 秒后停止
sox -d -r 24000 -c 1 ~/.chatter/voices/my_voice.wav trim 0 25
语音样本必须是:WAV 格式,24000 Hz,单声道,10-30 秒。
# 从 MP3 转换
ffmpeg -i voice.mp3 -ar 24000 -ac 1 voice.wav
# 从 M4A (QuickTime) 转换
ffmpeg -i voice.m4a -ar 24000 -ac 1 voice.wav
# 修剪至 25 秒
ffmpeg -i long.wav -t 25 -ar 24000 -ac 1 trimmed.wav
# 检查样本属性
ffprobe -i voice.wav 2>&1 | grep -E "Duration|Stream"
# 应显示:时长约 15-25s,24000 Hz,单声道
# 创建目录
mkdir -p ~/.chatter/voices/
# 移动样本
mv voice.wav ~/.chatter/voices/my_voice.wav
# 测试
speak "Testing my voice" --voice ~/.chatter/voices/my_voice.wav --stream
# 用于内容
speak notes.txt --voice ~/.chatter/voices/my_voice.wav --output presentation.wav
路径要求:
~/.chatter/voices/my_voice.wav(波浪号由 shell 展开)/Users/name/.chatter/voices/my_voice.wavmy_voice.wav(相对路径)./voices/my_voice.wav(相对路径)| 好的样本 | 差的样本 |
|---|---|
| 安静的房间 | 背景噪音 |
| 自然的语速 | 急促或单调 |
| 清晰的发音 | 含糊不清 |
| 内容多样 | 重复的短语 |
当省略 --voice 时,使用内置的默认语音:
speak "Hello world" --stream # 使用默认语音
标签会产生可听的效果(实际的声音),而不是说出标签文字:
speak "[sigh] Monday again." --stream
# 输出:(叹气声)"Monday again."
| 标签 | 效果 |
|---|---|
[laugh] | 笑声 |
[chuckle] | 轻声笑 |
[sigh] | 叹气 |
[gasp] | 喘气 |
[groan] | 呻吟 |
[clear throat] | 清嗓子 |
[cough] | 咳嗽 |
[crying] | 哭泣 |
[singing] | 歌唱式语音 |
不支持: [pause]、[whisper](会被忽略)
如需停顿: 使用标点符号:"Wait... let me think."
mkdir -p ~/Audio/book/
speak ch01.txt ch02.txt ch03.txt --output-dir ~/Audio/book/
# 创建:ch01.wav, ch02.wav, ch03.wav
# 使用自动分块(针对长文件)
speak chapters/*.txt --output-dir ~/Audio/book/ --auto-chunk
# 跳过已完成的文件
speak chapters/*.txt --output-dir ~/Audio/book/ --skip-existing
在批量处理中使用 --auto-chunk 时:
.wav 文件(例如 ch01.wav)--keep-chunks)您无需手动拼接块 — 只需拼接最终的章节文件。
# 显式指定顺序(推荐)
speak concat ch01.wav ch02.wav ch03.wav --output book.wav
# 通配符模式(需要零填充的文件名)
speak concat audiobook/*.wav --output book.wav
对于正确的拼接顺序至关重要:
| 文件 | 正确 | 错误 |
|---|---|---|
| 1-9 | 01, 02, ..., 09 | 1, 2, ..., 9 |
| 10-99 | 01, 02, ..., 99 | 1, 10, 2, ... |
| 100+ | 001, 002, ..., 999 | 1, 100, 2, ... |
原因: Shell 通配符扩展按字母顺序排序。1, 10, 2 对比 01, 02, 10。
# 预览目录
pdftotext -f 1 -l 5 textbook.pdf toc.txt
cat toc.txt # 记下章节页码
# 或搜索 "Chapter" 标记
pdftotext textbook.pdf - | grep -n "Chapter"
# 对于 100 页的书,约 10 章
pdftotext -f 1 -l 12 -layout textbook.pdf ch01.txt
pdftotext -f 13 -l 25 -layout textbook.pdf ch02.txt
pdftotext -f 26 -l 38 -layout textbook.pdf ch03.txt
# ... 对所有章节继续此操作
speak --estimate ch*.txt
# 显示:总音频时长、生成时间、所需存储空间
# 快速估算:
# 1 页 ≈ 2 分钟音频 ≈ 1 分钟生成时间
# 100 页 ≈ 200 分钟音频 ≈ 100 分钟生成时间 ≈ 500 MB
mkdir -p audiobook/
speak ch01.txt ch02.txt ch03.txt --output-dir audiobook/ --auto-chunk
# 创建:audiobook/ch01.wav, audiobook/ch02.wav, audiobook/ch03.wav
speak concat audiobook/ch01.wav audiobook/ch02.wav audiobook/ch03.wav --output complete_audiobook.wav
# 或使用通配符(仅在零填充时有效):
speak concat audiobook/ch*.wav --output complete_audiobook.wav
| 问题 | 解决方案 |
|---|---|
| 空/乱码文本 | 扫描版 PDF — 使用 OCR:brew install tesseract |
| 编码错误 | 尝试:pdftotext -enc UTF-8 doc.pdf |
| 检查字数 | `pdftotext doc.pdf - |
mkdir -p podcast/scripts podcast/wav
echo "Welcome to the show." > podcast/scripts/01_host.txt
echo "Thanks for having me." > podcast/scripts/02_guest.txt
speak podcast/scripts/01_host.txt --voice ~/.chatter/voices/host.wav --output podcast/wav/01.wav
speak podcast/scripts/02_guest.txt --voice ~/.chatter/voices/guest.wav --output podcast/wav/02.wav
speak concat podcast/wav/01.wav podcast/wav/02.wav --output podcast.wav
| 选项 | 描述 | 默认值 |
|---|---|---|
--stream | 生成时流式播放 | false |
--play | 完成后播放 | false |
--output <path> | 输出文件 | ~/Audio/speak/ |
--output-dir <dir> | 批量输出目录 | - |
--voice <path> | 语音样本(完整路径) | default |
--timeout <sec> | 每个文件的超时时间 | 300 |
--auto-chunk | 分割长文档 | false |
--chunk-size <n> | 每个块的字符数 | 6000 |
--resume <file> | 从清单文件恢复 | - |
--keep-chunks | 保留中间文件 | false |
--skip-existing | 如果输出存在则跳过 | false |
--estimate | 显示时长估算 | false |
--dry-run | 仅预览 | false |
--quiet | 抑制输出 | false |
| 命令 | 描述 |
|---|---|
speak setup | 设置环境 |
speak health | 检查系统状态 |
speak models | 列出 TTS 模型 |
speak concat | 拼接音频 |
speak daemon kill | 停止 TTS 服务器 |
speak config | 显示配置 |
| 指标 | 值 |
|---|---|
| 冷启动 | ~4-8s |
| 热启动 | ~3-8s |
| 速度 | 0.3-0.5x RTF(快于实时) |
| 存储 | ~2.5 MB/分钟,~150 MB/小时 |
对于中断的长时生成:
# 单个文件使用自动分块 — 使用 --resume
speak long.txt --auto-chunk --output book.wav
# 如果中断,清单保存在 ~/Audio/speak/manifest.json
speak --resume ~/Audio/speak/manifest.json
# 批量处理 — 使用 --skip-existing
speak ch*.txt --output-dir audiobook/ --auto-chunk
# 如果中断,重新运行相同命令:
speak ch*.txt --output-dir audiobook/ --auto-chunk --skip-existing
| 错误 | 原因 | 解决方案 |
|---|---|---|
| "Voice file not found" | 相对路径 | 使用完整路径:~/.chatter/voices/x.wav |
| "Invalid WAV format" | 规格错误 | 转换:ffmpeg -i in.wav -ar 24000 -ac 1 out.wav |
| "Voice sample too short" | <10 秒 | 录制 15-25 秒 |
| "Output directory doesn't exist" | 目录未创建 | mkdir -p dirname/ |
| "sox not found" | 未安装 | brew install sox |
| 拼接顺序混乱 | 未使用零填充 | 使用 01, 02,而不是 1, 2 |
| 超时 | 生成时间 >5 分钟 | 使用 --auto-chunk 或 --timeout 600 |
| "Server not running" | 守护进程陈旧 | speak daemon kill && speak health |
speak "test" # 首次运行时自动设置(下载模型约 500MB)
speak setup # 或手动设置
speak health # 验证一切正常
服务器自动启动,并在空闲 1 小时后关闭。
speak health # 检查状态
speak daemon kill # 手动停止
每周安装次数
694
仓库
GitHub 星标数
6
首次出现
2026年1月27日
安全审计
安装于
github-copilot659
gemini-cli649
opencode648
codex645
cursor640
cline626
Give your agent the ability to speak to you real-time. Local text-to-speech, voice cloning, and audio generation on Apple Silicon. Give your agent the ability to speak to you real-time. Local TTS with voice cloning on Apple Silicon.
| Requirement | Check | Install |
|---|---|---|
| Apple Silicon Mac | uname -m → arm64 | Intel not supported |
| macOS 12.0+ | sw_vers | - |
| sox | which sox | brew install sox |
| ffmpeg | which ffmpeg | brew install ffmpeg |
| poppler (PDF) | which pdftotext | brew install poppler |
| Source | Example |
|---|---|
| Text file | speak article.txt |
| Markdown | speak doc.md |
| Direct string | speak "Hello" |
| Clipboard | `pbpaste |
| Stdin | `cat file.txt |
lynx -dump -nolist "https://example.com/article" | speak --output article.wav
| Format | Convert Command |
|---|---|
pdftotext doc.pdf doc.txt | |
| DOCX | textutil -convert txt doc.docx |
| HTML | pandoc -f html -t plain doc.html > doc.txt |
| Goal | Command |
|---|---|
| Save for later | speak text.txt --output file.wav |
| Listen now (streaming) | speak text.txt --stream |
| Listen now (complete) | speak text.txt --play |
| Both | speak text.txt --stream --output file.wav |
speak article.txt # → ~/Audio/speak/article.wav (no playback)
speak "Hello" # → ~/Audio/speak/speak_<timestamp>.wav
| Directory | Auto-Created? |
|---|---|
~/Audio/speak/ | ✓ Yes |
~/.chatter/voices/ | ✗ No |
| Custom directories | ✗ No |
Always create custom directories first:
mkdir -p ~/.chatter/voices/
mkdir -p ~/Audio/custom/
Voice cloning generates speech that matches your vocal characteristics (pitch, tone, cadence) from a short recording.
Using QuickTime:
Using sox (command line):
# -d = use default microphone
# Recording starts immediately and stops after 25 seconds
sox -d -r 24000 -c 1 ~/.chatter/voices/my_voice.wav trim 0 25
Voice samples MUST be: WAV, 24000 Hz, mono, 10-30 seconds.
# From MP3
ffmpeg -i voice.mp3 -ar 24000 -ac 1 voice.wav
# From M4A (QuickTime)
ffmpeg -i voice.m4a -ar 24000 -ac 1 voice.wav
# Trim to 25 seconds
ffmpeg -i long.wav -t 25 -ar 24000 -ac 1 trimmed.wav
# Check sample properties
ffprobe -i voice.wav 2>&1 | grep -E "Duration|Stream"
# Should show: Duration ~15-25s, 24000 Hz, mono
# Create directory
mkdir -p ~/.chatter/voices/
# Move sample
mv voice.wav ~/.chatter/voices/my_voice.wav
# Test
speak "Testing my voice" --voice ~/.chatter/voices/my_voice.wav --stream
# Use for content
speak notes.txt --voice ~/.chatter/voices/my_voice.wav --output presentation.wav
Path requirements:
~/.chatter/voices/my_voice.wav (tilde expanded by shell)/Users/name/.chatter/voices/my_voice.wavmy_voice.wav (relative path)./voices/my_voice.wav (relative path)| Good Sample | Bad Sample |
|---|---|
| Quiet room | Background noise |
| Natural pace | Rushed or monotone |
| Clear diction | Mumbling |
| Varied content | Repetitive phrases |
When --voice is omitted, a built-in default voice is used:
speak "Hello world" --stream # Uses default voice
Tags produce audible effects (actual sounds), not spoken words:
speak "[sigh] Monday again." --stream
# Output: (sigh sound) "Monday again."
| Tag | Effect |
|---|---|
[laugh] | Laughter |
[chuckle] | Light chuckle |
[sigh] | Sighing |
[gasp] | Gasping |
[groan] | Groaning |
[clear throat] | Throat clearing |
[cough] |
NOT supported: [pause], [whisper] (ignored)
For pauses: Use punctuation: "Wait... let me think."
mkdir -p ~/Audio/book/
speak ch01.txt ch02.txt ch03.txt --output-dir ~/Audio/book/
# Creates: ch01.wav, ch02.wav, ch03.wav
# With auto-chunking (for long files)
speak chapters/*.txt --output-dir ~/Audio/book/ --auto-chunk
# Skip completed files
speak chapters/*.txt --output-dir ~/Audio/book/ --skip-existing
When using --auto-chunk with batch processing:
.wav per input file (e.g., ch01.wav)--keep-chunks)You don't need to manually concatenate chunks — only concatenate final chapter files.
# Explicit order (recommended)
speak concat ch01.wav ch02.wav ch03.wav --output book.wav
# Glob pattern (REQUIRES zero-padded filenames)
speak concat audiobook/*.wav --output book.wav
Critical for correct concatenation order:
| Files | Correct | Wrong |
|---|---|---|
| 1-9 | 01, 02, ..., 09 | 1, 2, ..., 9 |
| 10-99 | 01, 02, ..., 99 |
Why: Shell glob expansion sorts alphabetically. 1, 10, 2 vs 01, 02, 10.
# Preview table of contents
pdftotext -f 1 -l 5 textbook.pdf toc.txt
cat toc.txt # Note chapter page numbers
# Or search for "Chapter" markers
pdftotext textbook.pdf - | grep -n "Chapter"
# For 100-page book with ~10 chapters
pdftotext -f 1 -l 12 -layout textbook.pdf ch01.txt
pdftotext -f 13 -l 25 -layout textbook.pdf ch02.txt
pdftotext -f 26 -l 38 -layout textbook.pdf ch03.txt
# ... continue for all chapters
speak --estimate ch*.txt
# Shows: total audio duration, generation time, storage needed
# Quick estimates:
# 1 page ≈ 2 min audio ≈ 1 min generation
# 100 pages ≈ 200 min audio ≈ 100 min generation ≈ 500 MB
mkdir -p audiobook/
speak ch01.txt ch02.txt ch03.txt --output-dir audiobook/ --auto-chunk
# Creates: audiobook/ch01.wav, audiobook/ch02.wav, audiobook/ch03.wav
speak concat audiobook/ch01.wav audiobook/ch02.wav audiobook/ch03.wav --output complete_audiobook.wav
# Or with glob (only if zero-padded):
speak concat audiobook/ch*.wav --output complete_audiobook.wav
| Issue | Solution |
|---|---|
| Empty/garbled text | Scanned PDF — use OCR: brew install tesseract |
| Wrong encoding | Try: pdftotext -enc UTF-8 doc.pdf |
| Check word count | `pdftotext doc.pdf - |
mkdir -p podcast/scripts podcast/wav
echo "Welcome to the show." > podcast/scripts/01_host.txt
echo "Thanks for having me." > podcast/scripts/02_guest.txt
speak podcast/scripts/01_host.txt --voice ~/.chatter/voices/host.wav --output podcast/wav/01.wav
speak podcast/scripts/02_guest.txt --voice ~/.chatter/voices/guest.wav --output podcast/wav/02.wav
speak concat podcast/wav/01.wav podcast/wav/02.wav --output podcast.wav
| Option | Description | Default |
|---|---|---|
--stream | Stream as it generates | false |
--play | Play after complete | false |
--output <path> | Output file | ~/Audio/speak/ |
--output-dir <dir> | Batch output directory | - |
--voice <path> | Voice sample (full path) | default |
| Command | Description |
|---|---|
speak setup | Set up environment |
speak health | Check system status |
speak models | List TTS models |
speak concat | Concatenate audio |
speak daemon kill | Stop TTS server |
speak config | Show configuration |
| Metric | Value |
|---|---|
| Cold start | ~4-8s |
| Warm start | ~3-8s |
| Speed | 0.3-0.5x RTF (faster than real-time) |
| Storage | ~2.5 MB/min, ~150 MB/hour |
For interrupted long generations:
# Single file with auto-chunk — use --resume
speak long.txt --auto-chunk --output book.wav
# If interrupted, manifest saved at ~/Audio/speak/manifest.json
speak --resume ~/Audio/speak/manifest.json
# Batch processing — use --skip-existing
speak ch*.txt --output-dir audiobook/ --auto-chunk
# If interrupted, re-run same command:
speak ch*.txt --output-dir audiobook/ --auto-chunk --skip-existing
| Error | Cause | Solution |
|---|---|---|
| "Voice file not found" | Relative path | Use full path: ~/.chatter/voices/x.wav |
| "Invalid WAV format" | Wrong specs | Convert: ffmpeg -i in.wav -ar 24000 -ac 1 out.wav |
| "Voice sample too short" | <10 seconds | Record 15-25 seconds |
| "Output directory doesn't exist" | Not created | mkdir -p dirname/ |
| "sox not found" | Not installed | brew install sox |
| Scrambled concat order |
speak "test" # Auto-setup on first run (downloads model ~500MB)
speak setup # Or manual setup
speak health # Verify everything works
Server auto-starts and shuts down after 1 hour idle.
speak health # Check status
speak daemon kill # Stop manually
Weekly Installs
694
Repository
GitHub Stars
6
First Seen
Jan 27, 2026
Security Audits
Gen Agent Trust HubWarnSocketPassSnykWarn
Installed on
github-copilot659
gemini-cli649
opencode648
codex645
cursor640
cline626
AI 代码实施计划编写技能 | 自动化开发任务分解与 TDD 流程规划工具
40,200 周安装
| Coughing |
[crying] | Crying |
[singing] | Sung speech |
1, 10, 2, ... |
| 100+ | 001, 002, ..., 999 | 1, 100, 2, ... |
--timeout <sec> | Timeout per file | 300 |
--auto-chunk | Split long documents | false |
--chunk-size <n> | Chars per chunk | 6000 |
--resume <file> | Resume from manifest | - |
--keep-chunks | Keep intermediate files | false |
--skip-existing | Skip if output exists | false |
--estimate | Show duration estimate | false |
--dry-run | Preview only | false |
--quiet | Suppress output | false |
| Non-zero-padded |
Use 01, 02, not 1, 2 |
| Timeout | >5 min generation | Use --auto-chunk or --timeout 600 |
| "Server not running" | Stale daemon | speak daemon kill && speak health |