speak-tts：Apple Silicon本地文本转语音与语音克隆工具，让Claude智能体实时对话

speak-tts by emzod/speak

694 周安装量

6 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/emzod/speak --skill speak-tts

AI/机器学习自动化音频处理

🇨🇳中文介绍

speak - 与你的 Claude 对话！

赋予你的智能体实时与你对话的能力。在 Apple Silicon 上实现本地文本转语音、语音克隆和音频生成。赋予你的智能体实时与你对话的能力。在 Apple Silicon 上实现带语音克隆的本地 TTS。

前提条件

要求	检查命令	安装方法
Apple Silicon Mac	`uname -m` → arm64	不支持 Intel
macOS 12.0+	`sw_vers`	-
sox	`which sox`	`brew install sox`
ffmpeg	`which ffmpeg`

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

来源	示例
文本文件	`speak article.txt`
Markdown	`speak doc.md`
直接字符串	`speak "Hello"`
剪贴板	`pbpaste
标准输入	`cat file.txt

目标	命令
保存供以后使用	`speak text.txt --output file.wav`
立即收听（流式）	`speak text.txt --stream`
立即收听（完整）	`speak text.txt --play`
两者都做	`speak text.txt --stream --output file.wav`

目录	是否自动创建？
`~/Audio/speak/`	✓ 是
`~/.chatter/voices/`	✗ 否
自定义目录	✗ 否

转换为所需格式

语音样本必须是：WAV 格式，24000 Hz，单声道，10-30 秒。

# 从 MP3 转换
ffmpeg -i voice.mp3 -ar 24000 -ac 1 voice.wav

# 从 M4A (QuickTime) 转换
ffmpeg -i voice.m4a -ar 24000 -ac 1 voice.wav

# 修剪至 25 秒
ffmpeg -i long.wav -t 25 -ar 24000 -ac 1 trimmed.wav

# 检查样本属性
ffprobe -i voice.wav 2>&1 | grep -E "Duration|Stream"
# 应显示：时长约 15-25s，24000 Hz，单声道

# 创建目录
mkdir -p ~/.chatter/voices/

# 移动样本
mv voice.wav ~/.chatter/voices/my_voice.wav

# 测试
speak "Testing my voice" --voice ~/.chatter/voices/my_voice.wav --stream

# 用于内容
speak notes.txt --voice ~/.chatter/voices/my_voice.wav --output presentation.wav

✓ 有效：~/.chatter/voices/my_voice.wav（波浪号由 shell 展开）
✓ 有效：/Users/name/.chatter/voices/my_voice.wav
✗ 无效：my_voice.wav（相对路径）
✗ 无效：./voices/my_voice.wav（相对路径）

好的样本	差的样本
安静的房间	背景噪音
自然的语速	急促或单调
清晰的发音	含糊不清
内容多样	重复的短语

当省略 --voice 时，使用内置的默认语音：

speak "Hello world" --stream  # 使用默认语音

标签会产生可听的效果（实际的声音），而不是说出标签文字：

speak "[sigh] Monday again." --stream
# 输出：（叹气声）"Monday again."

标签	效果
`[laugh]`	笑声
`[chuckle]`	轻声笑
`[sigh]`	叹气
`[gasp]`	喘气
`[groan]`	呻吟
`[clear throat]`	清嗓子
`[cough]`	咳嗽
`[crying]`	哭泣
`[singing]`	歌唱式语音

不支持： [pause]、[whisper]（会被忽略）

如需停顿： 使用标点符号："Wait... let me think."

mkdir -p ~/Audio/book/
speak ch01.txt ch02.txt ch03.txt --output-dir ~/Audio/book/
# 创建：ch01.wav, ch02.wav, ch03.wav

# 使用自动分块（针对长文件）
speak chapters/*.txt --output-dir ~/Audio/book/ --auto-chunk

# 跳过已完成的文件
speak chapters/*.txt --output-dir ~/Audio/book/ --skip-existing

在批量处理中使用 --auto-chunk 时：

每个输入文件被独立地分块
每个文件的块被生成并自动拼接
最终输出：每个输入文件一个 .wav 文件（例如 ch01.wav）
中间块被删除（除非使用 --keep-chunks）

您无需手动拼接块 — 只需拼接最终的章节文件。

# 显式指定顺序（推荐）
speak concat ch01.wav ch02.wav ch03.wav --output book.wav

# 通配符模式（需要零填充的文件名）
speak concat audiobook/*.wav --output book.wav

对于正确的拼接顺序至关重要：

文件	正确	错误
1-9	`01`, `02`, ..., `09`	`1`, `2`, ..., `9`
10-99	`01`, `02`, ..., `99`	`1`, `10`, `2`, ...
100+	`001`, `002`, ..., `999`	`1`, `100`, `2`, ...

原因： Shell 通配符扩展按字母顺序排序。1, 10, 2 对比 01, 02, 10。

PDF 转有声书（完整工作流程）

步骤 1：查找章节边界

# 预览目录
pdftotext -f 1 -l 5 textbook.pdf toc.txt
cat toc.txt  # 记下章节页码

# 或搜索 "Chapter" 标记
pdftotext textbook.pdf - | grep -n "Chapter"

步骤 2：提取章节（使用零填充！）

# 对于 100 页的书，约 10 章
pdftotext -f 1 -l 12 -layout textbook.pdf ch01.txt
pdftotext -f 13 -l 25 -layout textbook.pdf ch02.txt
pdftotext -f 26 -l 38 -layout textbook.pdf ch03.txt
# ... 对所有章节继续此操作

步骤 3：估算时间

speak --estimate ch*.txt
# 显示：总音频时长、生成时间、所需存储空间

# 快速估算：
# 1 页 ≈ 2 分钟音频 ≈ 1 分钟生成时间
# 100 页 ≈ 200 分钟音频 ≈ 100 分钟生成时间 ≈ 500 MB

步骤 4：生成音频

mkdir -p audiobook/
speak ch01.txt ch02.txt ch03.txt --output-dir audiobook/ --auto-chunk
# 创建：audiobook/ch01.wav, audiobook/ch02.wav, audiobook/ch03.wav

speak concat audiobook/ch01.wav audiobook/ch02.wav audiobook/ch03.wav --output complete_audiobook.wav
# 或使用通配符（仅在零填充时有效）：
speak concat audiobook/ch*.wav --output complete_audiobook.wav

问题	解决方案
空/乱码文本	扫描版 PDF — 使用 OCR：`brew install tesseract`
编码错误	尝试：`pdftotext -enc UTF-8 doc.pdf`
检查字数	`pdftotext doc.pdf -

mkdir -p podcast/scripts podcast/wav

echo "Welcome to the show." > podcast/scripts/01_host.txt
echo "Thanks for having me." > podcast/scripts/02_guest.txt

speak podcast/scripts/01_host.txt --voice ~/.chatter/voices/host.wav --output podcast/wav/01.wav
speak podcast/scripts/02_guest.txt --voice ~/.chatter/voices/guest.wav --output podcast/wav/02.wav

speak concat podcast/wav/01.wav podcast/wav/02.wav --output podcast.wav

选项	描述	默认值
`--stream`	生成时流式播放	false
`--play`	完成后播放	false
`--output <path>`	输出文件	~/Audio/speak/
`--output-dir <dir>`	批量输出目录	-
`--voice <path>`	语音样本（完整路径）	default
`--timeout <sec>`	每个文件的超时时间	300
`--auto-chunk`	分割长文档	false
`--chunk-size <n>`	每个块的字符数	6000
`--resume <file>`	从清单文件恢复	-
`--keep-chunks`	保留中间文件	false
`--skip-existing`	如果输出存在则跳过	false
`--estimate`	显示时长估算	false
`--dry-run`	仅预览	false
`--quiet`	抑制输出	false

命令	描述
`speak setup`	设置环境
`speak health`	检查系统状态
`speak models`	列出 TTS 模型
`speak concat`	拼接音频
`speak daemon kill`	停止 TTS 服务器
`speak config`	显示配置

指标	值
冷启动	~4-8s
热启动	~3-8s
速度	0.3-0.5x RTF（快于实时）
存储	~2.5 MB/分钟，~150 MB/小时

对于中断的长时生成：

# 单个文件使用自动分块 — 使用 --resume
speak long.txt --auto-chunk --output book.wav
# 如果中断，清单保存在 ~/Audio/speak/manifest.json
speak --resume ~/Audio/speak/manifest.json

# 批量处理 — 使用 --skip-existing
speak ch*.txt --output-dir audiobook/ --auto-chunk
# 如果中断，重新运行相同命令：
speak ch*.txt --output-dir audiobook/ --auto-chunk --skip-existing

错误	原因	解决方案
"Voice file not found"	相对路径	使用完整路径：`~/.chatter/voices/x.wav`
"Invalid WAV format"	规格错误	转换：`ffmpeg -i in.wav -ar 24000 -ac 1 out.wav`
"Voice sample too short"	<10 秒	录制 15-25 秒
"Output directory doesn't exist"	目录未创建	`mkdir -p dirname/`
"sox not found"	未安装	`brew install sox`
拼接顺序混乱	未使用零填充	使用 `01`, `02`，而不是 `1`, `2`
超时	生成时间 >5 分钟	使用 `--auto-chunk` 或 `--timeout 600`
"Server not running"	守护进程陈旧	`speak daemon kill && speak health`

speak "test"     # 首次运行时自动设置（下载模型约 500MB）
speak setup      # 或手动设置
speak health     # 验证一切正常

服务器自动启动，并在空闲 1 小时后关闭。

speak health        # 检查状态
speak daemon kill   # 手动停止

🇺🇸English

speak - Talk to your Claude!

Give your agent the ability to speak to you real-time. Local text-to-speech, voice cloning, and audio generation on Apple Silicon. Give your agent the ability to speak to you real-time. Local TTS with voice cloning on Apple Silicon.

Prerequisites

Requirement	Check	Install
Apple Silicon Mac	`uname -m` → arm64	Intel not supported
macOS 12.0+	`sw_vers`	-
sox	`which sox`	`brew install sox`
ffmpeg	`which ffmpeg`	`brew install ffmpeg`
poppler (PDF)	`which pdftotext`	`brew install poppler`

Input Sources

Source	Example
Text file	`speak article.txt`
Markdown	`speak doc.md`
Direct string	`speak "Hello"`
Clipboard	`pbpaste
Stdin	`cat file.txt

Web Articles

lynx -dump -nolist "https://example.com/article" | speak --output article.wav

Converting Formats

Format	Convert Command
PDF	`pdftotext doc.pdf doc.txt`
DOCX	`textutil -convert txt doc.docx`
HTML	`pandoc -f html -t plain doc.html > doc.txt`

Output Modes

Goal	Command
Save for later	`speak text.txt --output file.wav`
Listen now (streaming)	`speak text.txt --stream`
Listen now (complete)	`speak text.txt --play`
Both	`speak text.txt --stream --output file.wav`

Default Behavior

speak article.txt          # → ~/Audio/speak/article.wav (no playback)
speak "Hello"              # → ~/Audio/speak/speak_<timestamp>.wav

Directory Auto-Creation

Directory	Auto-Created?
`~/Audio/speak/`	✓ Yes
`~/.chatter/voices/`	✗ No
Custom directories	✗ No

Always create custom directories first:

mkdir -p ~/.chatter/voices/
mkdir -p ~/Audio/custom/

Voice Cloning

Voice cloning generates speech that matches your vocal characteristics (pitch, tone, cadence) from a short recording.

Quality Expectations

Output captures general voice characteristics but is not a perfect replica
Quality depends heavily on sample quality
15-25 seconds is optimal (10s minimum, 30s maximum)

Recording Your Voice

Using QuickTime:

Open QuickTime Player → File → New Audio Recording
Record 20 seconds of clear speech
File → Export As → Audio Only (.m4a)
Convert to WAV (see below)

Using sox (command line):

# -d = use default microphone
# Recording starts immediately and stops after 25 seconds
sox -d -r 24000 -c 1 ~/.chatter/voices/my_voice.wav trim 0 25

Converting to Required Format

Voice samples MUST be: WAV, 24000 Hz, mono, 10-30 seconds.

# From MP3
ffmpeg -i voice.mp3 -ar 24000 -ac 1 voice.wav

# From M4A (QuickTime)
ffmpeg -i voice.m4a -ar 24000 -ac 1 voice.wav

# Trim to 25 seconds
ffmpeg -i long.wav -t 25 -ar 24000 -ac 1 trimmed.wav

# Check sample properties
ffprobe -i voice.wav 2>&1 | grep -E "Duration|Stream"
# Should show: Duration ~15-25s, 24000 Hz, mono

Using Your Voice

# Create directory
mkdir -p ~/.chatter/voices/

# Move sample
mv voice.wav ~/.chatter/voices/my_voice.wav

# Test
speak "Testing my voice" --voice ~/.chatter/voices/my_voice.wav --stream

# Use for content
speak notes.txt --voice ~/.chatter/voices/my_voice.wav --output presentation.wav

Path requirements:

✓ Works: ~/.chatter/voices/my_voice.wav (tilde expanded by shell)
✓ Works: /Users/name/.chatter/voices/my_voice.wav
✗ Fails: my_voice.wav (relative path)
✗ Fails: ./voices/my_voice.wav (relative path)

Voice Sample Tips

Good Sample	Bad Sample
Quiet room	Background noise
Natural pace	Rushed or monotone
Clear diction	Mumbling
Varied content	Repetitive phrases

Default Voice

When --voice is omitted, a built-in default voice is used:

speak "Hello world" --stream  # Uses default voice

Emotion Tags

Tags produce audible effects (actual sounds), not spoken words:

speak "[sigh] Monday again." --stream
# Output: (sigh sound) "Monday again."

Tag	Effect
`[laugh]`	Laughter
`[chuckle]`	Light chuckle
`[sigh]`	Sighing
`[gasp]`	Gasping
`[groan]`	Groaning
`[clear throat]`	Throat clearing
`[cough]`

NOT supported: [pause], [whisper] (ignored)

For pauses: Use punctuation: "Wait... let me think."

Batch Processing

mkdir -p ~/Audio/book/
speak ch01.txt ch02.txt ch03.txt --output-dir ~/Audio/book/
# Creates: ch01.wav, ch02.wav, ch03.wav

# With auto-chunking (for long files)
speak chapters/*.txt --output-dir ~/Audio/book/ --auto-chunk

# Skip completed files
speak chapters/*.txt --output-dir ~/Audio/book/ --skip-existing

Auto-Chunk Behavior

When using --auto-chunk with batch processing:

Each input file is chunked independently
Chunks are generated and automatically concatenated per file
Final output: one .wav per input file (e.g., ch01.wav)
Intermediate chunks deleted (unless --keep-chunks)

You don't need to manually concatenate chunks — only concatenate final chapter files.

Concatenating Audio

# Explicit order (recommended)
speak concat ch01.wav ch02.wav ch03.wav --output book.wav

# Glob pattern (REQUIRES zero-padded filenames)
speak concat audiobook/*.wav --output book.wav

Zero-Padding Rules

Critical for correct concatenation order:

Files	Correct	Wrong
1-9	`01`, `02`, ..., `09`	`1`, `2`, ..., `9`
10-99	`01`, `02`, ..., `99`

Why: Shell glob expansion sorts alphabetically. 1, 10, 2 vs 01, 02, 10.

PDF to Audiobook (Complete Workflow)

Step 1: Find Chapter Boundaries

# Preview table of contents
pdftotext -f 1 -l 5 textbook.pdf toc.txt
cat toc.txt  # Note chapter page numbers

# Or search for "Chapter" markers
pdftotext textbook.pdf - | grep -n "Chapter"

Step 2: Extract Chapters (Zero-Padded!)

# For 100-page book with ~10 chapters
pdftotext -f 1 -l 12 -layout textbook.pdf ch01.txt
pdftotext -f 13 -l 25 -layout textbook.pdf ch02.txt
pdftotext -f 26 -l 38 -layout textbook.pdf ch03.txt
# ... continue for all chapters

Step 3: Estimate Time

speak --estimate ch*.txt
# Shows: total audio duration, generation time, storage needed

# Quick estimates:
# 1 page ≈ 2 min audio ≈ 1 min generation
# 100 pages ≈ 200 min audio ≈ 100 min generation ≈ 500 MB

Step 4: Generate Audio

mkdir -p audiobook/
speak ch01.txt ch02.txt ch03.txt --output-dir audiobook/ --auto-chunk
# Creates: audiobook/ch01.wav, audiobook/ch02.wav, audiobook/ch03.wav

Step 5: Concatenate

speak concat audiobook/ch01.wav audiobook/ch02.wav audiobook/ch03.wav --output complete_audiobook.wav
# Or with glob (only if zero-padded):
speak concat audiobook/ch*.wav --output complete_audiobook.wav

PDF Troubleshooting

Issue	Solution
Empty/garbled text	Scanned PDF — use OCR: `brew install tesseract`
Wrong encoding	Try: `pdftotext -enc UTF-8 doc.pdf`
Check word count	`pdftotext doc.pdf -

Multi-Voice Content

mkdir -p podcast/scripts podcast/wav

echo "Welcome to the show." > podcast/scripts/01_host.txt
echo "Thanks for having me." > podcast/scripts/02_guest.txt

speak podcast/scripts/01_host.txt --voice ~/.chatter/voices/host.wav --output podcast/wav/01.wav
speak podcast/scripts/02_guest.txt --voice ~/.chatter/voices/guest.wav --output podcast/wav/02.wav

speak concat podcast/wav/01.wav podcast/wav/02.wav --output podcast.wav

Options Reference

Option	Description	Default
`--stream`	Stream as it generates	false
`--play`	Play after complete	false
`--output <path>`	Output file	~/Audio/speak/
`--output-dir <dir>`	Batch output directory	-
`--voice <path>`	Voice sample (full path)	default

Commands

Command	Description
`speak setup`	Set up environment
`speak health`	Check system status
`speak models`	List TTS models
`speak concat`	Concatenate audio
`speak daemon kill`	Stop TTS server
`speak config`	Show configuration

Performance

Metric	Value
Cold start	~4-8s
Warm start	~3-8s
Speed	0.3-0.5x RTF (faster than real-time)
Storage	~2.5 MB/min, ~150 MB/hour

Resume Capability

For interrupted long generations:

# Single file with auto-chunk — use --resume
speak long.txt --auto-chunk --output book.wav
# If interrupted, manifest saved at ~/Audio/speak/manifest.json
speak --resume ~/Audio/speak/manifest.json

# Batch processing — use --skip-existing
speak ch*.txt --output-dir audiobook/ --auto-chunk
# If interrupted, re-run same command:
speak ch*.txt --output-dir audiobook/ --auto-chunk --skip-existing

Common Errors

Error	Cause	Solution
"Voice file not found"	Relative path	Use full path: `~/.chatter/voices/x.wav`
"Invalid WAV format"	Wrong specs	Convert: `ffmpeg -i in.wav -ar 24000 -ac 1 out.wav`
"Voice sample too short"	<10 seconds	Record 15-25 seconds
"Output directory doesn't exist"	Not created	`mkdir -p dirname/`
"sox not found"	Not installed	`brew install sox`
Scrambled concat order

Setup

speak "test"     # Auto-setup on first run (downloads model ~500MB)
speak setup      # Or manual setup
speak health     # Verify everything works

Server Management

Server auto-starts and shuts down after 1 hour idle.

speak health        # Check status
speak daemon kill   # Stop manually

Weekly Installs

694

Repository

emzod/speak

GitHub Stars

First Seen

Jan 27, 2026

Security Audits

Gen Agent Trust HubWarn SocketPass SnykWarn

Installed on

github-copilot659

gemini-cli649

opencode648

codex645

cursor640

cline626

AI 代码实施计划编写技能 | 自动化开发任务分解与 TDD 流程规划工具

40,200 周安装

speak-tts：Apple Silicon本地文本转语音与语音克隆工具，让Claude智能体实时对话

🇨🇳中文介绍

speak - 与你的 Claude 对话！

前提条件

相关 Skills

输入源

网页文章

格式转换

输出模式

默认行为

目录自动创建

语音克隆

质量预期

录制您的声音