tts文本转语音工具：支持语音克隆、情感控制、SRT时间线精准配音，Kokoro与Noiz双后端

tts by noizai/skills

2,400 周安装量

402 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/noizai/skills --skill tts

AI/机器学习音频处理生产力

🇨🇳中文介绍

tts

将任意文本转换为语音音频。支持两种后端（Kokoro 本地、Noiz 云端）、两种模式（简单模式或时间线精准模式），以及逐片段语音控制。

触发器

text to speech / tts / speak / say
voice clone / dubbing
epub to audio / srt to audio / convert to audio
语音 / 说 / 讲 / 说话

简单模式 — 文本转音频

speak 是默认子命令，可以省略：

# 基本用法（speak 是隐式的）
python3 skills/tts/scripts/tts.py -t "Hello world"          # 添加 -o 路径以保存
python3 skills/tts/scripts/tts.py -f article.txt -o out.mp3

# 语音克隆 — 本地文件路径或 URL
python3 skills/tts/scripts/tts.py -t "Hello" --ref-audio ./ref.wav
python3 skills/tts/scripts/tts.py -t "Hello" --ref-audio https://example.com/my_voice.wav -o clone.wav

# 语音消息格式
python3 skills/tts/scripts/tts.py -t "Hello" --format opus -o voice.opus
python3 skills/tts/scripts/tts.py -t "Hello" --format ogg -o voice.ogg

第三方集成（飞书/Telegram/Discord）的文档在 ref_3rd_party.md。

时间线模式 — SRT 转时间对齐音频

用于精确的逐片段时间控制（配音、字幕、视频旁白）。

步骤 1：获取或创建 SRT 文件

如果用户没有 SRT 文件，可以从文本生成：

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

步骤 2：创建语音映射文件

JSON 文件，控制默认及逐片段的语音设置。segments 的键支持单个索引 "3" 或范围 "5-8"。

Kokoro 语音映射文件示例：

{
  "default": { "voice": "zf_xiaoni", "lang": "cmn" },
  "segments": {
    "1": { "voice": "zm_yunxi" },
    "5-8": { "voice": "af_sarah", "lang": "en-us", "speed": 0.9 }
  }
}

Noiz 语音映射文件示例（增加了 emo、reference_audio 支持）。reference_audio 可以是本地路径或 URL（用户自己的音频；仅 Noiz 支持）：

{
  "default": { "voice_id": "voice_123", "target_lang": "zh" },
  "segments": {
    "1": { "voice_id": "voice_host", "emo": { "Joy": 0.6 } },
    "2-4": { "reference_audio": "./refs/guest.wav" }
  }
}

动态参考音频切片：如果您正在翻译或为视频配音，并希望每个句子自动使用原始视频中完全相同时间戳的音频作为其参考音频，请使用 --ref-audio-track 参数，而不是在映射文件中设置 reference_audio：

python3 skills/tts/scripts/tts.py render --srt input.srt --voice-map vm.json --ref-audio-track original_video.mp4 -o output.wav

完整示例请查看 examples/ 目录。

python3 skills/tts/scripts/tts.py render --srt input.srt --voice-map vm.json -o output.wav
python3 skills/tts/scripts/tts.py render --srt input.srt --voice-map vm.json --backend noiz --auto-emotion -o output.wav

需求	推荐
只需朗读文本，无需复杂功能	Kokoro（默认）
带章节的 EPUB/PDF 有声书	Kokoro（原生支持）
语音混合（`"v1:60,v2:40"`）	Kokoro
根据参考音频进行语音克隆	Noiz
情感控制（`emo` 参数）	Noiz
服务器端精确控制每段时长	Noiz

当用户需要同时使用情感控制 + 语音克隆 + 精确时长时，Noiz 是唯一支持这三项功能的后端。

访客模式（无需 API 密钥）

当未配置 API 密钥时，tts.py 会自动回退到访客模式——一个无需认证的有限功能 Noiz 端点。访客模式仅支持 --voice-id、--speed 和 --format；不支持语音克隆、情感控制、时长控制和时间线渲染。

# 访客模式（未设置 API 密钥时自动检测）
python3 skills/tts/scripts/tts.py -t "Hello" --voice-id 883b6b7c -o hello.wav

# 显式指定后端以使用 kokoro
python3 skills/tts/scripts/tts.py -t "Hello" --backend kokoro

可用的访客语音（15 种内置）：

voice_id	名称	语言	性别	语调
`063a4491`	販売員（なおみ）	ja	F	喜び
`4252b9c8`	落ち着いた女性	ja	F	穏やか
`578b4be2`	熱血漢（たける）	ja	M	怒り
`a9249ce7`	安らぎ（みなと）	ja	M	穏やか
`f00e45a1`	旅人（かいと）	ja	M	穏やか
`b4775100`	悦悦｜社交分享	zh	F	Joyful
`77e15f2c`	婉青｜情绪抚慰	zh	F	Calm
`ac09aeb4`	阿豪｜磁性主持	zh	M	Calm
`87cb2405`	建国｜知识科普	zh	M	Calm
`3b9f1e27`	小明｜科技达人	zh	M	Joyful
`95814add`	Science Narration	en	M	Calm
`883b6b7c`	The Mentor (Alex)	en	M	Joyful
`a845c7de`	The Naturalist (Silas)	en	M	Calm
`5a68d66b`	The Healer (Serena)	en	F	Calm
`0e4ab6ec`	The Mentor (Maya)	en	F	Calm

安全与数据披露

此技能在运行时执行以下文件和网络操作：

凭证存储：当您运行 config --set-api-key 时，密钥会保存到 ~/.config/noiz/api_key（权限为 0600）。也支持使用 NOIZ_API_KEY 环境变量作为替代。
旧密钥迁移：如果 ~/.noiz_api_key 存在而 ~/.config/noiz/api_key 不存在，密钥会被复制（而非删除）到新位置。会打印一条消息；旧文件保持不变，供您手动移除。
网络调用（Noiz 后端）：文本和可选的参考音频会上传到 https://noiz.ai/v1/ 进行合成。除非您调用 Noiz 命令，否则不会发送任何数据。
参考音频下载：当 --ref-audio 是 URL 时，文件会下载到临时文件，用于 API 调用，然后删除。如果未提供 voice-id 或 ref-audio，则会从 storage.googleapis.com 或 noiz.ai 下载默认参考音频。
临时文件：合成过程中可能会创建临时的音频/文本文件，使用后会被清理。
ffmpeg：仅在时间线 render 模式下调用，用于组装最终音频。

除输出路径和 ~/.config/noiz/ 外，不会修改其他文件。Kokoro 后端完全离线运行，无需网络访问。

PATH 中包含 ffmpeg（仅时间线模式需要）
requests 包：uv pip install requests（Noiz 后端需要）
在 Noiz Developer 获取您的 API 密钥，然后运行 python3 skills/tts/scripts/tts.py config --set-api-key YOUR_KEY（访客模式无需密钥即可工作，但功能有限）
Kokoro：如果已安装，传递 --backend kokoro 以使用本地后端

仅使用 base64 编码的 API 密钥作为 Authorization 头——不加前缀（例如，不加 APIKEY 或 Bearer ）。任何前缀都会导致 401 错误。

有关后端详情和完整参数参考，请参阅 reference.md。

2026 年 2 月 28 日

🇺🇸English

tts

Convert any text into speech audio. Supports two backends (Kokoro local, Noiz cloud), two modes (simple or timeline-accurate), and per-segment voice control.

Triggers

text to speech / tts / speak / say
voice clone / dubbing
epub to audio / srt to audio / convert to audio
语音 / 说 / 讲 / 说话

Simple Mode — text to audio

speak is the default — the subcommand can be omitted:

# Basic usage (speak is implicit)
python3 skills/tts/scripts/tts.py -t "Hello world"          # add -o path to save
python3 skills/tts/scripts/tts.py -f article.txt -o out.mp3

# Voice cloning — local file path or URL
python3 skills/tts/scripts/tts.py -t "Hello" --ref-audio ./ref.wav
python3 skills/tts/scripts/tts.py -t "Hello" --ref-audio https://example.com/my_voice.wav -o clone.wav

# Voice message format
python3 skills/tts/scripts/tts.py -t "Hello" --format opus -o voice.opus
python3 skills/tts/scripts/tts.py -t "Hello" --format ogg -o voice.ogg

Third-party integration (Feishu/Telegram/Discord) is documented in ref_3rd_party.md.

Timeline Mode — SRT to time-aligned audio

For precise per-segment timing (dubbing, subtitles, video narration).

Step 1: Get or create an SRT

If the user doesn't have one, generate from text:

python3 skills/tts/scripts/tts.py to-srt -i article.txt -o article.srt
python3 skills/tts/scripts/tts.py to-srt -i article.txt -o article.srt --cps 15 --gap 500

--cps = characters per second (default 4, good for Chinese; ~15 for English). The agent can also write SRT manually.

Step 2: Create a voice map

JSON file controlling default + per-segment voice settings. segments keys support single index "3" or range "5-8".

Kokoro voice map:

{
  "default": { "voice": "zf_xiaoni", "lang": "cmn" },
  "segments": {
    "1": { "voice": "zm_yunxi" },
    "5-8": { "voice": "af_sarah", "lang": "en-us", "speed": 0.9 }
  }
}

Noiz voice map (adds emo, reference_audio support). reference_audio can be a local path or a URL (user’s own audio; Noiz only):

{
  "default": { "voice_id": "voice_123", "target_lang": "zh" },
  "segments": {
    "1": { "voice_id": "voice_host", "emo": { "Joy": 0.6 } },
    "2-4": { "reference_audio": "./refs/guest.wav" }
  }
}

Dynamic Reference Audio Slicing : If you are translating or dubbing a video and want each sentence to automatically use the audio from the original video at the exact same timestamp as its reference audio, use the --ref-audio-track argument instead of setting reference_audio in the map:

python3 skills/tts/scripts/tts.py render --srt input.srt --voice-map vm.json --ref-audio-track original_video.mp4 -o output.wav

See examples/ for full samples.

Step 3: Render

python3 skills/tts/scripts/tts.py render --srt input.srt --voice-map vm.json -o output.wav
python3 skills/tts/scripts/tts.py render --srt input.srt --voice-map vm.json --backend noiz --auto-emotion -o output.wav

When to Choose Which

Need	Recommended
Just read text aloud, no fuss	Kokoro (default)
EPUB/PDF audiobook with chapters	Kokoro (native support)
Voice blending (`"v1:60,v2:40"`)	Kokoro
Voice cloning from reference audio	Noiz
Emotion control (`emo` param)	Noiz
Exact server-side duration per segment	Noiz

When the user needs emotion control + voice cloning + precise duration together, Noiz is the only backend that supports all three.

Guest Mode (no API key)

When no API key is configured, tts.py automatically falls back to guest mode — a limited Noiz endpoint that requires no authentication. Guest mode only supports --voice-id, --speed, and --format; voice cloning, emotion, duration, and timeline rendering are not available.

# Guest mode (auto-detected when no API key is set)
python3 skills/tts/scripts/tts.py -t "Hello" --voice-id 883b6b7c -o hello.wav

# Explicit backend override to use kokoro instead
python3 skills/tts/scripts/tts.py -t "Hello" --backend kokoro

Available guest voices (15 built-in):

voice_id	name	lang	gender	tone
`063a4491`	販売員（なおみ）	ja	F	喜び
`4252b9c8`	落ち着いた女性	ja	F	穏やか
`578b4be2`	熱血漢（たける）	ja	M	怒り
`a9249ce7`	安らぎ（みなと）

Security & data disclosure

This skill performs the following file and network operations at runtime:

Credential storage : When you run config --set-api-key, the key is saved to ~/.config/noiz/api_key (permissions 0600). The NOIZ_API_KEY environment variable is also supported as an alternative.
Legacy key migration : If ~/.noiz_api_key exists and ~/.config/noiz/api_key does not, the key is copied (not deleted) to the new location. A message is printed; the old file is left untouched for you to remove manually.
Network calls (Noiz backend) : Text and optional reference audio are uploaded to https://noiz.ai/v1/ for synthesis. No data is sent unless you invoke a Noiz command.
Reference audio download : When --ref-audio is a URL, the file is downloaded to a temp file, used for the API call, then deleted. If no voice-id or ref-audio is provided, a default reference audio is downloaded from or .

No files outside the output path and ~/.config/noiz/ are modified. The Kokoro backend runs entirely offline with no network access.

Requirements

ffmpeg in PATH (timeline mode only)
requests package: uv pip install requests (required for Noiz backend)
Get your API key at Noiz Developer, then run python3 skills/tts/scripts/tts.py config --set-api-key YOUR_KEY (guest mode works without a key but has limited features)
Kokoro: if already installed, pass --backend kokoro to use the local backend

Noiz API authentication

Use only the base64-encoded API key as Authorization—no prefix (e.g. no APIKEY or Bearer ). Any prefix causes 401.

For backend details and full argument reference, see reference.md.

Weekly Installs

2.4K

Repository

noizai/skills

GitHub Stars

402

First Seen

Feb 28, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykWarn

Installed on

gemini-cli2.4K

opencode2.4K

cursor2.4K

kimi-cli2.4K

codex2.4K

cline2.4K

storage.googleapis.com

Temp files : Temporary audio/text files may be created during synthesis and are cleaned up after use.

ffmpeg : Invoked only in timeline render mode to assemble the final audio.