MarsWaveAI TTS：文本转语音API，支持多说话人脚本与快速语音合成

tts by marswaveai/skills

595 周安装量

34 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/marswaveai/skills --skill tts

AI/机器学习音频处理 API

🇨🇳中文介绍

何时使用

用户希望将文本转换为语音音频
用户要求“朗读”、“TTS”、“文本转语音”、“语音旁白”
用户说“朗读”、“配音”、“语音合成”
用户需要多说话人脚本音频或对话

何时不使用

用户想要带有话题探讨的播客式讨论（使用 /podcast）
用户想要带有视觉效果的解说视频（使用 /explainer）
用户想要生成图像（使用 /image-gen）

目的

将文本转换为听起来自然的语音音频。两种模式：

快速模式 (/v1/tts): 单一语音，低延迟，同步 MP3 流。适用于闲聊、阅读片段、即时音频。
脚本模式 (/v1/speech): 多说话人，可为每个片段分配语音。适用于对话、有声书、脚本内容。

硬性约束

不使用 shell 脚本。根据资源中列出的 API 参考文件构造 curl 命令
始终阅读 shared/authentication.md 以获取 API 密钥和请求头信息

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

信号	模式
"多角色", "脚本", "对话", "script", "dialogue", "multi-speaker"	脚本模式
按姓名或角色提及多个角色	脚本模式
输入包含结构化片段（A: ..., B: ...）	脚本模式
单段文本，无角色标记	快速模式
"读一下", "read this", "TTS", "朗读" 附带纯文本	快速模式
不明确	快速模式（默认）

步骤 -1: API 密钥检查

遵循 shared/config-pattern.md § API 密钥检查。如果密钥缺失，立即停止。

步骤 0: 配置设置

遵循 shared/config-pattern.md 步骤 0（零问题启动）。

如果文件不存在 —— 静默创建默认配置并继续：

mkdir -p ".listenhub/tts"
echo '{"outputMode":"inline","language":null,"defaultSpeakers":{}}' > ".listenhub/tts/config.json"
CONFIG_PATH=".listenhub/tts/config.json"
CONFIG=$(cat "$CONFIG_PATH")

不要询问任何设置问题。 直接进入交互流程。

如果文件存在 —— 静默读取配置并继续：

CONFIG_PATH=".listenhub/tts/config.json"
[ ! -f "$CONFIG_PATH" ] && CONFIG_PATH="$HOME/.listenhub/tts/config.json"
CONFIG=$(cat "$CONFIG_PATH")

设置流程（仅限用户主动重新配置）

仅在用户明确要求重新配置时运行。显示当前设置：

当前配置 (tts)：
  输出方式：{inline / download / both}
  语言偏好：{zh / en / 未设置}
  默认主播：{speakerName / 使用内置默认}

outputMode : 遵循 shared/output-mode.md § 设置流程问题。
Language (可选): "默认语言？"
- "中文 (zh)"
- "English (en)"
- "每次手动选择" → 保持 null

收集答案后，立即保存：

NEW_CONFIG=$(echo "$CONFIG" | jq --arg m "$OUTPUT_MODE" '. + {"outputMode": $m}')
# 如果用户选择了语言（不是"每次手动选择"），则保存
if [ "$LANGUAGE" != "null" ]; then
  NEW_CONFIG=$(echo "$NEW_CONFIG" | jq --arg lang "$LANGUAGE" '. + {"language": $lang}')
fi
echo "$NEW_CONFIG" > "$CONFIG_PATH"
CONFIG=$(cat "$CONFIG_PATH")

快速模式 — `POST /v1/tts`

步骤 1: 提取文本

获取要转换的文本。如果用户未提供，则询问：

"您希望我朗读什么文本？"

步骤 2: 确定语音

如果 config.defaultSpeakers.{language}[0] 已设置 → 静默使用它（跳至步骤 4）
如果未设置 → 使用 shared/speaker-selection.md 中针对检测到的语言的内置默认值（跳至步骤 4）
仅当用户明确要求更改语音时才显示说话人选择界面

步骤 3: 保存偏好

在用户明确选择新语音后（使用默认值时除外）：

问题："是否将 {voice name} 保存为您 {language} 的默认语音？"
选项：
  - "是" — 更新 .listenhub/tts/config.json
  - "否" — 仅本次会话使用

步骤 4: 确认

准备生成：

  文本："{前 80 个字符}..."
  语音：{voice name}

是否继续？

步骤 5: 生成

curl -sS -X POST "https://api.marswave.ai/openapi/v1/tts" \
  -H "Authorization: Bearer $LISTENHUB_API_KEY" \
  -H "Content-Type: application/json" \
  -H "X-Source: skills" \
  -d '{"input": "...", "voice": "..."}' \
  --output /tmp/tts-output.mp3

步骤 6: 呈现结果

从配置中读取 OUTPUT_MODE。遵循 shared/output-mode.md 中的行为。

使用带时间戳的 jobId：$(date +%s)

inline 或 both (TTS 快速模式返回同步音频流 —— 无 audioUrl)：

JOB_ID=$(date +%s)
curl -sS -X POST "https://api.marswave.ai/openapi/v1/tts" \
  -H "Authorization: Bearer $LISTENHUB_API_KEY" \
  -H "Content-Type: application/json" \
  -H "X-Source: skills" \
  -d '{"input": "...", "voice": "..."}' \
  --output /tmp/tts-${JOB_ID}.mp3

然后对 /tmp/tts-{jobId}.mp3 使用 Read 工具。

音频已生成！

download 或 both: 根据文本内容生成主题 slug，遵循 shared/config-pattern.md § 产物命名。

SLUG="{topic-slug}"  # 例如 "server-maintenance-notice"
NAME="${SLUG}.mp3"
# 去重：如果文件存在，则追加 -2, -3 等
BASE="${NAME%.*}"; EXT="${NAME##*.}"; i=2
while [ -e "$NAME" ]; do NAME="${BASE}-${i}.${EXT}"; i=$((i+1)); done
curl -sS -X POST "https://api.marswave.ai/openapi/v1/tts" \
  -H "Authorization: Bearer $LISTENHUB_API_KEY" \
  -H "Content-Type: application/json" \
  -H "X-Source: skills" \
  -d '{"input": "...", "voice": "..."}' \
  --output "$NAME"

音频已生成！

已保存到当前目录：
  {NAME}

脚本模式 — `POST /v1/speech`

步骤 1: 获取脚本

确定用户是否已提供脚本数组：

已提供 (JSON 或清晰片段): 解析并显示以确认
尚未提供 : 帮助用户构建片段。询问：

"请提供带有说话人分配的脚本。格式：每行 SpeakerName: 文本内容。我将进行转换。"

一旦用户提供脚本，将其解析为 scripts JSON 格式。

步骤 2: 为每个角色分配语音

对于脚本中的每个唯一角色：

如果 config.defaultSpeakers.{language} 已保存语音 → 静默自动分配（按顺序为每个角色分配一个）
如果未设置 → 使用 shared/speaker-selection.md 中的内置默认值（第一个角色使用 Primary，第二个角色使用 Secondary）
仅当用户明确要求更改语音时才显示说话人选择界面

步骤 3: 保存偏好

在所有语音分配完成后（如果有新的分配）：

问题："是否将这些语音分配保存以备将来会话使用？"
选项：
  - "是" — 更新 .listenhub/tts/config.json 中的 defaultSpeakers
  - "否" — 仅本次会话使用

步骤 4: 确认

准备生成：

  角色：
    {name}: {voice}
    {name}: {voice}
  片段数：{count}
  标题：(自动生成)

是否继续？

步骤 5: 生成

将请求体写入临时文件，然后提交：

# 将请求写入临时文件
cat > /tmp/lh-speech-request.json << 'ENDJSON'
{
  "scripts": [
    {"content": "...", "speakerId": "..."},
    {"content": "...", "speakerId": "..."}
  ]
}
ENDJSON

# 提交
curl -sS -X POST "https://api.marswave.ai/openapi/v1/speech" \
  -H "Authorization: Bearer $LISTENHUB_API_KEY" \
  -H "Content-Type: application/json" \
  -H "X-Source: skills" \
  -d @/tmp/lh-speech-request.json

rm /tmp/lh-speech-request.json

步骤 6: 呈现结果

从配置中读取 OUTPUT_MODE。遵循 shared/output-mode.md 中的行为。

inline 或 both: 将 audioUrl 和 subtitlesUrl 显示为可点击链接。

音频已生成！

在线收听：{audioUrl}
字幕：{subtitlesUrl}
时长：{audioDuration / 1000}s
消耗积分：{credits}

download 或 both: 同时下载文件。根据文本内容生成主题 slug，遵循 shared/config-pattern.md § 产物命名。

SLUG="{topic-slug}"  # 例如 "welcome-dialogue"
NAME="${SLUG}.mp3"
# 去重：如果文件存在，则追加 -2, -3 等
BASE="${NAME%.*}"; EXT="${NAME##*.}"; i=2
while [ -e "$NAME" ]; do NAME="${BASE}-${i}.${EXT}"; i=$((i+1)); done
curl -sS -o "$NAME" "{audioUrl}"

已保存到当前目录：
  {NAME}

保存偏好时，合并到 .listenhub/tts/config.json 中 —— 不要覆盖未更改的键。

快速模式语音：将 defaultSpeakers.{language}[0] 设置为选定的 speakerId
脚本模式语音：将 defaultSpeakers.{language} 设置为本次会话分配的完整数组
语言：如果用户明确指定，则设置 language

TTS & Speech 端点：shared/api-tts.md
说话人列表：shared/api-speakers.md
说话人选择指南：shared/speaker-selection.md
错误处理：shared/common-patterns.md § 错误处理
长文本输入：shared/common-patterns.md § 长文本输入

调用 : speakers API（用于说话人选择）
被调用 : explainer（用于旁白）

"TTS this: The server will be down for maintenance at midnight."

检测：快速模式（纯文本，"TTS this"）
读取配置：quickVoice 为 null
获取说话人列表，用户选择 "Yuanye"
询问是否保存 → 是 → 更新配置
POST /v1/tts 附带 input + voice
呈现：/tmp/tts-output.mp3

"帮我做一段双人对话配音，A说：欢迎大家，B说：谢谢邀请"

检测：脚本模式（"双人对话"）
解析片段：A → "欢迎大家", B → "谢谢邀请"
读取配置：scriptVoices 为空
获取 zh 说话人列表，为 A 和 B 分配语音
询问是否保存 → 是 → 更新配置
POST /v1/speech 附带 scripts 数组
呈现：audioUrl, subtitlesUrl, 时长

🇺🇸English

When to Use

User wants to convert text to spoken audio
User asks for "read aloud", "TTS", "text to speech", "voice narration"
User says "朗读", "配音", "语音合成"
User wants multi-speaker scripted audio or dialogue

When NOT to Use

User wants a podcast-style discussion with topic exploration (use /podcast)
User wants an explainer video with visuals (use /explainer)
User wants to generate an image (use /image-gen)

Purpose

Convert text into natural-sounding speech audio. Two paths:

Quick mode (/v1/tts): Single voice, low-latency, sync MP3 stream. For casual chat, reading snippets, instant audio.
Script mode (/v1/speech): Multi-speaker, per-segment voice assignment. For dialogue, audiobooks, scripted content.

Hard Constraints

No shell scripts. Construct curl commands from the API reference files listed in Resources
Always read shared/authentication.md for API key and headers
Follow shared/common-patterns.md for errors and interaction patterns
Never hardcode speaker IDs in API calls — use built-in defaults from shared/speaker-selection.md as fallback only; fetch from the speakers API when the user wants to change voice
Always read config following shared/config-pattern.md before any interaction
Always follow shared/speaker-selection.md for speaker selection (text table + free-text input)
Never save files to ~/Downloads/ or /tmp/ as primary output — save artifacts to the current working directory with friendly topic-based names (see shared/config-pattern.md § Artifact Naming)

Mode Detection

Determine the mode from the user's input automatically before asking any questions:

Signal	Mode
"多角色", "脚本", "对话", "script", "dialogue", "multi-speaker"	Script
Multiple characters mentioned by name or role	Script
Input contains structured segments (A: ..., B: ...)	Script
Single paragraph of text, no character markers	Quick
"读一下", "read this", "TTS", "朗读" with plain text	Quick
Ambiguous	Quick (default)

Interaction Flow

Step -1: API Key Check

Follow shared/config-pattern.md § API Key Check. If the key is missing, stop immediately.

Step 0: Config Setup

Follow shared/config-pattern.md Step 0 (Zero-Question Boot).

If file doesn't exist — silently create with defaults and proceed:

mkdir -p ".listenhub/tts"
echo '{"outputMode":"inline","language":null,"defaultSpeakers":{}}' > ".listenhub/tts/config.json"
CONFIG_PATH=".listenhub/tts/config.json"
CONFIG=$(cat "$CONFIG_PATH")

Do NOT ask any setup questions. Proceed directly to the Interaction Flow.

If file exists — read config silently and proceed:

CONFIG_PATH=".listenhub/tts/config.json"
[ ! -f "$CONFIG_PATH" ] && CONFIG_PATH="$HOME/.listenhub/tts/config.json"
CONFIG=$(cat "$CONFIG_PATH")

Setup Flow (user-initiated reconfigure only)

Only run when the user explicitly asks to reconfigure. Display current settings:

当前配置 (tts)：
  输出方式：{inline / download / both}
  语言偏好：{zh / en / 未设置}
  默认主播：{speakerName / 使用内置默认}

Then ask:

outputMode : Follow shared/output-mode.md § Setup Flow Question.
Language (optional): "默认语言？"
- "中文 (zh)"
- "English (en)"
- "每次手动选择" → keep null

After collecting answers, save immediately:

NEW_CONFIG=$(echo "$CONFIG" | jq --arg m "$OUTPUT_MODE" '. + {"outputMode": $m}')
# Save language if user chose one (not "每次手动选择")
if [ "$LANGUAGE" != "null" ]; then
  NEW_CONFIG=$(echo "$NEW_CONFIG" | jq --arg lang "$LANGUAGE" '. + {"language": $lang}')
fi
echo "$NEW_CONFIG" > "$CONFIG_PATH"
CONFIG=$(cat "$CONFIG_PATH")

Quick Mode — `POST /v1/tts`

Step 1: Extract text

Get the text to convert. If the user hasn't provided it, ask:

"What text would you like me to read aloud?"

Step 2: Determine voice

If config.defaultSpeakers.{language}[0] is set → use it silently (skip to Step 4)
If not set → use the built-in default from shared/speaker-selection.md for the detected language (skip to Step 4)
Only show speaker selection if the user explicitly asks to change voice

Step 3: Save preference

After the user explicitly selects a new voice (not when using defaults):

Question: "Save {voice name} as your default voice for {language}?"
Options:
  - "Yes" — update .listenhub/tts/config.json
  - "No" — use for this session only

Step 4: Confirm

Ready to generate:

  Text: "{first 80 chars}..."
  Voice: {voice name}

Proceed?

Step 5: Generate

curl -sS -X POST "https://api.marswave.ai/openapi/v1/tts" \
  -H "Authorization: Bearer $LISTENHUB_API_KEY" \
  -H "Content-Type: application/json" \
  -H "X-Source: skills" \
  -d '{"input": "...", "voice": "..."}' \
  --output /tmp/tts-output.mp3

Step 6: Present result

Read OUTPUT_MODE from config. Follow shared/output-mode.md for behavior.

Use a timestamped jobId: $(date +%s)

inline or both (TTS quick returns a sync audio stream — no audioUrl):

JOB_ID=$(date +%s)
curl -sS -X POST "https://api.marswave.ai/openapi/v1/tts" \
  -H "Authorization: Bearer $LISTENHUB_API_KEY" \
  -H "Content-Type: application/json" \
  -H "X-Source: skills" \
  -d '{"input": "...", "voice": "..."}' \
  --output /tmp/tts-${JOB_ID}.mp3

Then use the Read tool on /tmp/tts-{jobId}.mp3.

Present:

Audio generated!

download or both: Generate a topic slug from the text content following shared/config-pattern.md § Artifact Naming.

SLUG="{topic-slug}"  # e.g. "server-maintenance-notice"
NAME="${SLUG}.mp3"
# Dedup: if file exists, append -2, -3, etc.
BASE="${NAME%.*}"; EXT="${NAME##*.}"; i=2
while [ -e "$NAME" ]; do NAME="${BASE}-${i}.${EXT}"; i=$((i+1)); done
curl -sS -X POST "https://api.marswave.ai/openapi/v1/tts" \
  -H "Authorization: Bearer $LISTENHUB_API_KEY" \
  -H "Content-Type: application/json" \
  -H "X-Source: skills" \
  -d '{"input": "...", "voice": "..."}' \
  --output "$NAME"

Present:

Audio generated!

已保存到当前目录：
  {NAME}

Script Mode — `POST /v1/speech`

Step 1: Get scripts

Determine whether the user already has a scripts array:

Already provided (JSON or clear segments): parse and display for confirmation
Not yet provided : help the user structure segments. Ask:

"Please provide the script with speaker assignments. Format: each line as SpeakerName: text content. I'll convert it."

Once the user provides the script, parse it into the scripts JSON format.

Step 2: Assign voices per character

For each unique character in the script:

If config.defaultSpeakers.{language} has saved voices → auto-assign silently (one per character in order)
If not set → use built-in defaults from shared/speaker-selection.md (Primary for first character, Secondary for second)
Only show speaker selection if the user explicitly asks to change voices

Step 3: Save preferences

After all voices are assigned (if any were new):

Question: "Save these voice assignments for future sessions?"
Options:
  - "Yes" — update defaultSpeakers in .listenhub/tts/config.json
  - "No" — use for this session only

Step 4: Confirm

Ready to generate:

  Characters:
    {name}: {voice}
    {name}: {voice}
  Segments: {count}
  Title: (auto-generated)

Proceed?

Step 5: Generate

Write the request body to a temp file, then submit:

# Write request to temp file
cat > /tmp/lh-speech-request.json << 'ENDJSON'
{
  "scripts": [
    {"content": "...", "speakerId": "..."},
    {"content": "...", "speakerId": "..."}
  ]
}
ENDJSON

# Submit
curl -sS -X POST "https://api.marswave.ai/openapi/v1/speech" \
  -H "Authorization: Bearer $LISTENHUB_API_KEY" \
  -H "Content-Type: application/json" \
  -H "X-Source: skills" \
  -d @/tmp/lh-speech-request.json

rm /tmp/lh-speech-request.json

Step 6: Present result

Read OUTPUT_MODE from config. Follow shared/output-mode.md for behavior.

inline or both: Display the audioUrl and subtitlesUrl as clickable links.

Present:

Audio generated!

在线收听：{audioUrl}
字幕：{subtitlesUrl}
时长：{audioDuration / 1000}s
消耗积分：{credits}

download or both: Also download the file. Generate a topic slug following shared/config-pattern.md § Artifact Naming.

SLUG="{topic-slug}"  # e.g. "welcome-dialogue"
NAME="${SLUG}.mp3"
# Dedup: if file exists, append -2, -3, etc.
BASE="${NAME%.*}"; EXT="${NAME##*.}"; i=2
while [ -e "$NAME" ]; do NAME="${BASE}-${i}.${EXT}"; i=$((i+1)); done
curl -sS -o "$NAME" "{audioUrl}"

Present:

已保存到当前目录：
  {NAME}

Updating Config

When saving preferences, merge into .listenhub/tts/config.json — do not overwrite unchanged keys.

Quick voice: set defaultSpeakers.{language}[0] to the selected speakerId
Script voices: set defaultSpeakers.{language} to the full array assigned this session
Language: set language if the user explicitly specifies it

API Reference

TTS & Speech endpoints: shared/api-tts.md
Speaker list: shared/api-speakers.md
Speaker selection guide: shared/speaker-selection.md
Error handling: shared/common-patterns.md § Error Handling
Long text input: shared/common-patterns.md § Long Text Input

Composability

Invokes : speakers API (for speaker selection)
Invoked by : explainer (for voiceover)

Examples

Quick mode:

"TTS this: The server will be down for maintenance at midnight."

Detect: Quick mode (plain text, "TTS this")
Read config: quickVoice is null
Fetch speakers, user picks "Yuanye"
Ask to save → yes → update config
POST /v1/tts with input + voice
Present: /tmp/tts-output.mp3

Script mode:

"帮我做一段双人对话配音，A说：欢迎大家，B说：谢谢邀请"

Detect: Script mode ("双人对话")
Parse segments: A → "欢迎大家", B → "谢谢邀请"
Read config: scriptVoices empty
Fetch zh speakers, assign A and B voices
Ask to save → yes → update config
POST /v1/speech with scripts array
Present: audioUrl, subtitlesUrl, duration

Weekly Installs

487

Repository

marswaveai/skills

GitHub Stars

First Seen

12 days ago

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

codex483

gemini-cli481

cursor481

opencode481

github-copilot480

cline480

AI 代码实施计划编写技能 | 自动化开发任务分解与 TDD 流程规划工具

42,300 周安装

MarsWaveAI TTS：文本转语音API，支持多说话人脚本与快速语音合成

🇨🇳中文介绍

何时使用

何时不使用

目的

硬性约束

相关 Skills

模式检测

交互流程

步骤 -1: API 密钥检查

步骤 0: 配置设置

设置流程（仅限用户主动重新配置）

快速模式 — `POST /v1/tts`

脚本模式 — `POST /v1/speech`

更新配置

API 参考

可组合性

示例

🇺🇸English

When to Use

When NOT to Use

Purpose

Hard Constraints

Mode Detection

Interaction Flow

Step -1: API Key Check

Step 0: Config Setup

Setup Flow (user-initiated reconfigure only)

Quick Mode — `POST /v1/tts`

Script Mode — `POST /v1/speech`

Updating Config

API Reference

Composability

Examples

最新 Skills

MarsWaveAI TTS：文本转语音API，支持多说话人脚本与快速语音合成

🇨🇳中文介绍

何时使用

何时不使用

目的

硬性约束

相关 Skills

模式检测

交互流程

步骤 -1: API 密钥检查

步骤 0: 配置设置

设置流程（仅限用户主动重新配置）

快速模式 — POST /v1/tts

脚本模式 — POST /v1/speech

更新配置

API 参考

可组合性

示例

🇺🇸English

When to Use

When NOT to Use

Purpose

Hard Constraints

Mode Detection

Interaction Flow

Step -1: API Key Check

Step 0: Config Setup

Setup Flow (user-initiated reconfigure only)

Quick Mode — POST /v1/tts

Script Mode — POST /v1/speech

Updating Config

API Reference

Composability

Examples

最新 Skills

快速模式 — `POST /v1/tts`

脚本模式 — `POST /v1/speech`

Quick Mode — `POST /v1/tts`

Script Mode — `POST /v1/speech`