OpenAI语音生成技能：文本转语音、旁白配音、批量TTS生成，支持GPT-4o mini TTS

speech by openai/skills

527 周安装量

15,300 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/openai/skills --skill speech

AI/机器学习自动化音频处理

🇨🇳中文介绍

语音生成技能

为当前项目生成语音音频（旁白、产品演示配音、交互式语音应答提示、无障碍阅读）。默认使用 gpt-4o-mini-tts-2025-12-15 和内置语音，并优先使用捆绑的 CLI 以实现确定性和可复现的运行。

使用时机

从文本生成单个语音片段
批量生成提示（多行文本，多个文件）

决策树（单次 vs 批量）

如果用户提供多行文本/提示或需要多个输出 -> 批量
否则 -> 单次

工作流程

确定意图：单次还是批量（参考上面的决策树）。
预先收集输入：确切的文本（逐字）、期望的语音、表达风格、格式以及任何限制条件。
如果是批量：在 tmp/ 目录下写入一个临时的 JSONL 文件（每行一个任务），运行一次，然后删除该 JSONL 文件。
将指令增强为一个简短的、带标签的规范，但不重写输入文本。
使用合理的默认值运行捆绑的 CLI (scripts/text_to_speech.py)（参见 references/cli.md）。
对于重要的片段，进行验证：清晰度、语速、发音以及对限制条件的遵守情况。
进行单次有针对性的更改（语音、速度或指令）后迭代，然后重新检查。
保存/返回最终输出，并记录最终使用的文本、指令和标志。

临时文件和输出约定

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

依赖项（缺失时安装）

优先使用 uv 进行依赖管理。

uv pip install openai

如果 uv 不可用：

python3 -m pip install openai

进行实时 API 调用时，必须设置 OPENAI_API_KEY。

如果缺少密钥，请向用户提供以下步骤：

在 OpenAI 平台 UI 中创建 API 密钥：https://platform.openai.com/api-keys
在他们的系统中将 OPENAI_API_KEY 设置为环境变量。
如果需要，引导他们完成针对其操作系统/Shell 设置环境变量的步骤。

切勿要求用户在聊天中粘贴完整的密钥。请他们本地设置，并在准备好后确认。

如果在此环境中无法安装，请告知用户缺少哪个依赖项以及如何在本地安装。

除非用户请求其他模型，否则使用 gpt-4o-mini-tts-2025-12-15。
默认语音：cedar。如果用户想要更明亮的音色，优先选择 marin。
仅使用内置语音。自定义语音超出此技能的范围。
instructions 参数支持 GPT-4o mini TTS 模型，但不支持 tts-1 或 tts-1-hd。
每个请求的输入长度必须 <= 4096 个字符。将较长的文本分割成块。
强制限制每分钟 50 个请求。CLI 将 --rpm 上限设置为 50。
在进行任何实时 API 调用之前，需要 OPENAI_API_KEY。
向最终用户明确披露该语音是由 AI 生成的。
所有 API 调用都使用 OpenAI Python SDK (openai 包)；不要使用原始 HTTP。
优先使用捆绑的 CLI (scripts/text_to_speech.py)，而不是编写新的临时脚本。
切勿修改 scripts/text_to_speech.py。如果缺少某些功能，请在执行其他操作之前询问用户。

将用户指示重新格式化为简短的、带标签的规范。仅将隐含的细节明确化；不要发明新的要求。

快速澄清（增强 vs 发明）：

如果用户说“用于演示的旁白”，你可以添加隐含的表达限制（清晰、平稳的语速、友好的语气）。
不要引入用户未请求的新角色、口音或情感风格。

模板（仅包含相关行）：

Voice Affect: <语音的整体特征和质感>
Tone: <态度、正式程度、热情度>
Pacing: <缓慢、平稳、轻快>
Emotion: <要传达的关键情感>
Pronunciation: <需要清晰发音或强调的词语>
Pauses: <添加有意停顿的位置>
Emphasis: <需要强调的关键词或短语>
Delivery: <节奏或韵律说明>

保持简短；仅添加用户已暗示或在其他地方提供的细节。
不要重写输入文本。
如果缺少任何关键细节并阻碍成功，请提问；否则继续。

单次示例（旁白）

Input text: "Welcome to the demo. Today we'll show how it works."
Instructions:
Voice Affect: Warm and composed.
Tone: Friendly and confident.
Pacing: Steady and moderate.
Emphasis: Stress "demo" and "show".

批量示例（交互式语音应答提示）

{"input":"Thank you for calling. Please hold.","voice":"cedar","response_format":"mp3","out":"hold.mp3"}
{"input":"For sales, press 1. For support, press 2.","voice":"marin","instructions":"Tone: Clear and neutral. Pacing: Slow.","response_format":"wav"}

指令最佳实践（简短列表）

将指示结构化为：特征 -> 语气 -> 语速 -> 情感 -> 发音/停顿 -> 强调。
保持 4 到 8 个短行；避免相互矛盾的指导。
对于名称/首字母缩略词，添加发音提示（例如，"enunciate A-I"）或在文本中提供音标拼写。
对于编辑/迭代，重复不变的要求（例如，"keep pacing steady"）以减少偏差。
通过单次更改的后续步骤进行迭代。

更多原则：references/prompting.md。复制/粘贴规范：references/sample-prompts.md。

当请求针对特定表达风格时，请使用这些模块。它们提供了有针对性的默认值和模板。

旁白 / 解说：references/narration.md
产品演示 / 配音：references/voiceover.md
交互式语音应答 / 电话提示：references/ivr.md
无障碍阅读：references/accessibility.md

CLI 命令 + 示例：references/cli.md
API 参数快速参考：references/audio-api.md
指令模式 + 示例：references/voice-directions.md
如果网络审批 / 沙箱设置造成阻碍：references/codex-network.md

references/cli.md : 如何通过 scripts/text_to_speech.py 运行语音生成/批处理（命令、标志、配方）。
references/audio-api.md : API 参数、限制、语音列表。
references/voice-directions.md : 指令模式和示例。
references/prompting.md : 指令最佳实践（结构、限制、迭代模式）。
references/sample-prompts.md : 复制/粘贴指令配方（仅示例；无额外理论）。
references/narration.md : 旁白和解说的模板 + 默认值。
references/voiceover.md : 产品演示配音的模板 + 默认值。
references/ivr.md : 交互式语音应答/电话提示的模板 + 默认值。
references/accessibility.md : 无障碍阅读的模板 + 默认值。
references/codex-network.md : 环境/沙箱/网络审批故障排除。

🇺🇸English

Speech Generation Skill

Generate spoken audio for the current project (narration, product demo voiceover, IVR prompts, accessibility reads). Defaults to gpt-4o-mini-tts-2025-12-15 and built-in voices, and prefers the bundled CLI for deterministic, reproducible runs.

When to use

Generate a single spoken clip from text
Generate a batch of prompts (many lines, many files)

Decision tree (single vs batch)

If the user provides multiple lines/prompts or wants many outputs -> batch
Else -> single

Workflow

Decide intent: single vs batch (see decision tree above).
Collect inputs up front: exact text (verbatim), desired voice, delivery style, format, and any constraints.
If batch: write a temporary JSONL under tmp/ (one job per line), run once, then delete the JSONL.
Augment instructions into a short labeled spec without rewriting the input text.
Run the bundled CLI (scripts/text_to_speech.py) with sensible defaults (see references/cli.md).
For important clips, validate: intelligibility, pacing, pronunciation, and adherence to constraints.
Iterate with a single targeted change (voice, speed, or instructions), then re-check.
Save/return final outputs and note the final text + instructions + flags used.

Temp and output conventions

Use tmp/speech/ for intermediate files (for example JSONL batches); delete when done.
Write final artifacts under output/speech/ when working in this repo.
Use --out or --out-dir to control output paths; keep filenames stable and descriptive.

Dependencies (install if missing)

Prefer uv for dependency management.

Python packages:

uv pip install openai

If uv is unavailable:

python3 -m pip install openai

Environment

OPENAI_API_KEY must be set for live API calls.

If the key is missing, give the user these steps:

Create an API key in the OpenAI platform UI: https://platform.openai.com/api-keys
Set OPENAI_API_KEY as an environment variable in their system.
Offer to guide them through setting the environment variable for their OS/shell if needed.

Never ask the user to paste the full key in chat. Ask them to set it locally and confirm when ready.

If installation isn't possible in this environment, tell the user which dependency is missing and how to install it locally.

Defaults & rules

Use gpt-4o-mini-tts-2025-12-15 unless the user requests another model.
Default voice: cedar. If the user wants a brighter tone, prefer marin.
Built-in voices only. Custom voices are out of scope for this skill.
instructions are supported for GPT-4o mini TTS models, but not for tts-1 or tts-1-hd.
Input length must be <= 4096 characters per request. Split longer text into chunks.
Enforce 50 requests/minute. The CLI caps --rpm at 50.
Require OPENAI_API_KEY before any live API call.
Provide a clear disclosure to end users that the voice is AI-generated.
Use the OpenAI Python SDK (openai package) for all API calls; do not use raw HTTP.

Instruction augmentation

Reformat user direction into a short, labeled spec. Only make implicit details explicit; do not invent new requirements.

Quick clarification (augmentation vs invention):

If the user says "narration for a demo", you may add implied delivery constraints (clear, steady pacing, friendly tone).
Do not introduce a new persona, accent, or emotional style the user did not request.

Template (include only relevant lines):

Voice Affect: <overall character and texture of the voice>
Tone: <attitude, formality, warmth>
Pacing: <slow, steady, brisk>
Emotion: <key emotions to convey>
Pronunciation: <words to enunciate or emphasize>
Pauses: <where to add intentional pauses>
Emphasis: <key words or phrases to stress>
Delivery: <cadence or rhythm notes>

Augmentation rules:

Keep it short; add only details the user already implied or provided elsewhere.
Do not rewrite the input text.
If any critical detail is missing and blocks success, ask a question; otherwise proceed.

Examples

Single example (narration)

Input text: "Welcome to the demo. Today we'll show how it works."
Instructions:
Voice Affect: Warm and composed.
Tone: Friendly and confident.
Pacing: Steady and moderate.
Emphasis: Stress "demo" and "show".

Batch example (IVR prompts)

{"input":"Thank you for calling. Please hold.","voice":"cedar","response_format":"mp3","out":"hold.mp3"}
{"input":"For sales, press 1. For support, press 2.","voice":"marin","instructions":"Tone: Clear and neutral. Pacing: Slow.","response_format":"wav"}

Instructioning best practices (short list)

Structure directions as: affect -> tone -> pacing -> emotion -> pronunciation/pauses -> emphasis.
Keep 4 to 8 short lines; avoid conflicting guidance.
For names/acronyms, add pronunciation hints (e.g., "enunciate A-I") or supply a phonetic spelling in the text.
For edits/iterations, repeat invariants (e.g., "keep pacing steady") to reduce drift.
Iterate with single-change follow-ups.

More principles: references/prompting.md. Copy/paste specs: references/sample-prompts.md.

Guidance by use case

Use these modules when the request is for a specific delivery style. They provide targeted defaults and templates.

Narration / explainer: references/narration.md
Product demo / voiceover: references/voiceover.md
IVR / phone prompts: references/ivr.md
Accessibility reads: references/accessibility.md

CLI + environment notes

CLI commands + examples: references/cli.md
API parameter quick reference: references/audio-api.md
Instruction patterns + examples: references/voice-directions.md
If network approvals / sandbox settings are getting in the way: references/codex-network.md

Reference map

references/cli.md : how to run speech generation/batches via scripts/text_to_speech.py (commands, flags, recipes).
references/audio-api.md : API parameters, limits, voice list.
references/voice-directions.md : instruction patterns and examples.
references/prompting.md : instruction best practices (structure, constraints, iteration patterns).
references/sample-prompts.md : copy/paste instruction recipes (examples only; no extra theory).
references/narration.md : templates + defaults for narration and explainers.
: templates + defaults for product demo voiceovers.

Weekly Installs

527

Repository

openai/skills

GitHub Stars

15.3K

First Seen

Jan 28, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

codex468

opencode445

gemini-cli436

github-copilot425

cursor421

kimi-cli410

AI 代码实施计划编写技能 | 自动化开发任务分解与 TDD 流程规划工具

41,400 周安装

Prefer the bundled CLI (scripts/text_to_speech.py) over writing new one-off scripts.

Never modify scripts/text_to_speech.py. If something is missing, ask the user before doing anything else.

references/voiceover.md

references/ivr.md : templates + defaults for IVR/phone prompts.

references/accessibility.md : templates + defaults for accessibility reads.

references/codex-network.md : environment/sandbox/network-approval troubleshooting.