⚠️

重要前提

安装AI Skills的关键前提是：必须科学上网，且开启TUN模式，这一点至关重要，直接决定安装能否顺利完成，在此郑重提醒三遍：科学上网，科学上网，科学上网。查看完整安装教程 →

Gemini TTS 文本转语音工具：支持多音色对话、流式音频生成

gemini-tts by akrindev/google-studio-skills

50 周安装量

1 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/akrindev/google-studio-skills --skill gemini-tts

AI/机器学习内容创作音频处理

🇨🇳中文介绍

Gemini 文本转语音

通过可执行脚本，利用 Gemini 的 TTS 模型将文本转换为自然流畅的语音，支持多种音色和多说话人对话。

何时使用此技能

在以下场景中使用此技能：

将文本转换为自然语音
为播客、有声读物或视频创建音频
生成多说话人对话
为长内容流式传输音频
从多种音色选项中进行选择
创建无障碍音频内容
为演示文稿生成画外音
批量将文本转换为音频文件

可用脚本

scripts/tts.js

用途 : 使用 Gemini TTS 模型将文本转换为语音

何时使用 :

任何文本转语音转换
多说话人对话生成
为长文本流式传输音频
为内容创作生成画外音
无障碍音频生成

关键参数 :

参数	描述	示例
`text`	要转换的文本（必需）	`"Hello, world!"`

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

工作流程 1: 基础文本转语音

node scripts/tts.js "Hello, world! Have a wonderful day."

最适合：快速音频生成，简单消息
音色：Kore（默认，清晰专业）
输出：audio/tts_output_YYYYMMDD_HHMMSS.wav（自动添加时间戳）

工作流程 2: 选择不同音色

node scripts/tts.js "Welcome to our podcast about technology trends" --voice Puck --output welcome

最适合：友好、对话式内容
音色选项：Kore, Puck, Charon, Fenrir, Aoede, Zephyr, Sulafat
输出：audio/welcome_YYYYMMDD_HHMMSS.wav

工作流程 3: 多说话人对话

node scripts/tts.js "TTS the following conversation:
Joe: How's it going today?
Jane: Not too bad, how about you?
Joe: I'm working on a new project.
Jane: Sounds exciting, tell me more!" --speakers "Joe:Kore,Jane:Puck" --output conversation

最适合：对话、访谈、角色扮演内容
格式：带有说话人标签的对话
脚本自动将文本路由到相应的音色
输出：audio/conversation_YYYYMMDD_HHMMSS.wav

工作流程 4: 长内容流式传输

node scripts/tts.js "This is a very long text that would benefit from streaming..." --stream --output long-form

最适合：播客、有声读物、长篇文章
流式传输：将长文本分块处理音频
输出：audio/long-form_YYYYMMDD_HHMMSS.wav

工作流程 5: 专业画外音

node scripts/tts.js "Welcome to our quarterly earnings presentation. Today we'll discuss our growth metrics and future plans." --voice Charon --output voiceover

最适合：企业内容、演示文稿、正式公告
音色：Charon（深沉、权威）
使用时机：需要专业、严肃的语气时

工作流程 6: 自定义输出目录

node scripts/tts.js "Save to specific folder." --output-dir ./my-projects/podcasts/ --output episode1

最适合：有组织的项目结构
目录不存在时会自动创建
输出：./my-projects/podcasts/episode1_YYYYMMDD_HHMMSS.wav

工作流程 7: 内容创作流水线（文本 → 音频）

# 1. 生成脚本（gemini-text 技能）
node skills/gemini-text/scripts/generate.js "Write a 2-minute podcast intro about sustainable energy"

# 2. 生成音频（此技能）
node scripts/tts.js "[Paste generated script]" --voice Fenrir --output podcast-intro

# 3. 用于视频或播客

最适合：播客、有声读物、视频旁白
可结合：gemini-text 用于脚本生成

工作流程 8: 无障碍内容

node scripts/tts.js "Welcome to our accessible website. This audio describes our main navigation options." --voice Aoede --output accessibility

最适合：网页无障碍、屏幕阅读器替代方案
音色：Aoede（旋律优美、悦耳）
使用时机：为视障用户提供无障碍内容时

工作流程 9: 教育内容

node scripts/tts.js "Chapter 1: Introduction to Quantum Computing. Let's explore the fundamental principles..." --voice Zephyr --output chapter1

最适合：教育材料、教程、电子学习
音色：Zephyr（轻盈、空灵）
可良好结合：gemini-text 用于内容生成

工作流程 10: 禁用时间戳

node scripts/tts.js "Fixed filename." --output my-audio --no-timestamp

最适合：希望完全控制文件名时
输出：audio/my-audio.wav（无时间戳）
使用时机：为特定命名方案生成文件时

模型	质量	速度	最适合
`gemini-2.5-flash-preview-tts`	良好	快速	通用、高吞吐量
`gemini-2.5-pro-preview-tts`	更高	较慢	优质内容、画外音

音色	特点	最适合
Kore	清晰、专业	公告、通用目的（默认）
Puck	友好、对话式	休闲内容、访谈
Charon	深沉、权威	企业、严肃内容
Fenrir	温暖、富有表现力	讲故事、叙述
Aoede	旋律优美、悦耳	教育、无障碍
Zephyr	轻盈、空灵	温和内容、教程
Sulafat	中性、平衡	纪录片、事实性内容

规格	值
格式	WAV (PCM)
采样率	24000 Hz
声道	1 (单声道)
位深度	16-bit

限制	类型	描述
8,192	输入	最大输入文本令牌数
16,384	输出	最大输出音频令牌数

格式：WAV（兼容大多数播放器）
单声道（单音轨）
采样率：24000 Hz（广播质量）
可根据需要转换为 MP3/AAC

包含多种音色的单个 WAV 文件
音色通过文件内的时间点分隔
使用 --speakers 参数将说话人映射到音色

音频在生成过程中分块处理
脚本显示 "Streaming audio..." 消息
适用于非常长的文本或实时应用

"google-genai not installed"

npm install @google/genai@latest dotenv@latest

"Voice name not found"

检查音色名称拼写
使用可用音色：Kore, Puck, Charon, Fenrir, Aoede, Zephyr, Sulafat
音色名称区分大小写

"No audio generated"

检查文本是否为空
确认文本未超过令牌限制（8,192）
尝试较短的文本片段
检查 API 配额限制

"Multi-speaker format error"

格式：SpeakerName:VoiceName,Speaker2:Voice2
用逗号分隔说话人
在说话人和音色之间使用冒号
示例："Joe:Kore,Jane:Puck,Host:Charon"

"Output file already exists"

脚本将覆盖现有文件
更改 --output 文件名以避免冲突
批量生成时使用唯一名称

检查输入文本是否有特殊字符
尝试不同音色以获得更好的发音
考虑将长文本拆分为较小的片段
验证音频播放软件的兼容性

Kore : 通用目的，发音清晰
Puck : 对话式，引人入胜的语气
Charon : 专业，权威
Fenrir : 富有情感，适合讲故事
Aoede : 柔和，适合无障碍场景
Zephyr : 教育，解释清晰

使用自然语言和标点符号
用逗号和句号包含停顿
必要时拼写出难读的单词
将很长的文本拆分为逻辑段落
为多说话人内容添加说话人标签

对非常长的文本使用流式传输
生成较短的片段以获得更好的控制
使用 flash 模型以获得更快的生成速度
批量处理多个文件以提高效率

针对您的内容类型测试不同的音色
使用标点符号控制适当的语速
选择音色时考虑上下文
在最终使用前试听输出
多说话人需要清晰的说话人标签

按音色的使用场景

音色	理想使用场景
Kore	公告、导航、通用信息
Puck	播客、访谈、休闲内容
Charon	企业、新闻、正式演示
Fenrir	有声读物、故事、情感内容
Aoede	无障碍、教育、温和内容
Zephyr	教程、解释、指南
Sulafat	纪录片、事实性演示

gemini-text : 为 TTS 生成脚本和文本
gemini-image : 创建与音频配套的视觉内容
gemini-batch : 高效处理多个 TTS 请求
gemini-files : 上传音频文件进行处理

# 基础
node scripts/tts.js "Your text here"

# 自定义音色
node scripts/tts.js "Your text" --voice Puck --output audio.wav

# 多说话人
node scripts/tts.js "Joe: Hi. Jane: Hello!" --speakers "Joe:Kore,Jane:Puck"

# 流式传输
node scripts/tts.js "Long text..." --stream --output long.wav

# 专业
node scripts/tts.js "Corporate announcement" --voice Charon

查看 references/voices.md 获取完整的音色文档
获取 API 密钥：https://aistudio.google.com/apikey
文档：https://ai.google.dev/gemini-api/docs/text-to-speech
采样率：24000 Hz，适用于大多数应用的标准

🇺🇸English

Gemini Text-to-Speech

Generate natural-sounding speech from text using Gemini's TTS models through executable scripts with support for multiple voices and multi-speaker conversations.

When to Use This Skill

Use this skill when you need to:

Convert text to natural speech
Create audio for podcasts, audiobooks, or videos
Generate multi-speaker conversations
Stream audio for long content
Choose from multiple voice options
Create accessible audio content
Generate voiceovers for presentations
Batch convert text to audio files

Available Scripts

scripts/tts.js

Purpose : Convert text to speech using Gemini TTS models

When to use :

Any text-to-speech conversion
Multi-speaker conversation generation
Streaming audio for long texts
Voiceovers for content creation
Accessible audio generation

Key parameters :

Parameter	Description	Example
`text`	Text to convert (required)	`"Hello, world!"`
`--voice`, `-v`	Voice name	`Kore`
`--output`, `-o`	Base name for output file	`welcome`
`--output-dir`	Output directory for audio	`audio/`
`--no-timestamp`	Disable auto timestamp	Flag
`--model`, `-m`	TTS model	`gemini-2.5-flash-preview-tts`
`--stream`, `-s`	Enable streaming	Flag
`--speakers`	Multi-speaker mapping	`"Joe:Kore,Jane:Puck"`

Output : WAV audio file path

Workflows

Workflow 1: Basic Text-to-Speech

node scripts/tts.js "Hello, world! Have a wonderful day."

Best for: Quick audio generation, simple messages
Voice: Kore (default, clear and professional)
Output: audio/tts_output_YYYYMMDD_HHMMSS.wav (auto timestamp)

Workflow 2: Choose Different Voice

node scripts/tts.js "Welcome to our podcast about technology trends" --voice Puck --output welcome

Best for: Friendly, conversational content
Voice options: Kore, Puck, Charon, Fenrir, Aoede, Zephyr, Sulafat
Output: audio/welcome_YYYYMMDD_HHMMSS.wav

Workflow 3: Multi-Speaker Conversation

node scripts/tts.js "TTS the following conversation:
Joe: How's it going today?
Jane: Not too bad, how about you?
Joe: I'm working on a new project.
Jane: Sounds exciting, tell me more!" --speakers "Joe:Kore,Jane:Puck" --output conversation

Best for: Dialogues, interviews, role-playing content
Format: Marked conversation with speaker names
Script automatically routes text to appropriate voices
Output: audio/conversation_YYYYMMDD_HHMMSS.wav

Workflow 4: Long Content with Streaming

node scripts/tts.js "This is a very long text that would benefit from streaming..." --stream --output long-form

Best for: Podcasts, audiobooks, long articles
Streaming: Processes audio in chunks for long texts
Output: audio/long-form_YYYYMMDD_HHMMSS.wav

Workflow 5: Professional Voiceover

node scripts/tts.js "Welcome to our quarterly earnings presentation. Today we'll discuss our growth metrics and future plans." --voice Charon --output voiceover

Best for: Corporate content, presentations, formal announcements
Voice: Charon (deep, authoritative)
Use when: Professional, serious tone required

Workflow 6: Custom Output Directory

node scripts/tts.js "Save to specific folder." --output-dir ./my-projects/podcasts/ --output episode1

Best for: Organized project structures
Directory created automatically if it doesn't exist
Output: ./my-projects/podcasts/episode1_YYYYMMDD_HHMMSS.wav

Workflow 7: Content Creation Pipeline (Text → Audio)

# 1. Generate script (gemini-text skill)
node skills/gemini-text/scripts/generate.js "Write a 2-minute podcast intro about sustainable energy"

# 2. Generate audio (this skill)
node scripts/tts.js "[Paste generated script]" --voice Fenrir --output podcast-intro

# 3. Use in video or podcast

Best for: Podcasts, audiobooks, video narration
Combines with: gemini-text for script generation

Workflow 8: Accessible Content

node scripts/tts.js "Welcome to our accessible website. This audio describes our main navigation options." --voice Aoede --output accessibility

Best for: Web accessibility, screen reader alternatives
Voice: Aoede (melodic, pleasant)
Use when: Making content accessible to visually impaired users

Workflow 9: Educational Content

node scripts/tts.js "Chapter 1: Introduction to Quantum Computing. Let's explore the fundamental principles..." --voice Zephyr --output chapter1

Best for: Educational materials, tutorials, e-learning
Voice: Zephyr (light, airy)
Combines well with: gemini-text for content generation

Workflow 10: Disable Timestamp

node scripts/tts.js "Fixed filename." --output my-audio --no-timestamp

Best for: When you want complete control over filename
Output: audio/my-audio.wav (no timestamp)
Use when: Generating files for specific naming schemes

Parameters Reference

Model Selection

Model	Quality	Speed	Best For
`gemini-2.5-flash-preview-tts`	Good	Fast	General use, high volume
`gemini-2.5-pro-preview-tts`	Higher	Slower	Premium content, voiceovers

Voice Selection

Voice	Characteristics	Best For
Kore	Clear, professional	Announcements, general purpose (default)
Puck	Friendly, conversational	Casual content, interviews
Charon	Deep, authoritative	Corporate, serious content
Fenrir	Warm, expressive	Storytelling, narratives
Aoede	Melodic, pleasant	Educational, accessibility
Zephyr	Light, airy	Gentle content, tutorials
Sulafat	Neutral, balanced	Documentaries, factual content

Audio Format

Specification	Value
Format	WAV (PCM)
Sample rate	24000 Hz
Channels	1 (mono)
Bit depth	16-bit

Token Limits

Limit	Type	Description
8,192	Input	Maximum input text tokens
16,384	Output	Maximum output audio tokens

Output Interpretation

Audio File

Format: WAV (compatible with most players)
Mono channel (single audio track)
Sample rate: 24000 Hz (broadcast quality)
Can be converted to MP3/AAC if needed

Multi-Speaker Files

Single WAV file with multiple voices
Voices separated by timing within file
Use --speakers parameter to map speakers to voices

Streaming Output

Audio processed in chunks during generation
Script shows "Streaming audio..." message
Useful for very long texts or real-time applications

Common Issues

"google-genai not installed"

npm install @google/genai@latest dotenv@latest

"Voice name not found"

Check voice name spelling
Use available voices: Kore, Puck, Charon, Fenrir, Aoede, Zephyr, Sulafat
Voice names are case-sensitive

"No audio generated"

Check text is not empty
Verify text doesn't exceed token limit (8,192)
Try shorter text segments
Check API quota limits

"Multi-speaker format error"

Format: SpeakerName:VoiceName,Speaker2:Voice2
Separate speakers with commas
Use colon between speaker and voice
Example: "Joe:Kore,Jane:Puck,Host:Charon"

"Output file already exists"

Script will overwrite existing files
Change --output filename to avoid conflicts
Use unique names for batch generation

Audio quality issues

Check input text for unusual characters
Try different voice for better pronunciation
Consider splitting long text into smaller segments
Verify audio playback software compatibility

Best Practices

Voice Selection

Kore : General purpose, clear articulation
Puck : Conversational, engaging tone
Charon : Professional, authoritative
Fenrir : Emotional, storytelling
Aoede : Soft, gentle for accessibility
Zephyr : Educational, clear explanations

Text Preparation

Use natural language and punctuation
Include pauses with commas and periods
Spell out difficult words if needed
Break very long text into logical segments
Add speaker labels for multi-speaker content

Performance Optimization

Use streaming for very long texts
Generate shorter segments for better control
Use flash model for faster generation
Batch process multiple files for efficiency

Quality Tips

Test different voices for your content type
Use appropriate pacing with punctuation
Consider context when selecting voice
Listen to output before final use
Multi-speaker requires clear speaker labeling

Use Cases by Voice

Voice	Ideal Use Cases
Kore	Announcements, navigation, general info
Puck	Podcasts, interviews, casual content
Charon	Corporate, news, formal presentations
Fenrir	Audiobooks, stories, emotional content
Aoede	Accessibility, educational, gentle content
Zephyr	Tutorials, explanations, guides
Sulafat	Documentaries, factual presentations

Related Skills

gemini-text : Generate scripts and text for TTS
gemini-image : Create visuals to accompany audio
gemini-batch : Process multiple TTS requests efficiently
gemini-files : Upload audio files for processing

Quick Reference

# Basic
node scripts/tts.js "Your text here"

# Custom voice
node scripts/tts.js "Your text" --voice Puck --output audio.wav

# Multi-speaker
node scripts/tts.js "Joe: Hi. Jane: Hello!" --speakers "Joe:Kore,Jane:Puck"

# Streaming
node scripts/tts.js "Long text..." --stream --output long.wav

# Professional
node scripts/tts.js "Corporate announcement" --voice Charon

Reference

See references/voices.md for complete voice documentation
Get API key: https://aistudio.google.com/apikey
Documentation: https://ai.google.dev/gemini-api/docs/text-to-speech
Sample rate: 24000 Hz standard for most applications

Weekly Installs

Repository

akrindev/google…o-skills

GitHub Stars

First Seen

Jan 29, 2026

Security Audits

Gen Agent Trust HubPass SocketFail SnykPass

Installed on

gemini-cli32

opencode28

codex27

cursor23

github-copilot22

openclaw21

超能力技能使用指南：AI助手技能调用优先级与工作流程详解

56,600 周安装

Gemini TTS 文本转语音工具：支持多音色对话、流式音频生成

🇨🇳中文介绍

Gemini 文本转语音

何时使用此技能

可用脚本

scripts/tts.js

相关 Skills

工作流程

工作流程 1: 基础文本转语音

工作流程 2: 选择不同音色

工作流程 3: 多说话人对话

工作流程 4: 长内容流式传输

工作流程 5: 专业画外音

工作流程 6: 自定义输出目录

工作流程 7: 内容创作流水线（文本 → 音频）

工作流程 8: 无障碍内容

工作流程 9: 教育内容

工作流程 10: 禁用时间戳

参数参考

模型选择

音色选择

音频格式

令牌限制

输出解读

音频文件

多说话人文件

流式输出

常见问题

"google-genai not installed"

"Voice name not found"

"No audio generated"

"Multi-speaker format error"

"Output file already exists"

音频质量问题

最佳实践

音色选择

文本准备

性能优化

质量提示

按音色的使用场景

相关技能

快速参考

参考

🇺🇸English

Gemini Text-to-Speech

When to Use This Skill

Available Scripts

scripts/tts.js

Workflows

Workflow 1: Basic Text-to-Speech

Workflow 2: Choose Different Voice

Workflow 3: Multi-Speaker Conversation

Workflow 4: Long Content with Streaming

Workflow 5: Professional Voiceover

Workflow 6: Custom Output Directory

Workflow 7: Content Creation Pipeline (Text → Audio)

Workflow 8: Accessible Content

Workflow 9: Educational Content

Workflow 10: Disable Timestamp

Parameters Reference

Model Selection

Voice Selection

Audio Format

Token Limits

Output Interpretation

Audio File

Multi-Speaker Files

Streaming Output

Common Issues

"google-genai not installed"

"Voice name not found"

"No audio generated"

"Multi-speaker format error"

"Output file already exists"

Audio quality issues

Best Practices

Voice Selection

Text Preparation

Performance Optimization

Quality Tips