YouTube字幕下载工具 - 无需API密钥，支持多语言、SRT格式和翻译

baoyu-youtube-transcript by jimliu/baoyu-skills

1,000 周安装量

10,800 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/jimliu/baoyu-skills --skill baoyu-youtube-transcript

开发音频处理生产力

🇨🇳中文介绍

YouTube 字幕转录

从 YouTube 视频下载字幕（字幕/说明文字）。适用于手动创建和自动生成的字幕。无需 API 密钥或浏览器——直接使用 YouTube 的 InnerTube API。

首次运行时获取视频元数据和封面图像，缓存原始数据以便快速重新格式化。

脚本目录

脚本位于 scripts/ 子目录中。{baseDir} = 此 SKILL.md 文件的目录路径。解析 ${BUN_X} 运行时：如果已安装 bun → bun；如果 npx 可用 → npx -y bun；否则建议安装 bun。请将 {baseDir} 和 ${BUN_X} 替换为实际值。

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

相关 Skills

find-skills 技能搜索工具 - Vercel Labs 开源智能体技能包管理器

705,000 周安装

Vercel React 最佳实践指南 | 58条Next.js性能优化规则与代码重构

245,300 周安装

Vercel Web界面规范检查工具 - 自动检测代码是否符合Web设计指南

196,800 周安装

agent-browser 浏览器自动化工具 - Vercel Labs 命令行网页操作与测试

128,699 周安装

React 组合模式指南：Vercel 组件架构最佳实践，提升代码可维护性

脚本	用途
`scripts/main.ts`	字幕下载命令行界面

# 默认：带时间戳的 Markdown 格式（英语）
${BUN_X} {baseDir}/scripts/main.ts <youtube-url-or-id>

# 指定语言（优先级顺序）
${BUN_X} {baseDir}/scripts/main.ts <url> --languages zh,en,ja

# 不带时间戳
${BUN_X} {baseDir}/scripts/main.ts <url> --no-timestamps

# 带章节分段
${BUN_X} {baseDir}/scripts/main.ts <url> --chapters

# 带说话人识别（需要 AI 后处理）
${BUN_X} {baseDir}/scripts/main.ts <url> --speakers

# SRT 字幕文件
${BUN_X} {baseDir}/scripts/main.ts <url> --format srt

# 翻译字幕
${BUN_X} {baseDir}/scripts/main.ts <url> --translate zh-Hans

# 列出可用的字幕
${BUN_X} {baseDir}/scripts/main.ts <url> --list

# 强制重新获取（忽略缓存）
${BUN_X} {baseDir}/scripts/main.ts <url> --refresh

选项	描述	默认值
`<url-or-id>`	YouTube URL 或视频 ID（允许多个）	必填
`--languages <codes>`	语言代码，逗号分隔，按优先级顺序	`en`
`--format <fmt>`	输出格式：`text`, `srt`	`text`
`--translate <code>`	翻译到指定的语言代码
`--list`	列出可用的字幕而非获取
`--timestamps`	每段包含 `[HH:MM:SS → HH:MM:SS]` 时间戳	开启
`--no-timestamps`	禁用时间戳
`--chapters`	从视频描述中提取章节分段
`--speakers`	包含元数据的原始字幕，用于说话人识别
`--exclude-generated`	跳过自动生成的字幕
`--exclude-manually-created`	跳过硬性创建的字幕
`--refresh`	强制重新获取，忽略缓存数据
`-o, --output <path>`	保存到特定文件路径	自动生成
`--output-dir <dir>`	基础输出目录	`youtube-transcript`

接受以下任意一种作为视频输入：

完整 URL：https://www.youtube.com/watch?v=dQw4w9WgXcQ
短 URL：https://youtu.be/dQw4w9WgXcQ
嵌入 URL：https://www.youtube.com/embed/dQw4w9WgXcQ
Shorts URL：https://www.youtube.com/shorts/dQw4w9WgXcQ
视频 ID：dQw4w9WgXcQ

格式	扩展名	描述
`text`	`.md`	带 frontmatter（包含 `description`）、标题、摘要、可选目录/封面/时间戳/章节/说话人的 Markdown 文件
`srt`	`.srt`	适用于视频播放器的 SubRip 字幕格式

youtube-transcript/
├── .index.json                          # 视频 ID → 目录路径映射（用于缓存查找）
└── {channel-slug}/{title-full-slug}/
    ├── meta.json                        # 视频元数据（标题、频道、描述、时长、章节等）
    ├── transcript-raw.json              # 来自 YouTube API 的原始字幕片段（已缓存）
    ├── transcript-sentences.json        # 句子分割后的字幕（按标点符号分割，跨片段合并）
    ├── imgs/
    │   └── cover.jpg                    # 视频缩略图
    ├── transcript.md                    # Markdown 字幕（从句子生成）
    └── transcript.srt                   # SRT 字幕（如果使用 --format srt，则从原始片段生成）

{channel-slug}：频道名称的 kebab-case 形式
{title-full-slug}：完整视频标题的 kebab-case 形式

--list 模式仅输出到标准输出（不保存文件）。

首次获取时，脚本会保存：

meta.json — 视频元数据、章节、封面图像路径、语言信息
transcript-raw.json — 来自 YouTube API 的原始字幕片段（{ text, start, duration }[]）
transcript-sentences.json — 句子分割后的字幕（{ text, start: "HH:mm:ss", end: "HH:mm:ss" }[]），按句子结束标点符号（.?!…。？！ 等）分割，时间戳按字符长度比例分配，支持 CJK 文本合并
imgs/cover.jpg — 视频缩略图

后续对同一视频的运行将使用缓存数据（无需网络调用）。使用 --refresh 强制重新获取。如果请求了不同的语言，缓存会自动刷新。

SRT 输出（--format srt）从 transcript-raw.json 生成。文本/Markdown 输出使用 transcript-sentences.json 以获得自然的句子边界。

当用户提供 YouTube URL 并想要字幕时：

如果用户未指定语言，首先使用 --list 运行以显示可用选项
运行脚本时始终用单引号包裹 URL — zsh 将 ? 视为通配符，因此未加引号的 YouTube URL 会导致“未找到匹配项”：请使用 'https://www.youtube.com/watch?v=ID'
默认：使用 --chapters --speakers 运行以获得最丰富的输出（章节 + 说话人识别）
脚本自动保存缓存数据 + 输出文件并打印文件路径
对于 --speakers 模式：脚本保存原始文件后，按照下面的说话人识别工作流程进行后处理以添加说话人标签

当用户只想要封面图像或元数据时，使用任何选项运行脚本也会缓存 meta.json 和 imgs/cover.jpg。

当重新格式化同一视频时（例如，先文本后 SRT），会重用缓存数据——无需重新获取。

章节与说话人工作流程

章节（`--chapters`）

脚本从视频描述中解析章节时间戳（例如 0:00 Introduction），按章节边界分割字幕，将片段分组为可读段落，并保存为带目录的 .md 文件。无需进一步处理。

如果描述中没有章节时间戳，则输出不带章节标题的分组段落。

说话人识别（`--speakers`）

说话人识别需要 AI 处理。脚本输出一个包含以下内容的原始 .md 文件：

带有视频元数据的 YAML frontmatter（标题、频道、日期、封面、描述、语言）
视频描述（用于提取说话人姓名）
描述中的章节列表（如果可用）
SRT 格式的原始字幕（预计算开始/结束时间戳，令牌高效）

脚本保存原始文件后，生成一个子代理（使用成本更低的模型，如 Sonnet，以提高成本效益）来处理说话人识别：

读取保存的 .md 文件
读取 {baseDir}/prompts/speaker-transcript.md 处的提示模板
按照提示处理原始字幕：
- 使用视频元数据识别说话人（标题 → 嘉宾，频道 → 主持人，描述 → 姓名）
- 从对话流程、问答模式和上下文线索中检测说话人转换
- 分割成章节（如果描述中有章节则使用，否则根据主题转换创建）
- 使用 **说话人姓名：** 标签、段落分组（2-4 句）和 [HH:MM:SS → HH:MM:SS] 时间戳进行格式化
用处理后的字幕覆盖 .md 文件（保留 YAML frontmatter）

当使用 --speakers 时，隐含 --chapters — 处理后的输出始终包含章节分割。

错误	含义
字幕已禁用	视频完全没有字幕
未找到字幕	请求的语言不可用
视频不可用	视频已删除、设为私有或受区域限制
IP 被阻止	请求过多，请稍后重试
年龄限制	视频需要登录进行年龄验证

🇺🇸English

YouTube Transcript

Downloads transcripts (subtitles/captions) from YouTube videos. Works with both manually created and auto-generated transcripts. No API key or browser required — uses YouTube's InnerTube API directly.

Fetches video metadata and cover image on first run, caches raw data for fast re-formatting.

Script Directory

Scripts in scripts/ subdirectory. {baseDir} = this SKILL.md's directory path. Resolve ${BUN_X} runtime: if bun installed → bun; if npx available → npx -y bun; else suggest installing bun. Replace {baseDir} and ${BUN_X} with actual values.

Script	Purpose
`scripts/main.ts`	Transcript download CLI

Usage

# Default: markdown with timestamps (English)
${BUN_X} {baseDir}/scripts/main.ts <youtube-url-or-id>

# Specify languages (priority order)
${BUN_X} {baseDir}/scripts/main.ts <url> --languages zh,en,ja

# Without timestamps
${BUN_X} {baseDir}/scripts/main.ts <url> --no-timestamps

# With chapter segmentation
${BUN_X} {baseDir}/scripts/main.ts <url> --chapters

# With speaker identification (requires AI post-processing)
${BUN_X} {baseDir}/scripts/main.ts <url> --speakers

# SRT subtitle file
${BUN_X} {baseDir}/scripts/main.ts <url> --format srt

# Translate transcript
${BUN_X} {baseDir}/scripts/main.ts <url> --translate zh-Hans

# List available transcripts
${BUN_X} {baseDir}/scripts/main.ts <url> --list

# Force re-fetch (ignore cache)
${BUN_X} {baseDir}/scripts/main.ts <url> --refresh

Options

Option	Description	Default
`<url-or-id>`	YouTube URL or video ID (multiple allowed)	Required
`--languages <codes>`	Language codes, comma-separated, in priority order	`en`
`--format <fmt>`	Output format: `text`, `srt`	`text`

Input Formats

Accepts any of these as video input:

Full URL: https://www.youtube.com/watch?v=dQw4w9WgXcQ
Short URL: https://youtu.be/dQw4w9WgXcQ
Embed URL: https://www.youtube.com/embed/dQw4w9WgXcQ
Shorts URL: https://www.youtube.com/shorts/dQw4w9WgXcQ
Video ID: dQw4w9WgXcQ

Output Formats

Format	Extension	Description
`text`	`.md`	Markdown with frontmatter (incl. `description`), title heading, summary, optional TOC/cover/timestamps/chapters/speakers
`srt`	`.srt`	SubRip subtitle format for video players

Output Directory

youtube-transcript/
├── .index.json                          # Video ID → directory path mapping (for cache lookup)
└── {channel-slug}/{title-full-slug}/
    ├── meta.json                        # Video metadata (title, channel, description, duration, chapters, etc.)
    ├── transcript-raw.json              # Raw transcript snippets from YouTube API (cached)
    ├── transcript-sentences.json        # Sentence-segmented transcript (split by punctuation, merged across snippets)
    ├── imgs/
    │   └── cover.jpg                    # Video thumbnail
    ├── transcript.md                    # Markdown transcript (generated from sentences)
    └── transcript.srt                   # SRT subtitle (generated from raw snippets, if --format srt)

{channel-slug}: Channel name in kebab-case
{title-full-slug}: Full video title in kebab-case

The --list mode outputs to stdout only (no file saved).

Caching

On first fetch, the script saves:

meta.json — video metadata, chapters, cover image path, language info
transcript-raw.json — raw transcript snippets from YouTube API ({ text, start, duration }[])
transcript-sentences.json — sentence-segmented transcript ({ text, start: "HH:mm:ss", end: "HH:mm:ss" }[]), split by sentence-ending punctuation (.?!…。？！ etc.), timestamps proportionally allocated by character length, CJK-aware text merging
imgs/cover.jpg — video thumbnail

Subsequent runs for the same video use cached data (no network calls). Use --refresh to force re-fetch. If a different language is requested, the cache is automatically refreshed.

SRT output (--format srt) is generated from transcript-raw.json. Text/markdown output uses transcript-sentences.json for natural sentence boundaries.

Workflow

When user provides a YouTube URL and wants the transcript:

Run with --list first if the user hasn't specified a language, to show available options
Always single-quote the URL when running the script — zsh treats ? as a glob wildcard, so an unquoted YouTube URL causes "no matches found": use 'https://www.youtube.com/watch?v=ID'
Default: run with --chapters --speakers for the richest output (chapters + speaker identification)
The script auto-saves cached data + output file and prints the file path
For --speakers mode: after the script saves the raw file, follow the speaker identification workflow below to post-process with speaker labels

When user only wants a cover image or metadata, running the script with any option will also cache meta.json and imgs/cover.jpg.

When re-formatting the same video (e.g., first text then SRT), the cached data is reused — no re-fetch needed.

Chapter & Speaker Workflow

Chapters (`--chapters`)

The script parses chapter timestamps from the video description (e.g., 0:00 Introduction), segments the transcript by chapter boundaries, groups snippets into readable paragraphs, and saves as .md with a Table of Contents. No further processing needed.

If no chapter timestamps exist in the description, the transcript is output as grouped paragraphs without chapter headings.

Speaker Identification (`--speakers`)

Speaker identification requires AI processing. The script outputs a raw .md file containing:

YAML frontmatter with video metadata (title, channel, date, cover, description, language)
Video description (for speaker name extraction)
Chapter list from description (if available)
Raw transcript in SRT format (pre-computed start/end timestamps, token-efficient)

After the script saves the raw file, spawn a sub-agent (use a cheaper model like Sonnet for cost efficiency) to process speaker identification:

Read the saved .md file
Read the prompt template at {baseDir}/prompts/speaker-transcript.md
Process the raw transcript following the prompt:
- Identify speakers using video metadata (title → guest, channel → host, description → names)
- Detect speaker turns from conversation flow, question-answer patterns, and contextual cues
- Segment into chapters (use description chapters if available, else create from topic shifts)
- Format with **Speaker Name:** labels, paragraph grouping (2-4 sentences), and [HH:MM:SS → HH:MM:SS] timestamps
Overwrite the .md file with the processed transcript (keep the YAML frontmatter)

When --speakers is used, --chapters is implied — the processed output always includes chapter segmentation.

Error Cases

Error	Meaning
Transcripts disabled	Video has no captions at all
No transcript found	Requested language not available
Video unavailable	Video deleted, private, or region-locked
IP blocked	Too many requests, try again later
Age restricted	Video requires login for age verification

Weekly Installs

1.0K

Repository

jimliu/baoyu-skills

GitHub Stars

10.8K

First Seen

2 days ago

Security Audits

Gen Agent Trust HubPass SocketPass SnykWarn

Installed on

opencode999

gemini-cli998

codex998

github-copilot997

amp997

kimi-cli996

YouTube字幕下载工具 - 无需API密钥，支持多语言、SRT格式和翻译

🇨🇳中文介绍

YouTube 字幕转录

脚本目录

相关 Skills

使用方法

选项

输入格式

输出格式

输出目录

缓存

工作流程

章节与说话人工作流程

章节（`--chapters`）

说话人识别（`--speakers`）

错误情况

🇺🇸English

YouTube Transcript

Script Directory

Usage

Options

Input Formats

Output Formats

Output Directory

Caching

Workflow

Chapter & Speaker Workflow

Chapters (`--chapters`)

Speaker Identification (`--speakers`)

Error Cases

YouTube字幕下载工具 - 无需API密钥，支持多语言、SRT格式和翻译

🇨🇳中文介绍

YouTube 字幕转录

脚本目录

相关 Skills

使用方法

选项

输入格式

输出格式

输出目录

缓存

工作流程

章节与说话人工作流程

章节（--chapters）

说话人识别（--speakers）

错误情况

🇺🇸English

YouTube Transcript

Script Directory

Usage

Options

Input Formats

Output Formats

Output Directory

Caching

Workflow

Chapter & Speaker Workflow

Chapters (--chapters)

Speaker Identification (--speakers)

Error Cases

章节（`--chapters`）

说话人识别（`--speakers`）

Chapters (`--chapters`)

Speaker Identification (`--speakers`)