Gemini API 音频视频转录工具 - 支持 YouTube URL 和本地文件，输出结构化 Markdown 字幕

omnicaptions-transcribe by lattifai/omni-captions-skills

103 周安装量

22 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/lattifai/omni-captions-skills --skill omnicaptions-transcribe

AI/机器学习内容创作音频处理

🇨🇳中文介绍

Gemini 转录

使用 Google Gemini API 转录音频/视频，输出结构化的 Markdown 格式。

YouTube 视频工作流程

重要提示：转录前请检查是否存在现有字幕：

1. 检查字幕：yt-dlp --list-subs "URL"
2. 有字幕 → 使用 /omnicaptions:download 获取现有字幕（质量更好）
3. 无字幕 → 直接使用 URL 进行转录（无需先下载！）

向用户确认：在转录之前，询问他们是否希望先检查现有字幕。

URL 与本地文件支持

Gemini 原生支持 YouTube URL - 无需下载，直接传递 URL 即可：

# YouTube URL（推荐，无需下载）
omnicaptions transcribe "https://www.youtube.com/watch?v=VIDEO_ID"

# 本地文件
omnicaptions transcribe video.mp4

注意：除非用户指定 -o，否则输出默认在当前目录。

适用场景

视频 URL - YouTube、直接视频链接（Gemini 原生支持）
转录播客、访谈、讲座
需要带时间戳和说话人标签的逐字稿
希望从内容中自动生成章节
混合语言音频（保留语码转换）

不适用场景

视频已有现有字幕 - 先使用获取现有字幕

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

相关 Skills

find-skills 技能搜索工具 - Vercel Labs 开源智能体技能包管理器

867,400 周安装

React 组合模式指南：Vercel 组件架构最佳实践，提升代码可维护性

116,600 周安装

AI Elements：基于shadcn/ui的AI原生应用组件库，快速构建对话界面

66,200 周安装

超能力技能使用指南：AI助手技能调用优先级与工作流程详解

49,600 周安装

/omnicaptions:download

需要实时流式转录（使用 Whisper）

音频时长 >2 小时（Gemini 上传限制）

需要翻译而非转录

方法	描述
`transcribe(path)`	转录文件或 URL（同步）
`translate(in, out, lang)`	翻译字幕
`write(text, path)`	保存文本到文件

pip install omni-captions-skills --extra-index-url https://lattifai.github.io/pypi/simple/

优先级：GEMINI_API_KEY 环境变量 → .env 文件 → ~/.config/omnicaptions/config.json

如果未设置，询问用户：请输入您的 Gemini API 密钥（从 https://aistudio.google.com/apikey 获取）：

然后使用 -k <key> 运行。密钥将自动保存到配置文件。

重要提示：CLI 需要子命令（transcribe、translate、convert）

# 转录（自动输出到相同目录）
omnicaptions transcribe video.mp4              # → ./video_GeminiUnd.md
omnicaptions transcribe "https://youtu.be/abc" # → ./abc_GeminiUnd.md

# 指定输出文件或目录
omnicaptions transcribe video.mp4 -o output/   # → output/video_GeminiUnd.md
omnicaptions transcribe video.mp4 -o my.md     # → my.md

# 选项
omnicaptions transcribe -m gemini-3-pro-preview video.mp4
omnicaptions transcribe -l zh video.mp4  # 强制使用中文

选项	描述
`-k, --api-key`	Gemini API 密钥（如果缺失会自动提示）
`-o, --output`	输出文件或目录（默认：自动）
`-m, --model`	模型（默认：gemini-3-flash-preview）
`-l, --language`	强制指定语言（zh, en, ja）
`-t, --translate LANG`	翻译到指定语言（一步完成）
`--bilingual`	双语输出（与 -t 一起使用）
`-v, --verbose`	详细输出

双语字幕（可选）

如果用户请求双语输出，添加 -t <lang> --bilingual：

omnicaptions transcribe video.mp4 -t zh --bilingual

如需精确的时间对齐，请使用单独的工作流程：转录 → LaiCut → 翻译（参见相关技能）。

## 目录
* [00:00:00] 介绍
* [00:02:15] 主要话题

## [00:00:00] 介绍

**主持人：** 欢迎来到节目。[00:00:01]

**嘉宾：** 谢谢邀请我。[00:00:05]

[掌声] [00:00:08]

## [HH:MM:SS] 标题 章节标题
**说话人：** 标签（自动检测）
[HH:MM:SS] 时间戳在段落末尾
[事件] 用于非语音内容（笑声、音乐）

错误	解决方法
无 API 密钥错误	使用 `-k YOUR_KEY` 或按照提示操作
空响应	检查文件格式（支持 mp3/mp4/wav/m4a）
上传超时	文件过大（>2GB）；请先分割文件
语言错误	使用 `-l en` 强制指定语言

技能	适用场景
`/omnicaptions:convert`	将输出转换为 SRT/VTT/ASS
`/omnicaptions:translate`	翻译（Gemini API 或 Claude 原生）
`/omnicaptions:download`	先下载视频/音频

# 基本转录
omnicaptions transcribe video.mp4
# → video_GeminiUnd.md

# 需要精确时间对齐：转录 → LaiCut 对齐 → 转换
omnicaptions transcribe video.mp4
omnicaptions LaiCut video.mp4 video_GeminiUnd.md
# → video_GeminiUnd_LaiCut.json
omnicaptions convert video_GeminiUnd_LaiCut.json -o video_GeminiUnd_LaiCut.srt

注意：对于翻译，请使用 /omnicaptions:translate（默认：Claude，可选：Gemini API）

🇺🇸English

Gemini Transcription

Transcribe audio/video using Google Gemini API with structured markdown output.

YouTube Video Workflow

Important : Check for existing captions before transcribing:

1. Check captions: yt-dlp --list-subs "URL"
2. Has caption → Use /omnicaptions:download to get existing captions (better quality)
3. No caption → Transcribe directly with URL (don't download first!)

Confirm with user : Before transcribing, ask if they want to check for existing captions first.

URL & Local File Support

Gemini natively supports YouTube URLs - no need to download, just pass the URL directly:

# YouTube URL (recommended, no download needed)
omnicaptions transcribe "https://www.youtube.com/watch?v=VIDEO_ID"

# Local files
omnicaptions transcribe video.mp4

Note : Output defaults to current directory unless user specifies -o.

When to Use

Video URLs - YouTube, direct video links (Gemini native support)
Transcribing podcasts, interviews, lectures
Need verbatim transcript with timestamps and speaker labels
Want auto-generated chapters from content
Mixed-language audio (code-switching preserved)

When NOT to Use

Video has existing captions - Use /omnicaptions:download to get existing captions first
Need real-time streaming transcription (use Whisper)
Audio >2 hours (Gemini upload limit)
Want translation instead of transcription

Quick Reference

Method	Description
`transcribe(path)`	Transcribe file or URL (sync)
`translate(in, out, lang)`	Translate captions
`write(text, path)`	Save text to file

Setup

pip install omni-captions-skills --extra-index-url https://lattifai.github.io/pypi/simple/

API Key

Priority: GEMINI_API_KEY env → .env file → ~/.config/omnicaptions/config.json

If not set, ask user: Please enter your Gemini API key (get from https://aistudio.google.com/apikey):

Then run with -k <key>. Key will be saved to config file automatically.

CLI Usage

IMPORTANT : CLI requires subcommand (transcribe, translate, convert)

# Transcribe (auto-output to same directory)
omnicaptions transcribe video.mp4              # → ./video_GeminiUnd.md
omnicaptions transcribe "https://youtu.be/abc" # → ./abc_GeminiUnd.md

# Specify output file or directory
omnicaptions transcribe video.mp4 -o output/   # → output/video_GeminiUnd.md
omnicaptions transcribe video.mp4 -o my.md     # → my.md

# Options
omnicaptions transcribe -m gemini-3-pro-preview video.mp4
omnicaptions transcribe -l zh video.mp4  # Force Chinese

Option	Description
`-k, --api-key`	Gemini API key (auto-prompted if missing)
`-o, --output`	Output file or directory (default: auto)
`-m, --model`	Model (default: gemini-3-flash-preview)
`-l, --language`	Force language (zh, en, ja)
`-t, --translate LANG`	Translate to language (one-step)
`--bilingual`

Bilingual Captions (Optional)

If user requests bilingual output, add -t <lang> --bilingual:

omnicaptions transcribe video.mp4 -t zh --bilingual

For precise timing, use separate workflow: transcribe → LaiCut → translate (see Related Skills).

Output Format

## Table of Contents
* [00:00:00] Introduction
* [00:02:15] Main Topic

## [00:00:00] Introduction

**Host:** Welcome to the show. [00:00:01]

**Guest:** Thanks for having me. [00:00:05]

[Applause] [00:00:08]

Key features:

## [HH:MM:SS] Title chapter headers
**Speaker:** labels (auto-detected)
[HH:MM:SS] timestamp at paragraph end
[Event] for non-speech (laughter, music)

Common Mistakes

Mistake	Fix
No API key error	Use `-k YOUR_KEY` or follow the prompt
Empty response	Check file format (mp3/mp4/wav/m4a supported)
Upload timeout	File too large (>2GB); split first
Wrong language	Use `-l en` to force language

Related Skills

Skill	Use When
`/omnicaptions:convert`	Convert output to SRT/VTT/ASS
`/omnicaptions:translate`	Translate (Gemini API or Claude native)
`/omnicaptions:download`	Download video/audio first

Workflow Examples

# Basic transcription
omnicaptions transcribe video.mp4
# → video_GeminiUnd.md

# Precise timing needed: transcribe → LaiCut align → convert
omnicaptions transcribe video.mp4
omnicaptions LaiCut video.mp4 video_GeminiUnd.md
# → video_GeminiUnd_LaiCut.json
omnicaptions convert video_GeminiUnd_LaiCut.json -o video_GeminiUnd_LaiCut.srt

Note : For translation, use /omnicaptions:translate (default: Claude, optional: Gemini API)

Weekly Installs

103

Repository

lattifai/omni-c…s-skills

GitHub Stars

First Seen

Jan 24, 2026

Security Audits

Gen Agent Trust HubWarn SocketPass SnykFail

Installed on

claude-code81

gemini-cli49

opencode49

codex45

cursor37

antigravity36

AI 代码实施计划编写技能 | 自动化开发任务分解与 TDD 流程规划工具

49,000 周安装