faster-whisper：本地语音转文本工具，速度提升4-6倍，支持多语言转录与字幕生成

faster-whisper by theplasmak/faster-whisper

828 周安装量

4 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/theplasmak/faster-whisper --skill faster-whisper

AI/机器学习音频处理生产力

🇨🇳中文介绍

Faster Whisper

使用 faster-whisper 进行本地语音转文本——这是 OpenAI Whisper 的 CTranslate2 重新实现，在保持相同准确性的前提下，运行速度快 4-6 倍。借助 GPU 加速，可实现约 20 倍实时的转录速度（一个 10 分钟的音频文件约需 30 秒）。

使用场景

在以下场景中使用此技能：

转录音频/视频文件 —— 会议、访谈、播客、讲座、YouTube 视频
生成字幕 —— SRT、VTT、ASS、LRC 或 TTML 广播标准字幕
识别说话人 —— 通过声纹分离标记谁说了什么 (--diarize)
转录 URL 内容 —— YouTube 链接和直接音频 URL（通过 yt-dlp 自动下载）
转录播客源 —— --rss <feed-url> 获取并转录剧集
批量处理文件 —— 支持通配符模式、目录、跳过已存在文件；自动显示预计完成时间
本地语音转文本 —— 无需 API 费用，可离线工作（模型下载后）
翻译成英语 —— 使用 --translate 将任何语言翻译成英语
多语言转录 —— 支持 99 种以上语言，自动检测
批量转录不同语言的文件 —— --language-map 为每个文件分配不同的语言

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

相关 Skills

find-skills 技能搜索工具 - Vercel Labs 开源智能体技能包管理器

733,500 周安装

Vercel Web界面规范检查工具 - 自动检测代码是否符合Web设计指南

202,600 周安装

agent-browser 浏览器自动化工具 - Vercel Labs 命令行网页操作与测试

133,200 周安装

React 组合模式指南：Vercel 组件架构最佳实践，提升代码可维护性

102,200 周安装

转录多语言音频 —— --multilingual 用于混合语言音频

转录包含特定术语的音频 —— 使用 --initial-prompt 处理专业术语密集的内容或任何其他需要注意的术语

预处理嘈杂音频（转录前） —— 转录前使用 --normalize 和 --denoise

流式输出 —— --stream 在转录时显示片段

剪辑时间范围 —— --clip-timestamps 转录特定部分

搜索转录文本 —— --search "term" 查找单词/短语出现的所有时间戳

检测章节 —— --detect-chapters 根据静默间隙查找章节断点

导出说话人音频 —— --export-speakers DIR 将每个说话人的发言保存为单独的 WAV 文件

电子表格输出 —— --format csv 生成带时间戳且正确引用的 CSV

触发短语： "转录此音频"、"语音转文本"、"他们说了什么"、"制作转录稿"、"音频转文本"、"为视频加字幕"、"谁在说话"、"翻译此音频"、"翻译成英语"、"查找 X 被提及的位置"、"搜索转录稿中的"、"他们什么时候说的"、"在什么时间戳"、"添加章节"、"检测章节"、"查找音频中的断点"、"此录音的目录"、"TTML 字幕"、"DFXP 字幕"、"广播格式字幕"、"Netflix 格式"、"ASS 字幕"、"aegisub 格式"、"advanced substation alpha"、"mpv 字幕"、"LRC 字幕"、"带时间戳的歌词"、"卡拉 OK 字幕"、"音乐播放器歌词"、"HTML 转录稿"、"置信度着色转录稿"、"彩色编码转录稿"、"按说话人分离音频"、"导出说话人音频"、"按说话人分割"、"CSV 格式转录稿"、"电子表格输出"、"转录播客"、"播客 RSS 源"、"批量不同语言"、"每个文件的语言"、"以多种格式转录"、"同时生成 srt 和 txt"、"同时输出 srt 和文本"、"移除填充词"、"清理 um 和 uh"、"去除犹豫音"、"移除 you know 和 I mean"、"转录左声道"、"转录右声道"、"立体声声道"、"仅左声道"、"字幕换行"、"每行字符限制"、"每行字幕最大字符数"、"检测段落"、"段落分隔"、"分组为段落"、"添加段落间距"

⚠️ 智能体指导原则 —— 保持调用简洁：

核心规则：默认命令 (./scripts/transcribe audio.mp3) 是最快路径 —— 仅在用户明确要求该功能时才添加标志。

仅当用户询问"谁说了什么" / "识别说话人" / "标记说话人"时才添加 --diarize
仅当用户要求特定格式的字幕/字幕时才添加 --format srt/vtt/ass/lrc/ttml
仅当用户要求 CSV 或电子表格输出时才添加 --format csv
仅当用户需要词级时间戳时才添加 --word-timestamps
仅当有特定领域的术语需要引导时才添加 --initial-prompt
仅当用户希望将非英语音频翻译成英语时才添加 --translate
仅当用户提到音频质量差或噪音时才添加 --normalize/--denoise
仅当用户希望对长文件进行实时/渐进式输出时才添加 --stream
仅当用户需要特定时间范围时才添加 --clip-timestamps
仅当模型在音乐/静默上产生幻觉时才添加 --temperature 0.0
仅当 VAD 过度切割语音或包含噪音时才添加 --vad-threshold
仅当您知道说话人数量时才添加 --min-speakers/--max-speakers
仅当令牌未缓存在 ~/.cache/huggingface/token 时才添加 --hf-token
仅当需要提高长片段字幕的可读性时才添加 --max-words-per-line
仅当转录稿包含明显伪影（音乐标记、重复）时才添加 --filter-hallucinations
仅当用户要求句子级字幕提示时才添加 --merge-sentences
仅当用户要求移除填充词（um, uh, you know, I mean, 犹豫音）时才添加 --clean-filler
仅当用户提到立体声轨道、双声道录音或要求特定声道时才添加 --channel left|right
仅当用户指定每行字幕的字符限制（例如，"Netflix 格式"、"每行 42 个字符"）时才添加 --max-chars-per-line N；优先级高于 --max-words-per-line
仅当用户要求段落分隔或结构化文本输出时才添加 --detect-paragraphs；仅当他们想要自定义间隔时才添加 --paragraph-gap（默认 3.0 秒）
仅当用户提供真实姓名来替换 SPEAKER_1/2 时才添加 --speaker-names "Alice,Bob" —— 始终需要 --diarize
仅当用户指定了 --initial-prompt 无法很好处理的特定罕见术语时才添加 --hotwords WORDS；对于一般领域术语，优先使用 --initial-prompt
仅当用户确切知道音频开头的单词时才添加 --prefix TEXT
仅当用户只想识别语言而不转录时才添加 --detect-language-only
仅当用户要求性能统计、RTF 或基准测试信息时才添加 --stats-file PATH
仅针对大型 CPU 批量作业添加 --parallel N；GPU 本身能高效处理单个文件 —— 对于单个文件或小批量不要添加
仅针对不可靠输入（URL、网络文件）且预计会出现瞬时故障时才添加 --retries N
仅当用户明确要求将字幕嵌入/烧录到视频中时才添加 --burn-in OUTPUT；需要 ffmpeg 和视频文件输入
仅当用户可能重新处理同一 URL 以避免重复下载时才添加 --keep-temp
仅在批量模式下用户指定自定义命名模式时才添加 --output-template
多格式输出 (--format srt,text)：仅当用户明确希望一次生成多种格式时使用；始终与 -o <dir> 配对使用
任何词级功能都会自动运行 wav2vec2 对齐（约 5-10 秒开销）
--diarize 在此基础上增加约 20-30 秒

仅当用户要求在音频中查找/定位/搜索特定单词或短语时才添加 --search "term"
--search 会替换正常的转录输出 —— 它只打印带有时间戳的匹配片段
仅当用户提到近似/部分匹配或拼写错误时才添加 --search-fuzzy
要将搜索结果保存到文件，请使用 -o results.txt

仅当用户要求章节、部分、目录或"话题在哪里改变"时才添加 --detect-chapters
默认 --chapter-gap 8（8 秒静默 = 新章节）适用于大多数播客/讲座；对于密集内容可调低
--chapter-format youtube（默认）输出 YouTube 就绪的时间戳；使用 json 用于编程用途
当结合章节与转录输出时，始终使用 --chapters-file PATH —— 避免将章节标记混入转录文本
如果用户只想要章节（不要转录稿），请使用 -o /dev/null 将标准输出重定向到文件，并使用 --chapters-file
批量模式限制： --chapters-file 接受单个路径 —— 在批量模式下，每个文件的章节会覆盖前一个。对于批量章节检测，请省略 --chapters-file（章节会打印到标准输出，位于 === CHAPTERS (N) === 下）或为每个文件单独运行

说话人音频导出：

仅当用户明确要求将每个说话人的音频单独保存时才添加 --export-speakers DIR
始终与 --diarize 配对使用 —— 如果没有说话人标签，它会静默跳过
需要 ffmpeg；输出 SPEAKER_1.wav、SPEAKER_2.wav 等（如果设置了 --speaker-names 则为真实姓名）

仅在批量模式下，当用户确认文件间语言不同时才添加 --language-map
内联格式："interview*.mp3=en,lecture*.mp3=fr" —— 对文件名进行 fnmatch 通配符匹配
JSON 文件格式：@/path/to/map.json，其中文件内容为 {"pattern": "lang_code"}

仅当用户提供播客 RSS 源 URL 时才添加 --rss URL
默认获取最新的 5 个剧集；--rss-latest 0 获取所有；--skip-existing 安全地恢复
始终与 -o <dir> 一起使用 —— 没有它，所有剧集的转录稿会连接打印到标准输出，难以使用；设置 -o <dir> 后，每个剧集都会有自己的文件

智能体中继的输出格式：

搜索结果 (--search) → 直接打印给用户；输出是人类可读的
章节输出 → 如果没有 --chapters-file，章节会出现在标准输出中，位于转录稿后的 === CHAPTERS (N) === 标题下；使用 --format json 时，章节也会嵌入 JSON 的 "chapters" 键下
字幕格式 (SRT, VTT, ASS, LRC, TTML) → 始终写入 -o 文件；告诉用户输出路径，切勿粘贴原始字幕内容
数据格式 (CSV, HTML, TTML, JSON) → 始终写入 -o 文件；告诉用户输出路径，不要粘贴原始 XML/CSV/HTML
ASS 格式 → 用于 Aegisub, VLC, mpv；写入文件并告诉用户可以在 Aegisub 中打开或在 VLC/mpv 中播放
LRC 格式 → 用于音乐播放器（Foobar2000, AIMP, VLC）的带时间戳歌词；写入文件
多格式 (--format srt,text) → 需要 -o <dir>；每种格式写入单独的文件；告诉用户所有写入的路径
JSON 格式 → 适用于编程后处理；不适合完整粘贴给用户
文本/转录稿 → 对于短文件，可以直接安全地显示给用户；对于长文件进行总结
统计输出 (--stats-file) → 为用户总结关键字段（时长、处理时间、RTF），而不是粘贴原始 JSON
语言检测 (--detect-language-only) → 直接打印结果；它是单行
预计完成时间 对于批量作业会自动打印到标准错误；无需操作

何时不使用：

没有本地计算能力的纯云端环境
小于 10 秒的文件，此时 API 调用延迟无关紧要

faster-whisper 与 whisperx： 此技能涵盖了 whisperx 的所有功能 —— 声纹分离 (--diarize)、词级时间戳 (--word-timestamps)、SRT/VTT 字幕 —— 因此不需要 whisperx。仅当您特别需要其 pyannote 流水线或此处未涵盖的批量 GPU 功能时才使用 whisperx。

任务	命令	备注
基本转录	`./scripts/transcribe audio.mp3`	批量推理，VAD 开启，distil-large-v3.5
SRT 字幕	`./scripts/transcribe audio.mp3 --format srt -o subs.srt`	自动启用词级时间戳
VTT 字幕	`./scripts/transcribe audio.mp3 --format vtt -o subs.vtt`	WebVTT 格式
词级时间戳	`./scripts/transcribe audio.mp3 --word-timestamps --format srt`	wav2vec2 对齐（约 10 毫秒）
说话人声纹分离	`./scripts/transcribe audio.mp3 --diarize`	需要 pyannote.audio
翻译 → 英语	`./scripts/transcribe audio.mp3 --translate`	任何语言 → 英语
流式输出	`./scripts/transcribe audio.mp3 --stream`	实时显示转录片段
剪辑时间范围	`./scripts/transcribe audio.mp3 --clip-timestamps "30,60"`	仅 30 秒–60 秒
降噪 + 归一化	`./scripts/transcribe audio.mp3 --denoise --normalize`	先清理嘈杂音频
减少幻觉	`./scripts/transcribe audio.mp3 --hallucination-silence-threshold 1.0`	跳过幻觉产生的静默
YouTube/URL	`./scripts/transcribe https://youtube.com/watch?v=...`	通过 yt-dlp 自动下载
批量处理	`./scripts/transcribe *.mp3 -o ./transcripts/`	输出到目录
批量处理并跳过	`./scripts/transcribe *.mp3 --skip-existing -o ./out/`	恢复中断的批量任务
领域术语	`./scripts/transcribe audio.mp3 --initial-prompt 'Kubernetes gRPC'`	提升罕见术语识别
热词提升	`./scripts/transcribe audio.mp3 --hotwords 'JIRA Kubernetes'`	使解码器偏向特定单词
前缀条件	`./scripts/transcribe audio.mp3 --prefix 'Good morning,'`	用已知的开场词引导第一个片段
固定模型版本	`./scripts/transcribe audio.mp3 --revision v1.2.0`	使用固定的修订版进行可重现的转录
调试库日志	`./scripts/transcribe audio.mp3 --log-level debug`	显示 faster_whisper 内部日志
Turbo 模型	`./scripts/transcribe audio.mp3 -m turbo`	large-v3-turbo 的别名
更快的英语转录	`./scripts/transcribe audio.mp3 --model distil-medium.en -l en`	仅限英语，快 6.8 倍
最高准确度	`./scripts/transcribe audio.mp3 --model large-v3 --beam-size 10`	完整模型
JSON 输出	`./scripts/transcribe audio.mp3 --format json -o out.json`	编程访问，含统计信息
过滤噪音	`./scripts/transcribe audio.mp3 --min-confidence 0.6`	丢弃低置信度片段
混合量化	`./scripts/transcribe audio.mp3 --compute-type int8_float16`	节省 VRAM，质量损失最小
减小批大小	`./scripts/transcribe audio.mp3 --batch-size 4`	如果 GPU 内存不足
TSV 输出	`./scripts/transcribe audio.mp3 --format tsv -o out.tsv`	OpenAI Whisper 兼容的 TSV
修复幻觉	`./scripts/transcribe audio.mp3 --temperature 0.0 --no-speech-threshold 0.8`	锁定温度 + 跳过静默
调整 VAD 灵敏度	`./scripts/transcribe audio.mp3 --vad-threshold 0.6 --min-silence-duration 500`	更严格的语音检测
已知说话人数量	`./scripts/transcribe meeting.wav --diarize --min-speakers 2 --max-speakers 3`	约束声纹分离
字幕词换行	`./scripts/transcribe audio.mp3 --format srt --word-timestamps --max-words-per-line 8`	分割长字幕提示
私有/门控模型	`./scripts/transcribe audio.mp3 --hf-token hf_xxx`	直接传递令牌
显示版本	`./scripts/transcribe --version`	打印 faster-whisper 版本
原地升级	`./setup.sh --update`	无需完全重新安装即可升级
系统检查	`./setup.sh --check`	验证 GPU、Python、ffmpeg、venv、yt-dlp、pyannote
仅检测语言	`./scripts/transcribe audio.mp3 --detect-language-only`	快速语言识别，不转录
JSON 格式语言检测	`./scripts/transcribe audio.mp3 --detect-language-only --format json`	机器可读的语言检测
LRC 字幕	`./scripts/transcribe audio.mp3 --format lrc -o lyrics.lrc`	用于音乐播放器的带时间戳歌词格式
ASS 字幕	`./scripts/transcribe audio.mp3 --format ass -o subtitles.ass`	Advanced SubStation Alpha (Aegisub, mpv, VLC)
合并句子	`./scripts/transcribe audio.mp3 --format srt --merge-sentences`	将片段合并为句子块
统计侧文件	`./scripts/transcribe audio.mp3 --stats-file stats.json`	转录后写入性能统计 JSON
批量统计	`./scripts/transcribe *.mp3 --stats-file ./stats/`	目录中每个输入一个统计文件
模板命名	`./scripts/transcribe audio.mp3 -o ./out/ --output-template "{stem}_{lang}.{ext}"`	自定义批量输出文件名
标准输入输入	`ffmpeg -i input.mp4 -f wav -	./scripts/transcribe -`
自定义模型目录	`./scripts/transcribe audio.mp3 --model-dir ~/my-models`	自定义 HuggingFace 缓存目录
本地模型	`./scripts/transcribe audio.mp3 -m ./my-model-ct2`	CTranslate2 模型目录
HTML 转录稿	`./scripts/transcribe audio.mp3 --format html -o out.html`	置信度着色
烧录字幕	`./scripts/transcribe video.mp4 --burn-in output.mp4`	需要 ffmpeg + 视频输入
命名说话人	`./scripts/transcribe audio.mp3 --diarize --speaker-names "Alice,Bob"`	替换 SPEAKER_1/2
过滤幻觉	`./scripts/transcribe audio.mp3 --filter-hallucinations`	移除伪影
保留临时文件	`./scripts/transcribe https://... --keep-temp`	用于 URL 重新处理
并行批量	`./scripts/transcribe *.mp3 --parallel 4 -o ./out/`	CPU 多文件
推荐 RTX 3070	`./scripts/transcribe audio.mp3 --compute-type int8_float16`	节省约 1GB VRAM，质量损失最小
CPU 线程数	`./scripts/transcribe audio.mp3 --threads 8`	强制 CPU 线程数（默认：自动）
播客 RSS（最新 5 个）	`./scripts/transcribe --rss https://feeds.example.com/podcast.xml`	下载并转录最新的 5 个剧集
播客 RSS（所有剧集）	`./scripts/transcribe --rss https://... --rss-latest 0 -o ./episodes/`	所有剧集，每个一个文件
播客 + SRT 字幕	`./scripts/transcribe --rss https://... --format srt -o ./subs/`	为所有剧集生成字幕
失败重试	`./scripts/transcribe *.mp3 --retries 3 -o ./out/`	出错时最多重试 3 次，带退避
CSV 输出	`./scripts/transcribe audio.mp3 --format csv -o out.csv`	电子表格就绪，带标题行；正确引用
带说话人的 CSV	`./scripts/transcribe audio.mp3 --diarize --format csv -o out.csv`	添加说话人列
语言映射（内联）	`./scripts/transcribe .mp3 --language-map "interview.mp3=en,lecture.wav=fr"`	批量中每个文件的语言
语言映射（JSON）	`./scripts/transcribe *.mp3 --language-map @langs.json`	JSON 文件：{"pattern": "lang"}
批量显示预计完成时间	`./scripts/transcribe *.mp3 -o ./out/`	批量中每个文件自动显示预计完成时间
TTML 字幕	`./scripts/transcribe audio.mp3 --format ttml -o subtitles.ttml`	广播标准 DFXP/TTML (Netflix, BBC, Amazon)
带说话人标签的 TTML	`./scripts/transcribe audio.mp3 --diarize --format ttml -o subtitles.ttml`	带说话人标签的 TTML
搜索转录稿	`./scripts/transcribe audio.mp3 --search "keyword"`	查找关键词出现的时间戳
搜索到文件	`./scripts/transcribe audio.mp3 --search "keyword" -o results.txt`	保存搜索结果
模糊搜索	`./scripts/transcribe audio.mp3 --search "aproximate" --search-fuzzy`	近似/部分匹配
检测章节	`./scripts/transcribe audio.mp3 --detect-chapters`	从静默间隙自动检测章节
章节间隔调整	`./scripts/transcribe audio.mp3 --detect-chapters --chapter-gap 5`	间隔 ≥5 秒时开始新章节（默认：8 秒）
章节到文件	`./scripts/transcribe audio.mp3 --detect-chapters --chapters-file ch.txt`	保存 YouTube 格式章节列表
章节 JSON	`./scripts/transcribe audio.mp3 --detect-chapters --chapter-format json`	机器可读的章节列表
导出说话人音频	`./scripts/transcribe audio.mp3 --diarize --export-speakers ./speakers/`	将每个说话人的音频保存到单独的 WAV 文件
多格式输出	`./scripts/transcribe audio.mp3 --format srt,text -o ./out/`	一次运行写入 SRT + TXT
移除填充词	`./scripts/transcribe audio.mp3 --clean-filler`	去除 um/uh/er/ah/hmm 等犹豫音和话语标记
仅左声道	`./scripts/transcribe audio.mp3 --channel left`	转录前提取左立体声声道
仅右声道	`./scripts/transcribe audio.mp3 --channel right`	转录前提取右立体声声道
每行最大字符数	`./scripts/transcribe audio.mp3 --format srt --max-chars-per-line 42`	基于字符的字幕换行
检测段落	`./scripts/transcribe audio.mp3 --detect-paragraphs`	在文本输出中插入段落分隔
段落间隔调整	`./scripts/transcribe audio.mp3 --detect-paragraphs --paragraph-gap 5.0`	调整间隔阈值（默认 3.0 秒）

根据需求选择合适的模型：

digraph model_selection {
    rankdir=LR;
    node [shape=box, style=rounded];

    start [label="开始", shape=doublecircle];
    need_accuracy [label="需要最高\n准确度？", shape=diamond];
    multilingual [label="多语言\n内容？", shape=diamond];
    resource_constrained [label="资源\n受限？", shape=diamond];

    large_v3 [label="large-v3\n或\nlarge-v3-turbo", style="rounded,filled", fillcolor=lightblue];
    large_turbo [label="large-v3-turbo", style="rounded,filled", fillcolor=lightblue];
    distil_large [label="distil-large-v3.5\n(默认)", style="rounded,filled", fillcolor=lightgreen];
    distil_medium [label="distil-medium.en", style="rounded,filled", fillcolor=lightyellow];
    distil_small [label="distil-small.en", style="rounded,filled", fillcolor=lightyellow];

    start -> need_accuracy;
    need_accuracy -> large_v3 [label="是"];
    need_accuracy -> multilingual [label="否"];
    multilingual -> large_turbo [label="是"];
    multilingual -> resource_constrained [label="否 (英语)"];
    resource_constrained -> distil_small [label="移动/边缘设备"];
    resource_constrained -> distil_medium [label="有一定限制"];
    resource_constrained -> distil_large [label="否"];
}

标准模型（完整 Whisper）

模型	大小	速度	准确度	使用场景
`tiny` / `tiny.en`	39M	最快	基础	快速草稿
`base` / `base.en`	74M	非常快	良好	一般用途
`small` / `small.en`	244M	快	更好	大多数任务
`medium` / `medium.en`	769M	中等	高	高质量转录
`large-v1/v2/v3`	1.5GB	较慢	最佳	最高准确度
`large-v3-turbo`	809M	快	优秀	高准确度（比 distil 慢）

蒸馏模型（约快 6 倍，WER 差异约 1%）

模型	大小	相对于标准模型的速度	准确度	使用场景
`distil-large-v3.5`	756M	约快 6.3 倍	7.08% WER	默认，最佳平衡
`distil-large-v3`	756M	约快 6.3 倍	7.53% WER	之前的默认值
`distil-large-v2`	756M	约快 5.8 倍	10.1% WER	备用
`distil-medium.en`	394M	约快 6.8 倍	11.1% WER	仅限英语，资源受限
`distil-small.en`	166M	约快 5.6 倍	12.1% WER	移动/边缘设备

.en 模型仅限英语，对于英语内容稍快/稍好。

关于蒸馏模型的注意事项： HuggingFace 建议为所有蒸馏模型禁用 condition_on_previous_text 以防止重复循环。脚本在检测到 distil-* 模型时会自动应用 --no-condition-on-previous-text。如果需要，可以传递 --condition-on-previous-text 来覆盖。

自定义和微调模型

WhisperModel 接受本地 CTranslate2 模型目录和 HuggingFace 仓库名称 —— 无需更改代码。

加载本地 CTranslate2 模型

./scripts/transcribe audio.mp3 --model /path/to/my-model-ct2

将 HuggingFace 模型转换为 CTranslate2

pip install ctranslate2
ct2-transformers-converter \
  --model openai/whisper-large-v3 \
  --output_dir whisper-large-v3-ct2 \
  --copy_files tokenizer.json preprocessor_config.json \
  --quantization float16
./scripts/transcribe audio.mp3 --model ./whisper-large-v3-ct2

通过 HuggingFace 仓库名称加载模型（自动下载）

./scripts/transcribe audio.mp3 --model username/whisper-large-v3-ct2

自定义模型缓存目录

默认情况下，模型缓存在 ~/.cache/huggingface/。使用 --model-dir 覆盖：

./scripts/transcribe audio.mp3 --model-dir ~/my-models

Linux / macOS / WSL2

# 基础安装（创建 venv，安装依赖，自动检测 GPU）
./setup.sh

# 带说话人声纹分离支持
./setup.sh --diarize

Python 3.10+
ffmpeg 不是基本转录所必需的 —— PyAV（与 faster-whisper 捆绑）处理音频解码。ffmpeg 仅在需要 --burn-in、--normalize 和 --denoise 时才需要。
可选：yt-dlp（用于 URL/YouTube 输入）
可选：pyannote.audio（用于 --diarize，通过 setup.sh --diarize 安装）

平台	加速	速度
Linux + NVIDIA GPU	CUDA	约 20 倍实时 🚀
WSL2 + NVIDIA GPU	CUDA	约 20 倍实时 🚀
macOS Apple Silicon	CPU*	约 3-5 倍实时
macOS Intel	CPU	约 1-2 倍实时
Linux（无 GPU）	CPU	约 1 倍实时

*faster-whisper 使用 CTranslate2，在 macOS 上仅限 CPU，但 Apple Silicon 足以满足实际使用。

GPU 支持（重要！）

安装脚本会自动检测您的 GPU 并安装带 CUDA 的 PyTorch。如果可用，请始终使用 GPU —— CPU 转录极其缓慢。

硬件	速度	9 分钟视频
RTX 3070 (GPU)	约 20 倍实时	约 27 秒
CPU (int8)	约 0.3 倍实时	约 30 分钟

RTX 3070 提示：使用 --compute-type int8_float16 进行混合量化 —— 节省约 1GB VRAM，质量损失最小。非常适合在转录的同时运行声纹分离。

如果安装脚本没有检测到您的 GPU，请手动安装带 CUDA 的 PyTorch：

# 对于 CUDA 12.x
uv pip install --python .venv/bin/python torch --index-url https://download.pytorch.org/whl/cu121

# 对于 CUDA 11.x
uv pip install --python .venv/bin/python torch --index-url https://download.pytorch.org/whl/cu118

WSL2 用户：确保在 Windows 上安装了 NVIDIA CUDA drivers for WSL

# 基本转录
./scripts/transcribe audio.mp3

# SRT 字幕
./scripts/transcribe audio.mp3 --format srt -o subtitles.srt

# WebVTT 字幕
./scripts/transcribe audio.mp3 --format vtt -o subtitles.vtt

# 从 YouTube URL 转录
./scripts/transcribe https://youtube.com/watch?v=dQw4w9WgXcQ --language en

# 说话人声纹分离
./scripts/transcribe meeting.wav --diarize

# 带声纹分离的 VTT 字幕
./scripts/transcribe meeting.wav --diarize --format vtt -o meeting.vtt

# 用领域术语引导
./scripts/transcribe lecture.mp3 --initial-prompt "Kubernetes, gRPC, PostgreSQL, NGINX"

# 批量处理目录
./scripts/transcribe ./recordings/ -o ./transcripts/

# 使用通配符批量处理，跳过已完成的文件
./scripts/transcribe *.mp3 --skip-existing -o ./transcripts/

# 过滤低置信度片段
./scripts/transcribe noisy-audio.mp3 --min-confidence 0.6

# JSON 输出，包含完整元数据
./scripts/transcribe audio.mp3 --format json -o result.json

# 指定语言（比自动检测更快）
./scripts/transcribe audio.mp3 --language en

输入：
  AUDIO                 音频文件、目录、通配符模式或 URL
                        接受：mp3, wav, m4a, flac, ogg, webm, mp4, mkv, avi, wma, aac
                        URL 通过 yt-dlp 自动下载（YouTube、直接链接等）

模型和语言：
  -m, --model NAME      Whisper 模型（默认：distil-large-v3.5；"turbo" = large-v3-turbo）
  --revision REV        模型修订版（git 分支/标签/提交）以固定特定版本
  -l, --language CODE   语言代码，例如 en, es, fr（如果省略则自动检测）
  --initial-prompt TEXT  用于引导模型的提示（术语、格式样式）
  --prefix TEXT         用于引导第一个片段的前缀（例如已知的开头词）
  --hotwords WORDS      用于提升识别率的热词，以空格分隔
  --translate           将任何语言翻译成英语（而不是转录）
  --multilingual        启用多语言/语码转换模式（有助于较小模型）
  --hf-token TOKEN      HuggingFace 令牌，用于私有/门控模型和声纹分离
  --model-dir PATH      自定义模型缓存目录（默认：~/.cache/huggingface/）

🇺🇸English

Faster Whisper

Local speech-to-text using faster-whisper — a CTranslate2 reimplementation of OpenAI's Whisper that runs 4-6x faster with identical accuracy. With GPU acceleration, expect ~20x realtime transcription (a 10-minute audio file in ~30 seconds).

When to Use

Use this skill when you need to:

Transcribe audio/video files — meetings, interviews, podcasts, lectures, YouTube videos
Generate subtitles — SRT, VTT, ASS, LRC, or TTML broadcast-standard subtitles
Identify speakers — diarization labels who said what (--diarize)
Transcribe from URLs — YouTube links and direct audio URLs (auto-downloads via yt-dlp)
Transcribe podcast feeds — --rss <feed-url> fetches and transcribes episodes
Batch process files — glob patterns, directories, skip-existing support; ETA shown automatically
Convert speech to text locally — no API costs, works offline (after model download)
Translate to English — translate any language to English with --translate
Do multilingual transcription — supports 99+ languages with auto-detection
Transcribe a batch of files in different languages — --language-map assigns a different language per file
Transcribe multilingual audio — --multilingual for mixed-language audio
Transcribe audio with specific terms — use --initial-prompt for jargon-heavy content or any other terms to look out for
Preprocess noisy audio (before transcription) — --normalize and --denoise before transcription
Stream output — --stream shows segments as they're transcribed
Clip time ranges — --clip-timestamps to transcribe specific sections
Search the transcript — --search "term" finds all timestamps where a word/phrase appears
Detect chapters — --detect-chapters finds section breaks from silence gaps
Export speaker audio — --export-speakers DIR saves each speaker's turns as separate WAV files
Spreadsheet output — --format csv produces a properly-quoted CSV with timestamps

Trigger phrases: "transcribe this audio", "convert speech to text", "what did they say", "make a transcript", "audio to text", "subtitle this video", "who's speaking", "translate this audio", "translate to English", "find where X is mentioned", "search transcript for", "when did they say", "at what timestamp", "add chapters", "detect chapters", "find breaks in the audio", "table of contents for this recording", "TTML subtitles", "DFXP subtitles", "broadcast format subtitles", "Netflix format", "ASS subtitles", "aegisub format", "advanced substation alpha", "mpv subtitles", "LRC subtitles", "timed lyrics", "karaoke subtitles", "music player lyrics", "HTML transcript", "confidence-colored transcript", "color-coded transcript", "separate audio per speaker", "export speaker audio", "split by speaker", "transcript as CSV", "spreadsheet output", "transcribe podcast", "podcast RSS feed", "different languages in batch", "per-file language", "transcribe in multiple formats", "srt and txt at the same time", "output both srt and text", "remove filler words", "clean up ums and uhs", "strip hesitation sounds", "remove you know and I mean", "transcribe left channel", "transcribe right channel", "stereo channel", "left track only", "wrap subtitle lines", "character limit per line", "max chars per subtitle", "detect paragraphs", "paragraph breaks", "group into paragraphs", "add paragraph spacing"

⚠️ Agent guidance — keep invocations minimal:

CORE RULE: default command (./scripts/transcribe audio.mp3) is the fastest path — add flags only when the user explicitly asks for that capability.

Transcription:

Only add --diarize if the user asks "who said what" / "identify speakers" / "label speakers"
Only add --format srt/vtt/ass/lrc/ttml if the user asks for subtitles/captions in that format
Only add --format csv if the user asks for CSV or spreadsheet output
Only add --word-timestamps if the user needs word-level timing
Only add --initial-prompt if there's domain-specific jargon to prime
Only add --translate if the user wants non-English audio translated to English
Only add --normalize/--denoise if the user mentions bad audio quality or noise
Only add --stream if the user wants live/progressive output for long files

Search:

Only add --search "term" when the user asks to find/locate/search for a specific word or phrase in audio
--search replaces the normal transcript output — it prints only matching segments with timestamps
Add --search-fuzzy only when the user mentions approximate/partial matching or typos
To save search results to a file, use -o results.txt

Chapter detection:

Only add --detect-chapters when the user asks for chapters, sections, a table of contents, or "where does the topic change"
Default --chapter-gap 8 (8-second silence = new chapter) works for most podcasts/lectures; tune down for dense content
--chapter-format youtube (default) outputs YouTube-ready timestamps; use json for programmatic use
Always use--chapters-file PATH when combining chapters with a transcript output — avoids mixing chapter markers into the transcript text
If the user only wants chapters (not the transcript), pipe stdout to a file with -o /dev/null and use --chapters-file
Batch mode limitation: --chapters-file takes a single path — in batch mode, each file's chapters overwrite the previous. For batch chapter detection, omit (chapters print to stdout under ) or use a separate run per file

Speaker audio export:

Only add --export-speakers DIR when the user explicitly asks to save each speaker's audio separately
Always pair with --diarize — it silently skips if no speaker labels are present
Requires ffmpeg; outputs SPEAKER_1.wav, SPEAKER_2.wav, etc. (or real names if --speaker-names is set)

Language map:

Only add --language-map in batch mode when the user has confirmed different languages across files
Inline format: "interview*.mp3=en,lecture*.mp3=fr" — fnmatch globs on filename
JSON file format: @/path/to/map.json where the file is {"pattern": "lang_code"}

RSS / Podcast:

Only add --rss URL when the user provides a podcast RSS feed URL
Default fetches 5 newest episodes; --rss-latest 0 for all; --skip-existing to resume safely
Always use-o <dir> with --rss — without it, all episode transcripts print to stdout concatenated, which is hard to use; each episode gets its own file when -o <dir> is set

Output format for agent relay:

Search results (--search) → print directly to user; output is human-readable
Chapter output → if no --chapters-file, chapters appear in stdout under === CHAPTERS (N) === header after the transcript; with --format json, chapters are also embedded in the JSON under "chapters" key
Subtitle formats (SRT, VTT, ASS, LRC, TTML) → always write to -o file; tell the user the output path, never paste raw subtitle content
Data formats (CSV, HTML, TTML, JSON) → always write to -o file; tell the user the output path, don't paste raw XML/CSV/HTML
ASS format → for Aegisub, VLC, mpv; write to file and tell user they can open it in Aegisub or play it in VLC/mpv
LRC format → timed lyrics for music players (Foobar2000, AIMP, VLC); write to file

When NOT to use:

Cloud-only environments without local compute
Files <10 seconds where API call latency doesn't matter

faster-whisper vs whisperx: This skill covers everything whisperx does — diarization (--diarize), word-level timestamps (--word-timestamps), SRT/VTT subtitles — so whisperx is not needed. Use whisperx only if you specifically need its pyannote pipeline or batch-GPU features not covered here.

Quick Reference

Task	Command	Notes
Basic transcription	`./scripts/transcribe audio.mp3`	Batched inference, VAD on, distil-large-v3.5
SRT subtitles	`./scripts/transcribe audio.mp3 --format srt -o subs.srt`	Word timestamps auto-enabled
VTT subtitles	`./scripts/transcribe audio.mp3 --format vtt -o subs.vtt`	WebVTT format
Word timestamps	`./scripts/transcribe audio.mp3 --word-timestamps --format srt`

Model Selection

Choose the right model for your needs:

digraph model_selection {
    rankdir=LR;
    node [shape=box, style=rounded];

    start [label="Start", shape=doublecircle];
    need_accuracy [label="Need maximum\naccuracy?", shape=diamond];
    multilingual [label="Multilingual\ncontent?", shape=diamond];
    resource_constrained [label="Resource\nconstraints?", shape=diamond];

    large_v3 [label="large-v3\nor\nlarge-v3-turbo", style="rounded,filled", fillcolor=lightblue];
    large_turbo [label="large-v3-turbo", style="rounded,filled", fillcolor=lightblue];
    distil_large [label="distil-large-v3.5\n(default)", style="rounded,filled", fillcolor=lightgreen];
    distil_medium [label="distil-medium.en", style="rounded,filled", fillcolor=lightyellow];
    distil_small [label="distil-small.en", style="rounded,filled", fillcolor=lightyellow];

    start -> need_accuracy;
    need_accuracy -> large_v3 [label="yes"];
    need_accuracy -> multilingual [label="no"];
    multilingual -> large_turbo [label="yes"];
    multilingual -> resource_constrained [label="no (English)"];
    resource_constrained -> distil_small [label="mobile/edge"];
    resource_constrained -> distil_medium [label="some limits"];
    resource_constrained -> distil_large [label="no"];
}

Model Table

Standard Models (Full Whisper)

Model	Size	Speed	Accuracy	Use Case
`tiny` / `tiny.en`	39M	Fastest	Basic	Quick drafts
`base` / `base.en`	74M	Very fast	Good	General use
`small` / `small.en`

Distilled Models (~6x Faster, ~1% WER difference)

Model	Size	Speed vs Standard	Accuracy	Use Case
`distil-large-v3.5`	756M	~6.3x faster	7.08% WER	Default, best balance
`distil-large-v3`	756M	~6.3x faster	7.53% WER	Previous default
`distil-large-v2`	756M	~5.8x faster	10.1% WER	Fallback
`distil-medium.en`

.en models are English-only and slightly faster/better for English content.

Note for distil models: HuggingFace recommends disabling condition_on_previous_text for all distil models to prevent repetition loops. The script auto-applies --no-condition-on-previous-text whenever a distil-* model is detected. Pass --condition-on-previous-text to override if needed.

Custom & Fine-tuned Models

WhisperModel accepts local CTranslate2 model directories and HuggingFace repo names — no code changes needed.

Load a local CTranslate2 model

./scripts/transcribe audio.mp3 --model /path/to/my-model-ct2

Convert a HuggingFace model to CTranslate2

pip install ctranslate2
ct2-transformers-converter \
  --model openai/whisper-large-v3 \
  --output_dir whisper-large-v3-ct2 \
  --copy_files tokenizer.json preprocessor_config.json \
  --quantization float16
./scripts/transcribe audio.mp3 --model ./whisper-large-v3-ct2

Load a model by HuggingFace repo name (auto-downloads)

./scripts/transcribe audio.mp3 --model username/whisper-large-v3-ct2

Custom model cache directory

By default, models are cached in ~/.cache/huggingface/. Use --model-dir to override:

./scripts/transcribe audio.mp3 --model-dir ~/my-models

Setup

Linux / macOS / WSL2

# Base install (creates venv, installs deps, auto-detects GPU)
./setup.sh

# With speaker diarization support
./setup.sh --diarize

Requirements:

Python 3.10+
ffmpeg is not required for basic transcription — PyAV (bundled with faster-whisper) handles audio decoding. ffmpeg is only needed for --burn-in, --normalize, and --denoise.
Optional: yt-dlp (for URL/YouTube input)
Optional: pyannote.audio (for --diarize, installed via setup.sh --diarize)

Platform Support

Platform	Acceleration	Speed
Linux + NVIDIA GPU	CUDA	~20x realtime 🚀
WSL2 + NVIDIA GPU	CUDA	~20x realtime 🚀
macOS Apple Silicon	CPU*	~3-5x realtime
macOS Intel	CPU	~1-2x realtime
Linux (no GPU)	CPU	~1x realtime

*faster-whisper uses CTranslate2 which is CPU-only on macOS, but Apple Silicon is fast enough for practical use.

GPU Support (IMPORTANT!)

The setup script auto-detects your GPU and installs PyTorch with CUDA. Always use GPU if available — CPU transcription is extremely slow.

Hardware	Speed	9-min video
RTX 3070 (GPU)	~20x realtime	~27 sec
CPU (int8)	~0.3x realtime	~30 min

RTX 3070 tip : Use --compute-type int8_float16 for hybrid quantization — saves ~1GB VRAM with minimal quality loss. Ideal for running diarization alongside transcription.

If setup didn't detect your GPU, manually install PyTorch with CUDA:

# For CUDA 12.x
uv pip install --python .venv/bin/python torch --index-url https://download.pytorch.org/whl/cu121

# For CUDA 11.x
uv pip install --python .venv/bin/python torch --index-url https://download.pytorch.org/whl/cu118

WSL2 users : Ensure you have the NVIDIA CUDA drivers for WSL installed on Windows

Usage

# Basic transcription
./scripts/transcribe audio.mp3

# SRT subtitles
./scripts/transcribe audio.mp3 --format srt -o subtitles.srt

# WebVTT subtitles
./scripts/transcribe audio.mp3 --format vtt -o subtitles.vtt

# Transcribe from YouTube URL
./scripts/transcribe https://youtube.com/watch?v=dQw4w9WgXcQ --language en

# Speaker diarization
./scripts/transcribe meeting.wav --diarize

# Diarized VTT subtitles
./scripts/transcribe meeting.wav --diarize --format vtt -o meeting.vtt

# Prime with domain terminology
./scripts/transcribe lecture.mp3 --initial-prompt "Kubernetes, gRPC, PostgreSQL, NGINX"

# Batch process a directory
./scripts/transcribe ./recordings/ -o ./transcripts/

# Batch with glob, skip already-done files
./scripts/transcribe *.mp3 --skip-existing -o ./transcripts/

# Filter low-confidence segments
./scripts/transcribe noisy-audio.mp3 --min-confidence 0.6

# JSON output with full metadata
./scripts/transcribe audio.mp3 --format json -o result.json

# Specify language (faster than auto-detect)
./scripts/transcribe audio.mp3 --language en

Options

Input:
  AUDIO                 Audio file(s), directory, glob pattern, or URL
                        Accepts: mp3, wav, m4a, flac, ogg, webm, mp4, mkv, avi, wma, aac
                        URLs auto-download via yt-dlp (YouTube, direct links, etc.)

Model & Language:
  -m, --model NAME      Whisper model (default: distil-large-v3.5; "turbo" = large-v3-turbo)
  --revision REV        Model revision (git branch/tag/commit) to pin a specific version
  -l, --language CODE   Language code, e.g. en, es, fr (auto-detects if omitted)
  --initial-prompt TEXT  Prompt to condition the model (terminology, formatting style)
  --prefix TEXT         Prefix to condition the first segment (e.g. known starting words)
  --hotwords WORDS      Space-separated hotwords to boost recognition
  --translate           Translate any language to English (instead of transcribing)
  --multilingual        Enable multilingual/code-switching mode (helps smaller models)
  --hf-token TOKEN      HuggingFace token for private/gated models and diarization
  --model-dir PATH      Custom model cache directory (default: ~/.cache/huggingface/)

Output Format:
  -f, --format FMT      text | json | srt | vtt | tsv | lrc | html | ass | ttml (default: text)
                        Accepts comma-separated list: --format srt,text writes both in one pass
                        Multi-format requires -o <dir> when saving to files
  --word-timestamps     Include word-level timestamps (wav2vec2 aligned automatically)
  --stream              Output segments as they are transcribed (disables diarize/alignment)
  --max-words-per-line N  For SRT/VTT, split segments into sub-cues of at most N words
  --max-chars-per-line N  For SRT/VTT/ASS/TTML, split lines so each fits within N characters
                        Takes priority over --max-words-per-line when both are set
  --clean-filler        Remove hesitation fillers (um, uh, er, ah, hmm, hm) and discourse markers
                        (you know, I mean, you see) from transcript text. Off by default.
  --detect-paragraphs   Insert paragraph breaks (blank lines) in text output at natural boundaries.
                        A new paragraph starts when: silence gap ≥ --paragraph-gap, OR the previous
                        segment ends a sentence AND the gap ≥ 1.5s.
  --paragraph-gap SEC   Minimum silence gap in seconds to start a new paragraph (default: 3.0).
                        Used with --detect-paragraphs.
  --channel {left,right,mix}
                        Stereo channel to transcribe: left (c0), right (c1), or mix (default: mix).
                        Extracts the channel via ffmpeg before transcription. Requires ffmpeg.
  --merge-sentences     Merge consecutive segments into sentence-level chunks
                        (improves SRT/VTT readability; groups by terminal punctuation or >2s gap)
  -o, --output PATH     Output file or directory (directory for batch mode)
  --output-template TEMPLATE
                        Batch output filename template. Variables: {stem}, {lang}, {ext}, {model}
                        Example: "{stem}_{lang}.{ext}" → "interview_en.srt"

Inference Tuning:
  --beam-size N         Beam search size; higher = more accurate but slower (default: 5)
  --temperature T       Sampling temperature or comma-separated fallback list, e.g.
                        '0.0' or '0.0,0.2,0.4' (default: faster-whisper's schedule)
  --no-speech-threshold PROB
                        Probability threshold to mark segments as silence (default: 0.6)
  --batch-size N        Batched inference batch size (default: 8; reduce if OOM)
  --no-vad              Disable voice activity detection (on by default)
  --vad-threshold T     VAD speech probability threshold (default: 0.5)
  --vad-neg-threshold T VAD negative threshold for ending speech (default: auto)
  --vad-onset T         Alias for --vad-threshold (legacy)
  --vad-offset T        Alias for --vad-neg-threshold (legacy)
  --min-speech-duration MS  Minimum speech segment duration in ms (default: 0)
  --max-speech-duration SEC Maximum speech segment duration in seconds (default: unlimited)
  --min-silence-duration MS Minimum silence before splitting a segment in ms (default: 2000)
  --speech-pad MS       Padding around speech segments in ms (default: 400)
  --no-batch            Disable batched inference (use standard WhisperModel)
  --hallucination-silence-threshold SEC
                        Skip silent sections where model hallucinates (e.g. 1.0)
  --no-condition-on-previous-text
                        Don't condition on previous text (reduces repetition/hallucination loops;
                        auto-enabled for distil models per HuggingFace recommendation)
  --condition-on-previous-text
                        Force-enable conditioning on previous text (overrides auto-disable for distil models)
  --compression-ratio-threshold RATIO
                        Filter segments above this compression ratio (default: 2.4)
  --log-prob-threshold PROB
                        Filter segments below this avg log probability (default: -1.0)
  --max-new-tokens N    Maximum tokens per segment (prevents runaway generation)
  --clip-timestamps RANGE
                        Transcribe specific time ranges: '30,60' or '0,30;60,90' (seconds)
  --progress            Show transcription progress bar
  --best-of N           Candidates when sampling with non-zero temperature (default: 5)
  --patience F          Beam search patience factor (default: 1.0)
  --repetition-penalty F  Penalty for repeated tokens (default: 1.0)
  --no-repeat-ngram-size N  Prevent n-gram repetitions of this size (default: 0 = off)

Advanced Inference:
  --no-timestamps       Output text without timing info (faster; incompatible with
                        --word-timestamps, --format srt/vtt/tsv, --diarize)
  --chunk-length N      Audio chunk length in seconds for batched inference (default: auto)
  --language-detection-threshold T
                        Confidence threshold for language auto-detection (default: 0.5)
  --language-detection-segments N
                        Audio segments to sample for language detection (default: 1)
  --length-penalty F    Beam search length penalty; >1 favors longer, <1 favors shorter (default: 1.0)
  --prompt-reset-on-temperature T
                        Reset initial prompt when temperature fallback hits threshold (default: 0.5)
  --no-suppress-blank   Disable blank token suppression (may help soft/quiet speech)
  --suppress-tokens IDS Comma-separated token IDs to suppress in addition to default -1
  --max-initial-timestamp T
                        Maximum timestamp for the first segment in seconds (default: 1.0)
  --prepend-punctuations CHARS
                        Punctuation characters merged into preceding word (default: "'¿([{-)
  --append-punctuations CHARS
                        Punctuation characters merged into following word (default: "'.。,，!！?？:：")]}、")

Preprocessing:
  --normalize           Normalize audio volume (EBU R128 loudnorm) before transcription
  --denoise             Apply noise reduction (high-pass + FFT denoise) before transcription

Advanced:
  --diarize             Speaker diarization (requires pyannote.audio)
  --min-speakers N      Minimum number of speakers hint for diarization
  --max-speakers N      Maximum number of speakers hint for diarization
  --speaker-names NAMES Comma-separated names to replace SPEAKER_1, SPEAKER_2 (e.g. 'Alice,Bob')
                        Requires --diarize
  --min-confidence PROB Filter segments below this avg word confidence (0.0–1.0)
  --skip-existing       Skip files whose output already exists (batch mode)
  --detect-language-only
                        Detect language and exit (no transcription). Output: "Language: en (probability: 0.984)"
                        With --format json: {"language": "en", "language_probability": 0.984}
  --stats-file PATH     Write JSON stats sidecar after transcription (processing time, RTF, word count, etc.)
                        Directory path → writes {stem}.stats.json inside; file path → exact path
  --burn-in OUTPUT      Burn subtitles into the original video (single-file mode only; requires ffmpeg)
  --filter-hallucinations
                        Filter common Whisper hallucinations: music/applause markers, duplicate segments,
                        'Thank you for watching', lone punctuation, etc.
  --keep-temp           Keep temp files from URL downloads (useful for re-processing without re-downloading)
  --parallel N          Number of parallel workers for batch processing (default: sequential)
  --retries N           Retry failed files up to N times with exponential backoff (default: 0;
                        incompatible with --parallel)

Batch ETA:
  Automatically shown for sequential batch jobs (no flag needed). After each file completes,
  the next file's progress line includes:  [current/total] filename | ETA: Xm Ys
  ETA is calculated from average time per file × remaining files.
  Shown to stderr (surfaced to users via OpenClaw/Clawdbot output).

Language Map (per-file language override):
  --language-map MAP    Per-file language override for batch mode. Two forms:
                          Inline: "interview*.mp3=en,lecture.wav=fr,keynote.wav=de"
                          JSON file: "@/path/to/map.json"  (must be {pattern: lang} dict)
                        Patterns support fnmatch globs on filename or stem.
                        Priority: exact filename > exact stem > glob on filename > glob on stem > fallback.
                        Files not matched fall back to --language (or auto-detect if not set).

Transcript Search:
  --search TERM         Search the transcript for TERM and print matching segments with timestamps.
                        Replaces normal transcript output (use -o to save results to a file).
                        Case-insensitive exact substring match by default.
  --search-fuzzy        Enable fuzzy/approximate matching with --search (useful for typos, phonetic
                        near-misses, or partial words; uses SequenceMatcher ratio ≥ 0.6)

Chapter Detection:
  --detect-chapters     Auto-detect chapter/section breaks from silence gaps and print chapter markers.
                        Output is printed after the transcript (or to --chapters-file).
  --chapter-gap SEC     Minimum silence gap in seconds between consecutive segments to start a new
                        chapter (default: 8.0). Tune down for dense speech, up for sparse content.
  --chapters-file PATH  Write chapter markers to this file (default: stdout after transcript)
  --chapter-format FMT  youtube | text | json — chapter output format:
                          youtube: "0:00 Chapter 1" (YouTube description ready)
                          text:    "Chapter 1: 00:00:00"
                          json:    JSON array with chapter, start, title fields
                        (default: youtube)

Speaker Audio Export:
  --export-speakers DIR After diarization, export each speaker's audio turns concatenated into
                        separate WAV files saved in DIR. Requires --diarize and ffmpeg.
                        Output: SPEAKER_1.wav, SPEAKER_2.wav, … (or real names if --speaker-names set)

RSS / Podcast:
  --rss URL             Podcast RSS feed URL — extracts audio enclosures and transcribes them.
                        AUDIO positional is optional when --rss is used.
  --rss-latest N        Number of most-recent episodes to process (default: 5; 0 = all episodes)

Device:
  --device DEV          auto | cpu | cuda (default: auto)
  --compute-type TYPE   auto | int8 | int8_float16 | float16 | float32 (default: auto)
                        int8_float16 = hybrid mode for GPU (saves VRAM, minimal quality loss)
  --threads N           CPU thread count for CTranslate2 (default: auto)
  -q, --quiet           Suppress progress and status messages
  --log-level LEVEL     Set faster_whisper library logging level: debug | info | warning | error
                        (default: warning; use debug to see CTranslate2/VAD internals)

Utility:
  --version             Print installed faster-whisper version and exit
  --update              Upgrade faster-whisper in the skill venv and exit

Output Formats

Text (default)

Plain transcript text. With --diarize, speaker labels are inserted:

[SPEAKER_1]
 Hello, welcome to the meeting.
[SPEAKER_2]
 Thanks for having me.

JSON (`--format json`)

Full metadata including segments, timestamps, language detection, and performance stats:

{
  "file": "audio.mp3",
  "text": "Hello, welcome...",
  "language": "en",
  "language_probability": 0.98,
  "duration": 600.5,
  "segments": [...],
  "speakers": ["SPEAKER_1", "SPEAKER_2"],
  "stats": {
    "processing_time": 28.3,
    "realtime_factor": 21.2
  }
}

SRT (`--format srt`)

Standard subtitle format for video players:

1
00:00:00,000 --> 00:00:02,500
[SPEAKER_1] Hello, welcome to the meeting.

2
00:00:02,800 --> 00:00:04,200
[SPEAKER_2] Thanks for having me.

VTT (`--format vtt`)

WebVTT format for web video players:

WEBVTT

1
00:00:00.000 --> 00:00:02.500
[SPEAKER_1] Hello, welcome to the meeting.

2
00:00:02.800 --> 00:00:04.200
[SPEAKER_2] Thanks for having me.

TSV (`--format tsv`)

Tab-separated values, OpenAI Whisper–compatible. Columns: start_ms, end_ms, text:

0	2500	Hello, welcome to the meeting.
2800	4200	Thanks for having me.

Useful for piping into other tools or spreadsheets. No header row.

ASS/SSA (`--format ass`)

Advanced SubStation Alpha format — supported by Aegisub, VLC, mpv, MPC-HC, and most video editors. Offers richer styling than SRT (font, size, color, position) via the [V4+ Styles] section:

[Script Info]
ScriptType: v4.00+
...

[V4+ Styles]
Style: Default,Arial,20,&H00FFFFFF,...

[Events]
Format: Layer, Start, End, Style, Name, ..., Text
Dialogue: 0,0:00:00.00,0:00:02.50,Default,,[SPEAKER_1] Hello, welcome.
Dialogue: 0,0:00:02.80,0:00:04.20,Default,,[SPEAKER_2] Thanks for having me.

Timestamps use H:MM:SS.cc (centiseconds). Edit the [V4+ Styles] block in Aegisub to customise font, color, and position without re-transcribing.

LRC (`--format lrc`)

Timed lyrics format used by music players (e.g., Foobar2000, VLC, AIMP). Timestamps use [mm:ss.xx] where xx = centiseconds:

[00:00.50]Hello, welcome to the meeting.
[00:02.80]Thanks for having me.

With diarization, speaker labels are included:

[00:00.50][SPEAKER_1] Hello, welcome to the meeting.
[00:02.80][SPEAKER_2] Thanks for having me.

Default file extension: .lrc. Useful for music transcription, karaoke, and any workflow requiring timed text with music-player compatibility.

Speaker Diarization

Identifies who spoke when using pyannote.audio.

Setup:

./setup.sh --diarize

Requirements:

HuggingFace token at ~/.cache/huggingface/token (huggingface-cli login)
Accepted model agreements:
- https://hf.co/pyannote/speaker-diarization-3.1
- https://hf.co/pyannote/segmentation-3.0

Usage:

# Basic diarization (text output)
./scripts/transcribe meeting.wav --diarize

# Diarized subtitles
./scripts/transcribe meeting.wav --diarize --format srt -o meeting.srt

# Diarized JSON (includes speakers list)
./scripts/transcribe meeting.wav --diarize --format json

Speakers are labeled SPEAKER_1, SPEAKER_2, etc. in order of first appearance. Diarization runs on GPU automatically if CUDA is available.

Precise Word Timestamps

Whenever word-level timestamps are computed (--word-timestamps, --diarize, or --min-confidence), a wav2vec2 forced alignment pass automatically refines them from Whisper's ~100-200ms accuracy to ~10ms. No extra flag needed.

# Word timestamps with automatic wav2vec2 alignment
./scripts/transcribe audio.mp3 --word-timestamps --format json

# Diarization also gets precise alignment automatically
./scripts/transcribe meeting.wav --diarize

# Precise subtitles
./scripts/transcribe audio.mp3 --word-timestamps --format srt -o subtitles.srt

Uses the MMS (Massively Multilingual Speech) model from torchaudio — supports 1000+ languages. The model is cached after first load, so batch processing stays fast.

URL & YouTube Input

Pass any URL as input — audio is downloaded automatically via yt-dlp:

# YouTube video
./scripts/transcribe https://youtube.com/watch?v=dQw4w9WgXcQ

# Direct audio URL
./scripts/transcribe https://example.com/podcast.mp3

# With options
./scripts/transcribe https://youtube.com/watch?v=... --language en --format srt -o subs.srt

Requires yt-dlp (checks PATH and ~/.local/share/pipx/venvs/yt-dlp/bin/yt-dlp).

Batch Processing

Process multiple files at once with glob patterns, directories, or multiple paths:

# All MP3s in current directory
./scripts/transcribe *.mp3

# Entire directory (auto-filters audio files)
./scripts/transcribe ./recordings/

# Output to directory (one file per input)
./scripts/transcribe *.mp3 -o ./transcripts/

# Skip already-transcribed files (resume interrupted batch)
./scripts/transcribe *.mp3 --skip-existing -o ./transcripts/

# Mixed inputs
./scripts/transcribe file1.mp3 file2.wav ./more-recordings/

# Batch SRT subtitles
./scripts/transcribe *.mp3 --format srt -o ./subtitles/

When outputting to a directory, files are named {input-stem}.{ext} (e.g., audio.mp3 → audio.srt).

Batch mode prints a summary after all files complete:

📊 Done: 12 files, 3h24m audio in 10m15s (19.9× realtime)

Workflows

End-to-end pipelines for common use cases.

Podcast Transcription Pipeline

Fetch and transcribe the latest 5 episodes from any podcast RSS feed:

# Transcribe latest 5 episodes → one .txt per episode
./scripts/transcribe --rss https://feeds.megaphone.fm/mypodcast -o ./transcripts/

# All episodes, as SRT subtitles
./scripts/transcribe --rss https://... --rss-latest 0 --format srt -o ./subtitles/

# Skip already-done episodes (safe to re-run)
./scripts/transcribe --rss https://... --skip-existing -o ./transcripts/

# With diarization (who said what) + retry on flaky network
./scripts/transcribe --rss https://... --diarize --retries 2 -o ./transcripts/

Meeting Notes Pipeline

Transcribe a meeting recording with speaker labels, then output clean text:

# Diarize + name speakers (replace SPEAKER_1/2 with real names)
./scripts/transcribe meeting.wav --diarize --speaker-names "Alice,Bob" -o meeting.txt

# Diarized JSON for post-processing (summaries, action items)
./scripts/transcribe meeting.wav --diarize --format json -o meeting.json

# Stream live while it transcribes (long meetings)
./scripts/transcribe meeting.wav --stream

Video Subtitle Pipeline

Generate ready-to-use subtitles for a video file:

# SRT subtitles with sentence merging (better readability)
./scripts/transcribe video.mp4 --format srt --merge-sentences -o subtitles.srt

# Burn subtitles directly into the video
./scripts/transcribe video.mp4 --format srt --burn-in video_subtitled.mp4

# Word-level SRT (karaoke-style), capped at 8 words per cue
./scripts/transcribe video.mp4 --format srt --word-timestamps --max-words-per-line 8 -o subs.srt

YouTube Batch Pipeline

Transcribe multiple YouTube videos at once:

# One-liner: transcribe a playlist video + output SRT
./scripts/transcribe "https://youtube.com/watch?v=abc123" --format srt -o subs.srt

# Batch from a text file of URLs (one per line)
cat urls.txt | xargs ./scripts/transcribe -o ./transcripts/

# Download audio first, then transcribe (for re-use without re-downloading)
./scripts/transcribe https://youtube.com/watch?v=abc123 --keep-temp

Noisy Audio Pipeline

Clean up poor-quality recordings before transcribing:

# Denoise + normalize, then transcribe
./scripts/transcribe interview.mp3 --denoise --normalize -o interview.txt

# Noisy batch with aggressive hallucination filtering
./scripts/transcribe *.mp3 --denoise --filter-hallucinations -o ./out/

Batch Recovery Pipeline

Process a large folder with retries — safe to re-run after failures:

# Retry each failed file up to 3 times, skip already-done
./scripts/transcribe ./recordings/ --skip-existing --retries 3 -o ./transcripts/

# Check what failed (printed in batch summary at the end)
# Re-run the same command — skips successes, retries failures

Server Mode (OpenAI-Compatible API)

speaches runs faster-whisper as an OpenAI-compatible /v1/audio/transcriptions endpoint — drop-in replacement for OpenAI Whisper API with streaming, Docker support, and live transcription.

Quick start (Docker)

docker run --gpus all -p 8000:8000 ghcr.io/speaches-ai/speaches:latest-cuda

Test it

# Transcribe a file via the API (same format as OpenAI)
curl http://localhost:8000/v1/audio/transcriptions \
  -F file=@audio.mp3 \
  -F model=Systran/faster-whisper-large-v3

Use with any OpenAI SDK

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000", api_key="none")
with open("audio.mp3", "rb") as f:
    result = client.audio.transcriptions.create(model="Systran/faster-whisper-large-v3", file=f)
print(result.text)

Useful when you want to expose transcription as a local API for other tools (Home Assistant, n8n, custom apps).

Common Mistakes

Mistake	Problem	Solution
Using CPU when GPU available	10-20x slower transcription	Check `nvidia-smi`; verify CUDA installation
Not specifying language	Wastes time auto-detecting on known content	Use `--language en` when you know the language
Using wrong model	Unnecessary slowness or poor accuracy	Default `distil-large-v3.5` is excellent; only use `large-v3` if accuracy issues
Ignoring distilled models	Missing 6x speedup with <1% accuracy loss	Try before reaching for standard models

Performance Notes

First run : Downloads model to ~/.cache/huggingface/ (one-time)
Batched inference : Enabled by default via BatchedInferencePipeline — ~3x faster than standard mode; VAD on by default
GPU : Automatically uses CUDA if available
Quantization : INT8 used on CPU for ~4x speedup with minimal accuracy loss
Performance stats : Every transcription shows audio duration, processing time, and realtime factor
Benchmark (RTX 3070, 21-min file): ~24s with batched inference (both distil-large-v3 and v3.5) vs ~69s without
--precise overhead : Adds ~5-10s for wav2vec2 model load + alignment (model cached for batch)
Diarization overhead : Adds ~10-30s depending on audio length (runs on GPU if available)
Memory :
- distil-large-v3: ~2GB RAM / ~1GB VRAM
- large-v3-turbo: ~4GB RAM / ~2GB VRAM
- tiny/base: <1GB RAM

Why faster-whisper?

Speed : ~4-6x faster than OpenAI's original Whisper
Accuracy : Identical (uses same model weights)
Efficiency : Lower memory usage via quantization
Production-ready : Stable C++ backend (CTranslate2)
Distilled models : ~6x faster with <1% accuracy loss
Subtitles : Native SRT/VTT/HTML output
Precise alignment : Automatic wav2vec2 refinement (~10ms word boundaries)
Diarization : Optional speaker identification via pyannote; --speaker-names maps to real names
URLs : Direct YouTube/URL input; --keep-temp preserves downloads for re-use
Custom models : Load local CTranslate2 dirs or HuggingFace repos; --model-dir controls cache
Quality control : --filter-hallucinations strips music/applause markers and duplicates
Parallel batch : --parallel N for multi-threaded batch processing

v1.5.0 New Features

Multi-format output:

--format srt,text — write multiple formats in one pass (e.g. SRT + plain text simultaneously)
Comma-separated list accepted: srt,vtt,json, srt,text, etc.
Requires -o <dir> when writing multiple formats; single format unchanged

Filler word removal:

--clean-filler — strip hesitation sounds (um, uh, er, ah, hmm, hm) and discourse markers (you know, I mean, you see) from transcript text; off by default
Conservative regex matching at word boundaries to avoid false positives
Segments that become empty after cleaning are dropped automatically

Stereo channel selection:

--channel left|right|mix — extract a specific stereo channel before transcribing (default: mix)
Useful for dual-track recordings (interviewer on left, interviewee on right)
Uses ffmpeg pan filter; falls back gracefully to full mix if ffmpeg not found

Character-based subtitle wrapping:

--max-chars-per-line N — split subtitle cues so each line fits within N characters
Works for SRT, VTT, ASS, and TTML formats; takes priority over --max-words-per-line
Requires word-level timestamps; falls back to full segment if no word data

Paragraph detection:

--detect-paragraphs — insert \n\n paragraph breaks in text output at natural boundaries
--paragraph-gap SEC — minimum silence gap for a paragraph (default: 3.0s)
Also detects paragraph breaks when the previous segment ends a sentence and gap ≥ 1.5s

Subtitle formats:

--format ass — Advanced SubStation Alpha (Aegisub, VLC, mpv, MPC-HC)
--format lrc — Timed lyrics format for music players
--format html — Confidence-colored HTML transcript (green/yellow/red per word)
--format ttml — W3C TTML 1.0 (DFXP) broadcast standard (Netflix, Amazon Prime, BBC)
--format csv — Spreadsheet-ready CSV with header row; RFC 4180 quoting; speaker column when diarized

Transcript tools:

--search TERM — Find all timestamps where a word/phrase appears; replaces normal output; -o to save
--search-fuzzy — Approximate/partial matching with --search
--detect-chapters — Auto-detect chapter breaks from silence gaps; --chapter-gap SEC (default 8s)
--chapters-file PATH — Write chapters to file instead of stdout; --chapter-format youtube|text|json
--export-speakers DIR — After --diarize, save each speaker's turns as separate WAV files via ffmpeg

Batch improvements:

ETA — [N/total] filename | ETA: Xm Ys shown before each file in sequential batch; no flag needed
--language-map "pat=lang,..." — Per-file language override; fnmatch glob patterns; @file.json form
--retries N — Retry failed files with exponential backoff; failed-file summary at end
--rss URL — Transcribe podcast RSS feeds; --rss-latest N for episode count
--skip-existing / --parallel N / --output-template / --stats-file /

Model & inference:

distil-large-v3.5 default (replaced distil-large-v3)
Auto-disables condition_on_previous_text for distil models (prevents repetition loops)
--condition-on-previous-text to override; --log-level for library debug output
--model-dir PATH — Custom HuggingFace cache dir; local CTranslate2 model support
--no-timestamps, --chunk-length, --length-penalty, --repetition-penalty, --no-repeat-ngram-size

Speaker & quality:

--speaker-names "Alice,Bob" — Replace SPEAKER_1/2 with real names (requires --diarize)
--filter-hallucinations — Remove music/applause markers, duplicates, "Thank you for watching"
--burn-in OUTPUT — Burn subtitles into video via ffmpeg
--keep-temp — Preserve URL-downloaded audio for re-processing

Setup:

setup.sh --check — System diagnostic: GPU, CUDA, Python, ffmpeg, pyannote, HuggingFace token (completes in ~12s)
ffmpeg no longer required for basic transcription (PyAV handles decoding); skill.json updated to reflect this (ffmpeg is now optionalBins)

Troubleshooting

"CUDA not available — using CPU" : Install PyTorch with CUDA (see GPU Support above) Setup fails : Make sure Python 3.10+ is installed Out of memory : Use smaller model, --compute-type int8, or --batch-size 4 Slow on CPU : Expected — use GPU for practical transcription Model download fails : Check ~/.cache/huggingface/ permissions Diarization model fails : Ensure HuggingFace token exists and model agreements accepted; or pass token directly with --hf-token hf_xxx URL download fails : Check yt-dlp is installed (pipx install yt-dlp) No audio files in batch : Check file extensions match supported formats Check installed version : Run ./scripts/transcribe --version Upgrade faster-whisper : Run ./setup.sh --update (upgrades in-place, no full reinstall) : Try : Tune with (lower) or : Run to upgrade faster-whisper to the latest version (includes Silero VAD V6).

References

Weekly Installs

828

Repository

theplasmak/fast…-whisper

GitHub Stars

First Seen

Jan 30, 2026

Security Audits

Gen Agent Trust HubFail SocketPass SnykFail

Installed on

opencode780

gemini-cli775

codex774

cursor765

github-copilot762

kimi-cli753

头脑风暴技能：AI协作设计流程，将创意转化为完整规范与实施计划

75,000 周安装

Only add --clip-timestamps if the user wants a specific time range

Only add --temperature 0.0 if the model is hallucinating on music/silence

Only add --vad-threshold if VAD is aggressively cutting speech or including noise

Only add --min-speakers/--max-speakers when you know the speaker count

Only add --hf-token if the token is not cached at ~/.cache/huggingface/token

Only add --max-words-per-line for subtitle readability on long segments

Only add --filter-hallucinations if the transcript contains obvious artifacts (music markers, duplicates)

Only add --merge-sentences if the user asks for sentence-level subtitle cues

Only add --clean-filler if the user asks to remove filler words (um, uh, you know, I mean, hesitation sounds)

Only add --channel left|right if the user mentions stereo tracks, dual-channel recordings, or asks for a specific channel

Only add --max-chars-per-line N when the user specifies a character limit per subtitle line (e.g., "Netflix format", "42 chars per line"); takes priority over --max-words-per-line

Only add --detect-paragraphs if the user asks for paragraph breaks or structured text output; --paragraph-gap (default 3.0s) only if they want a custom gap

Only add --speaker-names "Alice,Bob" when the user provides real names to replace SPEAKER_1/2 — always requires --diarize

Only add --hotwords WORDS when the user names specific rare terms not well served by --initial-prompt; prefer --initial-prompt for general domain jargon

Only add --prefix TEXT when the user knows the exact words the audio starts with

Only add --detect-language-only when the user only wants to identify the language, not transcribe

Only add --stats-file PATH if the user asks for performance stats, RTF, or benchmark info

Only add --parallel N for large CPU batch jobs; GPU handles one file efficiently on its own — don't add for single files or small batches

Only add --retries N for unreliable inputs (URLs, network files) where transient failures are expected

Only add --burn-in OUTPUT only when user explicitly asks to embed/burn subtitles into the video; requires ffmpeg and a video file input

Only add --keep-temp when the user may re-process the same URL to avoid re-downloading

Only add --output-template when user specifies a custom naming pattern in batch mode

Multi-format output (--format srt,text): only when user explicitly wants multiple formats in one pass; always pair with -o <dir>

Any word-level feature auto-runs wav2vec2 alignment (~5-10s overhead)

--diarize adds ~20-30s on top of that

=== CHAPTERS (N) ===

Multi-format (--format srt,text) → requires -o <dir>; each format goes to a separate file; tell user all paths written

JSON format → useful for programmatic post-processing; not ideal to paste in full to user

Text/transcript → safe to show directly to user for short files; summarise for long ones

Stats output (--stats-file) → summarise key fields (duration, processing time, RTF) for the user rather than pasting raw JSON

Language detection (--detect-language-only) → print the result directly; it's a single line

ETA is printed automatically to stderr for batch jobs; no action needed

Diarization: additional ~1-2GB VRAM

OOM : Lower --batch-size (try 4) if you hit out-of-memory errors

Pre-convert to WAV (optional): ffmpeg -i input.mp3 -ar 16000 -ac 1 input.wav converts to 16kHz mono WAV before transcription. Benefit is minimal (~5%) for one-off use since PyAV decodes efficiently — most useful when re-processing the same file multiple times (research/experiments) or when a format causes PyAV decode issues. Note: --normalize and --denoise already perform this conversion automatically.

Silero VAD V6 : faster-whisper 1.2.1 upgraded to Silero VAD V6 (improved speech detection). Run ./setup.sh --update to get it.

Batched silence removal : faster-whisper 1.2.0+ automatically removes silence in BatchedInferencePipeline (used by default). Upgrade with ./setup.sh --update to get this if you installed before August 2024.

Subtitle burn-in : --burn-in overlays subtitles directly into video via ffmpeg

--clip-timestamps, --stream, --progress, --best-of, --patience, --max-new-tokens

--hotwords, --prefix, --revision, --suppress-tokens, --max-initial-timestamp

Hallucinations on silence/music

--temperature 0.0 --no-speech-threshold 0.8

VAD splits speech incorrectly

--vad-threshold 0.3

--min-silence-duration 300

Improve speech detection

./setup.sh --update

faster-whisper：本地语音转文本工具，速度提升4-6倍，支持多语言转录与字幕生成

🇨🇳中文介绍

Faster Whisper

使用场景

相关 Skills

快速参考

模型选择

模型表

标准模型（完整 Whisper）

蒸馏模型（约快 6 倍，WER 差异约 1%）

自定义和微调模型

加载本地 CTranslate2 模型

将 HuggingFace 模型转换为 CTranslate2

通过 HuggingFace 仓库名称加载模型（自动下载）

自定义模型缓存目录

安装

Linux / macOS / WSL2

平台支持

GPU 支持（重要！）

使用方法

选项

🇺🇸English

Faster Whisper

When to Use

Quick Reference

Model Selection

Model Table

Standard Models (Full Whisper)

Distilled Models (~6x Faster, ~1% WER difference)

Custom & Fine-tuned Models

Load a local CTranslate2 model

Convert a HuggingFace model to CTranslate2

Load a model by HuggingFace repo name (auto-downloads)

Custom model cache directory

Setup

Linux / macOS / WSL2

Platform Support

GPU Support (IMPORTANT!)

Usage

Options

Output Formats

Text (default)

JSON (--format json)

SRT (--format srt)

VTT (--format vtt)

TSV (--format tsv)

ASS/SSA (--format ass)

LRC (--format lrc)

Speaker Diarization

Precise Word Timestamps

URL & YouTube Input

Batch Processing

Workflows

Podcast Transcription Pipeline

Meeting Notes Pipeline

Video Subtitle Pipeline

YouTube Batch Pipeline

Noisy Audio Pipeline

Batch Recovery Pipeline

Server Mode (OpenAI-Compatible API)

Quick start (Docker)

Test it

Use with any OpenAI SDK

Common Mistakes

Performance Notes

Why faster-whisper?

v1.5.0 New Features

Troubleshooting

References

最新 Skills

JSON (`--format json`)

SRT (`--format srt`)

VTT (`--format vtt`)

TSV (`--format tsv`)

ASS/SSA (`--format ass`)

LRC (`--format lrc`)