⚠️

重要前提

安装AI Skills的关键前提是：必须科学上网，且开启TUN模式，这一点至关重要，直接决定安装能否顺利完成，在此郑重提醒三遍：科学上网，科学上网，科学上网。查看完整安装教程 →

音频转录工具：支持时间戳、说话人识别，本地/云端多服务商选择

audio-transcribe by agntswrm/agent-media

65 周安装量

3 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/agntswrm/agent-media --skill audio-transcribe

AI/机器学习命令行工具音频处理

🇨🇳中文介绍

音频转录

将音频文件转录为带时间戳的文本。支持自动语言检测、说话人识别（声纹分离），并输出包含分段级别时间信息的结构化 JSON。

命令

npx agent-media@latest audio transcribe --in <路径> [选项]

输入参数

选项	必需	描述
`--in`	是	输入音频文件路径或 URL（支持 mp3、wav、m4a、ogg）
`--diarize`	否	启用说话人识别
`--language`	否	语言代码（未提供则自动检测）
`--speakers`

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

从视频中提取音频

要转录视频文件，请先提取音频：

# 步骤 1：从视频中提取音频
npx agent-media@latest audio extract --in video.mp4 --format mp3

# 步骤 2：转录提取出的音频
npx agent-media@latest audio transcribe --in extracted_xxx.mp3

使用 Transformers.js 在本地 CPU 上运行，无需 API 密钥。

使用 Moonshine 模型（比 Whisper 快 5 倍）
模型在首次使用时下载（约 100MB）
不支持 声纹分离 —— 如需说话人识别，请使用 fal 或 replicate
你可能会看到 mutex lock failed 错误 —— 忽略它，如果输出中 "ok": true，则结果是正确的

npx agent-media@latest audio transcribe --in audio.mp3 --provider local

需要 FAL_API_KEY
当声纹分离禁用时，使用 wizper 模型进行快速转录（快 2 倍）
当声纹分离启用时，使用 whisper 模型（原生支持）

需要 REPLICATE_API_TOKEN
使用 whisper-diarization 模型，基于 Whisper Large V3 Turbo
原生支持声纹分离，并提供词级时间戳

需要 RUNPOD_API_KEY
使用 pruna/whisper-v3-large 模型（Whisper Large V3）
不支持 声纹分离（说话人识别）—— 如需声纹分离，请使用 fal 或 replicate

npx agent-media@latest audio transcribe --in audio.mp3 --provider runpod

🇺🇸English

Audio Transcribe

Transcribes audio files to text with timestamps. Supports automatic language detection, speaker identification (diarization), and outputs structured JSON with segment-level timing.

Command

npx agent-media@latest audio transcribe --in <path> [options]

Inputs

Option	Required	Description
`--in`	Yes	Input audio file path or URL (supports mp3, wav, m4a, ogg)
`--diarize`	No	Enable speaker identification
`--language`	No	Language code (auto-detected if not provided)
`--speakers`	No	Number of speakers hint for diarization
`--out`	No	Output path, filename or directory (default: ./)
`--provider`	No	Provider to use (local, fal, replicate, runpod)

Output

Returns a JSON object with transcription data:

{
  "ok": true,
  "media_type": "audio",
  "action": "transcribe",
  "provider": "fal",
  "output_path": "transcription_123_abc.json",
  "transcription": {
    "text": "Full transcription text...",
    "language": "en",
    "segments": [
      { "start": 0.0, "end": 2.5, "text": "Hello.", "speaker": "SPEAKER_0" },
      { "start": 2.5, "end": 5.0, "text": "Hi there.", "speaker": "SPEAKER_1" }
    ]
  }
}

Examples

Basic transcription (auto-detect language):

npx agent-media@latest audio transcribe --in interview.mp3

Transcription with speaker identification:

npx agent-media@latest audio transcribe --in meeting.wav --diarize

Transcription with specific language and speaker count:

npx agent-media@latest audio transcribe --in podcast.mp3 --diarize --language en --speakers 3

Use specific provider:

npx agent-media@latest audio transcribe --in audio.wav --provider replicate

Extracting Audio from Video

To transcribe a video file, first extract the audio:

# Step 1: Extract audio from video
npx agent-media@latest audio extract --in video.mp4 --format mp3

# Step 2: Transcribe the extracted audio
npx agent-media@latest audio transcribe --in extracted_xxx.mp3

Providers

local

Runs locally on CPU using Transformers.js, no API key required.

Uses Moonshine model (5x faster than Whisper)
Models downloaded on first use (~100MB)
Does NOT support diarization — use fal or replicate for speaker identification
You may see a mutex lock failed error — ignore it, the output is correct if "ok": true

npx agent-media@latest audio transcribe --in audio.mp3 --provider local

fal

Requires FAL_API_KEY
Uses wizper model for fast transcription (2x faster) when diarization is disabled
Uses whisper model when diarization is enabled (native support)

replicate

Requires REPLICATE_API_TOKEN
Uses whisper-diarization model with Whisper Large V3 Turbo
Native diarization support with word-level timestamps

runpod

Requires RUNPOD_API_KEY
Uses pruna/whisper-v3-large model (Whisper Large V3)
Does NOT support diarization (speaker identification) - use fal or replicate for diarization

npx agent-media@latest audio transcribe --in audio.mp3 --provider runpod

Weekly Installs

Repository

agntswrm/agent-media

GitHub Stars

First Seen

Jan 20, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykWarn

Installed on

opencode23

gemini-cli20

claude-code20

codex19

cursor19

openclaw17

超能力技能使用指南：AI助手技能调用优先级与工作流程详解

53,700 周安装

音频转录工具：支持时间戳、说话人识别，本地/云端多服务商选择

🇨🇳中文介绍

音频转录

命令

输入参数

相关 Skills

输出

示例

从视频中提取音频

服务提供商

local

fal

replicate

runpod

🇺🇸English

Audio Transcribe

Command

Inputs

Output

Examples

Extracting Audio from Video

Providers

local

fal

replicate

runpod

最新 Skills