ElevenLabs 语音转文字 API - 支持 90+ 语言、说话人分离和词级时间戳的音频转录

speech-to-text by elevenlabs/skills

1,800 周安装量

142 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/elevenlabs/skills --skill speech-to-text

AI/机器学习开发音频处理

🇨🇳中文介绍

ElevenLabs 语音转文字

使用 Scribe v2 将音频转录为文本 - 支持 90 多种语言、说话人分离和词级时间戳。

设置： 请参阅安装指南。对于 JavaScript，请仅使用 @elevenlabs/* 包。

快速开始

Python

from elevenlabs import ElevenLabs

client = ElevenLabs()

with open("audio.mp3", "rb") as audio_file:
    result = client.speech_to_text.convert(file=audio_file, model_id="scribe_v2")

print(result.text)

JavaScript

import { ElevenLabsClient } from "@elevenlabs/elevenlabs-js";
import { createReadStream } from "fs";

const client = new ElevenLabsClient();
const result = await client.speechToText.convert({
  file: createReadStream("audio.mp3"),
  modelId: "scribe_v2",
});
console.log(result.text);

cURL

curl -X POST "https://api.elevenlabs.io/v1/speech-to-text" \
  -H "xi-api-key: $ELEVENLABS_API_KEY" -F "file=@audio.mp3" -F "model_id=scribe_v2"

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

模型 ID	描述	最佳用途
`scribe_v2`	最先进的准确率，支持 90 多种语言	批量转录、字幕、长音频
`scribe_v2_realtime`	低延迟（约 150 毫秒）	实时转录、语音代理

带时间戳的转录

词级时间戳包含类型分类和说话人识别：

result = client.speech_to_text.convert(
    file=audio_file, model_id="scribe_v2", timestamps_granularity="word"
)

for word in result.words:
    print(f"{word.text}: {word.start}s - {word.end}s (type: {word.type})")

识别谁说了什么 - 模型为每个词标注说话人 ID，适用于会议、访谈或任何多说话人音频：

result = client.speech_to_text.convert(
    file=audio_file,
    model_id="scribe_v2",
    diarize=True
)

for word in result.words:
    print(f"[{word.speaker_id}] {word.text}")

帮助模型识别可能误听的特定词语 - 产品名称、技术术语或不常见的拼写（最多 100 个术语）：

result = client.speech_to_text.convert(
    file=audio_file,
    model_id="scribe_v2",
    keyterms=["ElevenLabs", "Scribe", "API"]
)

自动检测，可提供语言提示：

result = client.speech_to_text.convert(
    file=audio_file,
    model_id="scribe_v2",
    language_code="eng"  # ISO 639-1 或 ISO 639-3 代码
)

print(f"Detected: {result.language_code} ({result.language_probability:.0%})")

音频： MP3, WAV, M4A, FLAC, OGG, WebM, AAC, AIFF, Opus 视频： MP4, AVI, MKV, MOV, WMV, FLV, WebM, MPEG, 3GPP

限制： 文件大小最大 3GB，时长最长 10 小时

{
  "text": "The full transcription text",
  "language_code": "eng",
  "language_probability": 0.98,
  "words": [
    {"text": "The", "start": 0.0, "end": 0.15, "type": "word", "speaker_id": "speaker_0"},
    {"text": " ", "start": 0.15, "end": 0.16, "type": "spacing", "speaker_id": "speaker_0"}
  ]
}

word - 实际说出的词语
spacing - 词语之间的空白（用于精确计时）
audio_event - 模型检测到的非语音声音（笑声、掌声、音乐等）

try:
    result = client.speech_to_text.convert(file=audio_file, model_id="scribe_v2")
except Exception as e:
    print(f"Transcription failed: {e}")

401 : API 密钥无效
422 : 参数无效
429 : 超出速率限制

通过 request-id 响应头监控使用情况：

response = client.speech_to_text.convert.with_raw_response(file=audio_file, model_id="scribe_v2")
result = response.parse()
print(f"Request ID: {response.headers.get('request-id')}")

对于超低延迟（约 150 毫秒）的实时转录，请使用实时 API。实时 API 产生两种类型的转录文本：

部分转录文本 : 在处理音频时频繁更新的中间结果 - 用于实时反馈（例如，在用户说话时显示文本）
已提交的转录文本 : 在你“提交”后最终、稳定的结果 - 用作应用程序的可靠来源

“提交”告诉模型完成当前片段的处理。你可以手动提交（例如，当用户暂停时），或使用语音活动检测（VAD）在静默时自动提交。

Python（服务器端）

import asyncio
from elevenlabs import ElevenLabs

client = ElevenLabs()

async def transcribe_realtime():
    async with client.speech_to_text.realtime.connect(
        model_id="scribe_v2_realtime",
        include_timestamps=True,
    ) as connection:
        await connection.stream_url("https://example.com/audio.mp3")

        async for event in connection:
            if event.type == "partial_transcript":
                print(f"Partial: {event.text}")
            elif event.type == "committed_transcript":
                print(f"Final: {event.text}")

asyncio.run(transcribe_realtime())

JavaScript（客户端使用 React）

import { useScribe, CommitStrategy } from "@elevenlabs/react";

function TranscriptionComponent() {
  const [transcript, setTranscript] = useState("");

  const scribe = useScribe({
    modelId: "scribe_v2_realtime",
    commitStrategy: CommitStrategy.VAD, // 麦克风输入时，静默自动提交
    onPartialTranscript: (data) => console.log("Partial:", data.text),
    onCommittedTranscript: (data) => setTranscript((prev) => prev + data.text),
  });

  const start = async () => {
    // 从你的后端获取令牌（切勿将 API 密钥暴露给客户端）
    const { token } = await fetch("/scribe-token").then((r) => r.json());

    await scribe.connect({
      token,
      microphone: { echoCancellation: true, noiseSuppression: true },
    });
  };

  return <button onClick={start}>开始录音</button>;
}

策略	描述
手动	准备好时调用 `commit()` - 用于文件处理或当你控制音频片段时
VAD	语音活动检测在检测到静默时自动提交 - 用于实时麦克风输入

// React: 在钩子上设置 commitStrategy（推荐用于麦克风输入）
import { useScribe, CommitStrategy } from "@elevenlabs/react";

const scribe = useScribe({
  modelId: "scribe_v2_realtime",
  commitStrategy: CommitStrategy.VAD,
  // 可选的 VAD 调优：
  vadSilenceThresholdSecs: 1.5,
  vadThreshold: 0.4,
});



// JavaScript 客户端: 在连接时传递 vad 配置
const connection = await client.speechToText.realtime.connect({
  modelId: "scribe_v2_realtime",
  vad: {
    silenceThresholdSecs: 1.5,
    threshold: 0.4,
  },
});

事件	描述
`partial_transcript`	实时中间结果
`committed_transcript`	提交后的最终结果
`committed_transcript_with_timestamps`	带词级计时的最终结果
`error`	发生错误

完整的文档请参阅实时参考。

🇺🇸English

ElevenLabs Speech-to-Text

Transcribe audio to text with Scribe v2 - supports 90+ languages, speaker diarization, and word-level timestamps.

Setup: See Installation Guide. For JavaScript, use @elevenlabs/* packages only.

Quick Start

Python

from elevenlabs import ElevenLabs

client = ElevenLabs()

with open("audio.mp3", "rb") as audio_file:
    result = client.speech_to_text.convert(file=audio_file, model_id="scribe_v2")

print(result.text)

JavaScript

import { ElevenLabsClient } from "@elevenlabs/elevenlabs-js";
import { createReadStream } from "fs";

const client = new ElevenLabsClient();
const result = await client.speechToText.convert({
  file: createReadStream("audio.mp3"),
  modelId: "scribe_v2",
});
console.log(result.text);

cURL

curl -X POST "https://api.elevenlabs.io/v1/speech-to-text" \
  -H "xi-api-key: $ELEVENLABS_API_KEY" -F "file=@audio.mp3" -F "model_id=scribe_v2"

Models

Model ID	Description	Best For
`scribe_v2`	State-of-the-art accuracy, 90+ languages	Batch transcription, subtitles, long-form audio
`scribe_v2_realtime`	Low latency (~150ms)	Live transcription, voice agents

Transcription with Timestamps

Word-level timestamps include type classification and speaker identification:

result = client.speech_to_text.convert(
    file=audio_file, model_id="scribe_v2", timestamps_granularity="word"
)

for word in result.words:
    print(f"{word.text}: {word.start}s - {word.end}s (type: {word.type})")

Speaker Diarization

Identify WHO said WHAT - the model labels each word with a speaker ID, useful for meetings, interviews, or any multi-speaker audio:

result = client.speech_to_text.convert(
    file=audio_file,
    model_id="scribe_v2",
    diarize=True
)

for word in result.words:
    print(f"[{word.speaker_id}] {word.text}")

Keyterm Prompting

Help the model recognize specific words it might otherwise mishear - product names, technical jargon, or unusual spellings (up to 100 terms):

result = client.speech_to_text.convert(
    file=audio_file,
    model_id="scribe_v2",
    keyterms=["ElevenLabs", "Scribe", "API"]
)

Language Detection

Automatic detection with optional language hint:

result = client.speech_to_text.convert(
    file=audio_file,
    model_id="scribe_v2",
    language_code="eng"  # ISO 639-1 or ISO 639-3 code
)

print(f"Detected: {result.language_code} ({result.language_probability:.0%})")

Supported Formats

Audio: MP3, WAV, M4A, FLAC, OGG, WebM, AAC, AIFF, Opus Video: MP4, AVI, MKV, MOV, WMV, FLV, WebM, MPEG, 3GPP

Limits: Up to 3GB file size, 10 hours duration

Response Format

{
  "text": "The full transcription text",
  "language_code": "eng",
  "language_probability": 0.98,
  "words": [
    {"text": "The", "start": 0.0, "end": 0.15, "type": "word", "speaker_id": "speaker_0"},
    {"text": " ", "start": 0.15, "end": 0.16, "type": "spacing", "speaker_id": "speaker_0"}
  ]
}

Word types:

word - An actual spoken word
spacing - Whitespace between words (useful for precise timing)
audio_event - Non-speech sounds the model detected (laughter, applause, music, etc.)

Error Handling

try:
    result = client.speech_to_text.convert(file=audio_file, model_id="scribe_v2")
except Exception as e:
    print(f"Transcription failed: {e}")

Common errors:

401 : Invalid API key
422 : Invalid parameters
429 : Rate limit exceeded

Tracking Costs

Monitor usage via request-id response header:

response = client.speech_to_text.convert.with_raw_response(file=audio_file, model_id="scribe_v2")
result = response.parse()
print(f"Request ID: {response.headers.get('request-id')}")

Real-Time Streaming

For live transcription with ultra-low latency (~150ms), use the real-time API. The real-time API produces two types of transcripts:

Partial transcripts : Interim results that update frequently as audio is processed - use these for live feedback (e.g., showing text as the user speaks)
Committed transcripts : Final, stable results after you "commit" - use these as the source of truth for your application

A "commit" tells the model to finalize the current segment. You can commit manually (e.g., when the user pauses) or use Voice Activity Detection (VAD) to auto-commit on silence.

Python (Server-Side)

import asyncio
from elevenlabs import ElevenLabs

client = ElevenLabs()

async def transcribe_realtime():
    async with client.speech_to_text.realtime.connect(
        model_id="scribe_v2_realtime",
        include_timestamps=True,
    ) as connection:
        await connection.stream_url("https://example.com/audio.mp3")

        async for event in connection:
            if event.type == "partial_transcript":
                print(f"Partial: {event.text}")
            elif event.type == "committed_transcript":
                print(f"Final: {event.text}")

asyncio.run(transcribe_realtime())

JavaScript (Client-Side with React)

import { useScribe, CommitStrategy } from "@elevenlabs/react";

function TranscriptionComponent() {
  const [transcript, setTranscript] = useState("");

  const scribe = useScribe({
    modelId: "scribe_v2_realtime",
    commitStrategy: CommitStrategy.VAD, // Auto-commit on silence for mic input
    onPartialTranscript: (data) => console.log("Partial:", data.text),
    onCommittedTranscript: (data) => setTranscript((prev) => prev + data.text),
  });

  const start = async () => {
    // Get token from your backend (never expose API key to client)
    const { token } = await fetch("/scribe-token").then((r) => r.json());

    await scribe.connect({
      token,
      microphone: { echoCancellation: true, noiseSuppression: true },
    });
  };

  return <button onClick={start}>Start Recording</button>;
}

Commit Strategies

Strategy	Description
Manual	You call `commit()` when ready - use for file processing or when you control the audio segments
VAD	Voice Activity Detection auto-commits when silence is detected - use for live microphone input

// React: set commitStrategy on the hook (recommended for mic input)
import { useScribe, CommitStrategy } from "@elevenlabs/react";

const scribe = useScribe({
  modelId: "scribe_v2_realtime",
  commitStrategy: CommitStrategy.VAD,
  // Optional VAD tuning:
  vadSilenceThresholdSecs: 1.5,
  vadThreshold: 0.4,
});



// JavaScript client: pass vad config on connect
const connection = await client.speechToText.realtime.connect({
  modelId: "scribe_v2_realtime",
  vad: {
    silenceThresholdSecs: 1.5,
    threshold: 0.4,
  },
});

Event Types

Event	Description
`partial_transcript`	Live interim results
`committed_transcript`	Final results after commit
`committed_transcript_with_timestamps`	Final with word timing
`error`	Error occurred

See real-time references for complete documentation.

References

Weekly Installs

1.8K

Repository

elevenlabs/skills

GitHub Stars

142

First Seen

Jan 27, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykWarn

Installed on

codex1.4K

gemini-cli1.4K

opencode1.4K

github-copilot1.3K

kimi-cli1.2K

amp1.2K

ElevenLabs 语音转文字 API - 支持 90+ 语言、说话人分离和词级时间戳的音频转录

🇨🇳中文介绍

ElevenLabs 语音转文字

快速开始

Python

JavaScript

cURL

相关 Skills

模型

带时间戳的转录

说话人分离

关键词提示

语言检测

支持的格式

响应格式

错误处理

跟踪成本

实时流式传输

Python（服务器端）

JavaScript（客户端使用 React）

提交策略

事件类型

参考

🇺🇸English

ElevenLabs Speech-to-Text

Quick Start

Python

JavaScript

cURL

Models

Transcription with Timestamps

Speaker Diarization

Keyterm Prompting

Language Detection

Supported Formats

Response Format

Error Handling

Tracking Costs

Real-Time Streaming

Python (Server-Side)

JavaScript (Client-Side with React)

Commit Strategies

Event Types

References