Gemini Live API 开发指南：实时语音视频交互、WebSockets集成与SDK使用

gemini-live-api-dev by google-gemini/gemini-skills

657 周安装量

2,300 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/google-gemini/gemini-skills --skill gemini-live-api-dev

AI/机器学习音频处理 API

🇨🇳中文介绍

Gemini Live API 开发技能

概述

Live API 通过 WebSockets 实现与 Gemini 的低延迟、实时语音和视频交互。它处理连续的音频、视频或文本流，以提供即时、类人的语音响应。

核心功能：

双向音频流 — 实时麦克风到扬声器的对话
视频流 — 在发送音频的同时发送摄像头/屏幕画面
文本输入/输出 — 在实时会话中发送和接收文本
音频转录 — 获取输入和输出音频的文本转录
语音活动检测 (VAD) — 自动处理打断
原生音频 — 情感对话、主动音频、思考音效
函数调用 — 同步和异步工具使用
Google 搜索背景信息 — 基于实时搜索结果提供有依据的响应
会话管理 — 上下文压缩、会话恢复、GoAway 信号
临时令牌 — 安全的客户端身份验证

[!NOTE] Live API 目前仅支持 WebSockets。如需 WebRTC 支持或简化集成，请使用合作伙伴集成方案。

模型

gemini-2.5-flash-native-audio-preview-12-2025 — 原生音频输出、情感对话、主动音频、思考音效。128k 上下文窗口。这是所有 Live API 用例的推荐模型。

[!WARNING] 以下 Live API 模型已弃用并将被关闭。请迁移至。

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

相关 Skills

FlyClaw：零登录航班聚合查询工具，Python实现多源航班信息与价格搜索

4,000,000 周安装

find-skills 技能搜索工具 - Vercel Labs 开源智能体技能包管理器

749,400 周安装

React 组合模式指南：Vercel 组件架构最佳实践，提升代码可维护性

103,800 周安装

AI Elements：基于shadcn/ui的AI原生应用组件库，快速构建对话界面

53,500 周安装

gemini-2.5-flash-native-audio-preview-12-2025

gemini-live-2.5-flash-preview — 发布于 2025年6月17日。关闭日期：2025年12月9日。
gemini-2.0-flash-live-001 — 发布于 2025年4月9日。关闭日期：2025年12月9日。

Python : google-genai — pip install google-genai
JavaScript/TypeScript : @google/genai — npm install @google/genai

[!WARNING] 旧版 SDK google-generativeai (Python) 和 @google/generative-ai (JS) 已弃用。请使用上方的新版 SDK。

为了简化实时音频/视频应用开发，可以使用支持通过 WebRTC 或 WebSockets 集成 Gemini Live API 的第三方方案：

LiveKit — 将 Gemini Live API 与 LiveKit Agents 结合使用。
Pipecat by Daily — 使用 Gemini Live 和 Pipecat 创建实时 AI 聊天机器人。
Fishjam by Software Mansion — 使用 Fishjam 创建实时视频和音频流应用。
Vision Agents by Stream — 使用 Vision Agents 构建实时语音和视频 AI 应用。
Voximplant — 使用 Voximplant 将呼入和呼出电话连接到 Live API。
Firebase AI SDK — 使用 Firebase AI Logic 开始使用 Gemini Live API。

输入 : 原始 PCM，小端序，16 位，单声道。原生 16kHz（其他采样率将被重采样）。MIME 类型：audio/pcm;rate=16000
输出 : 原始 PCM，小端序，16 位，单声道。24kHz 采样率。

[!IMPORTANT] 对所有实时用户输入（音频、视频和文本）使用 send_realtime_input / sendRealtimeInput。仅将 send_client_content / sendClientContent 用于增量对话历史更新（将先前的轮次追加到上下文中），而不是用于发送新的用户消息。

[!WARNING] 不要在 sendRealtimeInput 中使用 media。请使用特定的键：audio 用于音频数据，video 用于图像/视频帧，text 用于文本输入。

from google import genai

client = genai.Client(api_key="YOUR_API_KEY")

import { GoogleGenAI } from '@google/genai';

const ai = new GoogleGenAI({ apiKey: 'YOUR_API_KEY' });

from google.genai import types

config = types.LiveConnectConfig(
    response_modalities=[types.Modality.AUDIO],
    system_instruction=types.Content(
        parts=[types.Part(text="You are a helpful assistant.")]
    )
)

async with client.aio.live.connect(model="gemini-2.5-flash-native-audio-preview-12-2025", config=config) as session:
    pass  # Session is now active

const session = await ai.live.connect({
  model: 'gemini-2.5-flash-native-audio-preview-12-2025',
  config: {
    responseModalities: ['audio'],
    systemInstruction: { parts: [{ text: 'You are a helpful assistant.' }] }
  },
  callbacks: {
    onopen: () => console.log('Connected'),
    onmessage: (response) => console.log('Message:', response),
    onerror: (error) => console.error('Error:', error),
    onclose: () => console.log('Closed')
  }
});

await session.send_realtime_input(text="Hello, how are you?")

session.sendRealtimeInput({ text: 'Hello, how are you?' });

await session.send_realtime_input(
    audio=types.Blob(data=chunk, mime_type="audio/pcm;rate=16000")
)

session.sendRealtimeInput({
  audio: { data: chunk.toString('base64'), mimeType: 'audio/pcm;rate=16000' }
});

# frame: raw JPEG-encoded bytes
await session.send_realtime_input(
    video=types.Blob(data=frame, mime_type="image/jpeg")
)

session.sendRealtimeInput({
  video: { data: frame.toString('base64'), mimeType: 'image/jpeg' }
});

接收音频和文本

async for response in session.receive():
    content = response.server_content
    if content:
        # Audio
        if content.model_turn:
            for part in content.model_turn.parts:
                if part.inline_data:
                    audio_data = part.inline_data.data
        # Transcription
        if content.input_transcription:
            print(f"User: {content.input_transcription.text}")
        if content.output_transcription:
            print(f"Gemini: {content.output_transcription.text}")
        # Interruption
        if content.interrupted is True:
            pass  # Stop playback, clear audio queue

// Inside the onmessage callback
const content = response.serverContent;
if (content?.modelTurn?.parts) {
  for (const part of content.modelTurn.parts) {
    if (part.inlineData) {
      const audioData = part.inlineData.data; // Base64 encoded
    }
  }
}
if (content?.inputTranscription) console.log('User:', content.inputTranscription.text);
if (content?.outputTranscription) console.log('Gemini:', content.outputTranscription.text);
if (content?.interrupted) { /* Stop playback, clear audio queue */ }

响应模态 — 每个会话仅支持 TEXT 或 AUDIO，不能同时支持两者
纯音频会话 — 无压缩情况下 15 分钟
音频+视频会话 — 无压缩情况下 2 分钟
连接生命周期 — 约 10 分钟（请使用会话恢复功能）
上下文窗口 — 128k 令牌（原生音频）/ 32k 令牌（标准）
代码执行 — 不支持
URL 上下文 — 不支持

测试麦克风音频时使用耳机，以防止回声/自我打断
对于超过 15 分钟的会话，启用上下文窗口压缩
实现会话恢复以优雅地处理连接重置
在客户端部署中使用临时令牌 — 切勿在浏览器中暴露 API 密钥
使用 send_realtime_input 处理所有实时用户输入（音频、视频、文本）。仅将 send_client_content 用于注入对话历史
当麦克风暂停时发送 audioStreamEnd 以清空缓存的音频
在收到打断信号时清除音频播放队列

如何使用 Gemini API

有关详细的 API 文档，请从官方文档索引获取：

llms.txt URL : https://ai.google.dev/gemini-api/docs/llms.txt

此索引包含所有文档页面的链接，格式为 .md.txt。使用网页抓取工具来：

获取 llms.txt 以发现可用的文档页面
获取特定页面（例如，https://ai.google.dev/gemini-api/docs/live-session.md.txt）

[!IMPORTANT] 这些并非所有文档页面。请使用 llms.txt 索引来发现可用的文档页面

Live API 概述 — 入门指南，原始 WebSocket 用法
Live API 功能指南 — 语音配置、转录配置、原生音频（情感对话、主动音频、思考音效）、VAD 配置、媒体分辨率
Live API 工具使用 — 函数调用（同步和异步）、Google 搜索背景信息
会话管理 — 上下文窗口压缩、会话恢复、GoAway 信号
临时令牌 — 用于浏览器/移动端的安全客户端身份验证
WebSockets API 参考 — 原始 WebSocket 协议详情

Live API 支持 70 多种语言，包括：英语、西班牙语、法语、德语、意大利语、葡萄牙语、中文、日语、韩语、印地语、阿拉伯语、俄语等。原生音频模型会自动检测并切换语言。

🇺🇸English

Gemini Live API Development Skill

Overview

The Live API enables low-latency, real-time voice and video interactions with Gemini over WebSockets. It processes continuous streams of audio, video, or text to deliver immediate, human-like spoken responses.

Key capabilities:

Bidirectional audio streaming — real-time mic-to-speaker conversations
Video streaming — send camera/screen frames alongside audio
Text input/output — send and receive text within a live session
Audio transcriptions — get text transcripts of both input and output audio
Voice Activity Detection (VAD) — automatic interruption handling
Native audio — affective dialog, proactive audio, thinking
Function calling — synchronous and asynchronous tool use
Google Search grounding — ground responses in real-time search results
Session management — context compression, session resumption, GoAway signals
Ephemeral tokens — secure client-side authentication

[!NOTE] The Live API currently only supports WebSockets. For WebRTC support or simplified integration, use a partner integration.

Models

gemini-2.5-flash-native-audio-preview-12-2025 — Native audio output, affective dialog, proactive audio, thinking. 128k context window. This is the recommended model for all Live API use cases.

[!WARNING] The following Live API models are deprecated and will be shut down. Migrate to gemini-2.5-flash-native-audio-preview-12-2025.

gemini-live-2.5-flash-preview — Released June 17, 2025. Shutdown: December 9, 2025.

gemini-2.0-flash-live-001 — Released April 9, 2025. Shutdown: December 9, 2025.

SDKs

Python : google-genai — pip install google-genai
JavaScript/TypeScript : @google/genai — npm install @google/genai

[!WARNING] Legacy SDKs google-generativeai (Python) and @google/generative-ai (JS) are deprecated. Use the new SDKs above.

Partner Integrations

To streamline real-time audio/video app development, use a third-party integration supporting the Gemini Live API over WebRTC or WebSockets :

LiveKit — Use the Gemini Live API with LiveKit Agents.
Pipecat by Daily — Create a real-time AI chatbot using Gemini Live and Pipecat.
Fishjam by Software Mansion — Create live video and audio streaming applications with Fishjam.
Vision Agents by Stream — Build real-time voice and video AI applications with Vision Agents.
Voximplant — Connect inbound and outbound calls to Live API with Voximplant.
Firebase AI SDK — Get started with the Gemini Live API using Firebase AI Logic.

Audio Formats

Input : Raw PCM, little-endian, 16-bit, mono. 16kHz native (will resample others). MIME type: audio/pcm;rate=16000
Output : Raw PCM, little-endian, 16-bit, mono. 24kHz sample rate.

[!IMPORTANT] Use send_realtime_input / sendRealtimeInput for all real-time user input (audio, video, and text). Use send_client_content / sendClientContent only for incremental conversation history updates (appending prior turns to context), not for sending new user messages.

[!WARNING] Do not use media in sendRealtimeInput. Use the specific keys: audio for audio data, video for images/video frames, and text for text input.

Quick Start

Authentication

Python

from google import genai

client = genai.Client(api_key="YOUR_API_KEY")

JavaScript

import { GoogleGenAI } from '@google/genai';

const ai = new GoogleGenAI({ apiKey: 'YOUR_API_KEY' });

Connecting to the Live API

Python

from google.genai import types

config = types.LiveConnectConfig(
    response_modalities=[types.Modality.AUDIO],
    system_instruction=types.Content(
        parts=[types.Part(text="You are a helpful assistant.")]
    )
)

async with client.aio.live.connect(model="gemini-2.5-flash-native-audio-preview-12-2025", config=config) as session:
    pass  # Session is now active

JavaScript

const session = await ai.live.connect({
  model: 'gemini-2.5-flash-native-audio-preview-12-2025',
  config: {
    responseModalities: ['audio'],
    systemInstruction: { parts: [{ text: 'You are a helpful assistant.' }] }
  },
  callbacks: {
    onopen: () => console.log('Connected'),
    onmessage: (response) => console.log('Message:', response),
    onerror: (error) => console.error('Error:', error),
    onclose: () => console.log('Closed')
  }
});

Sending Text

Python

await session.send_realtime_input(text="Hello, how are you?")

JavaScript

session.sendRealtimeInput({ text: 'Hello, how are you?' });

Sending Audio

Python

await session.send_realtime_input(
    audio=types.Blob(data=chunk, mime_type="audio/pcm;rate=16000")
)

JavaScript

session.sendRealtimeInput({
  audio: { data: chunk.toString('base64'), mimeType: 'audio/pcm;rate=16000' }
});

Sending Video

Python

# frame: raw JPEG-encoded bytes
await session.send_realtime_input(
    video=types.Blob(data=frame, mime_type="image/jpeg")
)

JavaScript

session.sendRealtimeInput({
  video: { data: frame.toString('base64'), mimeType: 'image/jpeg' }
});

Receiving Audio and Text

Python

async for response in session.receive():
    content = response.server_content
    if content:
        # Audio
        if content.model_turn:
            for part in content.model_turn.parts:
                if part.inline_data:
                    audio_data = part.inline_data.data
        # Transcription
        if content.input_transcription:
            print(f"User: {content.input_transcription.text}")
        if content.output_transcription:
            print(f"Gemini: {content.output_transcription.text}")
        # Interruption
        if content.interrupted is True:
            pass  # Stop playback, clear audio queue

JavaScript

// Inside the onmessage callback
const content = response.serverContent;
if (content?.modelTurn?.parts) {
  for (const part of content.modelTurn.parts) {
    if (part.inlineData) {
      const audioData = part.inlineData.data; // Base64 encoded
    }
  }
}
if (content?.inputTranscription) console.log('User:', content.inputTranscription.text);
if (content?.outputTranscription) console.log('Gemini:', content.outputTranscription.text);
if (content?.interrupted) { /* Stop playback, clear audio queue */ }

Limitations

Response modality — Only TEXT or AUDIO per session, not both
Audio-only session — 15 min without compression
Audio+video session — 2 min without compression
Connection lifetime — ~10 min (use session resumption)
Context window — 128k tokens (native audio) / 32k tokens (standard)
Code execution — Not supported
URL context — Not supported

Best Practices

Use headphones when testing mic audio to prevent echo/self-interruption
Enable context window compression for sessions longer than 15 minutes
Implement session resumption to handle connection resets gracefully
Use ephemeral tokens for client-side deployments — never expose API keys in browsers
Usesend_realtime_input for all real-time user input (audio, video, text). Reserve send_client_content only for injecting conversation history
SendaudioStreamEnd when the mic is paused to flush cached audio
Clear audio playback queues on interruption signals

How to use the Gemini API

For detailed API documentation, fetch from the official docs index:

llms.txt URL : https://ai.google.dev/gemini-api/docs/llms.txt

This index contains links to all documentation pages in .md.txt format. Use web fetch tools to:

Fetch llms.txt to discover available documentation pages
Fetch specific pages (e.g., https://ai.google.dev/gemini-api/docs/live-session.md.txt)

Key Documentation Pages

[!IMPORTANT] Those are not all the documentation pages. Use the llms.txt index to discover available documentation pages

Live API Overview — getting started, raw WebSocket usage
Live API Capabilities Guide — voice config, transcription config, native audio (affective dialog, proactive audio, thinking), VAD configuration, media resolution
Live API Tool Use — function calling (sync and async), Google Search grounding
Session Management — context window compression, session resumption, GoAway signals
Ephemeral Tokens — secure client-side authentication for browser/mobile
WebSockets API Reference — raw WebSocket protocol details

Supported Languages

The Live API supports 70 languages including: English, Spanish, French, German, Italian, Portuguese, Chinese, Japanese, Korean, Hindi, Arabic, Russian, and many more. Native audio models automatically detect and switch languages.

Weekly Installs

657

Repository

google-gemini/g…i-skills

GitHub Stars

2.3K

First Seen

Mar 3, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

gemini-cli607

codex600

cursor599

opencode597

kimi-cli596

github-copilot596

AI 代码实施计划编写技能 | 自动化开发任务分解与 TDD 流程规划工具

41,400 周安装

Gemini Live API 开发指南：实时语音视频交互、WebSockets集成与SDK使用

🇨🇳中文介绍

Gemini Live API 开发技能

概述

模型

相关 Skills

SDK

合作伙伴集成

音频格式

快速开始

身份验证

Python

JavaScript

连接到 Live API

Python

JavaScript

发送文本

Python

JavaScript

发送音频

Python

JavaScript

发送视频

Python

JavaScript

接收音频和文本

Python

JavaScript

限制

最佳实践

如何使用 Gemini API

关键文档页面

支持的语言

🇺🇸English

Gemini Live API Development Skill

Overview

Models

SDKs

Partner Integrations

Audio Formats

Quick Start

Authentication

Python

JavaScript

Connecting to the Live API

Python

JavaScript

Sending Text

Python

JavaScript

Sending Audio

Python

JavaScript

Sending Video

Python

JavaScript

Receiving Audio and Text

Python

JavaScript

Limitations

Best Practices

How to use the Gemini API

Key Documentation Pages

Supported Languages

最新 Skills