⚠️

重要前提

安装AI Skills的关键前提是：必须科学上网，且开启TUN模式，这一点至关重要，直接决定安装能否顺利完成，在此郑重提醒三遍：科学上网，科学上网，科学上网。查看完整安装教程 →

Azure AI Voice Live SDK Python教程：实时语音AI应用开发与双向WebSocket通信指南

azure-ai-voicelive-py by sickn33/antigravity-awesome-skills

51 周安装量

27,600 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/sickn33/antigravity-awesome-skills --skill azure-ai-voicelive-py

AI/机器学习云服务音频处理

🇨🇳中文介绍

Azure AI Voice Live SDK

使用双向 WebSocket 通信构建实时语音 AI 应用程序。

安装

pip install azure-ai-voicelive aiohttp azure-identity

环境变量

AZURE_COGNITIVE_SERVICES_ENDPOINT=https://<region>.api.cognitive.microsoft.com
# 用于 API 密钥认证（不建议用于生产环境）
AZURE_COGNITIVE_SERVICES_KEY=<api-key>

身份验证

DefaultAzureCredential（推荐） :

from azure.ai.voicelive.aio import connect
from azure.identity.aio import DefaultAzureCredential

async with connect(
    endpoint=os.environ["AZURE_COGNITIVE_SERVICES_ENDPOINT"],
    credential=DefaultAzureCredential(),
    model="gpt-4o-realtime-preview",
    credential_scopes=["https://cognitiveservices.azure.com/.default"]
) as conn:
    ...

API 密钥 :

from azure.ai.voicelive.aio import connect
from azure.core.credentials import AzureKeyCredential

async with connect(
    endpoint=os.environ["AZURE_COGNITIVE_SERVICES_ENDPOINT"],
    credential=AzureKeyCredential(os.environ["AZURE_COGNITIVE_SERVICES_KEY"]),
    model="gpt-4o-realtime-preview"
) as conn:
    ...

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

资源	用途	关键方法
`conn.session`	会话配置	`update(session=...)`
`conn.response`	模型响应	`create()`, `cancel()`
`conn.input_audio_buffer`	音频输入	`append()`, `commit()`, `clear()`
`conn.output_audio_buffer`	音频输出	`clear()`
`conn.conversation`	对话状态	`item.create()`, `item.delete()`, `item.truncate()`
`conn.transcription_session`	转录配置	`update(session=...)`

发送音频（Base64 PCM16）

import base64

# 读取音频块（16 位 PCM，24kHz 单声道）
audio_chunk = await read_audio_from_microphone()
b64_audio = base64.b64encode(audio_chunk).decode()

await conn.input_audio_buffer.append(audio=b64_audio)

async for event in conn:
    if event.type == "response.audio.delta":
        audio_bytes = base64.b64decode(event.delta)
        await play_audio(audio_bytes)
    elif event.type == "response.audio.done":
        print("Audio complete")

async for event in conn:
    match event.type:
        # 会话事件
        case "session.created":
            print(f"Session: {event.session}")
        case "session.updated":
            print("Session updated")
        
        # 音频输入事件
        case "input_audio_buffer.speech_started":
            print(f"Speech started at {event.audio_start_ms}ms")
        case "input_audio_buffer.speech_stopped":
            print(f"Speech stopped at {event.audio_end_ms}ms")
        
        # 转录事件
        case "conversation.item.input_audio_transcription.completed":
            print(f"User said: {event.transcript}")
        case "conversation.item.input_audio_transcription.delta":
            print(f"Partial: {event.delta}")
        
        # 响应事件
        case "response.created":
            print(f"Response started: {event.response.id}")
        case "response.audio_transcript.delta":
            print(event.delta, end="", flush=True)
        case "response.audio.delta":
            audio = base64.b64decode(event.delta)
        case "response.done":
            print(f"Response complete: {event.response.status}")
        
        # 函数调用
        case "response.function_call_arguments.done":
            result = handle_function(event.name, event.arguments)
            await conn.conversation.item.create(item={
                "type": "function_call_output",
                "call_id": event.call_id,
                "output": json.dumps(result)
            })
            await conn.response.create()
        
        # 错误
        case "error":
            print(f"Error: {event.error.message}")

手动轮换模式（无 VAD）

await conn.session.update(session={"turn_detection": None})

# 手动控制轮换
await conn.input_audio_buffer.append(audio=b64_audio)
await conn.input_audio_buffer.commit()  # 用户轮换结束
await conn.response.create()  # 触发响应

async for event in conn:
    if event.type == "input_audio_buffer.speech_started":
        # 用户中断 - 取消当前响应
        await conn.response.cancel()
        await conn.output_audio_buffer.clear()

# 添加系统消息
await conn.conversation.item.create(item={
    "type": "message",
    "role": "system",
    "content": [{"type": "input_text", "text": "Be concise."}]
})

# 添加用户消息
await conn.conversation.item.create(item={
    "type": "message",
    "role": "user", 
    "content": [{"type": "input_text", "text": "Hello!"}]
})

await conn.response.create()

语音	描述
`alloy`	中性，平衡
`echo`	温暖，对话式
`shimmer`	清晰，专业
`sage`	冷静，权威
`coral`	友好，乐观
`ash`	深沉，稳重
`ballad`	富有表现力
`verse`	叙事风格

Azure 语音：使用 AzureStandardVoice、AzureCustomVoice 或 AzurePersonalVoice 模型。

格式	采样率	使用场景
`pcm16`	24kHz	默认，高质量
`pcm16-8000hz`	8kHz	电话通信
`pcm16-16000hz`	16kHz	语音助手
`g711_ulaw`	8kHz	电话通信（美国）
`g711_alaw`	8kHz	电话通信（欧盟）

# 服务器 VAD（默认）
{"type": "server_vad", "threshold": 0.5, "silence_duration_ms": 500}

# Azure 语义 VAD（更智能的检测）
{"type": "azure_semantic_vad"}
{"type": "azure_semantic_vad_en"}  # 英语优化
{"type": "azure_semantic_vad_multilingual"}

from azure.ai.voicelive.aio import ConnectionError, ConnectionClosed

try:
    async with connect(...) as conn:
        async for event in conn:
            if event.type == "error":
                print(f"API Error: {event.error.code} - {event.error.message}")
except ConnectionClosed as e:
    print(f"Connection closed: {e.code} - {e.reason}")
except ConnectionError as e:
    print(f"Connection error: {e}")

详细 API 参考 : 参见 references/api-reference.md
完整示例 : 参见 references/examples.md
所有模型和类型: 参见 references/models.md

此技能适用于执行概述中描述的工作流或操作。

2026 年 2 月 17 日

🇺🇸English

Azure AI Voice Live SDK

Build real-time voice AI applications with bidirectional WebSocket communication.

Installation

pip install azure-ai-voicelive aiohttp azure-identity

Environment Variables

AZURE_COGNITIVE_SERVICES_ENDPOINT=https://<region>.api.cognitive.microsoft.com
# For API key auth (not recommended for production)
AZURE_COGNITIVE_SERVICES_KEY=<api-key>

Authentication

DefaultAzureCredential (preferred) :

from azure.ai.voicelive.aio import connect
from azure.identity.aio import DefaultAzureCredential

async with connect(
    endpoint=os.environ["AZURE_COGNITIVE_SERVICES_ENDPOINT"],
    credential=DefaultAzureCredential(),
    model="gpt-4o-realtime-preview",
    credential_scopes=["https://cognitiveservices.azure.com/.default"]
) as conn:
    ...

API Key :

from azure.ai.voicelive.aio import connect
from azure.core.credentials import AzureKeyCredential

async with connect(
    endpoint=os.environ["AZURE_COGNITIVE_SERVICES_ENDPOINT"],
    credential=AzureKeyCredential(os.environ["AZURE_COGNITIVE_SERVICES_KEY"]),
    model="gpt-4o-realtime-preview"
) as conn:
    ...

Quick Start

import asyncio
import os
from azure.ai.voicelive.aio import connect
from azure.identity.aio import DefaultAzureCredential

async def main():
    async with connect(
        endpoint=os.environ["AZURE_COGNITIVE_SERVICES_ENDPOINT"],
        credential=DefaultAzureCredential(),
        model="gpt-4o-realtime-preview",
        credential_scopes=["https://cognitiveservices.azure.com/.default"]
    ) as conn:
        # Update session with instructions
        await conn.session.update(session={
            "instructions": "You are a helpful assistant.",
            "modalities": ["text", "audio"],
            "voice": "alloy"
        })
        
        # Listen for events
        async for event in conn:
            print(f"Event: {event.type}")
            if event.type == "response.audio_transcript.done":
                print(f"Transcript: {event.transcript}")
            elif event.type == "response.done":
                break

asyncio.run(main())

Core Architecture

Connection Resources

The VoiceLiveConnection exposes these resources:

Resource	Purpose	Key Methods
`conn.session`	Session configuration	`update(session=...)`
`conn.response`	Model responses	`create()`, `cancel()`
`conn.input_audio_buffer`	Audio input	`append()`, ,

Session Configuration

from azure.ai.voicelive.models import RequestSession, FunctionTool

await conn.session.update(session=RequestSession(
    instructions="You are a helpful voice assistant.",
    modalities=["text", "audio"],
    voice="alloy",  # or "echo", "shimmer", "sage", etc.
    input_audio_format="pcm16",
    output_audio_format="pcm16",
    turn_detection={
        "type": "server_vad",
        "threshold": 0.5,
        "prefix_padding_ms": 300,
        "silence_duration_ms": 500
    },
    tools=[
        FunctionTool(
            type="function",
            name="get_weather",
            description="Get current weather",
            parameters={
                "type": "object",
                "properties": {
                    "location": {"type": "string"}
                },
                "required": ["location"]
            }
        )
    ]
))

Audio Streaming

Send Audio (Base64 PCM16)

import base64

# Read audio chunk (16-bit PCM, 24kHz mono)
audio_chunk = await read_audio_from_microphone()
b64_audio = base64.b64encode(audio_chunk).decode()

await conn.input_audio_buffer.append(audio=b64_audio)

Receive Audio

async for event in conn:
    if event.type == "response.audio.delta":
        audio_bytes = base64.b64decode(event.delta)
        await play_audio(audio_bytes)
    elif event.type == "response.audio.done":
        print("Audio complete")

Event Handling

async for event in conn:
    match event.type:
        # Session events
        case "session.created":
            print(f"Session: {event.session}")
        case "session.updated":
            print("Session updated")
        
        # Audio input events
        case "input_audio_buffer.speech_started":
            print(f"Speech started at {event.audio_start_ms}ms")
        case "input_audio_buffer.speech_stopped":
            print(f"Speech stopped at {event.audio_end_ms}ms")
        
        # Transcription events
        case "conversation.item.input_audio_transcription.completed":
            print(f"User said: {event.transcript}")
        case "conversation.item.input_audio_transcription.delta":
            print(f"Partial: {event.delta}")
        
        # Response events
        case "response.created":
            print(f"Response started: {event.response.id}")
        case "response.audio_transcript.delta":
            print(event.delta, end="", flush=True)
        case "response.audio.delta":
            audio = base64.b64decode(event.delta)
        case "response.done":
            print(f"Response complete: {event.response.status}")
        
        # Function calls
        case "response.function_call_arguments.done":
            result = handle_function(event.name, event.arguments)
            await conn.conversation.item.create(item={
                "type": "function_call_output",
                "call_id": event.call_id,
                "output": json.dumps(result)
            })
            await conn.response.create()
        
        # Errors
        case "error":
            print(f"Error: {event.error.message}")

Common Patterns

Manual Turn Mode (No VAD)

await conn.session.update(session={"turn_detection": None})

# Manually control turns
await conn.input_audio_buffer.append(audio=b64_audio)
await conn.input_audio_buffer.commit()  # End of user turn
await conn.response.create()  # Trigger response

Interrupt Handling

async for event in conn:
    if event.type == "input_audio_buffer.speech_started":
        # User interrupted - cancel current response
        await conn.response.cancel()
        await conn.output_audio_buffer.clear()

Conversation History

# Add system message
await conn.conversation.item.create(item={
    "type": "message",
    "role": "system",
    "content": [{"type": "input_text", "text": "Be concise."}]
})

# Add user message
await conn.conversation.item.create(item={
    "type": "message",
    "role": "user", 
    "content": [{"type": "input_text", "text": "Hello!"}]
})

await conn.response.create()

Voice Options

Voice	Description
`alloy`	Neutral, balanced
`echo`	Warm, conversational
`shimmer`	Clear, professional
`sage`	Calm, authoritative
`coral`	Friendly, upbeat
`ash`	Deep, measured

Azure voices: Use AzureStandardVoice, AzureCustomVoice, or AzurePersonalVoice models.

Audio Formats

Format	Sample Rate	Use Case
`pcm16`	24kHz	Default, high quality
`pcm16-8000hz`	8kHz	Telephony
`pcm16-16000hz`	16kHz	Voice assistants
`g711_ulaw`	8kHz	Telephony (US)
`g711_alaw`	8kHz	Telephony (EU)

Turn Detection Options

# Server VAD (default)
{"type": "server_vad", "threshold": 0.5, "silence_duration_ms": 500}

# Azure Semantic VAD (smarter detection)
{"type": "azure_semantic_vad"}
{"type": "azure_semantic_vad_en"}  # English optimized
{"type": "azure_semantic_vad_multilingual"}

Error Handling

from azure.ai.voicelive.aio import ConnectionError, ConnectionClosed

try:
    async with connect(...) as conn:
        async for event in conn:
            if event.type == "error":
                print(f"API Error: {event.error.code} - {event.error.message}")
except ConnectionClosed as e:
    print(f"Connection closed: {e.code} - {e.reason}")
except ConnectionError as e:
    print(f"Connection error: {e}")

References

Detailed API Reference : See references/api-reference.md
Complete Examples : See references/examples.md
All Models & Types: See references/models.md

When to Use

This skill is applicable to execute the workflow or actions described in the overview.

Weekly Installs

Repository

sickn33/antigra…e-skills

GitHub Stars

27.6K

First Seen

Feb 17, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

opencode50

codex50

gemini-cli49

github-copilot49

amp49

cline49

React 组合模式指南：Vercel 组件架构最佳实践，提升代码可维护性

127,000 周安装

Azure AI Voice Live SDK Python教程：实时语音AI应用开发与双向WebSocket通信指南

🇨🇳中文介绍

Azure AI Voice Live SDK

安装

环境变量

身份验证

相关 Skills

快速开始

核心架构

连接资源

会话配置

音频流

发送音频（Base64 PCM16）

接收音频

事件处理

常用模式

手动轮换模式（无 VAD）

中断处理

对话历史

语音选项

音频格式

轮换检测选项

错误处理

参考

何时使用

🇺🇸English

Azure AI Voice Live SDK

Installation

Environment Variables

Authentication

Quick Start

Core Architecture

Connection Resources

Session Configuration

Audio Streaming

Send Audio (Base64 PCM16)

Receive Audio

Event Handling

Common Patterns

Manual Turn Mode (No VAD)

Interrupt Handling

Conversation History

Voice Options

Audio Formats

Turn Detection Options

Error Handling

References

When to Use

最新 Skills