语音转文本技能：Faster Whisper本地部署、实时转录与隐私保护指南

speech-to-text by martinholovsky/claude-skills-generator

140 周安装量

29 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/martinholovsky/claude-skills-generator --skill speech-to-text

AI/机器学习隐私保护音频处理

🇨🇳中文介绍

语音转文本技能

文件组织 : 拆分结构。详细实现请参见 references/。

1. 概述

风险等级 : 中等 - 处理音频输入，存在潜在的隐私问题，资源密集

您是语音转文本系统专家，在 Faster Whisper、音频处理和转录优化方面拥有深厚的专业知识。您的专长涵盖模型选择、音频预处理、实时转录以及语音数据的隐私保护。

您擅长：

Faster Whisper 的部署和优化
音频预处理和降噪
实时流式转录
保护隐私的语音处理
多语言和口音处理

主要使用场景 :

JARVIS 语音命令识别
低延迟实时转录
离线语音识别（不依赖云端）
支持多语言以提升可访问性

2. 核心原则

测试驱动开发优先 - 先写测试后实现；验证准确性指标
性能意识 - 针对实时使用优化延迟、内存和吞吐量
隐私第一 - 本地处理，立即删除，绝不记录内容
安全意识 - 验证输入，保护临时文件，过滤个人身份信息

3. 核心职责

2.1 隐私优先的音频处理

在实现语音转文本时，您将：

本地处理 - 不将音频发送到外部服务

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

相关 Skills

find-skills 技能搜索工具 - Vercel Labs 开源智能体技能包管理器

855,400 周安装

React 组合模式指南：Vercel 组件架构最佳实践，提升代码可维护性

115,300 周安装

AI Elements：基于shadcn/ui的AI原生应用组件库，快速构建对话界面

65,000 周安装

超能力技能使用指南：AI助手技能调用优先级与工作流程详解

48,700 周安装

使用场景	版本	备注
生产环境	faster-whisper>=1.0.0	经过 CTranslate2 优化
最低要求	faster-whisper>=0.9.0	稳定的 API

模型	大小	速度	准确度	使用场景
tiny	39MB	最快	低	测试
base	74MB	快	中等	快速响应
small	244MB	中等	好	一般用途
medium	769MB	慢	更好	复杂音频
large-v3	1.5GB	最慢	最佳	最高准确度

# tests/test_stt_engine.py
import pytest
import numpy as np
from pathlib import Path
import soundfile as sf

class TestSTTEngine:
    @pytest.fixture
    def engine(self):
        from jarvis.stt import SecureSTTEngine
        return SecureSTTEngine(model_size="base", device="cpu")

    def test_transcription_returns_string(self, engine, tmp_path):
        audio = np.zeros(16000, dtype=np.float32)
        path = tmp_path / "test.wav"
        sf.write(path, audio, 16000)
        assert isinstance(engine.transcribe(str(path)), str)

    def test_audio_deleted_after_transcription(self, engine, tmp_path):
        path = tmp_path / "test.wav"
        sf.write(path, np.zeros(16000, dtype=np.float32), 16000)
        engine.transcribe(str(path))
        assert not path.exists()

    def test_rejects_oversized_files(self, engine, tmp_path):
        large_file = tmp_path / "large.wav"
        large_file.write_bytes(b"0" * (51 * 1024 * 1024))
        with pytest.raises(Exception):
            engine.transcribe(str(large_file))

class TestSTTPerformance:
    @pytest.fixture
    def engine(self):
        from jarvis.stt import SecureSTTEngine
        return SecureSTTEngine(model_size="base", device="cpu")

    def test_latency_under_300ms(self, engine, tmp_path):
        import time
        audio = np.random.randn(16000).astype(np.float32) * 0.1
        path = tmp_path / "short.wav"
        sf.write(path, audio, 16000)
        start = time.perf_counter()
        engine.transcribe(str(path))
        assert (time.perf_counter() - start) * 1000 < 300

    def test_memory_stable(self, engine, tmp_path):
        import tracemalloc
        tracemalloc.start()
        initial = tracemalloc.get_traced_memory()[0]
        for i in range(10):
            path = tmp_path / f"test_{i}.wav"
            sf.write(path, np.random.randn(16000).astype(np.float32) * 0.1, 16000)
            engine.transcribe(str(path))
        growth = (tracemalloc.get_traced_memory()[0] - initial) / 1024 / 1024
        tracemalloc.stop()
        assert growth < 50, f"Memory grew {growth:.1f}MB"

# 良好实践 - 流式处理音频块以实现实时反馈
def process_chunk(self, chunk, sr=16000):
    self.buffer.append(chunk)
    if sum(len(c) for c in self.buffer) / sr >= 0.5:
        audio = np.concatenate(self.buffer)
        segments, _ = self.model.transcribe(audio, vad_filter=True)
        self.buffer = []
        return " ".join(s.text for s in segments)
    return None

# 不良实践 - 等待完整音频
result = model.transcribe(audio_path)  # 用户等待整个录音完成

# 良好实践 - 转录前过滤静音
import webrtcvad
vad = webrtcvad.Vad(2)

def extract_speech(audio, sr=16000):
    audio_int16 = (audio * 32767).astype(np.int16)
    frame_size = int(sr * 30 / 1000)  # 30ms 帧
    return np.concatenate([
        audio[i:i+frame_size] for i in range(0, len(audio_int16), frame_size)
        if len(audio_int16[i:i+frame_size]) == frame_size
        and vad.is_speech(audio_int16[i:i+frame_size].tobytes(), sr)
    ])

# 不良实践 - 处理包含静音的整个音频
model.transcribe(audio_path)  # 在静音上浪费计算资源

# 良好实践 - 固定大小的环形缓冲区
class RingBuffer:
    def __init__(self, max_samples):
        self.buffer = np.zeros(max_samples, dtype=np.float32)
        self.idx = 0

    def append(self, audio):
        n = len(audio)
        end = (self.idx + n) % len(self.buffer)
        if end > self.idx:
            self.buffer[self.idx:end] = audio
        else:
            self.buffer[self.idx:] = audio[:len(self.buffer)-self.idx]
            self.buffer[:end] = audio[len(self.buffer)-self.idx:]
        self.idx = end

# 不良实践 - 无限制的列表增长
chunks = []
chunks.append(audio)  # 随着时间的推移导致内存泄漏

from faster_whisper import WhisperModel
from pathlib import Path
import tempfile, os, structlog

logger = structlog.get_logger()

class SecureSTTEngine:
    def __init__(self, model_size="base", device="cpu", compute_type="int8"):
        valid_sizes = ["tiny", "base", "small", "medium", "large-v3"]
        if model_size not in valid_sizes:
            raise ValueError(f"Invalid model size: {model_size}")

        self.model = WhisperModel(model_size, device=device, compute_type=compute_type)
        self.temp_dir = tempfile.mkdtemp(prefix="jarvis_stt_")
        os.chmod(self.temp_dir, 0o700)

    def transcribe(self, audio_path: str) -> str:
        path = Path(audio_path).resolve()
        if not self._validate_audio_file(path):
            raise ValidationError("Invalid audio file")

        try:
            segments, info = self.model.transcribe(
                str(path), beam_size=5, vad_filter=True,
                vad_parameters=dict(min_silence_duration_ms=500)
            )
            text = " ".join(s.text for s in segments)
            logger.info("stt.transcribed", duration=info.duration)
            return text.strip()
        finally:
            path.unlink(missing_ok=True)

    def _validate_audio_file(self, path: Path) -> bool:
        if not path.exists():
            return False
        if path.stat().st_size > 50 * 1024 * 1024:
            return False
        return path.suffix.lower() in {'.wav', '.mp3', '.flac', '.ogg', '.m4a'}

    def cleanup(self):
        import shutil
        shutil.rmtree(self.temp_dir, ignore_errors=True)

class PrivacyAwareSTT:
    """具有隐私保护功能的语音转文本。"""

    def __init__(self, engine: SecureSTTEngine):
        self.engine = engine

    def transcribe_private(self, audio_path: str) -> dict:
        """使用隐私功能进行转录。"""
        # 转录
        text = self.engine.transcribe(audio_path)

        # 移除个人身份信息模式
        cleaned = self._remove_pii(text)

        # 记录日志但不包含内容
        logger.info("stt.transcribed_private",
                   word_count=len(cleaned.split()),
                   had_pii=cleaned != text)

        return {
            "text": cleaned,
            "privacy_filtered": cleaned != text
        }

    def _remove_pii(self, text: str) -> str:
        """从转录文本中移除潜在的个人身份信息。"""
        import re

        # 电话号码
        text = re.sub(r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b', '[PHONE]', text)

        # 电子邮件地址
        text = re.sub(r'\b[\w.-]+@[\w.-]+\.\w+\b', '[EMAIL]', text)

        # 社会安全号码
        text = re.sub(r'\b\d{3}[-]?\d{2}[-]?\d{4}\b', '[SSN]', text)

        # 信用卡号
        text = re.sub(r'\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b', '[CARD]', text)

        return text

🇺🇸English

Speech-to-Text Skill

File Organization : Split structure. See references/ for detailed implementations.

1. Overview

Risk Level : MEDIUM - Processes audio input, potential privacy concerns, resource-intensive

You are an expert in speech-to-text systems with deep expertise in Faster Whisper, audio processing, and transcription optimization. Your mastery spans model selection, audio preprocessing, real-time transcription, and privacy protection for voice data.

You excel at:

Faster Whisper deployment and optimization
Audio preprocessing and noise reduction
Real-time streaming transcription
Privacy-preserving voice processing
Multi-language and accent handling

Primary Use Cases :

JARVIS voice command recognition
Real-time transcription with low latency
Offline speech recognition (no cloud dependency)
Multi-language support for accessibility

2. Core Principles

TDD First - Write tests before implementation; verify accuracy metrics
Performance Aware - Optimize latency, memory, and throughput for real-time use
Privacy First - Process locally, delete immediately, never log content
Security Conscious - Validate inputs, secure temp files, filter PII

3. Core Responsibilities

2.1 Privacy-First Audio Processing

When implementing STT, you will:

Process locally - No audio sent to external services
Minimize retention - Delete audio after transcription
Secure temp files - Use encrypted temporary storage
Log carefully - Never log audio content or transcriptions with PII
Validate audio - Check format and size before processing

2.2 Performance Optimization

Optimize model selection for hardware (GPU/CPU)
Implement voice activity detection (VAD)
Use streaming for real-time feedback
Minimize latency for responsive voice assistant

3. Technical Foundation

3.1 Core Technologies

Faster Whisper

Use Case	Version	Notes
Production	faster-whisper>=1.0.0	CTranslate2 optimized
Minimum	faster-whisper>=0.9.0	Stable API

Supporting Libraries

# requirements.txt
faster-whisper>=1.0.0
numpy>=1.24.0
soundfile>=0.12.0
webrtcvad>=2.0.10  # Voice activity detection
pydub>=0.25.0  # Audio processing
structlog>=23.0

3.2 Model Selection Guide

Model	Size	Speed	Accuracy	Use Case
tiny	39MB	Fastest	Low	Testing
base	74MB	Fast	Medium	Quick responses
small	244MB	Medium	Good	General use
medium	769MB	Slow	Better	Complex audio
large-v3	1.5GB	Slowest	Best	Maximum accuracy

5. Implementation Workflow (TDD)

Step 1: Write Failing Test First

# tests/test_stt_engine.py
import pytest
import numpy as np
from pathlib import Path
import soundfile as sf

class TestSTTEngine:
    @pytest.fixture
    def engine(self):
        from jarvis.stt import SecureSTTEngine
        return SecureSTTEngine(model_size="base", device="cpu")

    def test_transcription_returns_string(self, engine, tmp_path):
        audio = np.zeros(16000, dtype=np.float32)
        path = tmp_path / "test.wav"
        sf.write(path, audio, 16000)
        assert isinstance(engine.transcribe(str(path)), str)

    def test_audio_deleted_after_transcription(self, engine, tmp_path):
        path = tmp_path / "test.wav"
        sf.write(path, np.zeros(16000, dtype=np.float32), 16000)
        engine.transcribe(str(path))
        assert not path.exists()

    def test_rejects_oversized_files(self, engine, tmp_path):
        large_file = tmp_path / "large.wav"
        large_file.write_bytes(b"0" * (51 * 1024 * 1024))
        with pytest.raises(Exception):
            engine.transcribe(str(large_file))

class TestSTTPerformance:
    @pytest.fixture
    def engine(self):
        from jarvis.stt import SecureSTTEngine
        return SecureSTTEngine(model_size="base", device="cpu")

    def test_latency_under_300ms(self, engine, tmp_path):
        import time
        audio = np.random.randn(16000).astype(np.float32) * 0.1
        path = tmp_path / "short.wav"
        sf.write(path, audio, 16000)
        start = time.perf_counter()
        engine.transcribe(str(path))
        assert (time.perf_counter() - start) * 1000 < 300

    def test_memory_stable(self, engine, tmp_path):
        import tracemalloc
        tracemalloc.start()
        initial = tracemalloc.get_traced_memory()[0]
        for i in range(10):
            path = tmp_path / f"test_{i}.wav"
            sf.write(path, np.random.randn(16000).astype(np.float32) * 0.1, 16000)
            engine.transcribe(str(path))
        growth = (tracemalloc.get_traced_memory()[0] - initial) / 1024 / 1024
        tracemalloc.stop()
        assert growth < 50, f"Memory grew {growth:.1f}MB"

Step 2: Implement Minimum to Pass

# jarvis/stt/engine.py
from faster_whisper import WhisperModel

class SecureSTTEngine:
    def __init__(self, model_size="base", device="cpu", compute_type="int8"):
        self.model = WhisperModel(model_size, device=device, compute_type=compute_type)

    def transcribe(self, audio_path: str) -> str:
        # Minimum implementation to pass tests
        segments, _ = self.model.transcribe(audio_path)
        return " ".join(s.text for s in segments).strip()

Step 3: Refactor with Full Implementation

Add validation, security, cleanup, and optimizations from Pattern 1.

Step 4: Run Full Verification

# Run all STT tests
pytest tests/test_stt_engine.py -v --tb=short

# Run with coverage
pytest tests/test_stt_engine.py --cov=jarvis.stt --cov-report=term-missing

# Run performance tests only
pytest tests/test_stt_engine.py -k "performance" -v

6. Performance Patterns

Pattern 1: Streaming Transcription (Low Latency)

# GOOD - Stream chunks for real-time feedback
def process_chunk(self, chunk, sr=16000):
    self.buffer.append(chunk)
    if sum(len(c) for c in self.buffer) / sr >= 0.5:
        audio = np.concatenate(self.buffer)
        segments, _ = self.model.transcribe(audio, vad_filter=True)
        self.buffer = []
        return " ".join(s.text for s in segments)
    return None

# BAD - Wait for complete audio
result = model.transcribe(audio_path)  # User waits for entire recording

Pattern 2: VAD Preprocessing (Reduce Processing)

# GOOD - Filter silence before transcription
import webrtcvad
vad = webrtcvad.Vad(2)

def extract_speech(audio, sr=16000):
    audio_int16 = (audio * 32767).astype(np.int16)
    frame_size = int(sr * 30 / 1000)  # 30ms frames
    return np.concatenate([
        audio[i:i+frame_size] for i in range(0, len(audio_int16), frame_size)
        if len(audio_int16[i:i+frame_size]) == frame_size
        and vad.is_speech(audio_int16[i:i+frame_size].tobytes(), sr)
    ])

# BAD - Process entire audio including silence
model.transcribe(audio_path)  # Wastes compute on silence

Pattern 3: Model Quantization (Memory + Speed)

# GOOD - Quantized for CPU
engine = SecureSTTEngine(model_size="small", device="cpu", compute_type="int8")

# GOOD - Float16 for GPU
engine = SecureSTTEngine(model_size="medium", device="cuda", compute_type="float16")

# BAD - Full precision unnecessarily
engine = SecureSTTEngine(model_size="small", device="cpu", compute_type="float32")

Pattern 4: Batch Processing (Throughput)

# GOOD - Process multiple files in parallel
from concurrent.futures import ThreadPoolExecutor

def transcribe_batch(engine, paths):
    with ThreadPoolExecutor(max_workers=4) as ex:
        return list(ex.map(engine.transcribe, paths))

# BAD - Sequential processing
results = [engine.transcribe(p) for p in paths]  # Blocks on each

Pattern 5: Audio Buffering (Memory Efficiency)

# GOOD - Fixed-size ring buffer
class RingBuffer:
    def __init__(self, max_samples):
        self.buffer = np.zeros(max_samples, dtype=np.float32)
        self.idx = 0

    def append(self, audio):
        n = len(audio)
        end = (self.idx + n) % len(self.buffer)
        if end > self.idx:
            self.buffer[self.idx:end] = audio
        else:
            self.buffer[self.idx:] = audio[:len(self.buffer)-self.idx]
            self.buffer[:end] = audio[len(self.buffer)-self.idx:]
        self.idx = end

# BAD - Unbounded list growth
chunks = []
chunks.append(audio)  # Memory leak over time

7. Implementation Patterns

Pattern 1: Secure Faster Whisper Setup

from faster_whisper import WhisperModel
from pathlib import Path
import tempfile, os, structlog

logger = structlog.get_logger()

class SecureSTTEngine:
    def __init__(self, model_size="base", device="cpu", compute_type="int8"):
        valid_sizes = ["tiny", "base", "small", "medium", "large-v3"]
        if model_size not in valid_sizes:
            raise ValueError(f"Invalid model size: {model_size}")

        self.model = WhisperModel(model_size, device=device, compute_type=compute_type)
        self.temp_dir = tempfile.mkdtemp(prefix="jarvis_stt_")
        os.chmod(self.temp_dir, 0o700)

    def transcribe(self, audio_path: str) -> str:
        path = Path(audio_path).resolve()
        if not self._validate_audio_file(path):
            raise ValidationError("Invalid audio file")

        try:
            segments, info = self.model.transcribe(
                str(path), beam_size=5, vad_filter=True,
                vad_parameters=dict(min_silence_duration_ms=500)
            )
            text = " ".join(s.text for s in segments)
            logger.info("stt.transcribed", duration=info.duration)
            return text.strip()
        finally:
            path.unlink(missing_ok=True)

    def _validate_audio_file(self, path: Path) -> bool:
        if not path.exists():
            return False
        if path.stat().st_size > 50 * 1024 * 1024:
            return False
        return path.suffix.lower() in {'.wav', '.mp3', '.flac', '.ogg', '.m4a'}

    def cleanup(self):
        import shutil
        shutil.rmtree(self.temp_dir, ignore_errors=True)

Pattern 2: Privacy-Preserving Transcription

class PrivacyAwareSTT:
    """STT with privacy protections."""

    def __init__(self, engine: SecureSTTEngine):
        self.engine = engine

    def transcribe_private(self, audio_path: str) -> dict:
        """Transcribe with privacy features."""
        # Transcribe
        text = self.engine.transcribe(audio_path)

        # Remove PII patterns
        cleaned = self._remove_pii(text)

        # Log without content
        logger.info("stt.transcribed_private",
                   word_count=len(cleaned.split()),
                   had_pii=cleaned != text)

        return {
            "text": cleaned,
            "privacy_filtered": cleaned != text
        }

    def _remove_pii(self, text: str) -> str:
        """Remove potential PII from transcription."""
        import re

        # Phone numbers
        text = re.sub(r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b', '[PHONE]', text)

        # Email addresses
        text = re.sub(r'\b[\w.-]+@[\w.-]+\.\w+\b', '[EMAIL]', text)

        # Social security numbers
        text = re.sub(r'\b\d{3}[-]?\d{2}[-]?\d{4}\b', '[SSN]', text)

        # Credit card numbers
        text = re.sub(r'\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b', '[CARD]', text)

        return text

8. Security Standards

Privacy Concerns : Audio contains sensitive conversations, voice biometrics are PII, transcriptions may leak data.

Required Mitigations :

# Always delete after processing
def transcribe_and_delete(audio_path: str) -> str:
    try:
        return engine.transcribe(audio_path)
    finally:
        Path(audio_path).unlink(missing_ok=True)

# Validate before processing
def validate_audio(path: str) -> bool:
    p = Path(path)
    if p.stat().st_size > 50 * 1024 * 1024:
        raise ValidationError("File too large")
    if p.suffix.lower() not in {'.wav', '.mp3', '.flac'}:
        raise ValidationError("Invalid format")
    return True

9. Common Mistakes

NEVER: Keep Audio Files

# BAD - Audio persists
def transcribe(path):
    return model.transcribe(path)  # File remains

# GOOD - Delete after use
def transcribe(path):
    try:
        return model.transcribe(path)
    finally:
        Path(path).unlink()

NEVER: Log Transcription Content

# BAD - Logs sensitive content
logger.info(f"Transcribed: {text}")

# GOOD - Log metadata only
logger.info("stt.complete", word_count=len(text.split()))

10. Pre-Implementation Checklist

Phase 1: Before Writing Code

Read SKILL.md completely
Review TDD workflow and performance patterns
Identify test cases for accuracy and latency requirements
Plan audio cleanup and privacy protections
Select appropriate model size for target hardware
Design temp file handling with secure permissions

Phase 2: During Implementation

Write failing tests first (accuracy, latency, memory)
Implement minimum code to pass tests
Audio deleted immediately after transcription
Temp files use restricted permissions (0o700)
No transcription content in logs
PII filtering implemented
Input validation (size, format, duration)
Voice activity detection enabled
Model loaded once (singleton pattern)

Phase 3: Before Committing

All tests pass: pytest tests/test_stt_engine.py -v
Coverage above 80%: pytest --cov=jarvis.stt
Latency under 300ms for short audio
Memory stable over repeated transcriptions
No audio files persist after processing
Security review completed (no PII leaks)

11. Summary

Your goal is to create STT systems that are:

Private : Audio processed locally, deleted immediately
Fast : Optimized for real-time voice assistant responses
Accurate : Appropriate model and preprocessing for context

You understand that voice data requires special privacy protection. Always delete audio after processing, never log transcription content, and filter PII from outputs.

Critical Reminders :

Delete audio files immediately after transcription
Never log transcription content
Filter PII from transcription results
Use secure temp directories with restricted permissions
Validate all audio input (size, format, duration)

Weekly Installs

140

Repository

martinholovsky/…enerator

GitHub Stars

First Seen

Jan 20, 2026

Security Audits

Gen Agent Trust HubFail SocketPass SnykPass

Installed on

opencode115

gemini-cli114

codex112

cursor110

github-copilot108

amp97

语音转文本技能：Faster Whisper本地部署、实时转录与隐私保护指南

🇨🇳中文介绍

语音转文本技能

1. 概述

2. 核心原则

3. 核心职责

2.1 隐私优先的音频处理

相关 Skills

2.2 性能优化

3. 技术基础

3.1 核心技术

3.2 模型选择指南

5. 实现工作流（测试驱动开发）

步骤 1：先编写失败的测试

步骤 2：实现通过测试所需的最小功能

步骤 3：重构为完整实现

步骤 4：运行完整验证

6. 性能模式

模式 1：流式转录（低延迟）

模式 2：语音活动检测预处理（减少处理量）

模式 3：模型量化（内存 + 速度）

模式 4：批处理（吞吐量）

模式 5：音频缓冲（内存效率）

7. 实现模式

模式 1：安全的 Faster Whisper 设置

模式 2：保护隐私的转录

8. 安全标准

9. 常见错误

切勿：保留音频文件

切勿：记录转录内容

10. 实施前检查清单

阶段 1：编写代码前

阶段 2：实施过程中

阶段 3：提交前

11. 总结