Google Gemini 多模态AI技能：音频转录、图像分析、视频处理、文档提取与生成

ai-multimodal by mrgoonie/claudekit-skills

242 周安装量

1,900 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/mrgoonie/claudekit-skills --skill ai-multimodal

AI/机器学习内容创作自动化

🇨🇳中文介绍

AI 多模态处理技能

使用 Google Gemini 的多模态 API 处理音频、图像、视频、文档并生成图像。为所有多媒体内容理解和生成提供统一接口。

核心能力

音频处理

带时间戳的转录（最长 9.5 小时）
音频摘要和分析
语音理解和说话人识别
音乐和环境声音分析
可控语音的文本转语音生成

图像理解

图像字幕和描述
带边界框的目标检测（2.0+）
像素级分割（2.5+）
视觉问答
多图像比较（最多 3,600 张图像）
OCR 和文本提取

视频分析

场景检测和摘要
带时间理解的视频问答
带视觉描述的转录
支持 YouTube URL
长视频处理（最长 6 小时）
帧级分析

文档提取

原生 PDF 视觉处理（最多 1,000 页）
表格和表单提取
图表和示意图分析
多页文档理解
结构化数据输出（JSON 模式）
格式转换（PDF 转 HTML/JSON）

图像生成

文本到图像生成
图像编辑和修改
多图像合成（最多 3 张图像）
迭代优化
多种宽高比（1:1, 16:9, 9:16, 4:3, 3:4）
可控风格和质量

能力矩阵

任务

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

Gemini 2.5 系列（推荐）

gemini-2.5-pro : 最高质量，所有功能，1M-2M 上下文
gemini-2.5-flash : 最佳平衡，所有功能，1M-2M 上下文
gemini-2.5-flash-lite : 轻量级，支持分割
gemini-2.5-flash-image : 仅图像生成

gemini-2.0-flash : 快速处理，目标检测
gemini-2.0-flash-lite : 轻量级选项

分割 : 需要 2.5+ 模型
目标检测 : 需要 2.0+ 模型
多视频 : 需要 2.5+ 模型
图像生成 : 需要 flash-image 模型

2M tokens : ~6 小时视频（低分辨率）或 ~2 小时（默认）
1M tokens : ~3 小时视频（低分辨率）或 ~1 小时（默认）
音频 : 32 tokens/秒（1 分钟 = 1,920 tokens）
PDF : 258 tokens/页（固定）
图像 : 258-1,548 tokens，基于大小

API 密钥设置 : 支持 Google AI Studio 和 Vertex AI。

技能按以下顺序检查 GEMINI_API_KEY：

进程环境：export GEMINI_API_KEY="your-key"
项目根目录：.env
.claude/.env
.claude/skills/.env
.claude/skills/ai-multimodal/.env

获取 API 密钥 : https://aistudio.google.com/apikey

对于 Vertex AI :

export GEMINI_USE_VERTEX=true
export VERTEX_PROJECT_ID=your-gcp-project-id
export VERTEX_LOCATION=us-central1  # 可选

pip install google-genai python-dotenv pillow

python scripts/gemini_batch_process.py \
  --files audio.mp3 \
  --task transcribe \
  --model gemini-2.5-flash

python scripts/gemini_batch_process.py \
  --files image.jpg \
  --task analyze \
  --prompt "描述这张图片" \
  --output docs/assets/<output-name>.md \
  --model gemini-2.5-flash

python scripts/gemini_batch_process.py \
  --files video.mp4 \
  --task analyze \
  --prompt "用时间戳总结关键点" \
  --output docs/assets/<output-name>.md \
  --model gemini-2.5-flash

从 PDF 提取 :

python scripts/gemini_batch_process.py \
  --files document.pdf \
  --task extract \
  --prompt "将表格数据提取为 JSON" \
  --output docs/assets/<output-name>.md \
  --format json

python scripts/gemini_batch_process.py \
  --task generate \
  --prompt "日落时的未来城市" \
  --output docs/assets/<output-file-name> \
  --model gemini-2.5-flash-image \
  --aspect-ratio 16:9

# 准备大视频进行处理
python scripts/media_optimizer.py \
  --input large-video.mp4 \
  --output docs/assets/<output-file-name> \
  --target-size 100MB

# 批量优化多个文件
python scripts/media_optimizer.py \
  --input-dir ./videos \
  --output-dir docs/assets/optimized \
  --quality 85

将文档转换为 Markdown :

# 转换为 PDF
python scripts/document_converter.py \
  --input document.docx \
  --output docs/assets/document.md

# 提取页面
python scripts/document_converter.py \
  --input large.pdf \
  --output docs/assets/chapter1.md \
  --pages 1-20

WAV, MP3, AAC, FLAC, OGG Vorbis, AIFF
每个请求最长 9.5 小时
自动下采样至 16 Kbps 单声道

PNG, JPEG, WEBP, HEIC, HEIF
每个请求最多 3,600 张图像
分辨率：≤384px = 258 tokens，更大 = 分块

MP4, MPEG, MOV, AVI, FLV, MPG, WebM, WMV, 3GPP
最长 6 小时（低分辨率）或 2 小时（默认）
支持 YouTube URL（仅限公开）

仅 PDF 支持视觉处理
最多 1,000 页
支持 TXT, HTML, Markdown（仅文本）

内联 : 总请求 <20MB
文件 API : 每个文件 2GB，项目配额 20GB
保留时间 : 48 小时自动删除

详细实现指南，请参阅：

references/audio-processing.md - 转录、分析、TTS
- 时间戳处理和片段分析
- 多说话人识别
- 非语音音频分析
- 文本转语音生成

references/vision-understanding.md - 字幕、检测、OCR
- 目标检测和定位
- 像素级分割
- 视觉问答
- 多图像比较

references/video-analysis.md - 场景检测、时间理解
- YouTube URL 处理
- 基于时间戳的查询
- 视频剪辑和 FPS 控制
- 长视频优化

references/document-extraction.md - PDF 处理、结构化输出
- 表格和表单提取
- 图表和示意图分析
- JSON 模式验证
- 多页处理

references/image-generation.md - 文本到图像、编辑
- 提示词工程策略
- 图像编辑和合成
- 宽高比选择
- 安全设置

Gemini 2.5 Flash: $1.00/1M 输入，$0.10/1M 输出
Gemini 2.5 Pro: $3.00/1M 输入，$12.00/1M 输出
Gemini 1.5 Flash: $0.70/1M 输入，$0.175/1M 输出

音频：32 tokens/秒（1 分钟 = 1,920 tokens）
视频：~300 tokens/秒（默认）或 ~100（低分辨率）
PDF：258 tokens/页（固定）
图像：258-1,548 tokens，基于大小

Flash TTS: $10/1M tokens
Pro TTS: $20/1M tokens

大多数任务使用 gemini-2.5-flash（最佳性价比）
对于 >20MB 的文件或重复查询，使用文件 API
上传前优化媒体（参见 media_optimizer.py）
处理特定片段而非完整视频
对于静态内容使用较低的 FPS
为重复查询实现上下文缓存
并行批量处理多个文件

10-15 RPM（每分钟请求数）
1M-4M TPM（每分钟 tokens 数）
1,500 RPD（每日请求数）

YouTube 限制 :

免费层：8 小时/天
付费层：无长度限制
仅限公开视频

每个项目 20GB
每个文件 2GB
48 小时保留时间

常见错误及解决方案：

400 : 无效格式/大小 - 上传前验证
401 : 无效 API 密钥 - 检查配置
403 : 权限被拒绝 - 验证 API 密钥限制
404 : 文件未找到 - 确保文件已上传且处于活动状态
429 : 超出速率限制 - 实现指数退避
500 : 服务器错误 - 使用退避重试

所有脚本都支持统一的 API 密钥检测和错误处理：

gemini_batch_process.py : 批量处理多个媒体文件

支持所有模态（音频、图像、视频、PDF）
进度跟踪和错误恢复
输出格式：JSON、Markdown、CSV
速率限制和重试逻辑
试运行模式

media_optimizer.py : 为 Gemini API 准备媒体

压缩视频/音频以适应大小限制
适当调整图像大小
将长视频分割成块
格式转换
质量与大小优化

document_converter.py : 将文档转换为 PDF

将 DOCX、XLSX、PPTX 转换为 PDF
提取页面范围
为 Gemini 优化 PDF
从 PDF 中提取图像
支持批量转换

运行任何脚本时使用 --help 获取详细用法。

🇺🇸English

AI Multimodal Processing Skill

Process audio, images, videos, documents, and generate images using Google Gemini's multimodal API. Unified interface for all multimedia content understanding and generation.

Core Capabilities

Audio Processing

Transcription with timestamps (up to 9.5 hours)
Audio summarization and analysis
Speech understanding and speaker identification
Music and environmental sound analysis
Text-to-speech generation with controllable voice

Image Understanding

Image captioning and description
Object detection with bounding boxes (2.0+)
Pixel-level segmentation (2.5+)
Visual question answering
Multi-image comparison (up to 3,600 images)
OCR and text extraction

Video Analysis

Scene detection and summarization
Video Q&A with temporal understanding
Transcription with visual descriptions
YouTube URL support
Long video processing (up to 6 hours)
Frame-level analysis

Document Extraction

Native PDF vision processing (up to 1,000 pages)
Table and form extraction
Chart and diagram analysis
Multi-page document understanding
Structured data output (JSON schema)
Format conversion (PDF to HTML/JSON)

Image Generation

Text-to-image generation
Image editing and modification
Multi-image composition (up to 3 images)
Iterative refinement
Multiple aspect ratios (1:1, 16:9, 9:16, 4:3, 3:4)
Controllable style and quality

Capability Matrix

Task	Audio	Image	Video	Document	Generation
Transcription	✓	-	✓	-	-
Summarization	✓	✓	✓	✓	-
Q&A	✓	✓	✓	✓	-
Object Detection	-	✓	✓	-	-
Text Extraction	-	✓	-	✓

Model Selection Guide

Gemini 2.5 Series (Recommended)

gemini-2.5-pro : Highest quality, all features, 1M-2M context
gemini-2.5-flash : Best balance, all features, 1M-2M context
gemini-2.5-flash-lite : Lightweight, segmentation support
gemini-2.5-flash-image : Image generation only

Gemini 2.0 Series

gemini-2.0-flash : Fast processing, object detection
gemini-2.0-flash-lite : Lightweight option

Feature Requirements

Segmentation : Requires 2.5+ models
Object Detection : Requires 2.0+ models
Multi-video : Requires 2.5+ models
Image Generation : Requires flash-image model

Context Windows

2M tokens : ~6 hours video (low-res) or ~2 hours (default)
1M tokens : ~3 hours video (low-res) or ~1 hour (default)
Audio : 32 tokens/second (1 min = 1,920 tokens)
PDF : 258 tokens/page (fixed)
Image : 258-1,548 tokens based on size

Quick Start

Prerequisites

API Key Setup : Supports both Google AI Studio and Vertex AI.

The skill checks for GEMINI_API_KEY in this order:

Process environment: export GEMINI_API_KEY="your-key"
Project root: .env
.claude/.env
.claude/skills/.env
.claude/skills/ai-multimodal/.env

Get API key : https://aistudio.google.com/apikey

For Vertex AI :

export GEMINI_USE_VERTEX=true
export VERTEX_PROJECT_ID=your-gcp-project-id
export VERTEX_LOCATION=us-central1  # Optional

Install SDK :

pip install google-genai python-dotenv pillow

Common Patterns

Transcribe Audio :

python scripts/gemini_batch_process.py \
  --files audio.mp3 \
  --task transcribe \
  --model gemini-2.5-flash

Analyze Image :

python scripts/gemini_batch_process.py \
  --files image.jpg \
  --task analyze \
  --prompt "Describe this image" \
  --output docs/assets/<output-name>.md \
  --model gemini-2.5-flash

Process Video :

python scripts/gemini_batch_process.py \
  --files video.mp4 \
  --task analyze \
  --prompt "Summarize key points with timestamps" \
  --output docs/assets/<output-name>.md \
  --model gemini-2.5-flash

Extract from PDF :

python scripts/gemini_batch_process.py \
  --files document.pdf \
  --task extract \
  --prompt "Extract table data as JSON" \
  --output docs/assets/<output-name>.md \
  --format json

Generate Image :

python scripts/gemini_batch_process.py \
  --task generate \
  --prompt "A futuristic city at sunset" \
  --output docs/assets/<output-file-name> \
  --model gemini-2.5-flash-image \
  --aspect-ratio 16:9

Optimize Media :

# Prepare large video for processing
python scripts/media_optimizer.py \
  --input large-video.mp4 \
  --output docs/assets/<output-file-name> \
  --target-size 100MB

# Batch optimize multiple files
python scripts/media_optimizer.py \
  --input-dir ./videos \
  --output-dir docs/assets/optimized \
  --quality 85

Convert Documents to Markdown :

# Convert to PDF
python scripts/document_converter.py \
  --input document.docx \
  --output docs/assets/document.md

# Extract pages
python scripts/document_converter.py \
  --input large.pdf \
  --output docs/assets/chapter1.md \
  --pages 1-20

Supported Formats

Audio

WAV, MP3, AAC, FLAC, OGG Vorbis, AIFF
Max 9.5 hours per request
Auto-downsampled to 16 Kbps mono

Images

PNG, JPEG, WEBP, HEIC, HEIF
Max 3,600 images per request
Resolution: ≤384px = 258 tokens, larger = tiled

Video

MP4, MPEG, MOV, AVI, FLV, MPG, WebM, WMV, 3GPP
Max 6 hours (low-res) or 2 hours (default)
YouTube URLs supported (public only)

Documents

PDF only for vision processing
Max 1,000 pages
TXT, HTML, Markdown supported (text-only)

Size Limits

Inline : <20MB total request
File API : 2GB per file, 20GB project quota
Retention : 48 hours auto-delete

Reference Navigation

For detailed implementation guidance, see:

Audio Processing

references/audio-processing.md - Transcription, analysis, TTS
- Timestamp handling and segment analysis
- Multi-speaker identification
- Non-speech audio analysis
- Text-to-speech generation

Image Understanding

references/vision-understanding.md - Captioning, detection, OCR
- Object detection and localization
- Pixel-level segmentation
- Visual question answering
- Multi-image comparison

Video Analysis

references/video-analysis.md - Scene detection, temporal understanding
- YouTube URL processing
- Timestamp-based queries
- Video clipping and FPS control
- Long video optimization

Document Extraction

references/document-extraction.md - PDF processing, structured output
- Table and form extraction
- Chart and diagram analysis
- JSON schema validation
- Multi-page handling

Image Generation

references/image-generation.md - Text-to-image, editing
- Prompt engineering strategies
- Image editing and composition
- Aspect ratio selection
- Safety settings

Cost Optimization

Token Costs

Input Pricing :

Gemini 2.5 Flash: $1.00/1M input, $0.10/1M output
Gemini 2.5 Pro: $3.00/1M input, $12.00/1M output
Gemini 1.5 Flash: $0.70/1M input, $0.175/1M output

Token Rates :

Audio: 32 tokens/second (1 min = 1,920 tokens)
Video: ~300 tokens/second (default) or ~100 (low-res)
PDF: 258 tokens/page (fixed)
Image: 258-1,548 tokens based on size

TTS Pricing :

Flash TTS: $10/1M tokens
Pro TTS: $20/1M tokens

Best Practices

Use gemini-2.5-flash for most tasks (best price/performance)
Use File API for files >20MB or repeated queries
Optimize media before upload (see media_optimizer.py)
Process specific segments instead of full videos
Use lower FPS for static content
Implement context caching for repeated queries
Batch process multiple files in parallel

Rate Limits

Free Tier :

10-15 RPM (requests per minute)
1M-4M TPM (tokens per minute)
1,500 RPD (requests per day)

YouTube Limits :

Free tier: 8 hours/day
Paid tier: No length limits
Public videos only

Storage Limits :

20GB per project
2GB per file
48-hour retention

Error Handling

Common errors and solutions:

400 : Invalid format/size - validate before upload
401 : Invalid API key - check configuration
403 : Permission denied - verify API key restrictions
404 : File not found - ensure file uploaded and active
429 : Rate limit exceeded - implement exponential backoff
500 : Server error - retry with backoff

Scripts Overview

All scripts support unified API key detection and error handling:

gemini_batch_process.py : Batch process multiple media files

Supports all modalities (audio, image, video, PDF)
Progress tracking and error recovery
Output formats: JSON, Markdown, CSV
Rate limiting and retry logic
Dry-run mode

media_optimizer.py : Prepare media for Gemini API

Compress videos/audio for size limits
Resize images appropriately
Split long videos into chunks
Format conversion
Quality vs size optimization

document_converter.py : Convert documents to PDF

Convert DOCX, XLSX, PPTX to PDF
Extract page ranges
Optimize PDFs for Gemini
Extract images from PDFs
Batch conversion support

Run any script with --help for detailed usage.

Resources

Weekly Installs

242

Repository

mrgoonie/claude…t-skills

GitHub Stars

1.9K

First Seen

Jan 22, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykWarn

Installed on

opencode201

gemini-cli198

codex186

cursor186

claude-code175

github-copilot165

AI Elements：基于shadcn/ui的AI原生应用组件库，快速构建对话界面

56,200 周安装

转录	✓	-	✓	-	-
摘要	✓	✓	✓	✓	-
问答	✓	✓	✓	✓	-
目标检测	-	✓	✓	-	-
文本提取	-	✓	-	✓	-
结构化输出	✓	✓	✓	✓	-
创建	TTS	-	-	-	✓
时间戳	✓	-	✓	-	-
分割	-	✓	-	-	-

Google Gemini 多模态AI技能：音频转录、图像分析、视频处理、文档提取与生成

🇨🇳中文介绍

AI 多模态处理技能

核心能力

音频处理

图像理解

视频分析

文档提取

图像生成

能力矩阵

相关 Skills

模型选择指南

Gemini 2.5 系列（推荐）

Gemini 2.0 系列

功能要求

上下文窗口

快速开始

先决条件

常用模式

支持的格式

音频

图像

视频

文档

大小限制

参考导航

音频处理

图像理解

视频分析

文档提取

图像生成

成本优化

Token 成本

最佳实践

速率限制

错误处理

脚本概述

资源

🇺🇸English

AI Multimodal Processing Skill

Core Capabilities

Audio Processing

Image Understanding

Video Analysis

Document Extraction

Image Generation

Capability Matrix

Model Selection Guide

Gemini 2.5 Series (Recommended)

Gemini 2.0 Series

Feature Requirements

Context Windows

Quick Start

Prerequisites

Common Patterns

Supported Formats

Audio

Images

Video

Documents

Size Limits

Reference Navigation

Audio Processing

Image Understanding

Video Analysis

Document Extraction

Image Generation

Cost Optimization

Token Costs

Best Practices

Rate Limits

Error Handling

Scripts Overview

Resources

最新 Skills