ai-multimodal by mrgoonie/claudekit-skills
npx skills add https://github.com/mrgoonie/claudekit-skills --skill ai-multimodal使用 Google Gemini 的多模态 API 处理音频、图像、视频、文档并生成图像。为所有多媒体内容理解和生成提供统一接口。
| 任务 |
|---|
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
| 音频 |
|---|
| 图像 |
|---|
| 视频 |
|---|
| 文档 |
|---|
| 生成 |
|---|
| 转录 | ✓ | - | ✓ | - | - |
| 摘要 | ✓ | ✓ | ✓ | ✓ | - |
| 问答 | ✓ | ✓ | ✓ | ✓ | - |
| 目标检测 | - | ✓ | ✓ | - | - |
| 文本提取 | - | ✓ | - | ✓ | - |
| 结构化输出 | ✓ | ✓ | ✓ | ✓ | - |
| 创建 | TTS | - | - | - | ✓ |
| 时间戳 | ✓ | - | ✓ | - | - |
| 分割 | - | ✓ | - | - | - |
API 密钥设置 : 支持 Google AI Studio 和 Vertex AI。
技能按以下顺序检查 GEMINI_API_KEY:
export GEMINI_API_KEY="your-key".env.claude/.env.claude/skills/.env.claude/skills/ai-multimodal/.env获取 API 密钥 : https://aistudio.google.com/apikey
对于 Vertex AI :
export GEMINI_USE_VERTEX=true
export VERTEX_PROJECT_ID=your-gcp-project-id
export VERTEX_LOCATION=us-central1 # 可选
安装 SDK :
pip install google-genai python-dotenv pillow
转录音频 :
python scripts/gemini_batch_process.py \
--files audio.mp3 \
--task transcribe \
--model gemini-2.5-flash
分析图像 :
python scripts/gemini_batch_process.py \
--files image.jpg \
--task analyze \
--prompt "描述这张图片" \
--output docs/assets/<output-name>.md \
--model gemini-2.5-flash
处理视频 :
python scripts/gemini_batch_process.py \
--files video.mp4 \
--task analyze \
--prompt "用时间戳总结关键点" \
--output docs/assets/<output-name>.md \
--model gemini-2.5-flash
从 PDF 提取 :
python scripts/gemini_batch_process.py \
--files document.pdf \
--task extract \
--prompt "将表格数据提取为 JSON" \
--output docs/assets/<output-name>.md \
--format json
生成图像 :
python scripts/gemini_batch_process.py \
--task generate \
--prompt "日落时的未来城市" \
--output docs/assets/<output-file-name> \
--model gemini-2.5-flash-image \
--aspect-ratio 16:9
优化媒体 :
# 准备大视频进行处理
python scripts/media_optimizer.py \
--input large-video.mp4 \
--output docs/assets/<output-file-name> \
--target-size 100MB
# 批量优化多个文件
python scripts/media_optimizer.py \
--input-dir ./videos \
--output-dir docs/assets/optimized \
--quality 85
将文档转换为 Markdown :
# 转换为 PDF
python scripts/document_converter.py \
--input document.docx \
--output docs/assets/document.md
# 提取页面
python scripts/document_converter.py \
--input large.pdf \
--output docs/assets/chapter1.md \
--pages 1-20
详细实现指南,请参阅:
references/audio-processing.md - 转录、分析、TTS
references/vision-understanding.md - 字幕、检测、OCR
references/video-analysis.md - 场景检测、时间理解
references/document-extraction.md - PDF 处理、结构化输出
references/image-generation.md - 文本到图像、编辑
输入定价 :
Token 速率 :
TTS 定价 :
gemini-2.5-flash(最佳性价比)media_optimizer.py)免费层 :
YouTube 限制 :
存储限制 :
常见错误及解决方案:
所有脚本都支持统一的 API 密钥检测和错误处理:
gemini_batch_process.py : 批量处理多个媒体文件
media_optimizer.py : 为 Gemini API 准备媒体
document_converter.py : 将文档转换为 PDF
运行任何脚本时使用 --help 获取详细用法。
每周安装数
242
仓库
GitHub 星标数
1.9K
首次出现
2026年1月22日
安全审计
安装于
opencode201
gemini-cli198
codex186
cursor186
claude-code175
github-copilot165
Process audio, images, videos, documents, and generate images using Google Gemini's multimodal API. Unified interface for all multimedia content understanding and generation.
| Task | Audio | Image | Video | Document | Generation |
|---|---|---|---|---|---|
| Transcription | ✓ | - | ✓ | - | - |
| Summarization | ✓ | ✓ | ✓ | ✓ | - |
| Q&A | ✓ | ✓ | ✓ | ✓ | - |
| Object Detection | - | ✓ | ✓ | - | - |
| Text Extraction | - | ✓ | - | ✓ |
API Key Setup : Supports both Google AI Studio and Vertex AI.
The skill checks for GEMINI_API_KEY in this order:
export GEMINI_API_KEY="your-key".env.claude/.env.claude/skills/.env.claude/skills/ai-multimodal/.envGet API key : https://aistudio.google.com/apikey
For Vertex AI :
export GEMINI_USE_VERTEX=true
export VERTEX_PROJECT_ID=your-gcp-project-id
export VERTEX_LOCATION=us-central1 # Optional
Install SDK :
pip install google-genai python-dotenv pillow
Transcribe Audio :
python scripts/gemini_batch_process.py \
--files audio.mp3 \
--task transcribe \
--model gemini-2.5-flash
Analyze Image :
python scripts/gemini_batch_process.py \
--files image.jpg \
--task analyze \
--prompt "Describe this image" \
--output docs/assets/<output-name>.md \
--model gemini-2.5-flash
Process Video :
python scripts/gemini_batch_process.py \
--files video.mp4 \
--task analyze \
--prompt "Summarize key points with timestamps" \
--output docs/assets/<output-name>.md \
--model gemini-2.5-flash
Extract from PDF :
python scripts/gemini_batch_process.py \
--files document.pdf \
--task extract \
--prompt "Extract table data as JSON" \
--output docs/assets/<output-name>.md \
--format json
Generate Image :
python scripts/gemini_batch_process.py \
--task generate \
--prompt "A futuristic city at sunset" \
--output docs/assets/<output-file-name> \
--model gemini-2.5-flash-image \
--aspect-ratio 16:9
Optimize Media :
# Prepare large video for processing
python scripts/media_optimizer.py \
--input large-video.mp4 \
--output docs/assets/<output-file-name> \
--target-size 100MB
# Batch optimize multiple files
python scripts/media_optimizer.py \
--input-dir ./videos \
--output-dir docs/assets/optimized \
--quality 85
Convert Documents to Markdown :
# Convert to PDF
python scripts/document_converter.py \
--input document.docx \
--output docs/assets/document.md
# Extract pages
python scripts/document_converter.py \
--input large.pdf \
--output docs/assets/chapter1.md \
--pages 1-20
For detailed implementation guidance, see:
references/audio-processing.md - Transcription, analysis, TTS
references/vision-understanding.md - Captioning, detection, OCR
references/video-analysis.md - Scene detection, temporal understanding
references/document-extraction.md - PDF processing, structured output
references/image-generation.md - Text-to-image, editing
Input Pricing :
Token Rates :
TTS Pricing :
gemini-2.5-flash for most tasks (best price/performance)media_optimizer.py)Free Tier :
YouTube Limits :
Storage Limits :
Common errors and solutions:
All scripts support unified API key detection and error handling:
gemini_batch_process.py : Batch process multiple media files
media_optimizer.py : Prepare media for Gemini API
document_converter.py : Convert documents to PDF
Run any script with --help for detailed usage.
Weekly Installs
242
Repository
GitHub Stars
1.9K
First Seen
Jan 22, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykWarn
Installed on
opencode201
gemini-cli198
codex186
cursor186
claude-code175
github-copilot165
AI Elements:基于shadcn/ui的AI原生应用组件库,快速构建对话界面
56,200 周安装
| - |
| Structured Output | ✓ | ✓ | ✓ | ✓ | - |
| Creation | TTS | - | - | - | ✓ |
| Timestamps | ✓ | - | ✓ | - | - |
| Segmentation | - | ✓ | - | - | - |