multimodal-llm by yonatangross/orchestkit
npx skills add https://github.com/yonatangross/orchestkit --skill multimodal-llm集成领先多模态模型的视觉、音频和视频生成能力。涵盖图像分析、文档理解、实时语音代理、语音转文本、文本转语音以及 AI 视频生成(Kling 3.0、Sora 2、Veo 3.1、Runway Gen-4.5)。
| 类别 | 规则 | 影响 | 使用场景 |
|---|---|---|---|
| 视觉:图像分析 | 1 | 高 | 图像描述、视觉问答、多图像比较、目标检测 |
| 视觉:文档理解 | 1 | 高 | 光学字符识别、图表/图解分析、PDF 处理、表格提取 |
| 视觉:模型选择 | 1 | 中 | 选择提供商、成本优化、图像尺寸限制 |
| 音频:语音转文本 | 1 | 高 | 转录、说话人分离、长音频处理 |
| 音频:文本转语音 | 1 | 中 | 语音合成、富有表现力的文本转语音、多说话人对话 |
| 音频:模型选择 | 1 | 中 |
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
| 实时语音代理、提供商比较、定价 |
| 视频:模型选择 | 1 | 高 | 选择视频生成提供商(Kling、Sora、Veo、Runway) |
| 视频:API 模式 | 1 | 高 | 异步任务轮询、SDK 集成、Webhook 回调 |
| 视频:多镜头 | 1 | 高 | 故事板设计、角色元素、场景一致性 |
总计:3 个类别(视觉、音频、视频生成)共 9 条规则
将图像发送给多模态大语言模型进行描述、视觉问答和目标检测。始终设置 max_tokens 并在编码前调整图像大小。
| 规则 | 文件 | 关键模式 |
|---|---|---|
| 图像分析 | rules/vision-image-analysis.md | Base64 编码、多图像、边界框 |
使用视觉模型从文档、图表和 PDF 中提取结构化数据。
| 规则 | 文件 | 关键模式 |
|---|---|---|
| 文档视觉 | rules/vision-document.md | PDF 页面范围、细节级别、OCR 策略 |
根据准确性、成本和上下文窗口需求选择合适的视觉提供商。
| 规则 | 文件 | 关键模式 |
|---|---|---|
| 视觉模型 | rules/vision-models.md | 提供商比较、令牌成本、图像限制 |
将音频转换为文本,支持说话人分离、时间戳和情感分析。
| 规则 | 文件 | 关键模式 |
|---|---|---|
| 语音转文本 | rules/audio-speech-to-text.md | Gemini 长音频、GPT-4o-Transcribe、AssemblyAI 功能 |
从文本生成自然语音,支持语音选择和表达提示。
| 规则 | 文件 | 关键模式 |
|---|---|---|
| 文本转语音 | rules/audio-text-to-speech.md | Gemini TTS、语音配置、听觉提示 |
为实时、转录或文本转语音用例选择合适的音频/语音提供商。
| 规则 | 文件 | 关键模式 |
|---|---|---|
| 音频模型 | rules/audio-models.md | 实时语音比较、语音转文本基准测试、定价 |
根据用例、时长和预算选择合适的视频生成提供商。
| 规则 | 文件 | 关键模式 |
|---|---|---|
| 视频模型 | rules/video-generation-models.md | Kling vs Sora vs Veo vs Runway、定价、能力 |
集成视频生成 API,采用适当的异步轮询、SDK 和 Webhook 回调。
| 规则 | 文件 | 关键模式 |
|---|---|---|
| API 集成 | rules/video-generation-patterns.md | Kling REST、fal.ai SDK、Vercel AI SDK、任务轮询 |
使用故事板和角色元素生成具有一致角色的多场景视频。
| 规则 | 文件 | 关键模式 |
|---|---|---|
| 多镜头 | rules/video-multi-shot.md | Kling 3.0 角色元素、6 镜头故事板、身份绑定 |
| 决策 | 推荐 |
|---|---|
| 高精度视觉 | Claude Opus 4.6 或 GPT-5 |
| 长文档处理 | Gemini 2.5 Pro(100 万上下文) |
| 高性价比视觉 | Gemini 2.5 Flash($0.15/百万令牌) |
| 视频分析 | Gemini 2.5/3 Pro(原生视频支持) |
| 语音助手 | Grok Voice Agent(最快,<1 秒) |
| 情感语音 AI | Gemini Live API |
| 长音频转录 | Gemini 2.5 Pro(9.5 小时) |
| 说话人分离 | AssemblyAI 或 Gemini |
| 自托管语音转文本 | Whisper Large V3 |
| 角色一致性视频 | Kling 3.0(Character Elements 3.0) |
| 叙事视频/故事讲述 | Sora 2(最佳因果连贯性) |
| 电影级 B-roll | Veo 3.1(摄像机控制 + 流畅运动) |
| 专业视觉特效 | Runway Gen-4.5(Act-Two 运动转移) |
| 高产量社交视频 | Kling 3.0 Standard($0.20/视频) |
| 开源视频生成 | Wan 2.6 或 LTX-2 |
| 口型同步/虚拟形象视频 | Kling 3.0(原生口型同步 API) |
import anthropic, base64
client = anthropic.Anthropic()
with open("image.png", "rb") as f:
b64 = base64.standard_b64encode(f.read()).decode("utf-8")
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=1024,
messages=[{"role": "user", "content": [
{"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": b64}},
{"type": "text", "text": "Describe this image"}
]}]
)
max_tokens(响应被截断)high 细节级别ork:rag-retrieval - 结合图像和文本检索的多模态 RAGork:llm-integration - 通用大语言模型函数调用模式streaming-api-patterns - 用于实时音频的 WebSocket 模式ork:demo-producer - 终端演示视频(VHS、asciinema)——非 AI 视频生成每周安装量
94
代码仓库
GitHub 星标数
134
首次出现
2026年2月14日
安全审计
安装于
gemini-cli91
codex90
opencode90
github-copilot90
cursor89
amp87
Integrate vision, audio, and video generation capabilities from leading multimodal models. Covers image analysis, document understanding, real-time voice agents, speech-to-text, text-to-speech, and AI video generation (Kling 3.0, Sora 2, Veo 3.1, Runway Gen-4.5).
| Category | Rules | Impact | When to Use |
|---|---|---|---|
| Vision: Image Analysis | 1 | HIGH | Image captioning, VQA, multi-image comparison, object detection |
| Vision: Document Understanding | 1 | HIGH | OCR, chart/diagram analysis, PDF processing, table extraction |
| Vision: Model Selection | 1 | MEDIUM | Choosing provider, cost optimization, image size limits |
| Audio: Speech-to-Text | 1 | HIGH | Transcription, speaker diarization, long-form audio |
| Audio: Text-to-Speech | 1 | MEDIUM | Voice synthesis, expressive TTS, multi-speaker dialogue |
| Audio: Model Selection | 1 | MEDIUM | Real-time voice agents, provider comparison, pricing |
| Video: Model Selection | 1 | HIGH | Choosing video gen provider (Kling, Sora, Veo, Runway) |
| Video: API Patterns | 1 | HIGH | Async task polling, SDK integration, webhook callbacks |
| Video: Multi-Shot | 1 | HIGH | Storyboarding, character elements, scene consistency |
Total: 9 rules across 3 categories (Vision, Audio, Video Generation)
Send images to multimodal LLMs for captioning, visual QA, and object detection. Always set max_tokens and resize images before encoding.
| Rule | File | Key Pattern |
|---|---|---|
| Image Analysis | rules/vision-image-analysis.md | Base64 encoding, multi-image, bounding boxes |
Extract structured data from documents, charts, and PDFs using vision models.
| Rule | File | Key Pattern |
|---|---|---|
| Document Vision | rules/vision-document.md | PDF page ranges, detail levels, OCR strategies |
Choose the right vision provider based on accuracy, cost, and context window needs.
| Rule | File | Key Pattern |
|---|---|---|
| Vision Models | rules/vision-models.md | Provider comparison, token costs, image limits |
Convert audio to text with speaker diarization, timestamps, and sentiment analysis.
| Rule | File | Key Pattern |
|---|---|---|
| Speech-to-Text | rules/audio-speech-to-text.md | Gemini long-form, GPT-4o-Transcribe, AssemblyAI features |
Generate natural speech from text with voice selection and expressive cues.
| Rule | File | Key Pattern |
|---|---|---|
| Text-to-Speech | rules/audio-text-to-speech.md | Gemini TTS, voice config, auditory cues |
Select the right audio/voice provider for real-time, transcription, or TTS use cases.
| Rule | File | Key Pattern |
|---|---|---|
| Audio Models | rules/audio-models.md | Real-time voice comparison, STT benchmarks, pricing |
Choose the right video generation provider based on use case, duration, and budget.
| Rule | File | Key Pattern |
|---|---|---|
| Video Models | rules/video-generation-models.md | Kling vs Sora vs Veo vs Runway, pricing, capabilities |
Integrate video generation APIs with proper async polling, SDKs, and webhook callbacks.
| Rule | File | Key Pattern |
|---|---|---|
| API Integration | rules/video-generation-patterns.md | Kling REST, fal.ai SDK, Vercel AI SDK, task polling |
Generate multi-scene videos with consistent characters using storyboarding and character elements.
| Rule | File | Key Pattern |
|---|---|---|
| Multi-Shot | rules/video-multi-shot.md | Kling 3.0 character elements, 6-shot storyboards, identity binding |
| Decision | Recommendation |
|---|---|
| High accuracy vision | Claude Opus 4.6 or GPT-5 |
| Long documents | Gemini 2.5 Pro (1M context) |
| Cost-efficient vision | Gemini 2.5 Flash ($0.15/M tokens) |
| Video analysis | Gemini 2.5/3 Pro (native video) |
| Voice assistant | Grok Voice Agent (fastest, <1s) |
| Emotional voice AI | Gemini Live API |
| Long audio transcription | Gemini 2.5 Pro (9.5hr) |
| Speaker diarization | AssemblyAI or Gemini |
| Self-hosted STT | Whisper Large V3 |
| Character-consistent video | Kling 3.0 (Character Elements 3.0) |
| Narrative video / storytelling | Sora 2 (best cause-and-effect coherence) |
| Cinematic B-roll | Veo 3.1 (camera control + polished motion) |
| Professional VFX | Runway Gen-4.5 (Act-Two motion transfer) |
import anthropic, base64
client = anthropic.Anthropic()
with open("image.png", "rb") as f:
b64 = base64.standard_b64encode(f.read()).decode("utf-8")
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=1024,
messages=[{"role": "user", "content": [
{"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": b64}},
{"type": "text", "text": "Describe this image"}
]}]
)
max_tokens on vision requests (responses truncated)high detail level for simple yes/no classificationork:rag-retrieval - Multimodal RAG with image + text retrievalork:llm-integration - General LLM function calling patternsstreaming-api-patterns - WebSocket patterns for real-time audioork:demo-producer - Terminal demo videos (VHS, asciinema) — not AI video genWeekly Installs
94
Repository
GitHub Stars
134
First Seen
Feb 14, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykWarn
Installed on
gemini-cli91
codex90
opencode90
github-copilot90
cursor89
amp87
AI 代码实施计划编写技能 | 自动化开发任务分解与 TDD 流程规划工具
49,000 周安装
Alchemy API 集成指南:AI 代理使用 API 密钥访问区块链数据与服务的完整教程
296 周安装
Skill Creator 指南:如何为 Claude AI 创建高效技能 | 模块化开发与渐进式披露设计
299 周安装
Claude Code 上下文压缩恢复工具 - 自动恢复工作状态,加载知识库,总结工作进度
304 周安装
AI幻灯片生成器 - 一键将Markdown内容转换为专业演示文稿,支持多种风格和格式导出
93 周安装
Allium 规范语言:领域行为驱动开发(BDD)工具,用于生成集成测试与端到端测试
301 周安装
vLLM 高性能 LLM 服务部署指南 - 24倍吞吐量,OpenAI兼容API
301 周安装
| High-volume social video | Kling 3.0 Standard ($0.20/video) |
| Open-source video gen | Wan 2.6 or LTX-2 |
| Lip-sync / avatar video | Kling 3.0 (native lip-sync API) |