Claude 视觉多模态技能：图像分析、文档处理、OCR文本提取与多模态理解

vision-multimodal by lobbi-docs/claude

107 周安装量

9 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/lobbi-docs/claude --skill vision-multimodal

AI/机器学习自动化计算机视觉

🇨🇳中文介绍

视觉与多模态技能

利用 Claude 的视觉能力进行图像分析、文档处理和多模态理解。

何时使用此技能

图像分析与描述
文档/PDF 处理
截图分析
类似 OCR 的文本提取
视觉对比
图表和图解解读

支持的格式

格式	状态	最佳用途
JPEG	✓	照片、自然场景
PNG	✓	截图、用户界面、文本
GIF	✓	动画（仅第一帧）
WebP	✓	现代、压缩格式
PDF	✓	文档（通过 Files API）

图像尺寸指南

最小尺寸： 200 像素（越小精度越低）
最佳尺寸： 1000x1000 像素
8000x8000 像素

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

相关 Skills

FlyClaw：零登录航班聚合查询工具，Python实现多源航班信息与价格搜索

4,000,000 周安装

find-skills 技能搜索工具 - Vercel Labs 开源智能体技能包管理器

867,400 周安装

Azure RBAC 权限管理工具：查找最小角色、创建自定义角色与自动化分配

129,699 周安装

React 组合模式指南：Vercel 组件架构最佳实践，提升代码可维护性

116,600 周安装

import anthropic
import base64

client = anthropic.Anthropic()

# 加载并编码图像
with open("image.jpg", "rb") as f:
    image_data = base64.standard_b64encode(f.read()).decode("utf-8")

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "image",
                "source": {
                    "type": "base64",
                    "media_type": "image/jpeg",
                    "data": image_data
                }
            },
            {
                "type": "text",
                "text": "Describe this image in detail."
            }
        ]
    }]
)

# 比较多张图像（每个请求最多 100 张）
messages = [{
    "role": "user",
    "content": [
        {"type": "image", "source": {"type": "base64", "media_type": "image/jpeg", "data": image1}},
        {"type": "image", "source": {"type": "base64", "media_type": "image/jpeg", "data": image2}},
        {"type": "text", "text": "Compare these two images and list the differences."}
    ]
}]

# 通过示例教学
messages = [
    # 示例 1
    {"role": "user", "content": [
        {"type": "image", "source": {...}},
        {"type": "text", "text": "Classify this image."}
    ]},
    {"role": "assistant", "content": "Category: Landscape\nElements: Mountains, lake, trees"},

    # 示例 2
    {"role": "user", "content": [
        {"type": "image", "source": {...}},
        {"type": "text", "text": "Classify this image."}
    ]},
    {"role": "assistant", "content": "Category: Portrait\nElements: Person, indoor, professional"},

    # 目标图像
    {"role": "user", "content": [
        {"type": "image", "source": {"type": "base64", "media_type": "image/jpeg", "data": target_image}},
        {"type": "text", "text": "Classify this image."}
    ]}
]

# 使用 Files API (beta)
with open("document.pdf", "rb") as f:
    pdf_data = base64.standard_b64encode(f.read()).decode("utf-8")

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=4096,
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "document",
                "source": {
                    "type": "base64",
                    "media_type": "application/pdf",
                    "data": pdf_data
                }
            },
            {"type": "text", "text": "Summarize this document."}
        ]
    }]
)

prompt = """You have perfect vision and exceptional attention to detail,
making you an expert at analyzing technical diagrams.

Analyze this architecture diagram and identify:
1. All components
2. Data flow between components
3. Potential bottlenecks"""

prompt = """Before answering, analyze the image systematically:

<thinking>
1. What is the overall subject?
2. What are the key elements?
3. How do elements relate to each other?
4. What details stand out?
</thinking>

Then provide your answer based on this analysis."""

from PIL import Image
import io

def optimize_for_claude(image_path, max_dimension=1568):
    """Resize image to reduce token usage by 30-50%"""
    with Image.open(image_path) as img:
        # Calculate new dimensions
        ratio = min(max_dimension / img.width, max_dimension / img.height)
        if ratio < 1:
            new_size = (int(img.width * ratio), int(img.height * ratio))
            img = img.resize(new_size, Image.LANCZOS)

        # Convert to bytes
        buffer = io.BytesIO()
        img.save(buffer, format="JPEG", quality=85)
        return base64.standard_b64encode(buffer.getvalue()).decode("utf-8")

🇺🇸English

Vision & Multimodal Skill

Leverage Claude's vision capabilities for image analysis, document processing, and multimodal understanding.

When to Use This Skill

Image analysis and description
Document/PDF processing
Screenshot analysis
OCR-like text extraction
Visual comparison
Chart and diagram interpretation

Supported Formats

Format	Status	Best For
JPEG	✓	Photos, natural scenes
PNG	✓	Screenshots, UI, text
GIF	✓	Animated (first frame)
WebP	✓	Modern, compressed
PDF	✓	Documents (via Files API)

Image Size Guidelines

Minimum: 200 pixels (smaller = reduced accuracy)
Optimal: 1000x1000 pixels
Maximum: 8000x8000 pixels
Token cost: ~(width × height) / 1000
Tip: Resize to 1568px max dimension for 30-50% token savings

Core Patterns

Pattern 1: Single Image Analysis

import anthropic
import base64

client = anthropic.Anthropic()

# Load and encode image
with open("image.jpg", "rb") as f:
    image_data = base64.standard_b64encode(f.read()).decode("utf-8")

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "image",
                "source": {
                    "type": "base64",
                    "media_type": "image/jpeg",
                    "data": image_data
                }
            },
            {
                "type": "text",
                "text": "Describe this image in detail."
            }
        ]
    }]
)

Pattern 2: Image from URL

import httpx

# Fetch and encode from URL
image_url = "https://example.com/image.jpg"
response = httpx.get(image_url)
image_data = base64.standard_b64encode(response.content).decode("utf-8")

# Then use same pattern as above

Pattern 3: Multiple Images

# Compare multiple images (up to 100 per request)
messages = [{
    "role": "user",
    "content": [
        {"type": "image", "source": {"type": "base64", "media_type": "image/jpeg", "data": image1}},
        {"type": "image", "source": {"type": "base64", "media_type": "image/jpeg", "data": image2}},
        {"type": "text", "text": "Compare these two images and list the differences."}
    ]
}]

Pattern 4: Few-Shot with Images

# Teach by example
messages = [
    # Example 1
    {"role": "user", "content": [
        {"type": "image", "source": {...}},
        {"type": "text", "text": "Classify this image."}
    ]},
    {"role": "assistant", "content": "Category: Landscape\nElements: Mountains, lake, trees"},

    # Example 2
    {"role": "user", "content": [
        {"type": "image", "source": {...}},
        {"type": "text", "text": "Classify this image."}
    ]},
    {"role": "assistant", "content": "Category: Portrait\nElements: Person, indoor, professional"},

    # Target image
    {"role": "user", "content": [
        {"type": "image", "source": {"type": "base64", "media_type": "image/jpeg", "data": target_image}},
        {"type": "text", "text": "Classify this image."}
    ]}
]

Pattern 5: PDF Processing

# Using Files API (beta)
with open("document.pdf", "rb") as f:
    pdf_data = base64.standard_b64encode(f.read()).decode("utf-8")

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=4096,
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "document",
                "source": {
                    "type": "base64",
                    "media_type": "application/pdf",
                    "data": pdf_data
                }
            },
            {"type": "text", "text": "Summarize this document."}
        ]
    }]
)

Prompt Engineering for Vision

Strategy 1: Role Assignment

prompt = """You have perfect vision and exceptional attention to detail,
making you an expert at analyzing technical diagrams.

Analyze this architecture diagram and identify:
1. All components
2. Data flow between components
3. Potential bottlenecks"""

Strategy 2: Step-by-Step Thinking

prompt = """Before answering, analyze the image systematically:

<thinking>
1. What is the overall subject?
2. What are the key elements?
3. How do elements relate to each other?
4. What details stand out?
</thinking>

Then provide your answer based on this analysis."""

Strategy 3: Structured Output

prompt = """Extract information from this receipt and return as JSON:

{
    "vendor": "",
    "date": "",
    "items": [{"name": "", "price": 0}],
    "total": 0
}"""

Image Optimization

from PIL import Image
import io

def optimize_for_claude(image_path, max_dimension=1568):
    """Resize image to reduce token usage by 30-50%"""
    with Image.open(image_path) as img:
        # Calculate new dimensions
        ratio = min(max_dimension / img.width, max_dimension / img.height)
        if ratio < 1:
            new_size = (int(img.width * ratio), int(img.height * ratio))
            img = img.resize(new_size, Image.LANCZOS)

        # Convert to bytes
        buffer = io.BytesIO()
        img.save(buffer, format="JPEG", quality=85)
        return base64.standard_b64encode(buffer.getvalue()).decode("utf-8")

Common Use Cases

Text Extraction (OCR-like)

prompt = """Extract all text from this image.
Preserve the original formatting and structure as much as possible.
If text is unclear, indicate with [unclear]."""

Table Extraction

prompt = """Extract the table data from this image.
Return as a markdown table with proper headers and alignment."""

Chart Analysis

prompt = """Analyze this chart:
1. What type of chart is this?
2. What are the axes/labels?
3. What are the key data points?
4. What trends or patterns are visible?"""

Best Practices

DO:

Use high-quality images (≥1000px)
Resize large images to save tokens
Provide context about what to look for
Use few-shot examples for consistent output

DON'T:

Send images smaller than 200px
Expect perfect OCR for handwriting
Send very large images (>8000px)
Ignore token costs for multiple images

Limitations

Cannot identify specific individuals
May struggle with very small text
Animated GIFs: only first frame analyzed
Some specialized symbols may be misread

Claude 视觉多模态技能：图像分析、文档处理、OCR文本提取与多模态理解

🇨🇳中文介绍

视觉与多模态技能

何时使用此技能

支持的格式

图像尺寸指南

相关 Skills

核心模式

模式 1：单张图像分析

模式 2：从 URL 获取图像

模式 3：多张图像

模式 4：带图像的少样本学习

模式 5：PDF 处理

视觉提示工程

策略 1：角色分配

策略 2：逐步思考

策略 3：结构化输出

图像优化

常见用例

文本提取（类似 OCR）

表格提取

图表分析

最佳实践

建议：

不建议：

局限性

另请参阅