DeepSeek-OCR：支持图像和PDF的视觉语言模型，具备上下文光学压缩与多模式OCR功能

deepseek-ocr by aradotso/trending-skills

462 周安装量

22 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/aradotso/trending-skills --skill deepseek-ocr

AI/机器学习数据处理计算机视觉

🇨🇳中文介绍

DeepSeek-OCR

技能来自 ara.so — Daily 2026 技能集合。

DeepSeek-OCR 是一个用于光学字符识别的视觉语言模型，具备“上下文光学压缩”功能。它支持原生和动态分辨率、多种提示模式（文档转 Markdown、自由 OCR、图表解析、定位），并可通过 vLLM（高吞吐量）或 HuggingFace Transformers 运行。它能处理图像和 PDF 文件，输出结构化的文本或 Markdown。

安装

先决条件

CUDA 11.8+， PyTorch 2.6.0
Python 3.12.9（推荐通过 conda 安装）

设置

git clone https://github.com/deepseek-ai/DeepSeek-OCR.git
cd DeepSeek-OCR

conda create -n deepseek-ocr python=3.12.9 -y
conda activate deepseek-ocr

# 安装带 CUDA 11.8 的 PyTorch
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 \
  --index-url https://download.pytorch.org/whl/cu118

# 从 https://github.com/vllm-project/vllm/releases/tag/v0.8.5 下载 vllm-0.8.5 whl 文件
pip install vllm-0.8.5+cu118-cp38-abi3-manylinux1_x86_64.whl

pip install -r requirements.txt
pip install flash-attn==2.7.3 --no-build-isolation

替代方案：上游 vLLM（夜间版）

uv venv
source .venv/bin/activate
uv pip install -U vllm --pre --extra-index-url https://wheels.vllm.ai/nightly

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

相关 Skills

find-skills 技能搜索工具 - Vercel Labs 开源智能体技能包管理器

917,400 周安装

React 组合模式指南：Vercel 组件架构最佳实践，提升代码可维护性

122,000 周安装

AI Elements：基于shadcn/ui的AI原生应用组件库，快速构建对话界面

71,500 周安装

超能力技能使用指南：AI助手技能调用优先级与工作流程详解

52,100 周安装

from vllm import LLM, SamplingParams
from vllm.model_executor.models.deepseek_ocr import NGramPerReqLogitsProcessor
from PIL import Image

llm = LLM(
    model="deepseek-ai/DeepSeek-OCR",
    enable_prefix_caching=False,
    mm_processor_cache_gb=0,
    logits_processors=[NGramPerReqLogitsProcessor]
)

image = Image.open("document.png").convert("RGB")
prompt = "<image>\nFree OCR."

sampling_params = SamplingParams(
    temperature=0.0,
    max_tokens=8192,
    extra_args=dict(
        ngram_size=30,
        window_size=90,
        whitelist_token_ids={128821, 128822},  # <td>, </td> 用于表格支持
    ),
    skip_special_tokens=False,
)

outputs = llm.generate(
    [{"prompt": prompt, "multi_modal_data": {"image": image}}],
    sampling_params
)

print(outputs[0].outputs[0].text)

from vllm import LLM, SamplingParams
from vllm.model_executor.models.deepseek_ocr import NGramPerReqLogitsProcessor
from PIL import Image

llm = LLM(
    model="deepseek-ai/DeepSeek-OCR",
    enable_prefix_caching=False,
    mm_processor_cache_gb=0,
    logits_processors=[NGramPerReqLogitsProcessor]
)

image_paths = ["page1.png", "page2.png", "page3.png"]
prompt = "<image>\n<|grounding|>Convert the document to markdown. "

model_input = [
    {
        "prompt": prompt,
        "multi_modal_data": {"image": Image.open(p).convert("RGB")}
    }
    for p in image_paths
]

sampling_params = SamplingParams(
    temperature=0.0,
    max_tokens=8192,
    extra_args=dict(
        ngram_size=30,
        window_size=90,
        whitelist_token_ids={128821, 128822},
    ),
    skip_special_tokens=False,
)

outputs = llm.generate(model_input, sampling_params)

for path, output in zip(image_paths, outputs):
    print(f"=== {path} ===")
    print(output.outputs[0].text)

import os
import torch
from transformers import AutoModel, AutoTokenizer

os.environ["CUDA_VISIBLE_DEVICES"] = "0"

model_name = "deepseek-ai/DeepSeek-OCR"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(
    model_name,
    _attn_implementation="flash_attention_2",
    trust_remote_code=True,
    use_safetensors=True,
)
model = model.eval().cuda().to(torch.bfloat16)

# 文档转 Markdown
res = model.infer(
    tokenizer,
    prompt="<image>\n<|grounding|>Convert the document to markdown. ",
    image_file="document.jpg",
    output_path="./output/",
    base_size=1024,
    image_size=640,
    crop_mode=True,
    save_results=True,
    test_compress=True,
)
print(res)

使用场景	提示词
文档 → Markdown	`\n<
通用 OCR	`\n<
自由 OCR（无布局）	`<image>\nFree OCR.`
解析图表/图形	`<image>\nParse the figure.`
通用描述	`<image>\nDescribe this image in detail.`
定位式 REC	`<image>\nLocate <

模式	分辨率	视觉令牌数
微小	512×512	64
小	640×640	100
基础	1024×1024	256
大	1280×1280	400
高达（动态）	n×640×640 + 1×1024×1024	可变

import os
from pathlib import Path
from PIL import Image
from vllm import LLM, SamplingParams
from vllm.model_executor.models.deepseek_ocr import NGramPerReqLogitsProcessor

def batch_ocr(image_dir: str, output_dir: str, prompt: str = "<image>\nFree OCR."):
    Path(output_dir).mkdir(parents=True, exist_ok=True)
    
    llm = LLM(
        model="deepseek-ai/DeepSeek-OCR",
        enable_prefix_caching=False,
        mm_processor_cache_gb=0,
        logits_processors=[NGramPerReqLogitsProcessor],
    )
    sampling_params = SamplingParams(
        temperature=0.0,
        max_tokens=8192,
        extra_args=dict(ngram_size=30, window_size=90, whitelist_token_ids={128821, 128822}),
        skip_special_tokens=False,
    )
    
    image_files = list(Path(image_dir).glob("*.png")) + list(Path(image_dir).glob("*.jpg"))
    
    inputs = [
        {"prompt": prompt, "multi_modal_data": {"image": Image.open(f).convert("RGB")}}
        for f in image_files
    ]
    
    outputs = llm.generate(inputs, sampling_params)
    
    for img_path, output in zip(image_files, outputs):
        out_file = Path(output_dir) / (img_path.stem + ".txt")
        out_file.write_text(output.outputs[0].text)
        print(f"Saved: {out_file}")

batch_ocr("/data/scans/", "/data/results/")

import fitz  # PyMuPDF
from PIL import Image
from io import BytesIO
from vllm import LLM, SamplingParams
from vllm.model_executor.models.deepseek_ocr import NGramPerReqLogitsProcessor

def pdf_to_markdown(pdf_path: str) -> list[str]:
    doc = fitz.open(pdf_path)
    llm = LLM(
        model="deepseek-ai/DeepSeek-OCR",
        enable_prefix_caching=False,
        mm_processor_cache_gb=0,
        logits_processors=[NGramPerReqLogitsProcessor],
    )
    sampling_params = SamplingParams(
        temperature=0.0,
        max_tokens=8192,
        extra_args=dict(ngram_size=30, window_size=90, whitelist_token_ids={128821, 128822}),
        skip_special_tokens=False,
    )
    
    prompt = "<image>\n<|grounding|>Convert the document to markdown. "
    inputs = []
    for page in doc:
        pix = page.get_pixmap(dpi=150)
        img = Image.open(BytesIO(pix.tobytes("png"))).convert("RGB")
        inputs.append({"prompt": prompt, "multi_modal_data": {"image": img}})
    
    outputs = llm.generate(inputs, sampling_params)
    return [o.outputs[0].text for o in outputs]

pages = pdf_to_markdown("report.pdf")
full_markdown = "\n\n---\n\n".join(pages)
print(full_markdown)

import torch
from transformers import AutoModel, AutoTokenizer

model_name = "deepseek-ai/DeepSeek-OCR"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(
    model_name,
    _attn_implementation="flash_attention_2",
    trust_remote_code=True,
    use_safetensors=True,
).eval().cuda().to(torch.bfloat16)

target = "Total Amount"
prompt = f"<image>\nLocate <|ref|>{target}<|/ref|> in the image. "

res = model.infer(
    tokenizer,
    prompt=prompt,
    image_file="invoice.jpg",
    output_path="./output/",
    base_size=1024,
    image_size=640,
    crop_mode=False,
    save_results=True,
)
print(res)  # 返回边界框 / 位置信息

🇺🇸English

DeepSeek-OCR

Skill by ara.so — Daily 2026 Skills collection.

DeepSeek-OCR is a vision-language model for Optical Character Recognition with "Contexts Optical Compression." It supports native and dynamic resolutions, multiple prompt modes (document-to-markdown, free OCR, figure parsing, grounding), and can be run via vLLM (high-throughput) or HuggingFace Transformers. It processes images and PDFs, outputting structured text or markdown.

Installation

Prerequisites

CUDA 11.8+, PyTorch 2.6.0
Python 3.12.9 (via conda recommended)

Setup

git clone https://github.com/deepseek-ai/DeepSeek-OCR.git
cd DeepSeek-OCR

conda create -n deepseek-ocr python=3.12.9 -y
conda activate deepseek-ocr

# Install PyTorch with CUDA 11.8
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 \
  --index-url https://download.pytorch.org/whl/cu118

# Download vllm-0.8.5 whl from https://github.com/vllm-project/vllm/releases/tag/v0.8.5
pip install vllm-0.8.5+cu118-cp38-abi3-manylinux1_x86_64.whl

pip install -r requirements.txt
pip install flash-attn==2.7.3 --no-build-isolation

Alternative: upstream vLLM (nightly)

uv venv
source .venv/bin/activate
uv pip install -U vllm --pre --extra-index-url https://wheels.vllm.ai/nightly

Model Download

Model is available on HuggingFace: deepseek-ai/DeepSeek-OCR

from huggingface_hub import snapshot_download
snapshot_download(repo_id="deepseek-ai/DeepSeek-OCR")

Inference: vLLM (Recommended for Production)

Single Image — Streaming

from vllm import LLM, SamplingParams
from vllm.model_executor.models.deepseek_ocr import NGramPerReqLogitsProcessor
from PIL import Image

llm = LLM(
    model="deepseek-ai/DeepSeek-OCR",
    enable_prefix_caching=False,
    mm_processor_cache_gb=0,
    logits_processors=[NGramPerReqLogitsProcessor]
)

image = Image.open("document.png").convert("RGB")
prompt = "<image>\nFree OCR."

sampling_params = SamplingParams(
    temperature=0.0,
    max_tokens=8192,
    extra_args=dict(
        ngram_size=30,
        window_size=90,
        whitelist_token_ids={128821, 128822},  # <td>, </td> for table support
    ),
    skip_special_tokens=False,
)

outputs = llm.generate(
    [{"prompt": prompt, "multi_modal_data": {"image": image}}],
    sampling_params
)

print(outputs[0].outputs[0].text)

Batch Images

from vllm import LLM, SamplingParams
from vllm.model_executor.models.deepseek_ocr import NGramPerReqLogitsProcessor
from PIL import Image

llm = LLM(
    model="deepseek-ai/DeepSeek-OCR",
    enable_prefix_caching=False,
    mm_processor_cache_gb=0,
    logits_processors=[NGramPerReqLogitsProcessor]
)

image_paths = ["page1.png", "page2.png", "page3.png"]
prompt = "<image>\n<|grounding|>Convert the document to markdown. "

model_input = [
    {
        "prompt": prompt,
        "multi_modal_data": {"image": Image.open(p).convert("RGB")}
    }
    for p in image_paths
]

sampling_params = SamplingParams(
    temperature=0.0,
    max_tokens=8192,
    extra_args=dict(
        ngram_size=30,
        window_size=90,
        whitelist_token_ids={128821, 128822},
    ),
    skip_special_tokens=False,
)

outputs = llm.generate(model_input, sampling_params)

for path, output in zip(image_paths, outputs):
    print(f"=== {path} ===")
    print(output.outputs[0].text)

PDF Processing (via vLLM scripts)

cd DeepSeek-OCR-master/DeepSeek-OCR-vllm
# Edit config.py: set INPUT_PATH, OUTPUT_PATH, model path, etc.
python run_dpsk_ocr_pdf.py   # ~2500 tokens/s on A100-40G

Benchmark Evaluation

cd DeepSeek-OCR-master/DeepSeek-OCR-vllm
python run_dpsk_ocr_eval_batch.py

Inference: HuggingFace Transformers

import os
import torch
from transformers import AutoModel, AutoTokenizer

os.environ["CUDA_VISIBLE_DEVICES"] = "0"

model_name = "deepseek-ai/DeepSeek-OCR"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(
    model_name,
    _attn_implementation="flash_attention_2",
    trust_remote_code=True,
    use_safetensors=True,
)
model = model.eval().cuda().to(torch.bfloat16)

# Document to markdown
res = model.infer(
    tokenizer,
    prompt="<image>\n<|grounding|>Convert the document to markdown. ",
    image_file="document.jpg",
    output_path="./output/",
    base_size=1024,
    image_size=640,
    crop_mode=True,
    save_results=True,
    test_compress=True,
)
print(res)

Transformers Script

cd DeepSeek-OCR-master/DeepSeek-OCR-hf
python run_dpsk_ocr.py

Prompt Reference

Use Case	Prompt
Document → Markdown	`\n<
General OCR	`\n<
Free OCR (no layout)	`<image>\nFree OCR.`
Parse figure/chart	`<image>\nParse the figure.`
General description	`<image>\nDescribe this image in detail.`
Grounded REC	`<image>\nLocate <

PROMPTS = {
    "document_markdown": "<image>\n<|grounding|>Convert the document to markdown. ",
    "ocr_image":         "<image>\n<|grounding|>OCR this image. ",
    "free_ocr":          "<image>\nFree OCR. ",
    "parse_figure":      "<image>\nParse the figure. ",
    "describe":          "<image>\nDescribe this image in detail. ",
    "rec":               "<image>\nLocate <|ref|>{target}<|/ref|> in the image. ",
}

Supported Resolutions

Mode	Resolution	Vision Tokens
Tiny	512×512	64
Small	640×640	100
Base	1024×1024	256
Large	1280×1280	400
Gundam (dynamic)	n×640×640 + 1×1024×1024	variable

# Transformers: control resolution via infer() params
res = model.infer(
    tokenizer,
    prompt=prompt,
    image_file="image.jpg",
    base_size=1024,   # 512, 640, 1024, or 1280
    image_size=640,   # patch size for dynamic mode
    crop_mode=True,   # True = Gundam dynamic resolution
)

Configuration (vLLM)

Edit DeepSeek-OCR-master/DeepSeek-OCR-vllm/config.py:

# Key config fields (example)
MODEL_PATH = "deepseek-ai/DeepSeek-OCR"   # or local path
INPUT_PATH = "/data/input_images/"
OUTPUT_PATH = "/data/output/"
TENSOR_PARALLEL_SIZE = 1                   # GPUs for tensor parallelism
MAX_TOKENS = 8192
TEMPERATURE = 0.0
NGRAM_SIZE = 30
WINDOW_SIZE = 90

Common Patterns

Process a Directory of Images

import os
from pathlib import Path
from PIL import Image
from vllm import LLM, SamplingParams
from vllm.model_executor.models.deepseek_ocr import NGramPerReqLogitsProcessor

def batch_ocr(image_dir: str, output_dir: str, prompt: str = "<image>\nFree OCR."):
    Path(output_dir).mkdir(parents=True, exist_ok=True)
    
    llm = LLM(
        model="deepseek-ai/DeepSeek-OCR",
        enable_prefix_caching=False,
        mm_processor_cache_gb=0,
        logits_processors=[NGramPerReqLogitsProcessor],
    )
    sampling_params = SamplingParams(
        temperature=0.0,
        max_tokens=8192,
        extra_args=dict(ngram_size=30, window_size=90, whitelist_token_ids={128821, 128822}),
        skip_special_tokens=False,
    )
    
    image_files = list(Path(image_dir).glob("*.png")) + list(Path(image_dir).glob("*.jpg"))
    
    inputs = [
        {"prompt": prompt, "multi_modal_data": {"image": Image.open(f).convert("RGB")}}
        for f in image_files
    ]
    
    outputs = llm.generate(inputs, sampling_params)
    
    for img_path, output in zip(image_files, outputs):
        out_file = Path(output_dir) / (img_path.stem + ".txt")
        out_file.write_text(output.outputs[0].text)
        print(f"Saved: {out_file}")

batch_ocr("/data/scans/", "/data/results/")

Convert PDF Pages to Markdown

import fitz  # PyMuPDF
from PIL import Image
from io import BytesIO
from vllm import LLM, SamplingParams
from vllm.model_executor.models.deepseek_ocr import NGramPerReqLogitsProcessor

def pdf_to_markdown(pdf_path: str) -> list[str]:
    doc = fitz.open(pdf_path)
    llm = LLM(
        model="deepseek-ai/DeepSeek-OCR",
        enable_prefix_caching=False,
        mm_processor_cache_gb=0,
        logits_processors=[NGramPerReqLogitsProcessor],
    )
    sampling_params = SamplingParams(
        temperature=0.0,
        max_tokens=8192,
        extra_args=dict(ngram_size=30, window_size=90, whitelist_token_ids={128821, 128822}),
        skip_special_tokens=False,
    )
    
    prompt = "<image>\n<|grounding|>Convert the document to markdown. "
    inputs = []
    for page in doc:
        pix = page.get_pixmap(dpi=150)
        img = Image.open(BytesIO(pix.tobytes("png"))).convert("RGB")
        inputs.append({"prompt": prompt, "multi_modal_data": {"image": img}})
    
    outputs = llm.generate(inputs, sampling_params)
    return [o.outputs[0].text for o in outputs]

pages = pdf_to_markdown("report.pdf")
full_markdown = "\n\n---\n\n".join(pages)
print(full_markdown)

Grounded Text Location (REC)

import torch
from transformers import AutoModel, AutoTokenizer

model_name = "deepseek-ai/DeepSeek-OCR"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(
    model_name,
    _attn_implementation="flash_attention_2",
    trust_remote_code=True,
    use_safetensors=True,
).eval().cuda().to(torch.bfloat16)

target = "Total Amount"
prompt = f"<image>\nLocate <|ref|>{target}<|/ref|> in the image. "

res = model.infer(
    tokenizer,
    prompt=prompt,
    image_file="invoice.jpg",
    output_path="./output/",
    base_size=1024,
    image_size=640,
    crop_mode=False,
    save_results=True,
)
print(res)  # Returns bounding box / location info

Troubleshooting

`transformers` version conflict with vLLM

vLLM 0.8.5 requires transformers>=4.51.1 — if running both in the same env, this error is safe to ignore per the project docs.

Flash Attention build errors

# Ensure torch is installed before flash-attn
pip install flash-attn==2.7.3 --no-build-isolation

CUDA out of memory

Use smaller resolution: base_size=512 or base_size=640
Disable crop_mode=False to avoid multi-crop dynamic resolution
Reduce batch size in vLLM inputs

Model output is garbled / repetitive

Ensure NGramPerReqLogitsProcessor is passed to LLM — this is required for proper decoding:

from vllm.model_executor.models.deepseek_ocr import NGramPerReqLogitsProcessor
llm = LLM(..., logits_processors=[NGramPerReqLogitsProcessor])

Tables not rendering correctly

Add table token IDs to the whitelist:

whitelist_token_ids={128821, 128822}  # <td> and </td>

Multi-GPU inference

llm = LLM(
    model="deepseek-ai/DeepSeek-OCR",
    tensor_parallel_size=4,  # number of GPUs
    enable_prefix_caching=False,
    mm_processor_cache_gb=0,
    logits_processors=[NGramPerReqLogitsProcessor],
)

Key Files

DeepSeek-OCR-master/
├── DeepSeek-OCR-vllm/
│   ├── config.py                  # vLLM configuration
│   ├── run_dpsk_ocr_image.py      # Single image inference
│   ├── run_dpsk_ocr_pdf.py        # PDF batch inference
│   └── run_dpsk_ocr_eval_batch.py # Benchmark evaluation
└── DeepSeek-OCR-hf/
    └── run_dpsk_ocr.py            # HuggingFace Transformers inference

Weekly Installs

129

Repository

aradotso/trending-skills

GitHub Stars

First Seen

3 days ago

Security Audits

Gen Agent Trust HubPass SocketPass SnykWarn

Installed on

github-copilot129

codex129

warp129

kimi-cli129

amp129

cline129

DeepSeek-OCR：支持图像和PDF的视觉语言模型，具备上下文光学压缩与多模式OCR功能

🇨🇳中文介绍

DeepSeek-OCR

安装

先决条件

设置

替代方案：上游 vLLM（夜间版）

相关 Skills

模型下载

推理：vLLM（推荐用于生产环境）

单张图片 — 流式处理

批量图片处理

PDF 处理（通过 vLLM 脚本）

基准评估

推理：HuggingFace Transformers

Transformers 脚本

提示词参考

支持的分辨率

配置（vLLM）

常见模式

处理目录中的图片

将 PDF 页面转换为 Markdown

定位文本（REC）

故障排除

transformers 版本与 vLLM 冲突

Flash Attention 构建错误

CUDA 内存不足

模型输出乱码/重复

表格渲染不正确

多 GPU 推理

关键文件

🇺🇸English

DeepSeek-OCR

Installation

Prerequisites

Setup

Alternative: upstream vLLM (nightly)

Model Download

Inference: vLLM (Recommended for Production)

Single Image — Streaming

Batch Images

PDF Processing (via vLLM scripts)

Benchmark Evaluation

Inference: HuggingFace Transformers

Transformers Script

Prompt Reference

Supported Resolutions

Configuration (vLLM)

Common Patterns

Process a Directory of Images

Convert PDF Pages to Markdown

Grounded Text Location (REC)

Troubleshooting

transformers version conflict with vLLM

Flash Attention build errors

CUDA out of memory

Model output is garbled / repetitive

Tables not rendering correctly

Multi-GPU inference

Key Files

最新 Skills

`transformers` 版本与 vLLM 冲突

`transformers` version conflict with vLLM