OCR文档处理器 - 从图像/PDF提取文本，支持多语言、表格识别和批量处理

ocr-document-processor by dkyazzentwatwa/chatgpt-skills

2,100 周安装量

36 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/dkyazzentwatwa/chatgpt-skills --skill ocr-document-processor

AI/机器学习数据分析生产力

🇨🇳中文介绍

OCR 文档处理器

使用光学字符识别（OCR）从图像、扫描的 PDF 和照片中提取文本。支持多种语言、结构化输出格式和智能文档解析。

核心功能

图像 OCR : 从 PNG、JPEG、TIFF、BMP 图像中提取文本
PDF OCR : 逐页处理扫描的 PDF
多语言 : 支持 100 多种语言
结构化输出 : 纯文本、Markdown、JSON 或 HTML
表格检测 : 将表格数据提取为 CSV/JSON
批量处理 : 一次性处理多个文档
质量评估 : 为 OCR 结果提供置信度评分

快速开始

from scripts.ocr_processor import OCRProcessor

# 简单文本提取
processor = OCRProcessor("document.png")
text = processor.extract_text()
print(text)

# 提取为结构化格式
result = processor.extract_structured()
print(result['text'])
print(result['confidence'])
print(result['blocks'])  # 带位置信息的文本块

核心工作流程

1. 基本文本提取

from scripts.ocr_processor import OCRProcessor

# 从图像
processor = OCRProcessor("scan.png")
text = processor.extract_text()

# 从 PDF
processor = OCRProcessor("scanned.pdf")
text = processor.extract_text()  # 所有页面

# 特定页面
text = processor.extract_text(pages=[1, 2, 3])

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

支持的语言（常见）

代码	语言	代码	语言
eng	英语	fra	法语
deu	德语	spa	西班牙语
ita	意大利语	por	葡萄牙语
rus	俄语	chi_sim	中文（简体）
chi_tra	中文（繁体）	jpn	日语
kor	韩语	ara	阿拉伯语
hin	印地语	nld	荷兰语

预处理可提高低质量图像上的 OCR 准确性。

# 启用预处理
processor = OCRProcessor("noisy_scan.png")
processor.preprocess(
    deskew=True,        # 修正旋转
    denoise=True,       # 去除噪声
    threshold=True,     # 二值化图像
    contrast=1.5        # 增强对比度
)
text = processor.extract_text()

可用的预处理选项

选项	描述	默认值
`deskew`	校正倾斜/旋转的图像	False
`denoise`	去除噪声和伪影	False
`threshold`	转换为黑白图像	False
`threshold_method`	'otsu', 'adaptive', 'simple'	'otsu'
`contrast`	对比度因子（1.0 = 无变化）	1.0
`sharpen`	锐化因子（0 = 无）	0
`scale`	小文本的放大因子	1.0
`remove_shadows`	去除阴影伪影	False

# 从文档中提取表格
tables = processor.extract_tables()

# 每个表格都是一个行列表
for table in tables:
    for row in table:
        print(row)

# 将表格导出为 CSV
processor.export_tables_csv("tables/")

# 导出为 JSON
processor.export_tables_json("tables.json")

# 处理所有页面
processor = OCRProcessor("document.pdf")
full_text = processor.extract_text()

# 处理特定页面
page_3 = processor.extract_text(pages=[3])

# 获取每页结果
results = processor.extract_by_page()
for page_num, text in results.items():
    print(f"Page {page_num}: {len(text)} characters")

创建可搜索的 PDF

# 将扫描的 PDF 转换为可搜索的 PDF
processor = OCRProcessor("scanned.pdf")
processor.export_searchable_pdf("searchable.pdf")

from scripts.ocr_processor import batch_ocr

# 处理图像目录
results = batch_ocr(
    input_dir="scans/",
    output_dir="extracted/",
    output_format="markdown",
    lang="eng",
    recursive=True
)

print(f"Processed: {results['success']} files")
print(f"Failed: {results['failed']} files")

# 解析收据结构
processor = OCRProcessor("receipt.jpg")
receipt_data = processor.parse_receipt()

# 返回结构化数据：
# - vendor: 商店名称
# - date: 交易日期
# - items: 带价格的商品列表
# - subtotal: 小计金额
# - tax: 税额
# - total: 总金额

# 提取名片信息
processor = OCRProcessor("card.jpg")
contact = processor.parse_business_card()

# 返回：
# - name: 人名
# - title: 职位
# - company: 公司名称
# - email: 电子邮件地址
# - phone: 电话号码
# - address: 物理地址
# - website: 网站 URL

processor = OCRProcessor("document.png")

# 配置 OCR 设置
processor.config.update({
    'psm': 3,           # 页面分割模式
    'oem': 3,           # OCR 引擎模式
    'dpi': 300,         # 处理 DPI
    'timeout': 30,      # 超时时间（秒）
    'min_confidence': 60,  # 最小单词置信度
})

页面分割模式（PSM）

模式	描述
0	仅方向和脚本检测
1	带 OSD 的自动页面分割
3	完全自动页面分割（默认）
4	假设为单列文本
6	假设为单个统一文本块
7	将图像视为单行文本
8	将图像视为单个单词
11	稀疏文本。尽可能多地查找文本
12	带 OSD 的稀疏文本

# 获取置信度分数
result = processor.extract_structured()

# 总体置信度（0-100）
print(f"Confidence: {result['confidence']}%")

# 每个单词的置信度
for word in result['words']:
    print(f"{word['text']}: {word['confidence']}%")

# 过滤低置信度单词
high_conf_words = [w for w in result['words'] if w['confidence'] > 80]

processor.export_markdown("output.md")

文档标题（如果检测到）
结构化标题
段落
表格（作为 Markdown 表格）
多页文档的分页符

processor.export_json("output.json")

{
  "source": "document.pdf",
  "pages": 5,
  "language": "eng",
  "confidence": 92.5,
  "text": "Full extracted text...",
  "blocks": [
    {
      "type": "paragraph",
      "text": "Block text...",
      "bbox": [x, y, width, height],
      "confidence": 95.2
    }
  ],
  "tables": [...]
}

processor.export_html("output.html")

创建带样式的 HTML，包含：

保留的近似布局
高亮显示的低置信度区域
嵌入的图像（可选）
适合打印的样式

# 基本提取
python ocr_processor.py image.png -o output.txt

# 提取为 markdown
python ocr_processor.py document.pdf -o output.md --format markdown

# 指定语言
python ocr_processor.py german.png --lang deu

# 批量处理
python ocr_processor.py scans/ -o extracted/ --batch

# 带预处理
python ocr_processor.py noisy.png --preprocess --deskew --denoise

from scripts.ocr_processor import OCRProcessor, OCRError

try:
    processor = OCRProcessor("document.png")
    text = processor.extract_text()
except OCRError as e:
    print(f"OCR failed: {e}")
except FileNotFoundError:
    print("File not found")

图像质量 : 更高的分辨率（300+ DPI）可提高准确性
预处理 : 用于低质量扫描
语言 : 指定语言可提高速度和准确性
PSM 模式 : 根据文档类型选择合适的模式
大文件 : 逐页处理 PDF 以提高内存效率

手写文本：准确性有限
复杂布局：可能丢失结构
质量极低：预处理有帮助但有限制
非拉丁文字：需要特定的语言包

pytesseract>=0.3.10
Pillow>=10.0.0
PyMuPDF>=1.23.0
opencv-python>=4.8.0
numpy>=1.24.0

必须安装 Tesseract OCR 引擎
非英语语言的语言数据文件

2026 年 1 月 24 日

🇺🇸English

OCR Document Processor

Extract text from images, scanned PDFs, and photographs using Optical Character Recognition (OCR). Supports multiple languages, structured output formats, and intelligent document parsing.

Core Capabilities

Image OCR : Extract text from PNG, JPEG, TIFF, BMP images
PDF OCR : Process scanned PDFs page by page
Multi-language : Support for 100+ languages
Structured Output : Plain text, Markdown, JSON, or HTML
Table Detection : Extract tabular data to CSV/JSON
Batch Processing : Process multiple documents at once
Quality Assessment : Confidence scoring for OCR results

Quick Start

from scripts.ocr_processor import OCRProcessor

# Simple text extraction
processor = OCRProcessor("document.png")
text = processor.extract_text()
print(text)

# Extract to structured format
result = processor.extract_structured()
print(result['text'])
print(result['confidence'])
print(result['blocks'])  # Text blocks with positions

Core Workflow

1. Basic Text Extraction

from scripts.ocr_processor import OCRProcessor

# From image
processor = OCRProcessor("scan.png")
text = processor.extract_text()

# From PDF
processor = OCRProcessor("scanned.pdf")
text = processor.extract_text()  # All pages

# Specific pages
text = processor.extract_text(pages=[1, 2, 3])

2. Structured Extraction

# Get detailed results
result = processor.extract_structured()

# Result contains:
# - text: Full extracted text
# - blocks: Text blocks with bounding boxes
# - lines: Individual lines
# - words: Individual words with confidence
# - confidence: Overall confidence score
# - language: Detected language

3. Export Formats

# Export to Markdown
processor.export_markdown("output.md")

# Export to JSON
processor.export_json("output.json")

# Export to searchable PDF
processor.export_searchable_pdf("searchable.pdf")

# Export to HTML
processor.export_html("output.html")

Language Support

# Specify language for better accuracy
processor = OCRProcessor("german_doc.png", lang='deu')

# Multiple languages
processor = OCRProcessor("mixed_doc.png", lang='eng+fra+deu')

# Auto-detect language
processor = OCRProcessor("document.png", lang='auto')

Supported Languages (Common)

Code	Language	Code	Language
eng	English	fra	French
deu	German	spa	Spanish
ita	Italian	por	Portuguese
rus	Russian	chi_sim	Chinese (Simplified)
chi_tra	Chinese (Traditional)	jpn	Japanese
kor	Korean	ara	Arabic
hin	Hindi	nld	Dutch

Image Preprocessing

Preprocessing improves OCR accuracy on low-quality images.

# Enable preprocessing
processor = OCRProcessor("noisy_scan.png")
processor.preprocess(
    deskew=True,        # Fix rotation
    denoise=True,       # Remove noise
    threshold=True,     # Binarize image
    contrast=1.5        # Enhance contrast
)
text = processor.extract_text()

Available Preprocessing Options

Option	Description	Default
`deskew`	Correct skewed/rotated images	False
`denoise`	Remove noise and artifacts	False
`threshold`	Convert to black/white	False
`threshold_method`	'otsu', 'adaptive', 'simple'	'otsu'
`contrast`	Contrast factor (1.0 = no change)	1.0

Table Extraction

# Extract tables from document
tables = processor.extract_tables()

# Each table is a list of rows
for table in tables:
    for row in table:
        print(row)

# Export tables to CSV
processor.export_tables_csv("tables/")

# Export to JSON
processor.export_tables_json("tables.json")

PDF Processing

Multi-Page PDFs

# Process all pages
processor = OCRProcessor("document.pdf")
full_text = processor.extract_text()

# Process specific pages
page_3 = processor.extract_text(pages=[3])

# Get per-page results
results = processor.extract_by_page()
for page_num, text in results.items():
    print(f"Page {page_num}: {len(text)} characters")

Create Searchable PDF

# Convert scanned PDF to searchable PDF
processor = OCRProcessor("scanned.pdf")
processor.export_searchable_pdf("searchable.pdf")

Batch Processing

from scripts.ocr_processor import batch_ocr

# Process directory of images
results = batch_ocr(
    input_dir="scans/",
    output_dir="extracted/",
    output_format="markdown",
    lang="eng",
    recursive=True
)

print(f"Processed: {results['success']} files")
print(f"Failed: {results['failed']} files")

Receipt/Document Parsing

Receipt Extraction

# Parse receipt structure
processor = OCRProcessor("receipt.jpg")
receipt_data = processor.parse_receipt()

# Returns structured data:
# - vendor: Store name
# - date: Transaction date
# - items: List of items with prices
# - subtotal: Subtotal amount
# - tax: Tax amount
# - total: Total amount

Business Card Parsing

# Extract business card info
processor = OCRProcessor("card.jpg")
contact = processor.parse_business_card()

# Returns:
# - name: Person's name
# - title: Job title
# - company: Company name
# - email: Email addresses
# - phone: Phone numbers
# - address: Physical address
# - website: Website URLs

Configuration

processor = OCRProcessor("document.png")

# Configure OCR settings
processor.config.update({
    'psm': 3,           # Page segmentation mode
    'oem': 3,           # OCR engine mode
    'dpi': 300,         # DPI for processing
    'timeout': 30,      # Timeout in seconds
    'min_confidence': 60,  # Minimum word confidence
})

Page Segmentation Modes (PSM)

Mode	Description
0	Orientation and script detection only
1	Automatic page segmentation with OSD
3	Fully automatic page segmentation (default)
4	Assume single column of text
6	Assume single uniform block of text
7	Treat image as single text line
8	Treat image as single word
11	Sparse text. Find as much text as possible
12	Sparse text with OSD

Quality Assessment

# Get confidence scores
result = processor.extract_structured()

# Overall confidence (0-100)
print(f"Confidence: {result['confidence']}%")

# Per-word confidence
for word in result['words']:
    print(f"{word['text']}: {word['confidence']}%")

# Filter low-confidence words
high_conf_words = [w for w in result['words'] if w['confidence'] > 80]

Output Formats

Markdown Export

processor.export_markdown("output.md")

Output includes:

Document title (if detected)
Structured headings
Paragraphs
Tables (as Markdown tables)
Page breaks for multi-page docs

JSON Export

processor.export_json("output.json")

Output structure:

{
  "source": "document.pdf",
  "pages": 5,
  "language": "eng",
  "confidence": 92.5,
  "text": "Full extracted text...",
  "blocks": [
    {
      "type": "paragraph",
      "text": "Block text...",
      "bbox": [x, y, width, height],
      "confidence": 95.2
    }
  ],
  "tables": [...]
}

HTML Export

processor.export_html("output.html")

Creates styled HTML with:

Preserved layout approximation
Highlighted low-confidence regions
Embedded images (optional)
Print-friendly styling

CLI Usage

# Basic extraction
python ocr_processor.py image.png -o output.txt

# Extract to markdown
python ocr_processor.py document.pdf -o output.md --format markdown

# Specify language
python ocr_processor.py german.png --lang deu

# Batch processing
python ocr_processor.py scans/ -o extracted/ --batch

# With preprocessing
python ocr_processor.py noisy.png --preprocess --deskew --denoise

Error Handling

from scripts.ocr_processor import OCRProcessor, OCRError

try:
    processor = OCRProcessor("document.png")
    text = processor.extract_text()
except OCRError as e:
    print(f"OCR failed: {e}")
except FileNotFoundError:
    print("File not found")

Performance Tips

Image Quality : Higher resolution (300+ DPI) improves accuracy
Preprocessing : Use for low-quality scans
Language : Specifying language improves speed and accuracy
PSM Mode : Choose appropriate mode for document type
Large Files : Process PDFs page by page for memory efficiency

Limitations

Handwritten text: Limited accuracy
Complex layouts: May lose structure
Very low quality: Preprocessing helps but has limits
Non-Latin scripts: Require specific language packs

Dependencies

pytesseract>=0.3.10
Pillow>=10.0.0
PyMuPDF>=1.23.0
opencv-python>=4.8.0
numpy>=1.24.0

System Requirements

Tesseract OCR engine must be installed
Language data files for non-English languages

Weekly Installs

2.1K

Repository

dkyazzentwatwa/…t-skills

GitHub Stars

First Seen

Jan 24, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

opencode2.0K

codex1.9K

gemini-cli1.9K

github-copilot1.9K

cursor1.9K

kimi-cli1.8K

OCR文档处理器 - 从图像/PDF提取文本，支持多语言、表格识别和批量处理

🇨🇳中文介绍

OCR 文档处理器

核心功能

快速开始

核心工作流程

1. 基本文本提取

相关 Skills

2. 结构化提取

3. 导出格式

语言支持

支持的语言（常见）

图像预处理

可用的预处理选项

表格提取

PDF 处理

多页 PDF