ocr-document-processor by dkyazzentwatwa/chatgpt-skills
npx skills add https://github.com/dkyazzentwatwa/chatgpt-skills --skill ocr-document-processor使用光学字符识别(OCR)从图像、扫描的 PDF 和照片中提取文本。支持多种语言、结构化输出格式和智能文档解析。
from scripts.ocr_processor import OCRProcessor
# 简单文本提取
processor = OCRProcessor("document.png")
text = processor.extract_text()
print(text)
# 提取为结构化格式
result = processor.extract_structured()
print(result['text'])
print(result['confidence'])
print(result['blocks']) # 带位置信息的文本块
from scripts.ocr_processor import OCRProcessor
# 从图像
processor = OCRProcessor("scan.png")
text = processor.extract_text()
# 从 PDF
processor = OCRProcessor("scanned.pdf")
text = processor.extract_text() # 所有页面
# 特定页面
text = processor.extract_text(pages=[1, 2, 3])
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
# 获取详细结果
result = processor.extract_structured()
# 结果包含:
# - text: 完整的提取文本
# - blocks: 带边界框的文本块
# - lines: 单独的行
# - words: 带置信度的单个单词
# - confidence: 总体置信度分数
# - language: 检测到的语言
# 导出为 Markdown
processor.export_markdown("output.md")
# 导出为 JSON
processor.export_json("output.json")
# 导出为可搜索的 PDF
processor.export_searchable_pdf("searchable.pdf")
# 导出为 HTML
processor.export_html("output.html")
# 指定语言以提高准确性
processor = OCRProcessor("german_doc.png", lang='deu')
# 多种语言
processor = OCRProcessor("mixed_doc.png", lang='eng+fra+deu')
# 自动检测语言
processor = OCRProcessor("document.png", lang='auto')
| 代码 | 语言 | 代码 | 语言 |
|---|---|---|---|
| eng | 英语 | fra | 法语 |
| deu | 德语 | spa | 西班牙语 |
| ita | 意大利语 | por | 葡萄牙语 |
| rus | 俄语 | chi_sim | 中文(简体) |
| chi_tra | 中文(繁体) | jpn | 日语 |
| kor | 韩语 | ara | 阿拉伯语 |
| hin | 印地语 | nld | 荷兰语 |
预处理可提高低质量图像上的 OCR 准确性。
# 启用预处理
processor = OCRProcessor("noisy_scan.png")
processor.preprocess(
deskew=True, # 修正旋转
denoise=True, # 去除噪声
threshold=True, # 二值化图像
contrast=1.5 # 增强对比度
)
text = processor.extract_text()
| 选项 | 描述 | 默认值 |
|---|---|---|
deskew | 校正倾斜/旋转的图像 | False |
denoise | 去除噪声和伪影 | False |
threshold | 转换为黑白图像 | False |
threshold_method | 'otsu', 'adaptive', 'simple' | 'otsu' |
contrast | 对比度因子(1.0 = 无变化) | 1.0 |
sharpen | 锐化因子(0 = 无) | 0 |
scale | 小文本的放大因子 | 1.0 |
remove_shadows | 去除阴影伪影 | False |
# 从文档中提取表格
tables = processor.extract_tables()
# 每个表格都是一个行列表
for table in tables:
for row in table:
print(row)
# 将表格导出为 CSV
processor.export_tables_csv("tables/")
# 导出为 JSON
processor.export_tables_json("tables.json")
# 处理所有页面
processor = OCRProcessor("document.pdf")
full_text = processor.extract_text()
# 处理特定页面
page_3 = processor.extract_text(pages=[3])
# 获取每页结果
results = processor.extract_by_page()
for page_num, text in results.items():
print(f"Page {page_num}: {len(text)} characters")
# 将扫描的 PDF 转换为可搜索的 PDF
processor = OCRProcessor("scanned.pdf")
processor.export_searchable_pdf("searchable.pdf")
from scripts.ocr_processor import batch_ocr
# 处理图像目录
results = batch_ocr(
input_dir="scans/",
output_dir="extracted/",
output_format="markdown",
lang="eng",
recursive=True
)
print(f"Processed: {results['success']} files")
print(f"Failed: {results['failed']} files")
# 解析收据结构
processor = OCRProcessor("receipt.jpg")
receipt_data = processor.parse_receipt()
# 返回结构化数据:
# - vendor: 商店名称
# - date: 交易日期
# - items: 带价格的商品列表
# - subtotal: 小计金额
# - tax: 税额
# - total: 总金额
# 提取名片信息
processor = OCRProcessor("card.jpg")
contact = processor.parse_business_card()
# 返回:
# - name: 人名
# - title: 职位
# - company: 公司名称
# - email: 电子邮件地址
# - phone: 电话号码
# - address: 物理地址
# - website: 网站 URL
processor = OCRProcessor("document.png")
# 配置 OCR 设置
processor.config.update({
'psm': 3, # 页面分割模式
'oem': 3, # OCR 引擎模式
'dpi': 300, # 处理 DPI
'timeout': 30, # 超时时间(秒)
'min_confidence': 60, # 最小单词置信度
})
| 模式 | 描述 |
|---|---|
| 0 | 仅方向和脚本检测 |
| 1 | 带 OSD 的自动页面分割 |
| 3 | 完全自动页面分割(默认) |
| 4 | 假设为单列文本 |
| 6 | 假设为单个统一文本块 |
| 7 | 将图像视为单行文本 |
| 8 | 将图像视为单个单词 |
| 11 | 稀疏文本。尽可能多地查找文本 |
| 12 | 带 OSD 的稀疏文本 |
# 获取置信度分数
result = processor.extract_structured()
# 总体置信度(0-100)
print(f"Confidence: {result['confidence']}%")
# 每个单词的置信度
for word in result['words']:
print(f"{word['text']}: {word['confidence']}%")
# 过滤低置信度单词
high_conf_words = [w for w in result['words'] if w['confidence'] > 80]
processor.export_markdown("output.md")
输出包括:
processor.export_json("output.json")
输出结构:
{
"source": "document.pdf",
"pages": 5,
"language": "eng",
"confidence": 92.5,
"text": "Full extracted text...",
"blocks": [
{
"type": "paragraph",
"text": "Block text...",
"bbox": [x, y, width, height],
"confidence": 95.2
}
],
"tables": [...]
}
processor.export_html("output.html")
创建带样式的 HTML,包含:
# 基本提取
python ocr_processor.py image.png -o output.txt
# 提取为 markdown
python ocr_processor.py document.pdf -o output.md --format markdown
# 指定语言
python ocr_processor.py german.png --lang deu
# 批量处理
python ocr_processor.py scans/ -o extracted/ --batch
# 带预处理
python ocr_processor.py noisy.png --preprocess --deskew --denoise
from scripts.ocr_processor import OCRProcessor, OCRError
try:
processor = OCRProcessor("document.png")
text = processor.extract_text()
except OCRError as e:
print(f"OCR failed: {e}")
except FileNotFoundError:
print("File not found")
pytesseract>=0.3.10
Pillow>=10.0.0
PyMuPDF>=1.23.0
opencv-python>=4.8.0
numpy>=1.24.0
每周安装量
2.1K
仓库
GitHub 星标
36
首次出现
2026 年 1 月 24 日
安全审计
安装于
opencode2.0K
codex1.9K
gemini-cli1.9K
github-copilot1.9K
cursor1.9K
kimi-cli1.8K
Extract text from images, scanned PDFs, and photographs using Optical Character Recognition (OCR). Supports multiple languages, structured output formats, and intelligent document parsing.
from scripts.ocr_processor import OCRProcessor
# Simple text extraction
processor = OCRProcessor("document.png")
text = processor.extract_text()
print(text)
# Extract to structured format
result = processor.extract_structured()
print(result['text'])
print(result['confidence'])
print(result['blocks']) # Text blocks with positions
from scripts.ocr_processor import OCRProcessor
# From image
processor = OCRProcessor("scan.png")
text = processor.extract_text()
# From PDF
processor = OCRProcessor("scanned.pdf")
text = processor.extract_text() # All pages
# Specific pages
text = processor.extract_text(pages=[1, 2, 3])
# Get detailed results
result = processor.extract_structured()
# Result contains:
# - text: Full extracted text
# - blocks: Text blocks with bounding boxes
# - lines: Individual lines
# - words: Individual words with confidence
# - confidence: Overall confidence score
# - language: Detected language
# Export to Markdown
processor.export_markdown("output.md")
# Export to JSON
processor.export_json("output.json")
# Export to searchable PDF
processor.export_searchable_pdf("searchable.pdf")
# Export to HTML
processor.export_html("output.html")
# Specify language for better accuracy
processor = OCRProcessor("german_doc.png", lang='deu')
# Multiple languages
processor = OCRProcessor("mixed_doc.png", lang='eng+fra+deu')
# Auto-detect language
processor = OCRProcessor("document.png", lang='auto')
| Code | Language | Code | Language |
|---|---|---|---|
| eng | English | fra | French |
| deu | German | spa | Spanish |
| ita | Italian | por | Portuguese |
| rus | Russian | chi_sim | Chinese (Simplified) |
| chi_tra | Chinese (Traditional) | jpn | Japanese |
| kor | Korean | ara | Arabic |
| hin | Hindi | nld | Dutch |
Preprocessing improves OCR accuracy on low-quality images.
# Enable preprocessing
processor = OCRProcessor("noisy_scan.png")
processor.preprocess(
deskew=True, # Fix rotation
denoise=True, # Remove noise
threshold=True, # Binarize image
contrast=1.5 # Enhance contrast
)
text = processor.extract_text()
| Option | Description | Default |
|---|---|---|
deskew | Correct skewed/rotated images | False |
denoise | Remove noise and artifacts | False |
threshold | Convert to black/white | False |
threshold_method | 'otsu', 'adaptive', 'simple' | 'otsu' |
contrast | Contrast factor (1.0 = no change) | 1.0 |
# Extract tables from document
tables = processor.extract_tables()
# Each table is a list of rows
for table in tables:
for row in table:
print(row)
# Export tables to CSV
processor.export_tables_csv("tables/")
# Export to JSON
processor.export_tables_json("tables.json")
# Process all pages
processor = OCRProcessor("document.pdf")
full_text = processor.extract_text()
# Process specific pages
page_3 = processor.extract_text(pages=[3])
# Get per-page results
results = processor.extract_by_page()
for page_num, text in results.items():
print(f"Page {page_num}: {len(text)} characters")
# Convert scanned PDF to searchable PDF
processor = OCRProcessor("scanned.pdf")
processor.export_searchable_pdf("searchable.pdf")
from scripts.ocr_processor import batch_ocr
# Process directory of images
results = batch_ocr(
input_dir="scans/",
output_dir="extracted/",
output_format="markdown",
lang="eng",
recursive=True
)
print(f"Processed: {results['success']} files")
print(f"Failed: {results['failed']} files")
# Parse receipt structure
processor = OCRProcessor("receipt.jpg")
receipt_data = processor.parse_receipt()
# Returns structured data:
# - vendor: Store name
# - date: Transaction date
# - items: List of items with prices
# - subtotal: Subtotal amount
# - tax: Tax amount
# - total: Total amount
# Extract business card info
processor = OCRProcessor("card.jpg")
contact = processor.parse_business_card()
# Returns:
# - name: Person's name
# - title: Job title
# - company: Company name
# - email: Email addresses
# - phone: Phone numbers
# - address: Physical address
# - website: Website URLs
processor = OCRProcessor("document.png")
# Configure OCR settings
processor.config.update({
'psm': 3, # Page segmentation mode
'oem': 3, # OCR engine mode
'dpi': 300, # DPI for processing
'timeout': 30, # Timeout in seconds
'min_confidence': 60, # Minimum word confidence
})
| Mode | Description |
|---|---|
| 0 | Orientation and script detection only |
| 1 | Automatic page segmentation with OSD |
| 3 | Fully automatic page segmentation (default) |
| 4 | Assume single column of text |
| 6 | Assume single uniform block of text |
| 7 | Treat image as single text line |
| 8 | Treat image as single word |
| 11 | Sparse text. Find as much text as possible |
| 12 | Sparse text with OSD |
# Get confidence scores
result = processor.extract_structured()
# Overall confidence (0-100)
print(f"Confidence: {result['confidence']}%")
# Per-word confidence
for word in result['words']:
print(f"{word['text']}: {word['confidence']}%")
# Filter low-confidence words
high_conf_words = [w for w in result['words'] if w['confidence'] > 80]
processor.export_markdown("output.md")
Output includes:
processor.export_json("output.json")
Output structure:
{
"source": "document.pdf",
"pages": 5,
"language": "eng",
"confidence": 92.5,
"text": "Full extracted text...",
"blocks": [
{
"type": "paragraph",
"text": "Block text...",
"bbox": [x, y, width, height],
"confidence": 95.2
}
],
"tables": [...]
}
processor.export_html("output.html")
Creates styled HTML with:
# Basic extraction
python ocr_processor.py image.png -o output.txt
# Extract to markdown
python ocr_processor.py document.pdf -o output.md --format markdown
# Specify language
python ocr_processor.py german.png --lang deu
# Batch processing
python ocr_processor.py scans/ -o extracted/ --batch
# With preprocessing
python ocr_processor.py noisy.png --preprocess --deskew --denoise
from scripts.ocr_processor import OCRProcessor, OCRError
try:
processor = OCRProcessor("document.png")
text = processor.extract_text()
except OCRError as e:
print(f"OCR failed: {e}")
except FileNotFoundError:
print("File not found")
pytesseract>=0.3.10
Pillow>=10.0.0
PyMuPDF>=1.23.0
opencv-python>=4.8.0
numpy>=1.24.0
Weekly Installs
2.1K
Repository
GitHub Stars
36
First Seen
Jan 24, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykPass
Installed on
opencode2.0K
codex1.9K
gemini-cli1.9K
github-copilot1.9K
cursor1.9K
kimi-cli1.8K
76,500 周安装
sharpen | Sharpen factor (0 = none) | 0 |
scale | Upscale factor for small text | 1.0 |
remove_shadows | Remove shadow artifacts | False |