PDF OCR 提取工具 - 扫描文档转可搜索文本，支持批量处理和多语言

PDF OCR Extraction by claude-office-skills/skills

5 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/claude-office-skills/skills --skill 'PDF OCR Extraction'

AI/机器学习自动化数据处理

🇨🇳中文介绍

PDF OCR 提取

使用 OCR 技术从扫描文档和基于图像的 PDF 中提取文本。

概述

此技能可帮助您：

从扫描文档中提取文本
使图像 PDF 可搜索
将纸质文档数字化
处理手写文本（有限支持）
批量处理多个文档

使用方法

基本 OCR

"Extract text from this scanned PDF"
"OCR this document image"
"Make this PDF searchable"

使用选项

"Extract text from pages 1-10, English language"
"OCR this document, preserve layout"
"Extract and output as structured data"

文档类型

按文档类型划分的 OCR 质量

文档类型	预期质量	提示
打印文档	⭐⭐⭐⭐⭐ 95%+	效果最佳

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

相关 Skills

FlyClaw：零登录航班聚合查询工具，Python实现多源航班信息与价格搜索

4,000,000 周安装

find-skills 技能搜索工具 - Vercel Labs 开源智能体技能包管理器

749,400 周安装

React 组合模式指南：Vercel 组件架构最佳实践，提升代码可维护性

103,800 周安装

AI Elements：基于shadcn/ui的AI原生应用组件库，快速构建对话界面

53,500 周安装

## OCR 提取: [文档名称]

### 文档信息
| 字段 | 值 |
|-------|-------|
| 标题 | [提取或推断] |
| 日期 | [如果找到] |
| 作者 | [如果找到] |

### 按章节划分的内容

#### [标题 1]
[此标题下的内容]

#### [标题 2]
[此标题下的内容]

### 找到的表格
| 列 1 | 列 2 | 列 3 |
|----------|----------|----------|
| [数据] | [数据] | [数据] |

### 不确定文本
| 页码 | 原始文本 | 置信度 | 可能为 |
|------|----------|------------|----------|
| 3 | "teh" | 70% | "the" |
| 5 | "l0ve" | 65% | "love" |

## OCR 到可搜索 PDF

**源文件**: [filename.pdf]
**输出文件**: [filename_searchable.pdf]

### 处理摘要
| 指标 | 值 |
|--------|-------|
| 页数 | [X] |
| 提取字数 | [Y] |
| 平均置信度 | [Z]% |
| 处理时间 | [T] 秒 |

### 质量报告
- [X] 页置信度 >= 95%
- [Y] 页置信度 80-94%
- [Z] 页置信度 < 80% (建议审核)

### 可搜索性
✅ 文档现在可进行文本搜索
✅ 原始图像已保留
✅ 已在图像后方添加文本层

问题	解决方案
分辨率低	首先放大图像
倾斜/旋转	自动纠偏
对比度差	调整色阶/阈值
噪点/斑点	应用降噪
阴影	平整光照
彩色文档	转换为灰度

## 表格提取: [表格名称]

### 字段值
| 字段 | 值 | 置信度 |
|-------|-------|------------|
| 姓名 | John Smith | 98% |
| 日期 | 01/15/2026 | 95% |
| 地址 | 123 Main St | 92% |

### 复选框
| 问题 | 已勾选 |
|----------|---------|
| 选项 A | ☑️ 是 |
| 选项 B | ☐ 否 |
| 选项 C | ☑️ 是 |

### 签名
[在第 X 页检测到签名 - 无法提取文本]

## 手写文本提取

**可读性评估**: [好/一般/差]
**建议**: 人工审核

### 提取的文本 (置信度: 65%)
[提取的文本，不确定的单词已标记]

### 不确定的单词
| 原始 | 最佳猜测 | 备选 |
|----------|------------|--------------|
| [image] | "meeting" | "meeting", "meaning" |
| [image] | "Tuesday" | "Tuesday", "Thursday" |

⚠️ **低置信度提取 - 请手动验证**

## 批量 OCR 处理

**文件夹**: [路径]
**总文档数**: [X]
**状态**: [进行中/完成]

### 结果
| 文件 | 页数 | 置信度 | 状态 |
|------|-------|------------|--------|
| doc1.pdf | 5 | 96% | ✅ 完成 |
| doc2.pdf | 12 | 88% | ✅ 完成 |
| doc3.pdf | 3 | 72% | ⚠️ 需审核 |
| doc4.pdf | 8 | - | ❌ 失败 |

### 问题
- doc3.pdf: 第 2-3 页有手写内容
- doc4.pdf: 文件损坏

### 摘要
- 成功: [X]
- 需审核: [Y]
- 失败: [Z]

🇺🇸English

PDF OCR Extraction

Extract text from scanned documents and image-based PDFs using OCR technology.

Overview

This skill helps you:

Extract text from scanned documents
Make image PDFs searchable
Digitize paper documents
Process handwritten text (limited)
Batch process multiple documents

How to Use

Basic OCR

"Extract text from this scanned PDF"
"OCR this document image"
"Make this PDF searchable"

With Options

"Extract text from pages 1-10, English language"
"OCR this document, preserve layout"
"Extract and output as structured data"

Document Types

OCR Quality by Document Type

Document Type	Expected Quality	Tips
Typed documents	⭐⭐⭐⭐⭐ 95%+	Best results
Printed books	⭐⭐⭐⭐ 90%+	Watch for aging
Forms	⭐⭐⭐⭐ 85%+	Check boxes may need manual
Tables/Data	⭐⭐⭐ 80%+	Structure may need fixing
Handwritten (neat)	⭐⭐ 60-80%	Variable results
Handwritten (cursive)	⭐ 30-60%	Often needs manual review
Mixed content	⭐⭐⭐ 75%+	Depends on complexity

Output Formats

Plain Text Extraction

## OCR Result: [Document Name]

**Pages Processed**: [X]
**Language**: [Detected/Specified]
**Confidence**: [X]%

---

[Extracted text content here]

---

### Notes
- [Any issues or uncertainties]
- [Characters that may be incorrect]

Structured Extraction

## OCR Extraction: [Document Name]

### Document Info
| Field | Value |
|-------|-------|
| Title | [Extracted or inferred] |
| Date | [If found] |
| Author | [If found] |

### Content by Section

#### [Header 1]
[Content under this header]

#### [Header 2]
[Content under this header]

### Tables Found
| Column 1 | Column 2 | Column 3 |
|----------|----------|----------|
| [Data] | [Data] | [Data] |

### Uncertain Text
| Page | Original | Confidence | Possible |
|------|----------|------------|----------|
| 3 | "teh" | 70% | "the" |
| 5 | "l0ve" | 65% | "love" |

Searchable PDF Output

## OCR to Searchable PDF

**Source**: [filename.pdf]
**Output**: [filename_searchable.pdf]

### Processing Summary
| Metric | Value |
|--------|-------|
| Pages | [X] |
| Words extracted | [Y] |
| Average confidence | [Z]% |
| Processing time | [T] seconds |

### Quality Report
- [X] pages with 95%+ confidence
- [Y] pages with 80-94% confidence
- [Z] pages with <80% confidence (review recommended)

### Searchability
✅ Document is now text-searchable
✅ Original images preserved
✅ Text layer added behind images

Pre-Processing Tips

Image Quality Checklist

Before OCR, ensure:

Resolution : 300 DPI minimum (600 for small text)
Contrast : Clear black text on white background
Alignment : Document is straight (not skewed)
Completeness : No cut-off edges
Cleanliness : No stains, marks, or shadows

Common Pre-Processing Steps

Issue	Solution
Low resolution	Upscale image first
Skewed/rotated	Auto-deskew
Poor contrast	Adjust levels/threshold
Noise/specks	Apply noise reduction
Shadows	Flatten lighting
Color document	Convert to grayscale

Language Support

Supported Languages

Excellent : English, Spanish, French, German, Italian
Good : Chinese (Simplified/Traditional), Japanese, Korean
Moderate : Arabic, Hebrew (RTL support), Hindi
Basic : Many others with varying quality

Multi-Language Documents

"OCR this document, detect language automatically"
"Extract text, primary: English, secondary: Chinese"

Handling Specific Content

Forms and Checkboxes

## Form Extraction: [Form Name]

### Field Values
| Field | Value | Confidence |
|-------|-------|------------|
| Name | John Smith | 98% |
| Date | 01/15/2026 | 95% |
| Address | 123 Main St | 92% |

### Checkboxes
| Question | Checked |
|----------|---------|
| Option A | ☑️ Yes |
| Option B | ☐ No |
| Option C | ☑️ Yes |

### Signature
[Signature detected on page X - cannot extract text]

Tables

## Table Extraction

### Table 1 (Page 2)
| Header A | Header B | Header C |
|----------|----------|----------|
| Value 1 | Value 2 | Value 3 |
| Value 4 | Value 5 | Value 6 |

**Table confidence**: 85%
**Note**: Column 3 may have alignment issues

Handwritten Text

## Handwritten Text Extraction

**Legibility Assessment**: [Good/Fair/Poor]
**Recommended**: Manual review

### Extracted Text (Confidence: 65%)
[Extracted text with uncertain words marked]

### Uncertain Words
| Original | Best Guess | Alternatives |
|----------|------------|--------------|
| [image] | "meeting" | "meeting", "meaning" |
| [image] | "Tuesday" | "Tuesday", "Thursday" |

⚠️ **Low confidence extraction - please verify manually**

Batch Processing

Batch OCR Job

## Batch OCR Processing

**Folder**: [Path]
**Total Documents**: [X]
**Status**: [In Progress/Complete]

### Results
| File | Pages | Confidence | Status |
|------|-------|------------|--------|
| doc1.pdf | 5 | 96% | ✅ Complete |
| doc2.pdf | 12 | 88% | ✅ Complete |
| doc3.pdf | 3 | 72% | ⚠️ Review |
| doc4.pdf | 8 | - | ❌ Failed |

### Issues
- doc3.pdf: Pages 2-3 have handwriting
- doc4.pdf: File corrupted

### Summary
- Successful: [X]
- Need Review: [Y]
- Failed: [Z]

Tool Recommendations

Cloud Services

Google Cloud Vision (excellent accuracy)
Amazon Textract (good for forms)
Azure Computer Vision (balanced)
Adobe Acrobat (integrated)

Desktop Software

ABBYY FineReader (best accuracy)
Adobe Acrobat Pro (reliable)
Readiris (good value)
Tesseract (free, open source)

Programming Libraries

pytesseract (Python + Tesseract)
EasyOCR (Python, multi-language)
PaddleOCR (Python, good for Asian languages)

Limitations

Cannot guarantee 100% accuracy
Handwritten text has low accuracy
Very small text may not extract well
Decorative fonts are problematic
Background images reduce quality
Cannot read text in complex graphics
Processing time increases with pages

Weekly Installs

Repository

claude-office-s…s/skills

GitHub Stars

First Seen

Jan 1, 1970

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

PDF OCR 提取工具 - 扫描文档转可搜索文本，支持批量处理和多语言

🇨🇳中文介绍

PDF OCR 提取

概述

使用方法

基本 OCR

使用选项

文档类型

按文档类型划分的 OCR 质量

相关 Skills

输出格式

纯文本提取

结构化提取

可搜索 PDF 输出

预处理提示

图像质量检查清单

常见预处理步骤

语言支持

支持的语言

多语言文档

处理特定内容

表格和复选框

表格

手写文本

批量处理

批量 OCR 任务

工具推荐

云服务

桌面软件

编程库

限制