PDF OCR Extraction by claude-office-skills/skills
npx skills add https://github.com/claude-office-skills/skills --skill 'PDF OCR Extraction'使用 OCR 技术从扫描文档和基于图像的 PDF 中提取文本。
此技能可帮助您:
"Extract text from this scanned PDF"
"OCR this document image"
"Make this PDF searchable"
"Extract text from pages 1-10, English language"
"OCR this document, preserve layout"
"Extract and output as structured data"
| 文档类型 | 预期质量 | 提示 |
|---|---|---|
| 打印文档 | ⭐⭐⭐⭐⭐ 95%+ | 效果最佳 |
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
| ⭐⭐⭐⭐ 90%+ |
| 注意老化问题 |
| 表格 | ⭐⭐⭐⭐ 85%+ | 复选框可能需要手动处理 |
| 表格/数据 | ⭐⭐⭐ 80%+ | 结构可能需要修复 |
| 手写(工整) | ⭐⭐ 60-80% | 结果不稳定 |
| 手写(草书) | ⭐ 30-60% | 通常需要人工审核 |
| 混合内容 | ⭐⭐⭐ 75%+ | 取决于复杂程度 |
## OCR 结果: [文档名称]
**已处理页数**: [X]
**语言**: [检测到/指定]
**置信度**: [X]%
---
[提取的文本内容在此]
---
### 备注
- [任何问题或不确定之处]
- [可能不正确的字符]
## OCR 提取: [文档名称]
### 文档信息
| 字段 | 值 |
|-------|-------|
| 标题 | [提取或推断] |
| 日期 | [如果找到] |
| 作者 | [如果找到] |
### 按章节划分的内容
#### [标题 1]
[此标题下的内容]
#### [标题 2]
[此标题下的内容]
### 找到的表格
| 列 1 | 列 2 | 列 3 |
|----------|----------|----------|
| [数据] | [数据] | [数据] |
### 不确定文本
| 页码 | 原始文本 | 置信度 | 可能为 |
|------|----------|------------|----------|
| 3 | "teh" | 70% | "the" |
| 5 | "l0ve" | 65% | "love" |
## OCR 到可搜索 PDF
**源文件**: [filename.pdf]
**输出文件**: [filename_searchable.pdf]
### 处理摘要
| 指标 | 值 |
|--------|-------|
| 页数 | [X] |
| 提取字数 | [Y] |
| 平均置信度 | [Z]% |
| 处理时间 | [T] 秒 |
### 质量报告
- [X] 页置信度 >= 95%
- [Y] 页置信度 80-94%
- [Z] 页置信度 < 80% (建议审核)
### 可搜索性
✅ 文档现在可进行文本搜索
✅ 原始图像已保留
✅ 已在图像后方添加文本层
进行 OCR 前,请确保:
| 问题 | 解决方案 |
|---|---|
| 分辨率低 | 首先放大图像 |
| 倾斜/旋转 | 自动纠偏 |
| 对比度差 | 调整色阶/阈值 |
| 噪点/斑点 | 应用降噪 |
| 阴影 | 平整光照 |
| 彩色文档 | 转换为灰度 |
"OCR this document, detect language automatically"
"Extract text, primary: English, secondary: Chinese"
## 表格提取: [表格名称]
### 字段值
| 字段 | 值 | 置信度 |
|-------|-------|------------|
| 姓名 | John Smith | 98% |
| 日期 | 01/15/2026 | 95% |
| 地址 | 123 Main St | 92% |
### 复选框
| 问题 | 已勾选 |
|----------|---------|
| 选项 A | ☑️ 是 |
| 选项 B | ☐ 否 |
| 选项 C | ☑️ 是 |
### 签名
[在第 X 页检测到签名 - 无法提取文本]
## 表格提取
### 表格 1 (第 2 页)
| 表头 A | 表头 B | 表头 C |
|----------|----------|----------|
| 值 1 | 值 2 | 值 3 |
| 值 4 | 值 5 | 值 6 |
**表格置信度**: 85%
**备注**: 第 3 列可能存在对齐问题
## 手写文本提取
**可读性评估**: [好/一般/差]
**建议**: 人工审核
### 提取的文本 (置信度: 65%)
[提取的文本,不确定的单词已标记]
### 不确定的单词
| 原始 | 最佳猜测 | 备选 |
|----------|------------|--------------|
| [image] | "meeting" | "meeting", "meaning" |
| [image] | "Tuesday" | "Tuesday", "Thursday" |
⚠️ **低置信度提取 - 请手动验证**
## 批量 OCR 处理
**文件夹**: [路径]
**总文档数**: [X]
**状态**: [进行中/完成]
### 结果
| 文件 | 页数 | 置信度 | 状态 |
|------|-------|------------|--------|
| doc1.pdf | 5 | 96% | ✅ 完成 |
| doc2.pdf | 12 | 88% | ✅ 完成 |
| doc3.pdf | 3 | 72% | ⚠️ 需审核 |
| doc4.pdf | 8 | - | ❌ 失败 |
### 问题
- doc3.pdf: 第 2-3 页有手写内容
- doc4.pdf: 文件损坏
### 摘要
- 成功: [X]
- 需审核: [Y]
- 失败: [Z]
每周安装量
0
代码仓库
GitHub 星标数
5
首次出现
Jan 1, 1970
安全审计
Extract text from scanned documents and image-based PDFs using OCR technology.
This skill helps you:
"Extract text from this scanned PDF"
"OCR this document image"
"Make this PDF searchable"
"Extract text from pages 1-10, English language"
"OCR this document, preserve layout"
"Extract and output as structured data"
| Document Type | Expected Quality | Tips |
|---|---|---|
| Typed documents | ⭐⭐⭐⭐⭐ 95%+ | Best results |
| Printed books | ⭐⭐⭐⭐ 90%+ | Watch for aging |
| Forms | ⭐⭐⭐⭐ 85%+ | Check boxes may need manual |
| Tables/Data | ⭐⭐⭐ 80%+ | Structure may need fixing |
| Handwritten (neat) | ⭐⭐ 60-80% | Variable results |
| Handwritten (cursive) | ⭐ 30-60% | Often needs manual review |
| Mixed content | ⭐⭐⭐ 75%+ | Depends on complexity |
## OCR Result: [Document Name]
**Pages Processed**: [X]
**Language**: [Detected/Specified]
**Confidence**: [X]%
---
[Extracted text content here]
---
### Notes
- [Any issues or uncertainties]
- [Characters that may be incorrect]
## OCR Extraction: [Document Name]
### Document Info
| Field | Value |
|-------|-------|
| Title | [Extracted or inferred] |
| Date | [If found] |
| Author | [If found] |
### Content by Section
#### [Header 1]
[Content under this header]
#### [Header 2]
[Content under this header]
### Tables Found
| Column 1 | Column 2 | Column 3 |
|----------|----------|----------|
| [Data] | [Data] | [Data] |
### Uncertain Text
| Page | Original | Confidence | Possible |
|------|----------|------------|----------|
| 3 | "teh" | 70% | "the" |
| 5 | "l0ve" | 65% | "love" |
## OCR to Searchable PDF
**Source**: [filename.pdf]
**Output**: [filename_searchable.pdf]
### Processing Summary
| Metric | Value |
|--------|-------|
| Pages | [X] |
| Words extracted | [Y] |
| Average confidence | [Z]% |
| Processing time | [T] seconds |
### Quality Report
- [X] pages with 95%+ confidence
- [Y] pages with 80-94% confidence
- [Z] pages with <80% confidence (review recommended)
### Searchability
✅ Document is now text-searchable
✅ Original images preserved
✅ Text layer added behind images
Before OCR, ensure:
| Issue | Solution |
|---|---|
| Low resolution | Upscale image first |
| Skewed/rotated | Auto-deskew |
| Poor contrast | Adjust levels/threshold |
| Noise/specks | Apply noise reduction |
| Shadows | Flatten lighting |
| Color document | Convert to grayscale |
"OCR this document, detect language automatically"
"Extract text, primary: English, secondary: Chinese"
## Form Extraction: [Form Name]
### Field Values
| Field | Value | Confidence |
|-------|-------|------------|
| Name | John Smith | 98% |
| Date | 01/15/2026 | 95% |
| Address | 123 Main St | 92% |
### Checkboxes
| Question | Checked |
|----------|---------|
| Option A | ☑️ Yes |
| Option B | ☐ No |
| Option C | ☑️ Yes |
### Signature
[Signature detected on page X - cannot extract text]
## Table Extraction
### Table 1 (Page 2)
| Header A | Header B | Header C |
|----------|----------|----------|
| Value 1 | Value 2 | Value 3 |
| Value 4 | Value 5 | Value 6 |
**Table confidence**: 85%
**Note**: Column 3 may have alignment issues
## Handwritten Text Extraction
**Legibility Assessment**: [Good/Fair/Poor]
**Recommended**: Manual review
### Extracted Text (Confidence: 65%)
[Extracted text with uncertain words marked]
### Uncertain Words
| Original | Best Guess | Alternatives |
|----------|------------|--------------|
| [image] | "meeting" | "meeting", "meaning" |
| [image] | "Tuesday" | "Tuesday", "Thursday" |
⚠️ **Low confidence extraction - please verify manually**
## Batch OCR Processing
**Folder**: [Path]
**Total Documents**: [X]
**Status**: [In Progress/Complete]
### Results
| File | Pages | Confidence | Status |
|------|-------|------------|--------|
| doc1.pdf | 5 | 96% | ✅ Complete |
| doc2.pdf | 12 | 88% | ✅ Complete |
| doc3.pdf | 3 | 72% | ⚠️ Review |
| doc4.pdf | 8 | - | ❌ Failed |
### Issues
- doc3.pdf: Pages 2-3 have handwriting
- doc4.pdf: File corrupted
### Summary
- Successful: [X]
- Need Review: [Y]
- Failed: [Z]
Weekly Installs
0
Repository
GitHub Stars
5
First Seen
Jan 1, 1970
Security Audits
AI 代码实施计划编写技能 | 自动化开发任务分解与 TDD 流程规划工具
41,400 周安装