PDF处理专业版 - 生产就绪的Python PDF处理工具包，支持表单、表格提取与OCR

PDF Processing Pro by nilecui/skillsbase

21 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/nilecui/skillsbase --skill 'PDF Processing Pro'

Python Web框架自动化数据处理

🇨🇳中文介绍

PDF 处理专业版

具备生产就绪的 PDF 处理工具包，包含预构建脚本、全面的错误处理以及对复杂工作流的支持。

快速开始

从 PDF 提取文本

import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    text = pdf.pages[0].extract_text()
    print(text)

分析 PDF 表单（使用内置脚本）

python scripts/analyze_form.py input.pdf --output fields.json
# 返回：包含所有表单字段、类型和位置的 JSON

验证并填充 PDF 表单

python scripts/fill_form.py input.pdf data.json output.pdf
# 在填充前验证所有字段，包含错误报告

从 PDF 提取表格

python scripts/extract_tables.py report.pdf --output tables.csv
# 提取所有表格，并自动检测列

功能特性

✅ 生产就绪的脚本

所有脚本都包含：

错误处理：提供详细错误信息的优雅失败处理
验证：输入验证和类型检查
：带时间戳的可配置日志记录

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

✅ 全面的工作流

PDF 表单：完整的表单处理流程
表格提取：高级表格检测和提取
OCR 处理：扫描版 PDF 文本提取
批量操作：高效处理多个 PDF
验证：预处理和后处理验证

适用于完整的表单工作流，包括：

字段分析和检测
动态表单填充
验证规则
多页表单
复选框和单选按钮处理

适用于复杂的表格提取：

多页表格
合并单元格
嵌套表格
自定义表格检测
导出为 CSV/Excel

适用于扫描版 PDF 和基于图像的文档：

Tesseract 集成
语言支持
图像预处理
置信度评分
批量 OCR

analyze_form.py - 提取表单字段信息

python scripts/analyze_form.py input.pdf [--output fields.json] [--verbose]

fill_form.py - 使用数据填充 PDF 表单

python scripts/fill_form.py input.pdf data.json output.pdf [--validate]

validate_form.py - 在填充前验证表单数据

python scripts/validate_form.py data.json schema.json

extract_tables.py - 将表格提取为 CSV/Excel

python scripts/extract_tables.py input.pdf [--output tables.csv] [--format csv|excel]

extract_text.py - 提取文本并保留格式

python scripts/extract_text.py input.pdf [--output text.txt] [--preserve-formatting]

merge_pdfs.py - 合并多个 PDF

python scripts/merge_pdfs.py file1.pdf file2.pdf file3.pdf --output merged.pdf

split_pdf.py - 将 PDF 拆分为单页

python scripts/split_pdf.py input.pdf --output-dir pages/

validate_pdf.py - 验证 PDF 完整性

python scripts/validate_pdf.py input.pdf

工作流 1：处理表单提交

# 1. 分析表单结构
python scripts/analyze_form.py template.pdf --output schema.json

# 2. 验证提交数据
python scripts/validate_form.py submission.json schema.json

# 3. 填充表单
python scripts/fill_form.py template.pdf submission.json completed.pdf

# 4. 验证输出
python scripts/validate_pdf.py completed.pdf

工作流 2：从报告中提取数据

# 1. 提取表格
python scripts/extract_tables.py monthly_report.pdf --output data.csv

# 2. 提取文本用于分析
python scripts/extract_text.py monthly_report.pdf --output report.txt

工作流 3：批量处理

import glob
from pathlib import Path
import subprocess

# 处理目录中的所有 PDF
for pdf_file in glob.glob("invoices/*.pdf"):
    output_file = Path("processed") / Path(pdf_file).name

    result = subprocess.run([
        "python", "scripts/extract_text.py",
        pdf_file,
        "--output", str(output_file)
    ], capture_output=True)

    if result.returncode == 0:
        print(f"✓ Processed: {pdf_file}")
    else:
        print(f"✗ Failed: {pdf_file} - {result.stderr}")

所有脚本都遵循一致的错误模式：

# 退出码
# 0 - 成功
# 1 - 文件未找到
# 2 - 无效输入
# 3 - 处理错误
# 4 - 验证错误

# 自动化中的使用示例
result = subprocess.run(["python", "scripts/fill_form.py", ...])

if result.returncode == 0:
    print("Success")
elif result.returncode == 4:
    print("Validation failed - check input data")
else:
    print(f"Error occurred: {result.returncode}")

所有脚本都需要：

pip install pdfplumber pypdf pillow pytesseract pandas

# 安装 tesseract-ocr 系统包
# macOS: brew install tesseract
# Ubuntu: apt-get install tesseract-ocr
# Windows: 从 GitHub releases 下载

使用批量处理来处理多个 PDF
启用多进程（在支持的地方使用 --parallel 标志）
缓存提取的数据以避免重复处理
尽早验证输入以实现快速失败
对于大型 PDF（>50MB）使用流式处理

始终在处理前验证输入
在自定义脚本中使用 try-except
记录所有操作以便调试
在生产前使用示例 PDF 进行测试
为长时间运行的操作设置超时
在自动化中检查退出码
在修改前备份原始文件

"Module not found" 错误：

pip install -r requirements.txt

找不到 Tesseract：

# 安装 tesseract 系统包（参见依赖项）

处理大型 PDF 时出现内存错误：

# 逐页处理，而不是加载整个 PDF
with pdfplumber.open("large.pdf") as pdf:
    for page in pdf.pages:
        text = page.extract_text()
        # 立即处理页面

chmod +x scripts/*.py

所有脚本都支持 --help：

python scripts/analyze_form.py --help
python scripts/extract_tables.py --help

有关特定主题的详细文档，请参见：

FORMS.md - 完整的表单处理指南
TABLES.md - 高级表格提取
OCR.md - 扫描版 PDF 处理

🇺🇸English

PDF Processing Pro

Production-ready PDF processing toolkit with pre-built scripts, comprehensive error handling, and support for complex workflows.

Quick start

Extract text from PDF

import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    text = pdf.pages[0].extract_text()
    print(text)

Analyze PDF form (using included script)

python scripts/analyze_form.py input.pdf --output fields.json
# Returns: JSON with all form fields, types, and positions

Fill PDF form with validation

python scripts/fill_form.py input.pdf data.json output.pdf
# Validates all fields before filling, includes error reporting

Extract tables from PDF

python scripts/extract_tables.py report.pdf --output tables.csv
# Extracts all tables with automatic column detection

Features

✅ Production-ready scripts

All scripts include:

Error handling : Graceful failures with detailed error messages
Validation : Input validation and type checking
Logging : Configurable logging with timestamps
Type hints : Full type annotations for IDE support
CLI interface : --help flag for all scripts
Exit codes : Proper exit codes for automation

✅ Comprehensive workflows

PDF Forms : Complete form processing pipeline
Table Extraction : Advanced table detection and extraction
OCR Processing : Scanned PDF text extraction
Batch Operations : Process multiple PDFs efficiently
Validation : Pre and post-processing validation

Advanced topics

PDF Form Processing

For complete form workflows including:

Field analysis and detection
Dynamic form filling
Validation rules
Multi-page forms
Checkbox and radio button handling

See FORMS.md

Table Extraction

For complex table extraction:

Multi-page tables
Merged cells
Nested tables
Custom table detection
Export to CSV/Excel

See TABLES.md

OCR Processing

For scanned PDFs and image-based documents:

Tesseract integration
Language support
Image preprocessing
Confidence scoring
Batch OCR

See OCR.md

Included scripts

Form processing

analyze_form.py - Extract form field information

python scripts/analyze_form.py input.pdf [--output fields.json] [--verbose]

fill_form.py - Fill PDF forms with data

python scripts/fill_form.py input.pdf data.json output.pdf [--validate]

validate_form.py - Validate form data before filling

python scripts/validate_form.py data.json schema.json

Table extraction

extract_tables.py - Extract tables to CSV/Excel

python scripts/extract_tables.py input.pdf [--output tables.csv] [--format csv|excel]

Text extraction

extract_text.py - Extract text with formatting preservation

python scripts/extract_text.py input.pdf [--output text.txt] [--preserve-formatting]

Utilities

merge_pdfs.py - Merge multiple PDFs

python scripts/merge_pdfs.py file1.pdf file2.pdf file3.pdf --output merged.pdf

split_pdf.py - Split PDF into individual pages

python scripts/split_pdf.py input.pdf --output-dir pages/

validate_pdf.py - Validate PDF integrity

python scripts/validate_pdf.py input.pdf

Common workflows

Workflow 1: Process form submissions

# 1. Analyze form structure
python scripts/analyze_form.py template.pdf --output schema.json

# 2. Validate submission data
python scripts/validate_form.py submission.json schema.json

# 3. Fill form
python scripts/fill_form.py template.pdf submission.json completed.pdf

# 4. Validate output
python scripts/validate_pdf.py completed.pdf

Workflow 2: Extract data from reports

# 1. Extract tables
python scripts/extract_tables.py monthly_report.pdf --output data.csv

# 2. Extract text for analysis
python scripts/extract_text.py monthly_report.pdf --output report.txt

Workflow 3: Batch processing

import glob
from pathlib import Path
import subprocess

# Process all PDFs in directory
for pdf_file in glob.glob("invoices/*.pdf"):
    output_file = Path("processed") / Path(pdf_file).name

    result = subprocess.run([
        "python", "scripts/extract_text.py",
        pdf_file,
        "--output", str(output_file)
    ], capture_output=True)

    if result.returncode == 0:
        print(f"✓ Processed: {pdf_file}")
    else:
        print(f"✗ Failed: {pdf_file} - {result.stderr}")

Error handling

All scripts follow consistent error patterns:

# Exit codes
# 0 - Success
# 1 - File not found
# 2 - Invalid input
# 3 - Processing error
# 4 - Validation error

# Example usage in automation
result = subprocess.run(["python", "scripts/fill_form.py", ...])

if result.returncode == 0:
    print("Success")
elif result.returncode == 4:
    print("Validation failed - check input data")
else:
    print(f"Error occurred: {result.returncode}")

Dependencies

All scripts require:

pip install pdfplumber pypdf pillow pytesseract pandas

Optional for OCR:

# Install tesseract-ocr system package
# macOS: brew install tesseract
# Ubuntu: apt-get install tesseract-ocr
# Windows: Download from GitHub releases

Performance tips

Use batch processing for multiple PDFs
Enable multiprocessing with --parallel flag (where supported)
Cache extracted data to avoid re-processing
Validate inputs early to fail fast
Use streaming for large PDFs (>50MB)

Best practices

Always validate inputs before processing
Use try-except in custom scripts
Log all operations for debugging
Test with sample PDFs before production
Set timeouts for long-running operations
Check exit codes in automation
Backup originals before modification

Troubleshooting

Common issues

"Module not found" errors :

pip install -r requirements.txt

Tesseract not found :

# Install tesseract system package (see Dependencies)

Memory errors with large PDFs :

# Process page by page instead of loading entire PDF
with pdfplumber.open("large.pdf") as pdf:
    for page in pdf.pages:
        text = page.extract_text()
        # Process page immediately

Permission errors :

chmod +x scripts/*.py

Getting help

All scripts support --help:

python scripts/analyze_form.py --help
python scripts/extract_tables.py --help

For detailed documentation on specific topics, see:

FORMS.md - Complete form processing guide
TABLES.md - Advanced table extraction
OCR.md - Scanned PDF processing

Weekly Installs

–

Repository

nilecui/skillsbase

GitHub Stars

First Seen

–

Security Audits

Gen Agent Trust HubFail SocketPass SnykPass

Skills CLI 使用指南：AI Agent 技能包管理器安装与管理教程

31,600 周安装