PDF Processing Pro by nilecui/skillsbase
npx skills add https://github.com/nilecui/skillsbase --skill 'PDF Processing Pro'具备生产就绪的 PDF 处理工具包,包含预构建脚本、全面的错误处理以及对复杂工作流的支持。
import pdfplumber
with pdfplumber.open("document.pdf") as pdf:
text = pdf.pages[0].extract_text()
print(text)
python scripts/analyze_form.py input.pdf --output fields.json
# 返回:包含所有表单字段、类型和位置的 JSON
python scripts/fill_form.py input.pdf data.json output.pdf
# 在填充前验证所有字段,包含错误报告
python scripts/extract_tables.py report.pdf --output tables.csv
# 提取所有表格,并自动检测列
所有脚本都包含:
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
--help 标志适用于完整的表单工作流,包括:
参见 FORMS.md
适用于复杂的表格提取:
参见 TABLES.md
适用于扫描版 PDF 和基于图像的文档:
参见 OCR.md
analyze_form.py - 提取表单字段信息
python scripts/analyze_form.py input.pdf [--output fields.json] [--verbose]
fill_form.py - 使用数据填充 PDF 表单
python scripts/fill_form.py input.pdf data.json output.pdf [--validate]
validate_form.py - 在填充前验证表单数据
python scripts/validate_form.py data.json schema.json
extract_tables.py - 将表格提取为 CSV/Excel
python scripts/extract_tables.py input.pdf [--output tables.csv] [--format csv|excel]
extract_text.py - 提取文本并保留格式
python scripts/extract_text.py input.pdf [--output text.txt] [--preserve-formatting]
merge_pdfs.py - 合并多个 PDF
python scripts/merge_pdfs.py file1.pdf file2.pdf file3.pdf --output merged.pdf
split_pdf.py - 将 PDF 拆分为单页
python scripts/split_pdf.py input.pdf --output-dir pages/
validate_pdf.py - 验证 PDF 完整性
python scripts/validate_pdf.py input.pdf
# 1. 分析表单结构
python scripts/analyze_form.py template.pdf --output schema.json
# 2. 验证提交数据
python scripts/validate_form.py submission.json schema.json
# 3. 填充表单
python scripts/fill_form.py template.pdf submission.json completed.pdf
# 4. 验证输出
python scripts/validate_pdf.py completed.pdf
# 1. 提取表格
python scripts/extract_tables.py monthly_report.pdf --output data.csv
# 2. 提取文本用于分析
python scripts/extract_text.py monthly_report.pdf --output report.txt
import glob
from pathlib import Path
import subprocess
# 处理目录中的所有 PDF
for pdf_file in glob.glob("invoices/*.pdf"):
output_file = Path("processed") / Path(pdf_file).name
result = subprocess.run([
"python", "scripts/extract_text.py",
pdf_file,
"--output", str(output_file)
], capture_output=True)
if result.returncode == 0:
print(f"✓ Processed: {pdf_file}")
else:
print(f"✗ Failed: {pdf_file} - {result.stderr}")
所有脚本都遵循一致的错误模式:
# 退出码
# 0 - 成功
# 1 - 文件未找到
# 2 - 无效输入
# 3 - 处理错误
# 4 - 验证错误
# 自动化中的使用示例
result = subprocess.run(["python", "scripts/fill_form.py", ...])
if result.returncode == 0:
print("Success")
elif result.returncode == 4:
print("Validation failed - check input data")
else:
print(f"Error occurred: {result.returncode}")
所有脚本都需要:
pip install pdfplumber pypdf pillow pytesseract pandas
OCR 可选依赖:
# 安装 tesseract-ocr 系统包
# macOS: brew install tesseract
# Ubuntu: apt-get install tesseract-ocr
# Windows: 从 GitHub releases 下载
--parallel 标志)"Module not found" 错误:
pip install -r requirements.txt
找不到 Tesseract:
# 安装 tesseract 系统包(参见依赖项)
处理大型 PDF 时出现内存错误:
# 逐页处理,而不是加载整个 PDF
with pdfplumber.open("large.pdf") as pdf:
for page in pdf.pages:
text = page.extract_text()
# 立即处理页面
权限错误:
chmod +x scripts/*.py
所有脚本都支持 --help:
python scripts/analyze_form.py --help
python scripts/extract_tables.py --help
有关特定主题的详细文档,请参见:
每周安装次数
–
代码仓库
GitHub 星标数
21
首次出现时间
–
安全审计
Production-ready PDF processing toolkit with pre-built scripts, comprehensive error handling, and support for complex workflows.
import pdfplumber
with pdfplumber.open("document.pdf") as pdf:
text = pdf.pages[0].extract_text()
print(text)
python scripts/analyze_form.py input.pdf --output fields.json
# Returns: JSON with all form fields, types, and positions
python scripts/fill_form.py input.pdf data.json output.pdf
# Validates all fields before filling, includes error reporting
python scripts/extract_tables.py report.pdf --output tables.csv
# Extracts all tables with automatic column detection
All scripts include:
--help flag for all scriptsFor complete form workflows including:
See FORMS.md
For complex table extraction:
See TABLES.md
For scanned PDFs and image-based documents:
See OCR.md
analyze_form.py - Extract form field information
python scripts/analyze_form.py input.pdf [--output fields.json] [--verbose]
fill_form.py - Fill PDF forms with data
python scripts/fill_form.py input.pdf data.json output.pdf [--validate]
validate_form.py - Validate form data before filling
python scripts/validate_form.py data.json schema.json
extract_tables.py - Extract tables to CSV/Excel
python scripts/extract_tables.py input.pdf [--output tables.csv] [--format csv|excel]
extract_text.py - Extract text with formatting preservation
python scripts/extract_text.py input.pdf [--output text.txt] [--preserve-formatting]
merge_pdfs.py - Merge multiple PDFs
python scripts/merge_pdfs.py file1.pdf file2.pdf file3.pdf --output merged.pdf
split_pdf.py - Split PDF into individual pages
python scripts/split_pdf.py input.pdf --output-dir pages/
validate_pdf.py - Validate PDF integrity
python scripts/validate_pdf.py input.pdf
# 1. Analyze form structure
python scripts/analyze_form.py template.pdf --output schema.json
# 2. Validate submission data
python scripts/validate_form.py submission.json schema.json
# 3. Fill form
python scripts/fill_form.py template.pdf submission.json completed.pdf
# 4. Validate output
python scripts/validate_pdf.py completed.pdf
# 1. Extract tables
python scripts/extract_tables.py monthly_report.pdf --output data.csv
# 2. Extract text for analysis
python scripts/extract_text.py monthly_report.pdf --output report.txt
import glob
from pathlib import Path
import subprocess
# Process all PDFs in directory
for pdf_file in glob.glob("invoices/*.pdf"):
output_file = Path("processed") / Path(pdf_file).name
result = subprocess.run([
"python", "scripts/extract_text.py",
pdf_file,
"--output", str(output_file)
], capture_output=True)
if result.returncode == 0:
print(f"✓ Processed: {pdf_file}")
else:
print(f"✗ Failed: {pdf_file} - {result.stderr}")
All scripts follow consistent error patterns:
# Exit codes
# 0 - Success
# 1 - File not found
# 2 - Invalid input
# 3 - Processing error
# 4 - Validation error
# Example usage in automation
result = subprocess.run(["python", "scripts/fill_form.py", ...])
if result.returncode == 0:
print("Success")
elif result.returncode == 4:
print("Validation failed - check input data")
else:
print(f"Error occurred: {result.returncode}")
All scripts require:
pip install pdfplumber pypdf pillow pytesseract pandas
Optional for OCR:
# Install tesseract-ocr system package
# macOS: brew install tesseract
# Ubuntu: apt-get install tesseract-ocr
# Windows: Download from GitHub releases
--parallel flag (where supported)"Module not found" errors :
pip install -r requirements.txt
Tesseract not found :
# Install tesseract system package (see Dependencies)
Memory errors with large PDFs :
# Process page by page instead of loading entire PDF
with pdfplumber.open("large.pdf") as pdf:
for page in pdf.pages:
text = page.extract_text()
# Process page immediately
Permission errors :
chmod +x scripts/*.py
All scripts support --help:
python scripts/analyze_form.py --help
python scripts/extract_tables.py --help
For detailed documentation on specific topics, see:
Weekly Installs
–
Repository
GitHub Stars
21
First Seen
–
Security Audits
Skills CLI 使用指南:AI Agent 技能包管理器安装与管理教程
31,600 周安装