table-extractor：从PDF和图像提取表格，智能OCR转换CSV/Excel/JSON

table-extractor by dkyazzentwatwa/chatgpt-skills

93 周安装量

36 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/dkyazzentwatwa/chatgpt-skills --skill table-extractor

自动化数据处理计算机视觉

🇨🇳中文介绍

表格提取器

从 PDF 和图像中提取表格，转换为结构化数据格式。

功能特性

PDF 表格 : 从数字 PDF 中提取表格
图像表格 : 基于 OCR 从图像中提取表格
多表格处理 : 从文档中提取所有表格
格式导出 : 支持 CSV、Excel、JSON 输出
表格检测 : 自动检测表格边界
列对齐 : 智能列检测
多页处理 : 处理整个 PDF 文档

快速开始

from table_extractor import TableExtractor

extractor = TableExtractor()

# 从 PDF 提取
extractor.load_pdf("document.pdf")
tables = extractor.extract_all()

# 将第一个表格保存为 CSV
tables[0].to_csv("table.csv")

# 从图像提取
extractor.load_image("scanned_table.png")
table = extractor.extract_table()
print(table)

命令行使用

# 从 PDF 提取
python table_extractor.py --input document.pdf --output tables/

# 提取指定页面
python table_extractor.py --input document.pdf --pages 1-3 --output tables/

# 从图像提取
python table_extractor.py --input scan.png --output table.csv

# 导出到 Excel
python table_extractor.py --input document.pdf --format xlsx --output tables.xlsx

# 对扫描的 PDF 使用 OCR
python table_extractor.py --input scanned.pdf --ocr --output tables/

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

相关 Skills

FlyClaw：零登录航班聚合查询工具，Python实现多源航班信息与价格搜索

4,000,000 周安装

Azure RBAC 权限管理工具：查找最小角色、创建自定义角色与自动化分配

131,500 周安装

GitHub Actions 官方文档查询助手 - 精准解答 CI/CD 工作流问题

43,400 周安装

通过 LiteLLM 代理让 Claude Code 对接 GitHub Copilot 运行 | 高级变通方案指南

43,100 周安装

class TableExtractor:
    def __init__(self)

    # 加载
    def load_pdf(self, filepath: str, pages: List[int] = None) -> 'TableExtractor'
    def load_image(self, filepath: str) -> 'TableExtractor'

    # 提取
    def extract_table(self, page: int = 0) -> pd.DataFrame
    def extract_all(self) -> List[pd.DataFrame]
    def extract_page(self, page: int) -> List[pd.DataFrame]

    # 检测
    def detect_tables(self, page: int = 0) -> List[Dict]
    def get_table_count(self) -> int

    # 配置
    def set_ocr(self, enabled: bool = True, lang: str = "eng") -> 'TableExtractor'
    def set_column_detection(self, mode: str = "auto") -> 'TableExtractor'

    # 导出
    def to_csv(self, tables: List, output_dir: str) -> List[str]
    def to_excel(self, tables: List, output: str) -> str
    def to_json(self, tables: List, output: str) -> str

PDF 文档（文本型和扫描型）
图像：PNG、JPEG、TIFF、BMP
包含表格的截图

CSV（每个表格一个文件）
Excel（多个工作表）
JSON（表格数组）
Pandas DataFrame

# 检测表格但不提取内容
tables_info = extractor.detect_tables(page=0)
# 返回:
# [
#     {"index": 0, "rows": 10, "cols": 5, "bbox": (x1, y1, x2, y2)},
#     {"index": 1, "rows": 8, "cols": 3, "bbox": (x1, y1, x2, y2)}
# ]

extractor = TableExtractor()
extractor.load_pdf("quarterly_report.pdf")

# 提取所有表格
tables = extractor.extract_all()

# 将每个表格导出为 CSV
for i, table in enumerate(tables):
    table.to_csv(f"table_{i}.csv", index=False)

extractor = TableExtractor()
extractor.set_ocr(enabled=True, lang="eng")
extractor.load_image("scanned_form.png")

table = extractor.extract_table()
print(table)

pdfplumber>=0.10.0
pillow>=10.0.0
pandas>=2.0.0
pytesseract>=0.3.10 (用于 OCR)
opencv-python>=4.8.0

2026 年 1 月 24 日

🇺🇸English

Table Extractor

Extract tables from PDFs and images into structured data formats.

Features

PDF Tables : Extract tables from digital PDFs
Image Tables : OCR-based extraction from images
Multiple Tables : Extract all tables from document
Format Export : CSV, Excel, JSON output
Table Detection : Auto-detect table boundaries
Column Alignment : Smart column detection
Multi-Page : Process entire PDF documents

Quick Start

from table_extractor import TableExtractor

extractor = TableExtractor()

# Extract from PDF
extractor.load_pdf("document.pdf")
tables = extractor.extract_all()

# Save first table to CSV
tables[0].to_csv("table.csv")

# Extract from image
extractor.load_image("scanned_table.png")
table = extractor.extract_table()
print(table)

CLI Usage

# Extract from PDF
python table_extractor.py --input document.pdf --output tables/

# Extract specific pages
python table_extractor.py --input document.pdf --pages 1-3 --output tables/

# Extract from image
python table_extractor.py --input scan.png --output table.csv

# Export to Excel
python table_extractor.py --input document.pdf --format xlsx --output tables.xlsx

# With OCR for scanned PDFs
python table_extractor.py --input scanned.pdf --ocr --output tables/

API Reference

TableExtractor Class

class TableExtractor:
    def __init__(self)

    # Loading
    def load_pdf(self, filepath: str, pages: List[int] = None) -> 'TableExtractor'
    def load_image(self, filepath: str) -> 'TableExtractor'

    # Extraction
    def extract_table(self, page: int = 0) -> pd.DataFrame
    def extract_all(self) -> List[pd.DataFrame]
    def extract_page(self, page: int) -> List[pd.DataFrame]

    # Detection
    def detect_tables(self, page: int = 0) -> List[Dict]
    def get_table_count(self) -> int

    # Configuration
    def set_ocr(self, enabled: bool = True, lang: str = "eng") -> 'TableExtractor'
    def set_column_detection(self, mode: str = "auto") -> 'TableExtractor'

    # Export
    def to_csv(self, tables: List, output_dir: str) -> List[str]
    def to_excel(self, tables: List, output: str) -> str
    def to_json(self, tables: List, output: str) -> str

Supported Formats

Input

PDF documents (text-based and scanned)
Images: PNG, JPEG, TIFF, BMP
Screenshots with tables

Output

CSV (one file per table)
Excel (multiple sheets)
JSON (array of tables)
Pandas DataFrame

Table Detection

# Detect tables without extracting
tables_info = extractor.detect_tables(page=0)
# Returns:
# [
#     {"index": 0, "rows": 10, "cols": 5, "bbox": (x1, y1, x2, y2)},
#     {"index": 1, "rows": 8, "cols": 3, "bbox": (x1, y1, x2, y2)}
# ]

Example Workflows

PDF Report Tables

extractor = TableExtractor()
extractor.load_pdf("quarterly_report.pdf")

# Extract all tables
tables = extractor.extract_all()

# Export each to CSV
for i, table in enumerate(tables):
    table.to_csv(f"table_{i}.csv", index=False)

Scanned Document

extractor = TableExtractor()
extractor.set_ocr(enabled=True, lang="eng")
extractor.load_image("scanned_form.png")

table = extractor.extract_table()
print(table)

Dependencies

pdfplumber>=0.10.0
pillow>=10.0.0
pandas>=2.0.0
pytesseract>=0.3.10 (for OCR)
opencv-python>=4.8.0

Weekly Installs

Repository

dkyazzentwatwa/…t-skills

GitHub Stars

First Seen

Jan 24, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

opencode78

gemini-cli76

codex75

cursor73

github-copilot69

amp64

Skills CLI 使用指南：AI Agent 技能包管理器安装与管理教程