PDF表格提取工具 - 使用Camelot高精度提取复杂表格，支持合并单元格和无边框表格

table-extractor by claude-office-skills/skills

283 周安装量

26 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/claude-office-skills/skills --skill table-extractor

自动化数据分析数据处理

🇨🇳中文介绍

表格提取技能

概述

此技能使用 camelot（PDF表格提取的黄金标准）实现从PDF文档中精确提取表格。能够高精度处理包含合并单元格的复杂表格、无边框表格以及多页布局。

使用方法

提供包含表格的PDF文件
可选指定页面或表格检测方法
我将以pandas DataFrame形式提取表格

示例提示：

"从该PDF中提取所有表格"
"获取此报告第5页的表格"
"从此文档中提取无边框表格"
"将PDF表格转换为Excel格式"

领域知识

camelot 基础

import camelot

# 从PDF提取表格
tables = camelot.read_pdf('document.pdf')

# 访问结果
print(f"Found {len(tables)} tables")

# 获取第一个表格作为DataFrame
df = tables[0].df
print(df)

提取方法

方法	使用场景	描述
`lattice`	有边框表格	通过线条/边框检测表格

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

相关 Skills

FlyClaw：零登录航班聚合查询工具，Python实现多源航班信息与价格搜索

4,000,000 周安装

Azure RBAC 权限管理工具：查找最小角色、创建自定义角色与自动化分配

104,600 周安装

Azure Data Explorer (Kusto) 查询技能：KQL数据分析、日志遥测与时间序列处理

102,600 周安装

专业SEO审计工具：全面网站诊断、技术SEO优化与页面分析指南

59,900 周安装

# Lattice方法（默认）- 用于有可见边框的表格
tables = camelot.read_pdf('document.pdf', flavor='lattice')

# Stream方法 - 用于无边框表格
tables = camelot.read_pdf('document.pdf', flavor='stream')

# 单页
tables = camelot.read_pdf('document.pdf', pages='1')

# 多页
tables = camelot.read_pdf('document.pdf', pages='1,3,5')

# 页面范围
tables = camelot.read_pdf('document.pdf', pages='1-5')

# 所有页面
tables = camelot.read_pdf('document.pdf', pages='all')

tables = camelot.read_pdf(
    'document.pdf',
    flavor='lattice',
    line_scale=40,              # 线条检测灵敏度
    copy_text=['h', 'v'],       # 跨合并单元格复制文本
    shift_text=['l', 't'],      # 文本对齐偏移
    split_text=True,            # 在换行处拆分文本
    flag_size=True,             # 标记上标/下标
    strip_text='\n',            # 要剥离的字符
    process_background=False,   # 处理背景线条
)

tables = camelot.read_pdf(
    'document.pdf',
    flavor='stream',
    edge_tol=500,               # 边缘容差
    row_tol=10,                 # 行容差
    column_tol=0,               # 列容差
    strip_text='\n',            # 要剥离的字符
)

# 从特定区域提取（x1, y1, x2, y2）
# 坐标从左下角开始，以PDF点为单位（72点 = 1英寸）
tables = camelot.read_pdf(
    'document.pdf',
    table_areas=['72,720,540,400'],  # 一个区域
)

# 多个区域
tables = camelot.read_pdf(
    'document.pdf',
    table_areas=['72,720,540,400', '72,380,540,200'],
)

# 手动指定列位置（用于stream方法）
tables = camelot.read_pdf(
    'document.pdf',
    flavor='stream',
    columns=['100,200,300,400'],  # 列分隔符的X位置
)

import camelot

tables = camelot.read_pdf('document.pdf')

for i, table in enumerate(tables):
    # 访问DataFrame
    df = table.df
    
    # 表格元数据
    print(f"Table {i+1}:")
    print(f"  Page: {table.page}")
    print(f"  Accuracy: {table.accuracy}")
    print(f"  Whitespace: {table.whitespace}")
    print(f"  Order: {table.order}")
    print(f"  Shape: {df.shape}")
    
    # 解析报告
    report = table.parsing_report
    print(f"  Report: {report}")

import camelot

tables = camelot.read_pdf('document.pdf')

# 导出为CSV
tables[0].to_csv('table.csv')

# 导出为Excel
tables[0].to_excel('table.xlsx')

# 导出为JSON
tables[0].to_json('table.json')

# 导出为HTML
tables[0].to_html('table.html')

# 导出所有表格
for i, table in enumerate(tables):
    table.to_excel(f'table_{i+1}.xlsx')

import camelot

# 启用可视化调试
tables = camelot.read_pdf('document.pdf')

# 绘制检测到的表格区域
camelot.plot(tables[0], kind='contour').show()

# 绘制表格上的文本
camelot.plot(tables[0], kind='text').show()

# 绘制检测到的线条（仅限lattice）
camelot.plot(tables[0], kind='joint').show()
camelot.plot(tables[0], kind='line').show()

# 保存绘图
fig = camelot.plot(tables[0])
fig.savefig('debug.png')

import camelot
import pandas as pd

def extract_multipage_table(pdf_path, pages='all'):
    """提取并合并跨越多页的表格。"""
    
    tables = camelot.read_pdf(pdf_path, pages=pages)
    
    # 按相似结构（列）分组表格
    table_groups = {}
    
    for table in tables:
        cols = tuple(table.df.columns)
        if cols not in table_groups:
            table_groups[cols] = []
        table_groups[cols].append(table.df)
    
    # 合并相似表格
    combined = []
    for cols, dfs in table_groups.items():
        if len(dfs) > 1:
            # 合并并去重标题行
            combined_df = pd.concat(dfs, ignore_index=True)
            combined.append(combined_df)
        else:
            combined.append(dfs[0])
    
    return combined

尝试两种方法：有边框用lattice，无边框用stream
检查准确度分数：通常90%以上为良好
使用可视化调试：理解提取结果
指定区域：对于包含多种表格类型的PDF
处理表头：第一行通常需要特殊处理

import camelot
from pathlib import Path
import pandas as pd

def batch_extract_tables(input_dir, output_dir):
    """从目录中的所有PDF提取表格。"""
    
    input_path = Path(input_dir)
    output_path = Path(output_dir)
    output_path.mkdir(exist_ok=True)
    
    results = []
    
    for pdf_file in input_path.glob('*.pdf'):
        try:
            tables = camelot.read_pdf(str(pdf_file), pages='all')
            
            for i, table in enumerate(tables):
                # 跳过低准确度表格
                if table.accuracy < 80:
                    continue
                
                output_file = output_path / f"{pdf_file.stem}_table_{i+1}.xlsx"
                table.to_excel(str(output_file))
                
                results.append({
                    'source': str(pdf_file),
                    'table': i + 1,
                    'page': table.page,
                    'accuracy': table.accuracy,
                    'output': str(output_file)
                })
        
        except Exception as e:
            results.append({
                'source': str(pdf_file),
                'error': str(e)
            })
    
    return results

自动检测表格方法

import camelot

def smart_extract_tables(pdf_path, pages='1'):
    """尝试两种方法并返回最佳结果。"""
    
    # 先尝试lattice
    lattice_tables = camelot.read_pdf(pdf_path, pages=pages, flavor='lattice')
    
    # 尝试stream
    stream_tables = camelot.read_pdf(pdf_path, pages=pages, flavor='stream')
    
    # 比较并返回最佳
    results = []
    
    if lattice_tables and lattice_tables[0].accuracy > 70:
        results.extend(lattice_tables)
    elif stream_tables:
        results.extend(stream_tables)
    
    return results

示例1：财务报表提取

import camelot
import pandas as pd

def extract_financial_tables(pdf_path):
    """从年度报告中提取财务报表。"""
    
    # 提取所有表格
    tables = camelot.read_pdf(pdf_path, pages='all', flavor='lattice')
    
    financial_data = {
        'income_statement': None,
        'balance_sheet': None,
        'cash_flow': None,
        'other_tables': []
    }
    
    for table in tables:
        df = table.df
        text = df.to_string().lower()
        
        # 识别表格类型
        if 'revenue' in text or 'sales' in text:
            if 'operating income' in text or 'net income' in text:
                financial_data['income_statement'] = df
        elif 'asset' in text and 'liabilities' in text:
            financial_data['balance_sheet'] = df
        elif 'cash flow' in text or 'operating activities' in text:
            financial_data['cash_flow'] = df
        else:
            financial_data['other_tables'].append({
                'page': table.page,
                'data': df,
                'accuracy': table.accuracy
            })
    
    return financial_data

financials = extract_financial_tables('annual_report.pdf')
if financials['income_statement'] is not None:
    print("Income Statement found:")
    print(financials['income_statement'])

示例2：科研数据提取

import camelot
import pandas as pd

def extract_research_data(pdf_path, pages='all'):
    """从研究论文中提取数据表格。"""
    
    # 尝试lattice处理有边框表格
    tables = camelot.read_pdf(pdf_path, pages=pages, flavor='lattice')
    
    if not tables or all(t.accuracy < 70 for t in tables):
        # 回退到stream处理无边框表格
        tables = camelot.read_pdf(pdf_path, pages=pages, flavor='stream')
    
    extracted_data = []
    
    for table in tables:
        df = table.df
        
        # 清理DataFrame
        # 如果第一行看起来像表头，则将其设为表头
        if not df.iloc[0].str.contains(r'\d').any():
            df.columns = df.iloc[0]
            df = df[1:]
            df = df.reset_index(drop=True)
        
        extracted_data.append({
            'page': table.page,
            'accuracy': table.accuracy,
            'data': df
        })
    
    return extracted_data

data = extract_research_data('research_paper.pdf')
for i, item in enumerate(data):
    print(f"Table {i+1} (Page {item['page']}, Accuracy: {item['accuracy']}%):")
    print(item['data'].head())

示例3：发票行项目

import camelot

def extract_invoice_items(pdf_path):
    """从发票中提取行项目。"""
    
    # 通常发票有边框表格
    tables = camelot.read_pdf(pdf_path, flavor='lattice')
    
    line_items = []
    
    for table in tables:
        df = table.df
        
        # 查找典型的发票列
        header_text = ' '.join(df.iloc[0].astype(str)).lower()
        
        if any(term in header_text for term in ['quantity', 'qty', 'amount', 'price', 'description']):
            # 这看起来像行项目表格
            df.columns = df.iloc[0]
            df = df[1:]
            
            for _, row in df.iterrows():
                item = {}
                for col in df.columns:
                    col_lower = str(col).lower()
                    value = row[col]
                    
                    if 'desc' in col_lower or 'item' in col_lower:
                        item['description'] = value
                    elif 'qty' in col_lower or 'quantity' in col_lower:
                        item['quantity'] = value
                    elif 'price' in col_lower or 'rate' in col_lower:
                        item['unit_price'] = value
                    elif 'amount' in col_lower or 'total' in col_lower:
                        item['amount'] = value
                
                if item:
                    line_items.append(item)
    
    return line_items

items = extract_invoice_items('invoice.pdf')
for item in items:
    print(item)

示例4：表格比较

import camelot
import pandas as pd

def compare_pdf_tables(pdf1_path, pdf2_path):
    """比较两个PDF版本之间的表格。"""
    
    tables1 = camelot.read_pdf(pdf1_path)
    tables2 = camelot.read_pdf(pdf2_path)
    
    comparisons = []
    
    # 按形状和位置匹配表格
    for t1 in tables1:
        best_match = None
        best_score = 0
        
        for t2 in tables2:
            if t1.df.shape == t2.df.shape:
                # 计算相似度
                try:
                    similarity = (t1.df == t2.df).mean().mean()
                    if similarity > best_score:
                        best_score = similarity
                        best_match = t2
                except:
                    pass
        
        if best_match:
            comparisons.append({
                'page1': t1.page,
                'page2': best_match.page,
                'similarity': best_score,
                'identical': best_score == 1.0,
                'diff': pd.DataFrame(t1.df != best_match.df)
            })
    
    return comparisons

comparison = compare_pdf_tables('report_v1.pdf', 'report_v2.pdf')

不支持加密的PDF
基于图像的PDF需要OCR预处理
非常复杂的合并单元格可能需要调整
旋转的表格需要预处理
大型PDF可能需要逐页处理

pip install camelot-py[cv]

# 额外依赖
# macOS
brew install ghostscript tcl-tk

# Ubuntu
apt-get install ghostscript python3-tk

🇺🇸English

Table Extractor Skill

Overview

This skill enables precise extraction of tables from PDF documents using camelot - the gold standard for PDF table extraction. Handle complex tables with merged cells, borderless tables, and multi-page layouts with high accuracy.

How to Use

Provide the PDF containing tables
Optionally specify pages or table detection method
I'll extract tables as pandas DataFrames

Example prompts:

"Extract all tables from this PDF"
"Get the table on page 5 of this report"
"Extract borderless tables from this document"
"Convert PDF tables to Excel format"

Domain Knowledge

camelot Fundamentals

import camelot

# Extract tables from PDF
tables = camelot.read_pdf('document.pdf')

# Access results
print(f"Found {len(tables)} tables")

# Get first table as DataFrame
df = tables[0].df
print(df)

Extraction Methods

Method	Use Case	Description
`lattice`	Bordered tables	Detects table by lines/borders
`stream`	Borderless tables	Uses text positioning

# Lattice method (default) - for tables with visible borders
tables = camelot.read_pdf('document.pdf', flavor='lattice')

# Stream method - for borderless tables
tables = camelot.read_pdf('document.pdf', flavor='stream')

Page Selection

# Single page
tables = camelot.read_pdf('document.pdf', pages='1')

# Multiple pages
tables = camelot.read_pdf('document.pdf', pages='1,3,5')

# Page range
tables = camelot.read_pdf('document.pdf', pages='1-5')

# All pages
tables = camelot.read_pdf('document.pdf', pages='all')

Advanced Options

Lattice Options

tables = camelot.read_pdf(
    'document.pdf',
    flavor='lattice',
    line_scale=40,              # Line detection sensitivity
    copy_text=['h', 'v'],       # Copy text across merged cells
    shift_text=['l', 't'],      # Shift text alignment
    split_text=True,            # Split text at newlines
    flag_size=True,             # Flag super/subscripts
    strip_text='\n',            # Characters to strip
    process_background=False,   # Process background lines
)

Stream Options

tables = camelot.read_pdf(
    'document.pdf',
    flavor='stream',
    edge_tol=500,               # Edge tolerance
    row_tol=10,                 # Row tolerance
    column_tol=0,               # Column tolerance
    strip_text='\n',            # Characters to strip
)

Table Area Specification

# Extract from specific area (x1, y1, x2, y2)
# Coordinates from bottom-left, in PDF points (72 points = 1 inch)
tables = camelot.read_pdf(
    'document.pdf',
    table_areas=['72,720,540,400'],  # One area
)

# Multiple areas
tables = camelot.read_pdf(
    'document.pdf',
    table_areas=['72,720,540,400', '72,380,540,200'],
)

Column Specification

# Manually specify column positions (for stream method)
tables = camelot.read_pdf(
    'document.pdf',
    flavor='stream',
    columns=['100,200,300,400'],  # X positions of column separators
)

Working with Results

import camelot

tables = camelot.read_pdf('document.pdf')

for i, table in enumerate(tables):
    # Access DataFrame
    df = table.df
    
    # Table metadata
    print(f"Table {i+1}:")
    print(f"  Page: {table.page}")
    print(f"  Accuracy: {table.accuracy}")
    print(f"  Whitespace: {table.whitespace}")
    print(f"  Order: {table.order}")
    print(f"  Shape: {df.shape}")
    
    # Parsing report
    report = table.parsing_report
    print(f"  Report: {report}")

Export Options

import camelot

tables = camelot.read_pdf('document.pdf')

# Export to CSV
tables[0].to_csv('table.csv')

# Export to Excel
tables[0].to_excel('table.xlsx')

# Export to JSON
tables[0].to_json('table.json')

# Export to HTML
tables[0].to_html('table.html')

# Export all tables
for i, table in enumerate(tables):
    table.to_excel(f'table_{i+1}.xlsx')

Visual Debugging

import camelot

# Enable visual debugging
tables = camelot.read_pdf('document.pdf')

# Plot detected table areas
camelot.plot(tables[0], kind='contour').show()

# Plot text on table
camelot.plot(tables[0], kind='text').show()

# Plot detected lines (lattice only)
camelot.plot(tables[0], kind='joint').show()
camelot.plot(tables[0], kind='line').show()

# Save plot
fig = camelot.plot(tables[0])
fig.savefig('debug.png')

Handling Multi-page Tables

import camelot
import pandas as pd

def extract_multipage_table(pdf_path, pages='all'):
    """Extract and combine tables that span multiple pages."""
    
    tables = camelot.read_pdf(pdf_path, pages=pages)
    
    # Group tables by similar structure (columns)
    table_groups = {}
    
    for table in tables:
        cols = tuple(table.df.columns)
        if cols not in table_groups:
            table_groups[cols] = []
        table_groups[cols].append(table.df)
    
    # Combine similar tables
    combined = []
    for cols, dfs in table_groups.items():
        if len(dfs) > 1:
            # Combine and deduplicate header rows
            combined_df = pd.concat(dfs, ignore_index=True)
            combined.append(combined_df)
        else:
            combined.append(dfs[0])
    
    return combined

Best Practices

Try Both Methods : Lattice for bordered, stream for borderless
Check Accuracy Score : Above 90% is usually good
Use Visual Debugging : Understand extraction results
Specify Areas : For PDFs with multiple table types
Handle Headers : First row often needs special treatment

Common Patterns

Batch Table Extraction

import camelot
from pathlib import Path
import pandas as pd

def batch_extract_tables(input_dir, output_dir):
    """Extract tables from all PDFs in directory."""
    
    input_path = Path(input_dir)
    output_path = Path(output_dir)
    output_path.mkdir(exist_ok=True)
    
    results = []
    
    for pdf_file in input_path.glob('*.pdf'):
        try:
            tables = camelot.read_pdf(str(pdf_file), pages='all')
            
            for i, table in enumerate(tables):
                # Skip low accuracy tables
                if table.accuracy < 80:
                    continue
                
                output_file = output_path / f"{pdf_file.stem}_table_{i+1}.xlsx"
                table.to_excel(str(output_file))
                
                results.append({
                    'source': str(pdf_file),
                    'table': i + 1,
                    'page': table.page,
                    'accuracy': table.accuracy,
                    'output': str(output_file)
                })
        
        except Exception as e:
            results.append({
                'source': str(pdf_file),
                'error': str(e)
            })
    
    return results

Auto-detect Table Method

import camelot

def smart_extract_tables(pdf_path, pages='1'):
    """Try both methods and return best results."""
    
    # Try lattice first
    lattice_tables = camelot.read_pdf(pdf_path, pages=pages, flavor='lattice')
    
    # Try stream
    stream_tables = camelot.read_pdf(pdf_path, pages=pages, flavor='stream')
    
    # Compare and return best
    results = []
    
    if lattice_tables and lattice_tables[0].accuracy > 70:
        results.extend(lattice_tables)
    elif stream_tables:
        results.extend(stream_tables)
    
    return results

Examples

Example 1: Financial Statement Extraction

import camelot
import pandas as pd

def extract_financial_tables(pdf_path):
    """Extract financial tables from annual report."""
    
    # Extract all tables
    tables = camelot.read_pdf(pdf_path, pages='all', flavor='lattice')
    
    financial_data = {
        'income_statement': None,
        'balance_sheet': None,
        'cash_flow': None,
        'other_tables': []
    }
    
    for table in tables:
        df = table.df
        text = df.to_string().lower()
        
        # Identify table type
        if 'revenue' in text or 'sales' in text:
            if 'operating income' in text or 'net income' in text:
                financial_data['income_statement'] = df
        elif 'asset' in text and 'liabilities' in text:
            financial_data['balance_sheet'] = df
        elif 'cash flow' in text or 'operating activities' in text:
            financial_data['cash_flow'] = df
        else:
            financial_data['other_tables'].append({
                'page': table.page,
                'data': df,
                'accuracy': table.accuracy
            })
    
    return financial_data

financials = extract_financial_tables('annual_report.pdf')
if financials['income_statement'] is not None:
    print("Income Statement found:")
    print(financials['income_statement'])

Example 2: Scientific Data Extraction

import camelot
import pandas as pd

def extract_research_data(pdf_path, pages='all'):
    """Extract data tables from research paper."""
    
    # Try lattice for bordered tables
    tables = camelot.read_pdf(pdf_path, pages=pages, flavor='lattice')
    
    if not tables or all(t.accuracy < 70 for t in tables):
        # Fall back to stream for borderless
        tables = camelot.read_pdf(pdf_path, pages=pages, flavor='stream')
    
    extracted_data = []
    
    for table in tables:
        df = table.df
        
        # Clean up the DataFrame
        # Set first row as header if it looks like one
        if not df.iloc[0].str.contains(r'\d').any():
            df.columns = df.iloc[0]
            df = df[1:]
            df = df.reset_index(drop=True)
        
        extracted_data.append({
            'page': table.page,
            'accuracy': table.accuracy,
            'data': df
        })
    
    return extracted_data

data = extract_research_data('research_paper.pdf')
for i, item in enumerate(data):
    print(f"Table {i+1} (Page {item['page']}, Accuracy: {item['accuracy']}%):")
    print(item['data'].head())

Example 3: Invoice Line Items

import camelot

def extract_invoice_items(pdf_path):
    """Extract line items from invoice."""
    
    # Usually invoices have bordered tables
    tables = camelot.read_pdf(pdf_path, flavor='lattice')
    
    line_items = []
    
    for table in tables:
        df = table.df
        
        # Look for table with typical invoice columns
        header_text = ' '.join(df.iloc[0].astype(str)).lower()
        
        if any(term in header_text for term in ['quantity', 'qty', 'amount', 'price', 'description']):
            # This looks like a line items table
            df.columns = df.iloc[0]
            df = df[1:]
            
            for _, row in df.iterrows():
                item = {}
                for col in df.columns:
                    col_lower = str(col).lower()
                    value = row[col]
                    
                    if 'desc' in col_lower or 'item' in col_lower:
                        item['description'] = value
                    elif 'qty' in col_lower or 'quantity' in col_lower:
                        item['quantity'] = value
                    elif 'price' in col_lower or 'rate' in col_lower:
                        item['unit_price'] = value
                    elif 'amount' in col_lower or 'total' in col_lower:
                        item['amount'] = value
                
                if item:
                    line_items.append(item)
    
    return line_items

items = extract_invoice_items('invoice.pdf')
for item in items:
    print(item)

Example 4: Table Comparison

import camelot
import pandas as pd

def compare_pdf_tables(pdf1_path, pdf2_path):
    """Compare tables between two PDF versions."""
    
    tables1 = camelot.read_pdf(pdf1_path)
    tables2 = camelot.read_pdf(pdf2_path)
    
    comparisons = []
    
    # Match tables by shape and position
    for t1 in tables1:
        best_match = None
        best_score = 0
        
        for t2 in tables2:
            if t1.df.shape == t2.df.shape:
                # Calculate similarity
                try:
                    similarity = (t1.df == t2.df).mean().mean()
                    if similarity > best_score:
                        best_score = similarity
                        best_match = t2
                except:
                    pass
        
        if best_match:
            comparisons.append({
                'page1': t1.page,
                'page2': best_match.page,
                'similarity': best_score,
                'identical': best_score == 1.0,
                'diff': pd.DataFrame(t1.df != best_match.df)
            })
    
    return comparisons

comparison = compare_pdf_tables('report_v1.pdf', 'report_v2.pdf')

Limitations

Encrypted PDFs not supported
Image-based PDFs need OCR preprocessing
Very complex merged cells may need tuning
Rotated tables require preprocessing
Large PDFs may need page-by-page processing

Installation

pip install camelot-py[cv]

# Additional dependencies
# macOS
brew install ghostscript tcl-tk

# Ubuntu
apt-get install ghostscript python3-tk

Resources

Weekly Installs

Repository

claude-office-s…s/skills

GitHub Stars

First Seen

1 day ago

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

claude-code18

opencode4

gemini-cli4

github-copilot4

codex4

amp4

Python PDF处理教程：合并拆分、提取文本表格、创建PDF文件

55,400 周安装