doc-parser 文档解析技能：使用 IBM docling 库解析 PDF、Word、图像，提取表格和结构化数据

doc-parser by claude-office-skills/skills

410 周安装量

25 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/claude-office-skills/skills --skill doc-parser

办公软件自动化数据处理

🇨🇳中文介绍

文档解析技能

概述

此技能使用 docling - IBM 先进的文档理解库，实现高级文档解析功能。可解析复杂的 PDF、Word 文档和图像，同时保留结构、提取表格和图形，并处理多栏布局。

使用方法

提供要解析的文档
指定要提取的内容（文本、表格、图形等）
我将解析文档并返回结构化数据

示例提示：

"解析此 PDF 并提取所有表格"
"将此学术论文转换为结构化 Markdown"
"从此文档中提取图形和标题"
"解析此报告，保留文档结构"

领域知识

docling 基础

from docling.document_converter import DocumentConverter

# 初始化转换器
converter = DocumentConverter()

# 转换文档
result = converter.convert("document.pdf")

# 访问解析内容
doc = result.document
print(doc.export_to_markdown())

支持格式

格式	扩展名	备注
PDF	.pdf	原生和扫描文档
Word	.docx

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

相关 Skills

FlyClaw：零登录航班聚合查询工具，Python实现多源航班信息与价格搜索

4,000,000 周安装

Gmail过滤器创建教程 - 使用Google Workspace CLI自动分类邮件与添加标签

6,500 周安装

Google Slides 演示文稿创建与共享自动化教程 - 使用 Google Workspace CLI

6,500 周安装

Google Workspace CLI 教程：通过邮件自动发送云端硬盘文件链接，提升团队协作效率

6,300 周安装

from docling.document_converter import DocumentConverter
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions

# 配置管道
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options.do_cell_matching = True

# 使用选项创建转换器
converter = DocumentConverter(
    allowed_formats=[InputFormat.PDF, InputFormat.DOCX],
    pdf_backend_options=pipeline_options
)

result = converter.convert("document.pdf")

from docling.document_converter import DocumentConverter
import pandas as pd

def extract_tables(doc_path):
    """从文档中提取所有表格。"""
    converter = DocumentConverter()
    result = converter.convert(doc_path)
    doc = result.document
    
    tables = []
    
    for element in doc.iterate_items():
        if element.type == "table":
            # 获取表格数据
            table_data = element.export_to_dataframe()
            tables.append({
                'page': element.prov[0].page_no if element.prov else None,
                'dataframe': table_data
            })
    
    return tables

# 使用示例
tables = extract_tables("report.pdf")
for i, table in enumerate(tables):
    print(f"第 {i+1} 个表格，位于第 {table['page']} 页:")
    print(table['dataframe'])

def extract_figures(doc_path, output_dir):
    """提取带标题的图形。"""
    import os
    
    converter = DocumentConverter()
    result = converter.convert(doc_path)
    doc = result.document
    
    figures = []
    os.makedirs(output_dir, exist_ok=True)
    
    for element in doc.iterate_items():
        if element.type == "picture":
            figure_info = {
                'caption': element.caption if hasattr(element, 'caption') else None,
                'page': element.prov[0].page_no if element.prov else None,
            }
            
            # 如果可用则保存图像
            if hasattr(element, 'image'):
                img_path = os.path.join(output_dir, f"figure_{len(figures)+1}.png")
                element.image.save(img_path)
                figure_info['path'] = img_path
            
            figures.append(figure_info)
    
    return figures

from docling.document_converter import DocumentConverter

def parse_multicolumn(doc_path):
    """解析多栏布局的文档。"""
    
    converter = DocumentConverter()
    result = converter.convert(doc_path)
    doc = result.document
    
    # docling 自动处理栏检测
    # 文本按阅读顺序返回
    
    structured_content = []
    
    for element in doc.iterate_items():
        content_item = {
            'type': element.type,
            'text': element.text if hasattr(element, 'text') else None,
            'level': element.level if hasattr(element, 'level') else None,
        }
        
        # 如果可用则添加边界框
        if element.prov:
            content_item['bbox'] = element.prov[0].bbox
            content_item['page'] = element.prov[0].page_no
        
        structured_content.append(content_item)
    
    return structured_content

from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("document.pdf")
doc = result.document

# Markdown 导出
markdown = doc.export_to_markdown()
with open("output.md", "w") as f:
    f.write(markdown)

# 纯文本
text = doc.export_to_text()

# JSON/字典格式
json_doc = doc.export_to_dict()

# HTML 格式（如果支持）
# html = doc.export_to_html()

from docling.document_converter import DocumentConverter
from pathlib import Path
from concurrent.futures import ThreadPoolExecutor

def batch_parse(input_dir, output_dir, max_workers=4):
    """并行解析多个文档。"""
    
    input_path = Path(input_dir)
    output_path = Path(output_dir)
    output_path.mkdir(exist_ok=True)
    
    converter = DocumentConverter()
    
    def process_single(doc_path):
        try:
            result = converter.convert(str(doc_path))
            md = result.document.export_to_markdown()
            
            out_file = output_path / f"{doc_path.stem}.md"
            with open(out_file, 'w') as f:
                f.write(md)
            
            return {'file': str(doc_path), 'status': 'success'}
        except Exception as e:
            return {'file': str(doc_path), 'status': 'error', 'error': str(e)}
    
    docs = list(input_path.glob('*.pdf')) + list(input_path.glob('*.docx'))
    
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        results = list(executor.map(process_single, docs))
    
    return results

def parse_academic_paper(pdf_path):
    """解析学术论文结构。"""
    
    converter = DocumentConverter()
    result = converter.convert(pdf_path)
    doc = result.document
    
    paper = {
        'title': None,
        'abstract': None,
        'sections': [],
        'references': [],
        'tables': [],
        'figures': []
    }
    
    current_section = None
    
    for element in doc.iterate_items():
        text = element.text if hasattr(element, 'text') else ''
        
        if element.type == 'title':
            paper['title'] = text
        
        elif element.type == 'heading':
            if 'abstract' in text.lower():
                current_section = 'abstract'
            elif 'reference' in text.lower():
                current_section = 'references'
            else:
                paper['sections'].append({
                    'title': text,
                    'content': ''
                })
                current_section = 'section'
        
        elif element.type == 'paragraph':
            if current_section == 'abstract':
                paper['abstract'] = text
            elif current_section == 'section' and paper['sections']:
                paper['sections'][-1]['content'] += text + '\n'
        
        elif element.type == 'table':
            paper['tables'].append({
                'caption': element.caption if hasattr(element, 'caption') else None,
                'data': element.export_to_dataframe() if hasattr(element, 'export_to_dataframe') else None
            })
    
    return paper

def parse_business_report(doc_path):
    """将业务报告解析为结构化格式。"""
    
    converter = DocumentConverter()
    result = converter.convert(doc_path)
    doc = result.document
    
    report = {
        'metadata': {
            'title': None,
            'date': None,
            'author': None
        },
        'executive_summary': None,
        'sections': [],
        'key_metrics': [],
        'recommendations': []
    }
    
    # 解析文档结构
    for element in doc.iterate_items():
        # 基于文档结构实现解析逻辑
        pass
    
    return report

from docling.document_converter import DocumentConverter

def parse_financial_report(pdf_path):
    """从财务报告中提取结构化数据。"""
    
    converter = DocumentConverter()
    result = converter.convert(pdf_path)
    doc = result.document
    
    financial_data = {
        'income_statement': None,
        'balance_sheet': None,
        'cash_flow': None,
        'notes': []
    }
    
    # 提取表格
    tables = []
    for element in doc.iterate_items():
        if element.type == 'table':
            table_df = element.export_to_dataframe()
            
            # 识别表格类型
            if 'revenue' in str(table_df).lower() or 'income' in str(table_df).lower():
                financial_data['income_statement'] = table_df
            elif 'asset' in str(table_df).lower() or 'liabilities' in str(table_df).lower():
                financial_data['balance_sheet'] = table_df
            elif 'cash' in str(table_df).lower():
                financial_data['cash_flow'] = table_df
            else:
                tables.append(table_df)
    
    # 提取备注的 Markdown
    financial_data['markdown'] = doc.export_to_markdown()
    
    return financial_data

report = parse_financial_report('annual_report.pdf')
print("损益表:")
print(report['income_statement'])

from docling.document_converter import DocumentConverter

def parse_technical_docs(doc_path):
    """解析技术文档。"""
    
    converter = DocumentConverter()
    result = converter.convert(doc_path)
    doc = result.document
    
    documentation = {
        'title': None,
        'version': None,
        'sections': [],
        'code_blocks': [],
        'diagrams': []
    }
    
    current_section = None
    
    for element in doc.iterate_items():
        if element.type == 'title':
            documentation['title'] = element.text
        
        elif element.type == 'heading':
            current_section = {
                'title': element.text,
                'level': element.level if hasattr(element, 'level') else 1,
                'content': []
            }
            documentation['sections'].append(current_section)
        
        elif element.type == 'code':
            if current_section:
                current_section['content'].append({
                    'type': 'code',
                    'content': element.text
                })
            documentation['code_blocks'].append(element.text)
        
        elif element.type == 'picture':
            documentation['diagrams'].append({
                'page': element.prov[0].page_no if element.prov else None,
                'caption': element.caption if hasattr(element, 'caption') else None
            })
    
    return documentation

docs = parse_technical_docs('api_documentation.pdf')
print(f"标题: {docs['title']}")
print(f"章节数: {len(docs['sections'])}")

from docling.document_converter import DocumentConverter

def analyze_contract(pdf_path):
    """解析合同文档以提取关键条款。"""
    
    converter = DocumentConverter()
    result = converter.convert(pdf_path)
    doc = result.document
    
    contract = {
        'parties': [],
        'clauses': [],
        'dates': [],
        'amounts': [],
        'full_text': doc.export_to_text()
    }
    
    import re
    
    # 提取日期
    date_pattern = r'\b\d{1,2}[/-]\d{1,2}[/-]\d{2,4}\b|\b(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* \d{1,2},? \d{4}\b'
    contract['dates'] = re.findall(date_pattern, contract['full_text'], re.IGNORECASE)
    
    # 提取金额
    amount_pattern = r'\$[\d,]+(?:\.\d{2})?|\b\d+(?:,\d{3})*(?:\.\d{2})?\s*(?:USD|dollars)\b'
    contract['amounts'] = re.findall(amount_pattern, contract['full_text'], re.IGNORECASE)
    
    # 将章节解析为条款
    for element in doc.iterate_items():
        if element.type == 'heading':
            contract['clauses'].append({
                'title': element.text,
                'content': ''
            })
        elif element.type == 'paragraph' and contract['clauses']:
            contract['clauses'][-1]['content'] += element.text + '\n'
    
    return contract

contract_data = analyze_contract('agreement.pdf')
print(f"关键日期: {contract_data['dates']}")
print(f"金额: {contract_data['amounts']}")

🇺🇸English

Document Parser Skill

Overview

This skill enables advanced document parsing using docling - IBM's state-of-the-art document understanding library. Parse complex PDFs, Word documents, and images while preserving structure, extracting tables, figures, and handling multi-column layouts.

How to Use

Provide the document to parse
Specify what you want to extract (text, tables, figures, etc.)
I'll parse it and return structured data

Example prompts:

"Parse this PDF and extract all tables"
"Convert this academic paper to structured markdown"
"Extract figures and captions from this document"
"Parse this report preserving the document structure"

Domain Knowledge

docling Fundamentals

from docling.document_converter import DocumentConverter

# Initialize converter
converter = DocumentConverter()

# Convert document
result = converter.convert("document.pdf")

# Access parsed content
doc = result.document
print(doc.export_to_markdown())

Supported Formats

Format	Extension	Notes
PDF	.pdf	Native and scanned
Word	.docx	Full structure preserved
PowerPoint	.pptx	Slides as sections
Images	.png, .jpg	OCR + layout analysis
HTML	.html	Structure preserved

Basic Usage

from docling.document_converter import DocumentConverter

# Create converter
converter = DocumentConverter()

# Convert single document
result = converter.convert("report.pdf")

# Access document
doc = result.document

# Export options
markdown = doc.export_to_markdown()
text = doc.export_to_text()
json_doc = doc.export_to_dict()

Advanced Configuration

from docling.document_converter import DocumentConverter
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions

# Configure pipeline
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options.do_cell_matching = True

# Create converter with options
converter = DocumentConverter(
    allowed_formats=[InputFormat.PDF, InputFormat.DOCX],
    pdf_backend_options=pipeline_options
)

result = converter.convert("document.pdf")

Document Structure

# Document hierarchy
doc = result.document

# Access metadata
print(doc.name)
print(doc.origin)

# Iterate through content
for element in doc.iterate_items():
    print(f"Type: {element.type}")
    print(f"Text: {element.text}")
    
    if element.type == "table":
        print(f"Rows: {len(element.data.table_cells)}")

Extracting Tables

from docling.document_converter import DocumentConverter
import pandas as pd

def extract_tables(doc_path):
    """Extract all tables from document."""
    converter = DocumentConverter()
    result = converter.convert(doc_path)
    doc = result.document
    
    tables = []
    
    for element in doc.iterate_items():
        if element.type == "table":
            # Get table data
            table_data = element.export_to_dataframe()
            tables.append({
                'page': element.prov[0].page_no if element.prov else None,
                'dataframe': table_data
            })
    
    return tables

# Usage
tables = extract_tables("report.pdf")
for i, table in enumerate(tables):
    print(f"Table {i+1} on page {table['page']}:")
    print(table['dataframe'])

Extracting Figures

def extract_figures(doc_path, output_dir):
    """Extract figures with captions."""
    import os
    
    converter = DocumentConverter()
    result = converter.convert(doc_path)
    doc = result.document
    
    figures = []
    os.makedirs(output_dir, exist_ok=True)
    
    for element in doc.iterate_items():
        if element.type == "picture":
            figure_info = {
                'caption': element.caption if hasattr(element, 'caption') else None,
                'page': element.prov[0].page_no if element.prov else None,
            }
            
            # Save image if available
            if hasattr(element, 'image'):
                img_path = os.path.join(output_dir, f"figure_{len(figures)+1}.png")
                element.image.save(img_path)
                figure_info['path'] = img_path
            
            figures.append(figure_info)
    
    return figures

Handling Multi-column Layouts

from docling.document_converter import DocumentConverter

def parse_multicolumn(doc_path):
    """Parse document with multi-column layout."""
    
    converter = DocumentConverter()
    result = converter.convert(doc_path)
    doc = result.document
    
    # docling automatically handles column detection
    # Text is returned in reading order
    
    structured_content = []
    
    for element in doc.iterate_items():
        content_item = {
            'type': element.type,
            'text': element.text if hasattr(element, 'text') else None,
            'level': element.level if hasattr(element, 'level') else None,
        }
        
        # Add bounding box if available
        if element.prov:
            content_item['bbox'] = element.prov[0].bbox
            content_item['page'] = element.prov[0].page_no
        
        structured_content.append(content_item)
    
    return structured_content

Export Formats

from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("document.pdf")
doc = result.document

# Markdown export
markdown = doc.export_to_markdown()
with open("output.md", "w") as f:
    f.write(markdown)

# Plain text
text = doc.export_to_text()

# JSON/dict format
json_doc = doc.export_to_dict()

# HTML format (if supported)
# html = doc.export_to_html()

Batch Processing

from docling.document_converter import DocumentConverter
from pathlib import Path
from concurrent.futures import ThreadPoolExecutor

def batch_parse(input_dir, output_dir, max_workers=4):
    """Parse multiple documents in parallel."""
    
    input_path = Path(input_dir)
    output_path = Path(output_dir)
    output_path.mkdir(exist_ok=True)
    
    converter = DocumentConverter()
    
    def process_single(doc_path):
        try:
            result = converter.convert(str(doc_path))
            md = result.document.export_to_markdown()
            
            out_file = output_path / f"{doc_path.stem}.md"
            with open(out_file, 'w') as f:
                f.write(md)
            
            return {'file': str(doc_path), 'status': 'success'}
        except Exception as e:
            return {'file': str(doc_path), 'status': 'error', 'error': str(e)}
    
    docs = list(input_path.glob('*.pdf')) + list(input_path.glob('*.docx'))
    
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        results = list(executor.map(process_single, docs))
    
    return results

Best Practices

Use Appropriate Pipeline : Configure for your document type
Handle Large Documents : Process in chunks if needed
Verify Table Extraction : Complex tables may need review
Check OCR Quality : Enable OCR for scanned documents
Cache Results : Store parsed documents for reuse

Common Patterns

Academic Paper Parser

def parse_academic_paper(pdf_path):
    """Parse academic paper structure."""
    
    converter = DocumentConverter()
    result = converter.convert(pdf_path)
    doc = result.document
    
    paper = {
        'title': None,
        'abstract': None,
        'sections': [],
        'references': [],
        'tables': [],
        'figures': []
    }
    
    current_section = None
    
    for element in doc.iterate_items():
        text = element.text if hasattr(element, 'text') else ''
        
        if element.type == 'title':
            paper['title'] = text
        
        elif element.type == 'heading':
            if 'abstract' in text.lower():
                current_section = 'abstract'
            elif 'reference' in text.lower():
                current_section = 'references'
            else:
                paper['sections'].append({
                    'title': text,
                    'content': ''
                })
                current_section = 'section'
        
        elif element.type == 'paragraph':
            if current_section == 'abstract':
                paper['abstract'] = text
            elif current_section == 'section' and paper['sections']:
                paper['sections'][-1]['content'] += text + '\n'
        
        elif element.type == 'table':
            paper['tables'].append({
                'caption': element.caption if hasattr(element, 'caption') else None,
                'data': element.export_to_dataframe() if hasattr(element, 'export_to_dataframe') else None
            })
    
    return paper

Report to Structured Data

def parse_business_report(doc_path):
    """Parse business report into structured format."""
    
    converter = DocumentConverter()
    result = converter.convert(doc_path)
    doc = result.document
    
    report = {
        'metadata': {
            'title': None,
            'date': None,
            'author': None
        },
        'executive_summary': None,
        'sections': [],
        'key_metrics': [],
        'recommendations': []
    }
    
    # Parse document structure
    for element in doc.iterate_items():
        # Implement parsing logic based on document structure
        pass
    
    return report

Examples

Example 1: Parse Financial Report

from docling.document_converter import DocumentConverter

def parse_financial_report(pdf_path):
    """Extract structured data from financial report."""
    
    converter = DocumentConverter()
    result = converter.convert(pdf_path)
    doc = result.document
    
    financial_data = {
        'income_statement': None,
        'balance_sheet': None,
        'cash_flow': None,
        'notes': []
    }
    
    # Extract tables
    tables = []
    for element in doc.iterate_items():
        if element.type == 'table':
            table_df = element.export_to_dataframe()
            
            # Identify table type
            if 'revenue' in str(table_df).lower() or 'income' in str(table_df).lower():
                financial_data['income_statement'] = table_df
            elif 'asset' in str(table_df).lower() or 'liabilities' in str(table_df).lower():
                financial_data['balance_sheet'] = table_df
            elif 'cash' in str(table_df).lower():
                financial_data['cash_flow'] = table_df
            else:
                tables.append(table_df)
    
    # Extract markdown for notes
    financial_data['markdown'] = doc.export_to_markdown()
    
    return financial_data

report = parse_financial_report('annual_report.pdf')
print("Income Statement:")
print(report['income_statement'])

Example 2: Technical Documentation Parser

from docling.document_converter import DocumentConverter

def parse_technical_docs(doc_path):
    """Parse technical documentation."""
    
    converter = DocumentConverter()
    result = converter.convert(doc_path)
    doc = result.document
    
    documentation = {
        'title': None,
        'version': None,
        'sections': [],
        'code_blocks': [],
        'diagrams': []
    }
    
    current_section = None
    
    for element in doc.iterate_items():
        if element.type == 'title':
            documentation['title'] = element.text
        
        elif element.type == 'heading':
            current_section = {
                'title': element.text,
                'level': element.level if hasattr(element, 'level') else 1,
                'content': []
            }
            documentation['sections'].append(current_section)
        
        elif element.type == 'code':
            if current_section:
                current_section['content'].append({
                    'type': 'code',
                    'content': element.text
                })
            documentation['code_blocks'].append(element.text)
        
        elif element.type == 'picture':
            documentation['diagrams'].append({
                'page': element.prov[0].page_no if element.prov else None,
                'caption': element.caption if hasattr(element, 'caption') else None
            })
    
    return documentation

docs = parse_technical_docs('api_documentation.pdf')
print(f"Title: {docs['title']}")
print(f"Sections: {len(docs['sections'])}")

Example 3: Contract Analysis

from docling.document_converter import DocumentConverter

def analyze_contract(pdf_path):
    """Parse contract document for key clauses."""
    
    converter = DocumentConverter()
    result = converter.convert(pdf_path)
    doc = result.document
    
    contract = {
        'parties': [],
        'clauses': [],
        'dates': [],
        'amounts': [],
        'full_text': doc.export_to_text()
    }
    
    import re
    
    # Extract dates
    date_pattern = r'\b\d{1,2}[/-]\d{1,2}[/-]\d{2,4}\b|\b(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* \d{1,2},? \d{4}\b'
    contract['dates'] = re.findall(date_pattern, contract['full_text'], re.IGNORECASE)
    
    # Extract monetary amounts
    amount_pattern = r'\$[\d,]+(?:\.\d{2})?|\b\d+(?:,\d{3})*(?:\.\d{2})?\s*(?:USD|dollars)\b'
    contract['amounts'] = re.findall(amount_pattern, contract['full_text'], re.IGNORECASE)
    
    # Parse sections as clauses
    for element in doc.iterate_items():
        if element.type == 'heading':
            contract['clauses'].append({
                'title': element.text,
                'content': ''
            })
        elif element.type == 'paragraph' and contract['clauses']:
            contract['clauses'][-1]['content'] += element.text + '\n'
    
    return contract

contract_data = analyze_contract('agreement.pdf')
print(f"Key dates: {contract_data['dates']}")
print(f"Amounts: {contract_data['amounts']}")

Limitations

Very large documents may require chunking
Handwritten content needs OCR preprocessing
Complex nested tables may need manual review
Some PDF types (encrypted) not supported
GPU recommended for best performance

Installation

pip install docling

# For full functionality
pip install docling[all]

# For OCR support
pip install docling[ocr]

Resources

Weekly Installs

Repository

claude-office-s…s/skills

GitHub Stars

First Seen

6 days ago

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

claude-code44

opencode25

gemini-cli25

github-copilot25

amp25

cline25

doc-parser 文档解析技能：使用 IBM docling 库解析 PDF、Word、图像，提取表格和结构化数据

🇨🇳中文介绍

文档解析技能

概述

使用方法

领域知识

docling 基础

支持格式

相关 Skills

基本用法

高级配置

文档结构

提取表格

提取图形

处理多栏布局

导出格式

批量处理

最佳实践

常见模式

学术论文解析器

报告转结构化数据

示例

示例 1：解析财务报告

示例 2：技术文档解析器

示例 3：合同分析

限制

安装

资源