Kreuzberg：高性能文档智能库，支持91+格式文本/表格/元数据提取与OCR

kreuzberg by kreuzberg-dev/kreuzberg

152 周安装量

7,100 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/kreuzberg-dev/kreuzberg --skill kreuzberg

文档生成数据处理计算机视觉

🇨🇳中文介绍

Kreuzberg 文档提取

Kreuzberg 是一个高性能的文档智能库，采用 Rust 核心，并为 Python、Node.js/TypeScript、Ruby、Go、Java、C#、PHP 和 Elixir 提供原生绑定。它可以从 91 种以上的文件格式中提取文本、表格、元数据和图像，包括 PDF、Office 文档、图像（带 OCR）、HTML、电子邮件、归档文件和学术格式。

在编写以下代码时使用此技能：

从文档中提取文本或元数据
对扫描文档或图像执行 OCR
批量处理多个文件
配置提取选项（输出格式、分块、OCR、语言检测）
实现自定义插件（后处理器、验证器、OCR 后端）

安装

Python

pip install kreuzberg
# 可选的 OCR 后端：
pip install kreuzberg[easyocr]    # EasyOCR
pip install kreuzberg[paddleocr]  # PaddleOCR

Node.js

npm install @kreuzberg/node

Rust

# Cargo.toml
[dependencies]
kreuzberg = { version = "4", features = ["tokio-runtime"] }
# features: tokio-runtime (同步 + 批量处理所需), pdf, ocr, chunking,
#           embeddings, language-detection, keywords-yake, keywords-rake

CLI

# 从 GitHub releases 下载，或者：
cargo install kreuzberg-cli

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

Rust (同步) — 需要 `tokio-runtime` 特性

use kreuzberg::{extract_file_sync, ExtractionConfig};

fn main() -> kreuzberg::Result<()> {
    let config = ExtractionConfig::default();
    let result = extract_file_sync("document.pdf", None, &config)?;
    println!("{}", result.content);
    Ok(())
}

kreuzberg extract document.pdf
kreuzberg extract document.pdf --format json
kreuzberg extract document.pdf --output-format markdown

所有语言都使用相同的配置结构，并遵循各自语言的命名约定。

from kreuzberg import (
    ExtractionConfig, OcrConfig, TesseractConfig,
    PdfConfig, ChunkingConfig,
)

config = ExtractionConfig(
    ocr=OcrConfig(
        backend="tesseract",
        language="eng",
        tesseract_config=TesseractConfig(psm=6, enable_table_detection=True),
    ),
    pdf_options=PdfConfig(passwords=["secret123"]),
    chunking=ChunkingConfig(max_chars=1000, max_overlap=200),
    output_format="markdown",
)

result = await extract_file("document.pdf", config=config)

import { extractFile, type ExtractionConfig } from '@kreuzberg/node';

const config: ExtractionConfig = {
    ocr: { backend: 'tesseract', language: 'eng' },
    pdfOptions: { passwords: ['secret123'] },
    chunking: { maxChars: 1000, maxOverlap: 200 },
    outputFormat: 'markdown',
};

const result = await extractFile('document.pdf', null, config);

use kreuzberg::{ExtractionConfig, OcrConfig, ChunkingConfig, OutputFormat};

let config = ExtractionConfig {
    ocr: Some(OcrConfig {
        backend: "tesseract".into(),
        language: "eng".into(),
        ..Default::default()
    }),
    chunking: Some(ChunkingConfig {
        max_characters: 1000,
        overlap: 200,
        ..Default::default()
    }),
    output_format: OutputFormat::Markdown,
    ..Default::default()
};

let result = extract_file("document.pdf", None, &config).await?;

output_format = "markdown"

[ocr]
backend = "tesseract"
language = "eng"

[chunking]
max_chars = 1000
max_overlap = 200

[pdf_options]
passwords = ["secret123"]



# CLI: 自动在当前/父目录中发现 kreuzberg.toml
kreuzberg extract doc.pdf
# 或者显式指定：
kreuzberg extract doc.pdf --config kreuzberg.toml
kreuzberg extract doc.pdf --config-json '{"ocr":{"backend":"tesseract","language":"deu"}}'

from kreuzberg import batch_extract_files, batch_extract_files_sync

# 异步
results = await batch_extract_files(["doc1.pdf", "doc2.docx", "doc3.xlsx"])

# 同步
results = batch_extract_files_sync(["doc1.pdf", "doc2.docx"])

for result in results:
    print(f"{len(result.content)} chars extracted")

import { batchExtractFiles } from '@kreuzberg/node';

const results = await batchExtractFiles(['doc1.pdf', 'doc2.docx']);

Rust — 需要 `tokio-runtime` 特性

use kreuzberg::{batch_extract_file, ExtractionConfig};

let config = ExtractionConfig::default();
let paths = vec!["doc1.pdf", "doc2.docx"];
let results = batch_extract_file(paths, &config).await?;

kreuzberg batch *.pdf --format json
kreuzberg batch docs/*.docx --output-format markdown

对于图像和扫描的 PDF，OCR 会自动运行。Tesseract 是默认的后端（原生绑定，无需外部安装）。

Tesseract (默认)：内置原生绑定。支持所有 Tesseract 语言。
EasyOCR (仅 Python)：pip install kreuzberg[easyocr]。传递 easyocr_kwargs={"gpu": True}。
PaddleOCR (仅 Python)：pip install kreuzberg[paddleocr]。传递 paddleocr_kwargs={"use_angle_cls": True}。
Guten (仅 Node.js)：通过 GutenOcrBackend 内置的 OCR 后端。

config = ExtractionConfig(ocr=OcrConfig(language="eng"))       # 英语
config = ExtractionConfig(ocr=OcrConfig(language="eng+deu"))   # 多语言
config = ExtractionConfig(ocr=OcrConfig(language="all"))       # 所有已安装的语言

config = ExtractionConfig(force_ocr=True)  # 即使文本可提取也执行 OCR

ExtractionResult 字段

字段	Python	Node.js	Rust	描述
文本内容	`result.content`	`result.content`	`result.content`	提取的文本 (str/String)
MIME 类型	`result.mime_type`	`result.mimeType`	`result.mime_type`	输入文档的 MIME 类型
元数据	`result.metadata`	`result.metadata`	`result.metadata`	文档元数据 (dict/object/HashMap)
表格	`result.tables`	`result.tables`	`result.tables`	提取的表格，包含单元格和 markdown 格式
语言	`result.detected_languages`	`result.detectedLanguages`	`result.detected_languages`	检测到的语言（如果启用）
分块	`result.chunks`	`result.chunks`	`result.chunks`	文本块（如果启用分块）
图像	`result.images`	`result.images`	`result.images`	提取的图像（如果启用）
元素	`result.elements`	`result.elements`	`result.elements`	语义元素（如果使用 element_based 格式）
页面	`result.pages`	`result.pages`	`result.pages`	每页内容（如果启用页面提取）
关键词	`result.keywords`	`result.keywords`	`result.keywords`	提取的关键词（如果启用）

from kreuzberg import (
    extract_file_sync, KreuzbergError, ParsingError,
    OCRError, ValidationError, MissingDependencyError,
)

try:
    result = extract_file_sync("file.pdf")
except ParsingError as e:
    print(f"Failed to parse: {e}")
except OCRError as e:
    print(f"OCR failed: {e}")
except ValidationError as e:
    print(f"Invalid input: {e}")
except MissingDependencyError as e:
    print(f"Missing dependency: {e}")
except KreuzbergError as e:
    print(f"Extraction failed: {e}")

import {
    extractFile, KreuzbergError, ParsingError,
    OcrError, ValidationError, MissingDependencyError,
} from '@kreuzberg/node';

try {
    const result = await extractFile('file.pdf');
} catch (e) {
    if (e instanceof ParsingError) { /* ... */ }
    else if (e instanceof OcrError) { /* ... */ }
    else if (e instanceof ValidationError) { /* ... */ }
    else if (e instanceof KreuzbergError) { /* ... */ }
}

use kreuzberg::{extract_file, ExtractionConfig, KreuzbergError};

let config = ExtractionConfig::default();
match extract_file("file.pdf", None, &config).await {
    Ok(result) => println!("{}", result.content),
    Err(KreuzbergError::Parsing(msg)) => eprintln!("Parse error: {msg}"),
    Err(KreuzbergError::Ocr(msg)) => eprintln!("OCR error: {msg}"),
    Err(e) => eprintln!("Error: {e}"),
}

Python ChunkingConfig 字段：使用 max_chars 和 max_overlap，而不是 max_characters 或 overlap。
Rust extract_file 签名：第三个参数是 &ExtractionConfig（引用），而不是 Option。使用 &ExtractionConfig::default() 获取默认值。
Rust 特性门控：extract_file_sync、batch_extract_file 和 batch_extract_file_sync 都需要在 Cargo.toml 中设置 features = ["tokio-runtime"]。
Rust 异步上下文：extract_file 是异步的。使用 #[tokio::main] 或在异步上下文中调用。
CLI --format 与 --output-format：--format 控制 CLI 输出（text/json）。--output-format 控制内容格式（plain/markdown/djot/html）。
Node.js extractFile 签名：extractFile(path, mimeType?, config?) — mimeType 是第二个参数（传递 null 以跳过）。
Python detect_mime_type：用于从字节检测的函数是 detect_mime_type(data)。对于路径，使用 detect_mime_type_from_path(path)。
配置文件字段名：在 TOML/YAML/JSON 配置文件中使用 snake_case（例如，max_chars、max_overlap、pdf_options）。

支持的格式（摘要）

类别	扩展名
PDF	`.pdf`
Word	`.docx`, `.odt`
电子表格	`.xlsx`, `.xlsm`, `.xlsb`, `.xls`, `.xla`, `.xlam`, `.xltm`, `.ods`
演示文稿	`.pptx`, `.ppt`, `.ppsx`
电子书	`.epub`, `.fb2`
图像	`.png`, `.jpg`, `.jpeg`, `.gif`, `.webp`, `.bmp`, `.tiff`, `.tif`, `.jp2`, `.jpx`, `.jpm`, `.mj2`, , , , , , ,
标记语言	`.html`, `.htm`, `.xhtml`, `.xml`
数据	`.json`, `.yaml`, `.yml`, `.toml`, `.csv`, `.tsv`
文本	`.txt`, `.md`, `.markdown`, `.djot`, `.rst`, `.org`, `.rtf`
电子邮件	`.eml`, `.msg`
归档文件	`.zip`, `.tar`, `.tgz`, `.gz`, `.7z`
学术	`.bib`, `.biblatex`, `.ris`, `.nbib`, `.enw`, `.csl`, `.tex`, `.latex`, `.typ`, `.jats`, `.ipynb`, `.docbook`, , , ,

完整格式参考（包含 MIME 类型）请参阅 references/supported-formats.md。

特定主题的详细参考文件：

Python API 参考 — 所有函数、配置类、插件协议、精确签名
Node.js API 参考 — 所有函数、TypeScript 接口、工作池 API
Rust API 参考 — 所有带特性门控的函数、结构体、Cargo.toml 示例
CLI 参考 — 所有命令、标志、配置优先级、退出代码
配置参考 — TOML/YAML/JSON 格式、自动发现、环境变量、完整模式
支持的格式 — 所有 85 种以上的格式，包含文件扩展名和 MIME 类型
高级功能 — 插件、嵌入、MCP 服务器、API 服务器、安全限制
其他语言绑定 — Go、Ruby、Java、C#、PHP、Elixir、WASM、Docker

🇺🇸English

Kreuzberg Document Extraction

Kreuzberg is a high-performance document intelligence library with a Rust core and native bindings for Python, Node.js/TypeScript, Ruby, Go, Java, C#, PHP, and Elixir. It extracts text, tables, metadata, and images from 91+ file formats including PDF, Office documents, images (with OCR), HTML, email, archives, and academic formats.

Use this skill when writing code that:

Extracts text or metadata from documents
Performs OCR on scanned documents or images
Batch-processes multiple files
Configures extraction options (output format, chunking, OCR, language detection)
Implements custom plugins (post-processors, validators, OCR backends)

Installation

Python

pip install kreuzberg
# Optional OCR backends:
pip install kreuzberg[easyocr]    # EasyOCR
pip install kreuzberg[paddleocr]  # PaddleOCR

Node.js

npm install @kreuzberg/node

Rust

# Cargo.toml
[dependencies]
kreuzberg = { version = "4", features = ["tokio-runtime"] }
# features: tokio-runtime (required for sync + batch), pdf, ocr, chunking,
#           embeddings, language-detection, keywords-yake, keywords-rake

CLI

# Download from GitHub releases, or:
cargo install kreuzberg-cli

Quick Start

Python (Async)

from kreuzberg import extract_file

result = await extract_file("document.pdf")
print(result.content)       # extracted text
print(result.metadata)      # document metadata
print(result.tables)        # extracted tables

Python (Sync)

from kreuzberg import extract_file_sync

result = extract_file_sync("document.pdf")
print(result.content)

Node.js

import { extractFile } from '@kreuzberg/node';

const result = await extractFile('document.pdf');
console.log(result.content);
console.log(result.metadata);
console.log(result.tables);

Node.js (Sync)

import { extractFileSync } from '@kreuzberg/node';

const result = extractFileSync('document.pdf');

Rust (Async)

use kreuzberg::{extract_file, ExtractionConfig};

#[tokio::main]
async fn main() -> kreuzberg::Result<()> {
    let config = ExtractionConfig::default();
    let result = extract_file("document.pdf", None, &config).await?;
    println!("{}", result.content);
    Ok(())
}

Rust (Sync) — requires `tokio-runtime` feature

use kreuzberg::{extract_file_sync, ExtractionConfig};

fn main() -> kreuzberg::Result<()> {
    let config = ExtractionConfig::default();
    let result = extract_file_sync("document.pdf", None, &config)?;
    println!("{}", result.content);
    Ok(())
}

CLI

kreuzberg extract document.pdf
kreuzberg extract document.pdf --format json
kreuzberg extract document.pdf --output-format markdown

Configuration

All languages use the same configuration structure with language-appropriate naming conventions.

Python (snake_case)

from kreuzberg import (
    ExtractionConfig, OcrConfig, TesseractConfig,
    PdfConfig, ChunkingConfig,
)

config = ExtractionConfig(
    ocr=OcrConfig(
        backend="tesseract",
        language="eng",
        tesseract_config=TesseractConfig(psm=6, enable_table_detection=True),
    ),
    pdf_options=PdfConfig(passwords=["secret123"]),
    chunking=ChunkingConfig(max_chars=1000, max_overlap=200),
    output_format="markdown",
)

result = await extract_file("document.pdf", config=config)

Node.js (camelCase)

import { extractFile, type ExtractionConfig } from '@kreuzberg/node';

const config: ExtractionConfig = {
    ocr: { backend: 'tesseract', language: 'eng' },
    pdfOptions: { passwords: ['secret123'] },
    chunking: { maxChars: 1000, maxOverlap: 200 },
    outputFormat: 'markdown',
};

const result = await extractFile('document.pdf', null, config);

Rust (snake_case)

use kreuzberg::{ExtractionConfig, OcrConfig, ChunkingConfig, OutputFormat};

let config = ExtractionConfig {
    ocr: Some(OcrConfig {
        backend: "tesseract".into(),
        language: "eng".into(),
        ..Default::default()
    }),
    chunking: Some(ChunkingConfig {
        max_characters: 1000,
        overlap: 200,
        ..Default::default()
    }),
    output_format: OutputFormat::Markdown,
    ..Default::default()
};

let result = extract_file("document.pdf", None, &config).await?;

Config File (TOML)

output_format = "markdown"

[ocr]
backend = "tesseract"
language = "eng"

[chunking]
max_chars = 1000
max_overlap = 200

[pdf_options]
passwords = ["secret123"]



# CLI: auto-discovers kreuzberg.toml in current/parent directories
kreuzberg extract doc.pdf
# or explicit:
kreuzberg extract doc.pdf --config kreuzberg.toml
kreuzberg extract doc.pdf --config-json '{"ocr":{"backend":"tesseract","language":"deu"}}'

Batch Processing

Python

from kreuzberg import batch_extract_files, batch_extract_files_sync

# Async
results = await batch_extract_files(["doc1.pdf", "doc2.docx", "doc3.xlsx"])

# Sync
results = batch_extract_files_sync(["doc1.pdf", "doc2.docx"])

for result in results:
    print(f"{len(result.content)} chars extracted")

Node.js

import { batchExtractFiles } from '@kreuzberg/node';

const results = await batchExtractFiles(['doc1.pdf', 'doc2.docx']);

Rust — requires `tokio-runtime` feature

use kreuzberg::{batch_extract_file, ExtractionConfig};

let config = ExtractionConfig::default();
let paths = vec!["doc1.pdf", "doc2.docx"];
let results = batch_extract_file(paths, &config).await?;

CLI

kreuzberg batch *.pdf --format json
kreuzberg batch docs/*.docx --output-format markdown

OCR

OCR runs automatically for images and scanned PDFs. Tesseract is the default backend (native binding, no external install required).

Backends

Tesseract (default): Built-in native binding. All Tesseract languages supported.
EasyOCR (Python only): pip install kreuzberg[easyocr]. Pass easyocr_kwargs={"gpu": True}.
PaddleOCR (Python only): pip install kreuzberg[paddleocr]. Pass paddleocr_kwargs={"use_angle_cls": True}.
Guten (Node.js only): Built-in OCR backend via GutenOcrBackend.

Language Codes

config = ExtractionConfig(ocr=OcrConfig(language="eng"))       # English
config = ExtractionConfig(ocr=OcrConfig(language="eng+deu"))   # Multiple
config = ExtractionConfig(ocr=OcrConfig(language="all"))       # All installed

Force OCR

config = ExtractionConfig(force_ocr=True)  # OCR even if text is extractable

ExtractionResult Fields

Field	Python	Node.js	Rust	Description
Text content	`result.content`	`result.content`	`result.content`	Extracted text (str/String)
MIME type	`result.mime_type`	`result.mimeType`	`result.mime_type`	Input document MIME type

Error Handling

Python

from kreuzberg import (
    extract_file_sync, KreuzbergError, ParsingError,
    OCRError, ValidationError, MissingDependencyError,
)

try:
    result = extract_file_sync("file.pdf")
except ParsingError as e:
    print(f"Failed to parse: {e}")
except OCRError as e:
    print(f"OCR failed: {e}")
except ValidationError as e:
    print(f"Invalid input: {e}")
except MissingDependencyError as e:
    print(f"Missing dependency: {e}")
except KreuzbergError as e:
    print(f"Extraction failed: {e}")

Node.js

import {
    extractFile, KreuzbergError, ParsingError,
    OcrError, ValidationError, MissingDependencyError,
} from '@kreuzberg/node';

try {
    const result = await extractFile('file.pdf');
} catch (e) {
    if (e instanceof ParsingError) { /* ... */ }
    else if (e instanceof OcrError) { /* ... */ }
    else if (e instanceof ValidationError) { /* ... */ }
    else if (e instanceof KreuzbergError) { /* ... */ }
}

Rust

use kreuzberg::{extract_file, ExtractionConfig, KreuzbergError};

let config = ExtractionConfig::default();
match extract_file("file.pdf", None, &config).await {
    Ok(result) => println!("{}", result.content),
    Err(KreuzbergError::Parsing(msg)) => eprintln!("Parse error: {msg}"),
    Err(KreuzbergError::Ocr(msg)) => eprintln!("OCR error: {msg}"),
    Err(e) => eprintln!("Error: {e}"),
}

Common Pitfalls

Python ChunkingConfig fields : Use max_chars and max_overlap, NOT max_characters or overlap.
Rust extract_file signature : Third argument is &ExtractionConfig (a reference), not Option. Use &ExtractionConfig::default() for defaults.
Rust feature gates : extract_file_sync, batch_extract_file, and batch_extract_file_sync all require in Cargo.toml.

Supported Formats (Summary)

Category	Extensions
PDF	`.pdf`
Word	`.docx`, `.odt`
Spreadsheets	`.xlsx`, `.xlsm`, `.xlsb`, `.xls`, `.xla`, , ,

See references/supported-formats.md for the complete format reference with MIME types.

Additional Resources

Detailed reference files for specific topics:

Python API Reference — All functions, config classes, plugin protocols, exact signatures
Node.js API Reference — All functions, TypeScript interfaces, worker pool APIs
Rust API Reference — All functions with feature gates, structs, Cargo.toml examples
CLI Reference — All commands, flags, config precedence, exit codes
Configuration Reference — TOML/YAML/JSON formats, auto-discovery, env vars, full schema
Supported Formats — All 85+ formats with file extensions and MIME types
Advanced Features — Plugins, embeddings, MCP server, API server, security limits
Other Language Bindings — Go, Ruby, Java, C#, PHP, Elixir, WASM, Docker

Full documentation: https://docs.kreuzberg.dev GitHub: https://github.com/kreuzberg-dev/kreuzberg

Weekly Installs

152

Repository

kreuzberg-dev/kreuzberg

GitHub Stars

7.1K

First Seen

Feb 8, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykFail

Installed on

opencode146

codex145

github-copilot145

kimi-cli144

gemini-cli144

amp144

Python PDF 提取技能：使用 pdfplumber 库精确提取文本、表格和元数据

883 周安装

features = ["tokio-runtime"]

Rust async context : extract_file is async. Use #[tokio::main] or call from an async context.

CLI --format vs --output-format : --format controls CLI output (text/json). --output-format controls content format (plain/markdown/djot/html).

Node.js extractFile signature : extractFile(path, mimeType?, config?) — mimeType is the second arg (pass null to skip).

Python detect_mime_type : The function for detecting from bytes is detect_mime_type(data). For paths use detect_mime_type_from_path(path).

Config file field names : Use snake_case in TOML/YAML/JSON config files (e.g., max_chars, max_overlap, pdf_options).

Kreuzberg：高性能文档智能库，支持91+格式文本/表格/元数据提取与OCR

🇨🇳中文介绍

Kreuzberg 文档提取

安装

Python

Node.js

Rust

CLI

相关 Skills

快速开始

Python (异步)

Python (同步)

Node.js

Node.js (同步)

Rust (异步)

Rust (同步) — 需要 tokio-runtime 特性

CLI

配置

Python (snake_case)

Node.js (camelCase)

Rust (snake_case)

配置文件 (TOML)

批量处理

Python

Node.js

Rust — 需要 tokio-runtime 特性

CLI

OCR

后端

语言代码

强制 OCR

ExtractionResult 字段

错误处理

Python

Node.js

Rust

常见陷阱

支持的格式（摘要）

其他资源

🇺🇸English

Kreuzberg Document Extraction

Installation

Python

Node.js

Rust

CLI

Quick Start

Python (Async)

Python (Sync)

Node.js

Node.js (Sync)

Rust (Async)

Rust (Sync) — requires tokio-runtime feature

CLI

Configuration

Python (snake_case)

Node.js (camelCase)

Rust (snake_case)

Config File (TOML)

Batch Processing

Python

Node.js

Rust — requires tokio-runtime feature

CLI

OCR

Backends

Language Codes

Force OCR

ExtractionResult Fields

Error Handling

Python

Node.js

Rust

Common Pitfalls

Supported Formats (Summary)

Additional Resources

最新 Skills

Rust (同步) — 需要 `tokio-runtime` 特性

Rust — 需要 `tokio-runtime` 特性

Rust (Sync) — requires `tokio-runtime` feature

Rust — requires `tokio-runtime` feature