kreuzberg by kreuzberg-dev/kreuzberg
npx skills add https://github.com/kreuzberg-dev/kreuzberg --skill kreuzbergKreuzberg 是一个高性能的文档智能库,采用 Rust 核心,并为 Python、Node.js/TypeScript、Ruby、Go、Java、C#、PHP 和 Elixir 提供原生绑定。它可以从 91 种以上的文件格式中提取文本、表格、元数据和图像,包括 PDF、Office 文档、图像(带 OCR)、HTML、电子邮件、归档文件和学术格式。
在编写以下代码时使用此技能:
pip install kreuzberg
# 可选的 OCR 后端:
pip install kreuzberg[easyocr] # EasyOCR
pip install kreuzberg[paddleocr] # PaddleOCR
npm install @kreuzberg/node
# Cargo.toml
[dependencies]
kreuzberg = { version = "4", features = ["tokio-runtime"] }
# features: tokio-runtime (同步 + 批量处理所需), pdf, ocr, chunking,
# embeddings, language-detection, keywords-yake, keywords-rake
# 从 GitHub releases 下载,或者:
cargo install kreuzberg-cli
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
from kreuzberg import extract_file
result = await extract_file("document.pdf")
print(result.content) # 提取的文本
print(result.metadata) # 文档元数据
print(result.tables) # 提取的表格
from kreuzberg import extract_file_sync
result = extract_file_sync("document.pdf")
print(result.content)
import { extractFile } from '@kreuzberg/node';
const result = await extractFile('document.pdf');
console.log(result.content);
console.log(result.metadata);
console.log(result.tables);
import { extractFileSync } from '@kreuzberg/node';
const result = extractFileSync('document.pdf');
use kreuzberg::{extract_file, ExtractionConfig};
#[tokio::main]
async fn main() -> kreuzberg::Result<()> {
let config = ExtractionConfig::default();
let result = extract_file("document.pdf", None, &config).await?;
println!("{}", result.content);
Ok(())
}
tokio-runtime 特性use kreuzberg::{extract_file_sync, ExtractionConfig};
fn main() -> kreuzberg::Result<()> {
let config = ExtractionConfig::default();
let result = extract_file_sync("document.pdf", None, &config)?;
println!("{}", result.content);
Ok(())
}
kreuzberg extract document.pdf
kreuzberg extract document.pdf --format json
kreuzberg extract document.pdf --output-format markdown
所有语言都使用相同的配置结构,并遵循各自语言的命名约定。
from kreuzberg import (
ExtractionConfig, OcrConfig, TesseractConfig,
PdfConfig, ChunkingConfig,
)
config = ExtractionConfig(
ocr=OcrConfig(
backend="tesseract",
language="eng",
tesseract_config=TesseractConfig(psm=6, enable_table_detection=True),
),
pdf_options=PdfConfig(passwords=["secret123"]),
chunking=ChunkingConfig(max_chars=1000, max_overlap=200),
output_format="markdown",
)
result = await extract_file("document.pdf", config=config)
import { extractFile, type ExtractionConfig } from '@kreuzberg/node';
const config: ExtractionConfig = {
ocr: { backend: 'tesseract', language: 'eng' },
pdfOptions: { passwords: ['secret123'] },
chunking: { maxChars: 1000, maxOverlap: 200 },
outputFormat: 'markdown',
};
const result = await extractFile('document.pdf', null, config);
use kreuzberg::{ExtractionConfig, OcrConfig, ChunkingConfig, OutputFormat};
let config = ExtractionConfig {
ocr: Some(OcrConfig {
backend: "tesseract".into(),
language: "eng".into(),
..Default::default()
}),
chunking: Some(ChunkingConfig {
max_characters: 1000,
overlap: 200,
..Default::default()
}),
output_format: OutputFormat::Markdown,
..Default::default()
};
let result = extract_file("document.pdf", None, &config).await?;
output_format = "markdown"
[ocr]
backend = "tesseract"
language = "eng"
[chunking]
max_chars = 1000
max_overlap = 200
[pdf_options]
passwords = ["secret123"]
# CLI: 自动在当前/父目录中发现 kreuzberg.toml
kreuzberg extract doc.pdf
# 或者显式指定:
kreuzberg extract doc.pdf --config kreuzberg.toml
kreuzberg extract doc.pdf --config-json '{"ocr":{"backend":"tesseract","language":"deu"}}'
from kreuzberg import batch_extract_files, batch_extract_files_sync
# 异步
results = await batch_extract_files(["doc1.pdf", "doc2.docx", "doc3.xlsx"])
# 同步
results = batch_extract_files_sync(["doc1.pdf", "doc2.docx"])
for result in results:
print(f"{len(result.content)} chars extracted")
import { batchExtractFiles } from '@kreuzberg/node';
const results = await batchExtractFiles(['doc1.pdf', 'doc2.docx']);
tokio-runtime 特性use kreuzberg::{batch_extract_file, ExtractionConfig};
let config = ExtractionConfig::default();
let paths = vec!["doc1.pdf", "doc2.docx"];
let results = batch_extract_file(paths, &config).await?;
kreuzberg batch *.pdf --format json
kreuzberg batch docs/*.docx --output-format markdown
对于图像和扫描的 PDF,OCR 会自动运行。Tesseract 是默认的后端(原生绑定,无需外部安装)。
pip install kreuzberg[easyocr]。传递 easyocr_kwargs={"gpu": True}。pip install kreuzberg[paddleocr]。传递 paddleocr_kwargs={"use_angle_cls": True}。GutenOcrBackend 内置的 OCR 后端。config = ExtractionConfig(ocr=OcrConfig(language="eng")) # 英语
config = ExtractionConfig(ocr=OcrConfig(language="eng+deu")) # 多语言
config = ExtractionConfig(ocr=OcrConfig(language="all")) # 所有已安装的语言
config = ExtractionConfig(force_ocr=True) # 即使文本可提取也执行 OCR
| 字段 | Python | Node.js | Rust | 描述 |
|---|---|---|---|---|
| 文本内容 | result.content | result.content | result.content | 提取的文本 (str/String) |
| MIME 类型 | result.mime_type | result.mimeType | result.mime_type | 输入文档的 MIME 类型 |
| 元数据 | result.metadata | result.metadata | result.metadata | 文档元数据 (dict/object/HashMap) |
| 表格 | result.tables | result.tables | result.tables | 提取的表格,包含单元格和 markdown 格式 |
| 语言 | result.detected_languages | result.detectedLanguages | result.detected_languages | 检测到的语言(如果启用) |
| 分块 | result.chunks | result.chunks | result.chunks | 文本块(如果启用分块) |
| 图像 | result.images | result.images | result.images | 提取的图像(如果启用) |
| 元素 | result.elements | result.elements | result.elements | 语义元素(如果使用 element_based 格式) |
| 页面 | result.pages | result.pages | result.pages | 每页内容(如果启用页面提取) |
| 关键词 | result.keywords | result.keywords | result.keywords | 提取的关键词(如果启用) |
from kreuzberg import (
extract_file_sync, KreuzbergError, ParsingError,
OCRError, ValidationError, MissingDependencyError,
)
try:
result = extract_file_sync("file.pdf")
except ParsingError as e:
print(f"Failed to parse: {e}")
except OCRError as e:
print(f"OCR failed: {e}")
except ValidationError as e:
print(f"Invalid input: {e}")
except MissingDependencyError as e:
print(f"Missing dependency: {e}")
except KreuzbergError as e:
print(f"Extraction failed: {e}")
import {
extractFile, KreuzbergError, ParsingError,
OcrError, ValidationError, MissingDependencyError,
} from '@kreuzberg/node';
try {
const result = await extractFile('file.pdf');
} catch (e) {
if (e instanceof ParsingError) { /* ... */ }
else if (e instanceof OcrError) { /* ... */ }
else if (e instanceof ValidationError) { /* ... */ }
else if (e instanceof KreuzbergError) { /* ... */ }
}
use kreuzberg::{extract_file, ExtractionConfig, KreuzbergError};
let config = ExtractionConfig::default();
match extract_file("file.pdf", None, &config).await {
Ok(result) => println!("{}", result.content),
Err(KreuzbergError::Parsing(msg)) => eprintln!("Parse error: {msg}"),
Err(KreuzbergError::Ocr(msg)) => eprintln!("OCR error: {msg}"),
Err(e) => eprintln!("Error: {e}"),
}
max_chars 和 max_overlap,而不是 max_characters 或 overlap。&ExtractionConfig(引用),而不是 Option。使用 &ExtractionConfig::default() 获取默认值。extract_file_sync、batch_extract_file 和 batch_extract_file_sync 都需要在 Cargo.toml 中设置 features = ["tokio-runtime"]。extract_file 是异步的。使用 #[tokio::main] 或在异步上下文中调用。--format 控制 CLI 输出(text/json)。--output-format 控制内容格式(plain/markdown/djot/html)。extractFile(path, mimeType?, config?) — mimeType 是第二个参数(传递 null 以跳过)。detect_mime_type(data)。对于路径,使用 detect_mime_type_from_path(path)。max_chars、max_overlap、pdf_options)。| 类别 | 扩展名 |
|---|---|
.pdf | |
| Word | .docx, .odt |
| 电子表格 | .xlsx, .xlsm, .xlsb, .xls, .xla, .xlam, .xltm, .ods |
| 演示文稿 | .pptx, .ppt, .ppsx |
| 电子书 | .epub, .fb2 |
| 图像 | .png, .jpg, .jpeg, .gif, .webp, .bmp, .tiff, .tif, .jp2, .jpx, .jpm, .mj2, , , , , , , |
| 标记语言 | .html, .htm, .xhtml, .xml |
| 数据 | .json, .yaml, .yml, .toml, .csv, .tsv |
| 文本 | .txt, .md, .markdown, .djot, .rst, .org, .rtf |
| 电子邮件 | .eml, .msg |
| 归档文件 | .zip, .tar, .tgz, .gz, .7z |
| 学术 | .bib, .biblatex, .ris, .nbib, .enw, .csl, .tex, .latex, .typ, .jats, .ipynb, .docbook, , , , |
完整格式参考(包含 MIME 类型)请参阅 references/supported-formats.md。
特定主题的详细参考文件:
每周安装次数
152
仓库
GitHub 星标
7.1K
首次出现
2026年2月8日
安全审计
安装于
opencode146
codex145
github-copilot145
kimi-cli144
gemini-cli144
amp144
Kreuzberg is a high-performance document intelligence library with a Rust core and native bindings for Python, Node.js/TypeScript, Ruby, Go, Java, C#, PHP, and Elixir. It extracts text, tables, metadata, and images from 91+ file formats including PDF, Office documents, images (with OCR), HTML, email, archives, and academic formats.
Use this skill when writing code that:
pip install kreuzberg
# Optional OCR backends:
pip install kreuzberg[easyocr] # EasyOCR
pip install kreuzberg[paddleocr] # PaddleOCR
npm install @kreuzberg/node
# Cargo.toml
[dependencies]
kreuzberg = { version = "4", features = ["tokio-runtime"] }
# features: tokio-runtime (required for sync + batch), pdf, ocr, chunking,
# embeddings, language-detection, keywords-yake, keywords-rake
# Download from GitHub releases, or:
cargo install kreuzberg-cli
from kreuzberg import extract_file
result = await extract_file("document.pdf")
print(result.content) # extracted text
print(result.metadata) # document metadata
print(result.tables) # extracted tables
from kreuzberg import extract_file_sync
result = extract_file_sync("document.pdf")
print(result.content)
import { extractFile } from '@kreuzberg/node';
const result = await extractFile('document.pdf');
console.log(result.content);
console.log(result.metadata);
console.log(result.tables);
import { extractFileSync } from '@kreuzberg/node';
const result = extractFileSync('document.pdf');
use kreuzberg::{extract_file, ExtractionConfig};
#[tokio::main]
async fn main() -> kreuzberg::Result<()> {
let config = ExtractionConfig::default();
let result = extract_file("document.pdf", None, &config).await?;
println!("{}", result.content);
Ok(())
}
tokio-runtime featureuse kreuzberg::{extract_file_sync, ExtractionConfig};
fn main() -> kreuzberg::Result<()> {
let config = ExtractionConfig::default();
let result = extract_file_sync("document.pdf", None, &config)?;
println!("{}", result.content);
Ok(())
}
kreuzberg extract document.pdf
kreuzberg extract document.pdf --format json
kreuzberg extract document.pdf --output-format markdown
All languages use the same configuration structure with language-appropriate naming conventions.
from kreuzberg import (
ExtractionConfig, OcrConfig, TesseractConfig,
PdfConfig, ChunkingConfig,
)
config = ExtractionConfig(
ocr=OcrConfig(
backend="tesseract",
language="eng",
tesseract_config=TesseractConfig(psm=6, enable_table_detection=True),
),
pdf_options=PdfConfig(passwords=["secret123"]),
chunking=ChunkingConfig(max_chars=1000, max_overlap=200),
output_format="markdown",
)
result = await extract_file("document.pdf", config=config)
import { extractFile, type ExtractionConfig } from '@kreuzberg/node';
const config: ExtractionConfig = {
ocr: { backend: 'tesseract', language: 'eng' },
pdfOptions: { passwords: ['secret123'] },
chunking: { maxChars: 1000, maxOverlap: 200 },
outputFormat: 'markdown',
};
const result = await extractFile('document.pdf', null, config);
use kreuzberg::{ExtractionConfig, OcrConfig, ChunkingConfig, OutputFormat};
let config = ExtractionConfig {
ocr: Some(OcrConfig {
backend: "tesseract".into(),
language: "eng".into(),
..Default::default()
}),
chunking: Some(ChunkingConfig {
max_characters: 1000,
overlap: 200,
..Default::default()
}),
output_format: OutputFormat::Markdown,
..Default::default()
};
let result = extract_file("document.pdf", None, &config).await?;
output_format = "markdown"
[ocr]
backend = "tesseract"
language = "eng"
[chunking]
max_chars = 1000
max_overlap = 200
[pdf_options]
passwords = ["secret123"]
# CLI: auto-discovers kreuzberg.toml in current/parent directories
kreuzberg extract doc.pdf
# or explicit:
kreuzberg extract doc.pdf --config kreuzberg.toml
kreuzberg extract doc.pdf --config-json '{"ocr":{"backend":"tesseract","language":"deu"}}'
from kreuzberg import batch_extract_files, batch_extract_files_sync
# Async
results = await batch_extract_files(["doc1.pdf", "doc2.docx", "doc3.xlsx"])
# Sync
results = batch_extract_files_sync(["doc1.pdf", "doc2.docx"])
for result in results:
print(f"{len(result.content)} chars extracted")
import { batchExtractFiles } from '@kreuzberg/node';
const results = await batchExtractFiles(['doc1.pdf', 'doc2.docx']);
tokio-runtime featureuse kreuzberg::{batch_extract_file, ExtractionConfig};
let config = ExtractionConfig::default();
let paths = vec!["doc1.pdf", "doc2.docx"];
let results = batch_extract_file(paths, &config).await?;
kreuzberg batch *.pdf --format json
kreuzberg batch docs/*.docx --output-format markdown
OCR runs automatically for images and scanned PDFs. Tesseract is the default backend (native binding, no external install required).
pip install kreuzberg[easyocr]. Pass easyocr_kwargs={"gpu": True}.pip install kreuzberg[paddleocr]. Pass paddleocr_kwargs={"use_angle_cls": True}.GutenOcrBackend.config = ExtractionConfig(ocr=OcrConfig(language="eng")) # English
config = ExtractionConfig(ocr=OcrConfig(language="eng+deu")) # Multiple
config = ExtractionConfig(ocr=OcrConfig(language="all")) # All installed
config = ExtractionConfig(force_ocr=True) # OCR even if text is extractable
| Field | Python | Node.js | Rust | Description |
|---|---|---|---|---|
| Text content | result.content | result.content | result.content | Extracted text (str/String) |
| MIME type | result.mime_type | result.mimeType | result.mime_type | Input document MIME type |
from kreuzberg import (
extract_file_sync, KreuzbergError, ParsingError,
OCRError, ValidationError, MissingDependencyError,
)
try:
result = extract_file_sync("file.pdf")
except ParsingError as e:
print(f"Failed to parse: {e}")
except OCRError as e:
print(f"OCR failed: {e}")
except ValidationError as e:
print(f"Invalid input: {e}")
except MissingDependencyError as e:
print(f"Missing dependency: {e}")
except KreuzbergError as e:
print(f"Extraction failed: {e}")
import {
extractFile, KreuzbergError, ParsingError,
OcrError, ValidationError, MissingDependencyError,
} from '@kreuzberg/node';
try {
const result = await extractFile('file.pdf');
} catch (e) {
if (e instanceof ParsingError) { /* ... */ }
else if (e instanceof OcrError) { /* ... */ }
else if (e instanceof ValidationError) { /* ... */ }
else if (e instanceof KreuzbergError) { /* ... */ }
}
use kreuzberg::{extract_file, ExtractionConfig, KreuzbergError};
let config = ExtractionConfig::default();
match extract_file("file.pdf", None, &config).await {
Ok(result) => println!("{}", result.content),
Err(KreuzbergError::Parsing(msg)) => eprintln!("Parse error: {msg}"),
Err(KreuzbergError::Ocr(msg)) => eprintln!("OCR error: {msg}"),
Err(e) => eprintln!("Error: {e}"),
}
max_chars and max_overlap, NOT max_characters or overlap.&ExtractionConfig (a reference), not Option. Use &ExtractionConfig::default() for defaults.extract_file_sync, batch_extract_file, and batch_extract_file_sync all require in Cargo.toml.| Category | Extensions |
|---|---|
.pdf | |
| Word | .docx, .odt |
| Spreadsheets | .xlsx, .xlsm, .xlsb, .xls, .xla, , , |
See references/supported-formats.md for the complete format reference with MIME types.
Detailed reference files for specific topics:
Full documentation: https://docs.kreuzberg.dev GitHub: https://github.com/kreuzberg-dev/kreuzberg
Weekly Installs
152
Repository
GitHub Stars
7.1K
First Seen
Feb 8, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykFail
Installed on
opencode146
codex145
github-copilot145
kimi-cli144
gemini-cli144
amp144
Python PDF 提取技能:使用 pdfplumber 库精确提取文本、表格和元数据
883 周安装
.jbig2.jb2.pnm.pbm.pgm.ppm.svg.opml.pod.mdoc.troff| Metadata | result.metadata | result.metadata | result.metadata | Document metadata (dict/object/HashMap) |
| Tables | result.tables | result.tables | result.tables | Extracted tables with cells + markdown |
| Languages | result.detected_languages | result.detectedLanguages | result.detected_languages | Detected languages (if enabled) |
| Chunks | result.chunks | result.chunks | result.chunks | Text chunks (if chunking enabled) |
| Images | result.images | result.images | result.images | Extracted images (if enabled) |
| Elements | result.elements | result.elements | result.elements | Semantic elements (if element_based format) |
| Pages | result.pages | result.pages | result.pages | Per-page content (if page extraction enabled) |
| Keywords | result.keywords | result.keywords | result.keywords | Extracted keywords (if enabled) |
features = ["tokio-runtime"]extract_file is async. Use #[tokio::main] or call from an async context.--format controls CLI output (text/json). --output-format controls content format (plain/markdown/djot/html).extractFile(path, mimeType?, config?) — mimeType is the second arg (pass null to skip).detect_mime_type(data). For paths use detect_mime_type_from_path(path).max_chars, max_overlap, pdf_options)..xlam.xltm.ods| Presentations | .pptx, .ppt, .ppsx |
| eBooks | .epub, .fb2 |
| Images | .png, .jpg, .jpeg, .gif, .webp, .bmp, .tiff, .tif, .jp2, .jpx, .jpm, .mj2, .jbig2, .jb2, .pnm, .pbm, .pgm, .ppm, .svg |
| Markup | .html, .htm, .xhtml, .xml |
| Data | .json, .yaml, .yml, .toml, .csv, .tsv |
| Text | .txt, .md, .markdown, .djot, .rst, .org, .rtf |
.eml, .msg |
| Archives | .zip, .tar, .tgz, .gz, .7z |
| Academic | .bib, .biblatex, .ris, .nbib, .enw, .csl, .tex, .latex, .typ, .jats, .ipynb, .docbook, .opml, .pod, .mdoc, .troff |