LiteParse 本地文档解析工具 - 快速解析PDF/DOCX/PPTX/XLSX/图像，无需云依赖

liteparse by run-llama/llamaparse-agent-skills

667 周安装量

31 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/run-llama/llamaparse-agent-skills --skill liteparse

文件管理命令行工具数据处理

🇨🇳中文介绍

LiteParse 技能

使用 LiteParse 本地解析非结构化文档（PDF、DOCX、PPTX、XLSX、图像等）：快速、轻量级，无需云依赖或 LLM。

初始设置

当此技能被调用时，请回复：

I'm ready to use LiteParse to parse files locally. Before we begin, please confirm that:

- `@llamaindex/liteparse` is installed globally (`npm i -g @llamaindex/liteparse`)
- The `lit` CLI command is available in your terminal

If both are set, please provide:

1. One or more files to parse (PDF, DOCX, PPTX, XLSX, images, etc.)
2. Any specific options: output format (json/text), page ranges, OCR preferences, DPI, etc.
3. What you'd like to do with the parsed content.

I will produce the appropriate `lit` CLI command or TypeScript script, and once approved, report the results.

然后等待用户的输入。

步骤 0 — 安装 LiteParse（如果需要）

如果尚未安装 liteparse，请全局安装：

npm i -g @llamaindex/liteparse

验证安装：

lit --version

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

步骤 1 — 生成 CLI 命令或脚本

# 基本文本提取
lit parse document.pdf

# JSON 输出保存到文件
lit parse document.pdf --format json -o output.json

# 特定页面范围
lit parse document.pdf --target-pages "1-5,10,15-20"

# 禁用 OCR（更快，适用于纯文本 PDF）
lit parse document.pdf --no-ocr

# 使用外部 HTTP OCR 服务器以获得更高准确度
lit parse document.pdf --ocr-server-url http://localhost:8828/ocr

# 更高 DPI 以获得更好质量
lit parse document.pdf --dpi 300

lit batch-parse ./input-directory ./output-directory

# 仅处理 PDF 文件，递归
lit batch-parse ./input ./output --extension .pdf --recursive

截图对于需要查看视觉布局的 LLM 代理很有用。

# 所有页面
lit screenshot document.pdf -o ./screenshots

# 特定页面
lit screenshot document.pdf --pages "1,3,5" -o ./screenshots

# 高 DPI PNG
lit screenshot document.pdf --dpi 300 --format png -o ./screenshots

# 页面范围
lit screenshot document.pdf --pages "1-10" -o ./screenshots

步骤 3 — 关键选项参考

选项	描述
(默认)	Tesseract.js — 零设置，内置
`--ocr-language fra`	设置 OCR 语言（ISO 代码）
`--ocr-server-url <url>`	使用外部 HTTP OCR 服务器（EasyOCR、PaddleOCR、自定义）
`--no-ocr`	完全禁用 OCR

选项	描述
`--format json`	带边界框的结构化 JSON
`--format text`	纯文本（默认）
`-o <file>`	将输出保存到文件

性能 / 质量选项

选项	描述
`--dpi <n>`	渲染 DPI（默认：150；使用 300 获得高质量）
`--max-pages <n>`	限制解析的页面数
`--target-pages <pages>`	解析特定页面（例如 `"1-5,10"`）
`--no-precise-bbox`	禁用精确边界框（更快）
`--skip-diagonal-text`	忽略旋转/对角线文本
`--preserve-small-text`	保留原本会被丢弃的非常小的文本

步骤 4 — 使用配置文件

对于需要一致选项的重复使用，生成一个 liteparse.config.json：

{
  "ocrLanguage": "en",
  "ocrEnabled": true,
  "maxPages": 1000,
  "dpi": 150,
  "outputFormat": "json",
  "preciseBoundingBox": true,
  "skipDiagonalText": false,
  "preserveVerySmallText": false
}

对于 HTTP OCR 服务器：

{
  "ocrServerUrl": "http://localhost:8828/ocr",
  "ocrLanguage": "en",
  "outputFormat": "json"
}

lit parse document.pdf --config liteparse.config.json

步骤 5 — HTTP OCR 服务器 API（高级）

如果用户想要接入自定义 OCR 后端，服务器必须实现：

端点 : POST /ocr
接受 : file（多部分）和 language（字符串）参数
返回 :

{ "results": [ { "text": "Hello", "bbox": [x1, y1, x2, y2], "confidence": 0.98 } ] }

LiteParse 仓库中提供了 EasyOCR 和 PaddleOCR 的现成封装。

支持的输入格式

类别	格式
PDF	`.pdf`
Word	`.doc`, `.docx`, `.docm`, `.odt`, `.rtf`
PowerPoint	`.ppt`, `.pptx`, `.pptm`, `.odp`
电子表格	`.xls`, `.xlsx`, `.xlsm`, `.ods`, `.csv`, `.tsv`
图像	`.jpg`, `.jpeg`, `.png`, `.gif`, `.bmp`, `.tiff`, `.webp`, `.svg`

Office 文档需要 LibreOffice；图像需要 ImageMagick。LiteParse 在解析前会自动将这些格式转换为 PDF。

🇺🇸English

LiteParse Skill

Parse unstructured documents (PDF, DOCX, PPTX, XLSX, images, and more) locally with LiteParse: fast, lightweight, no cloud dependencies or LLM required.

Initial Setup

When this skill is invoked, respond with:

I'm ready to use LiteParse to parse files locally. Before we begin, please confirm that:

- `@llamaindex/liteparse` is installed globally (`npm i -g @llamaindex/liteparse`)
- The `lit` CLI command is available in your terminal

If both are set, please provide:

1. One or more files to parse (PDF, DOCX, PPTX, XLSX, images, etc.)
2. Any specific options: output format (json/text), page ranges, OCR preferences, DPI, etc.
3. What you'd like to do with the parsed content.

I will produce the appropriate `lit` CLI command or TypeScript script, and once approved, report the results.

Then wait for the user's input.

Step 0 — Install LiteParse (if needed)

If liteparse is not yet installed, install it globally:

npm i -g @llamaindex/liteparse

Verify installation:

lit --version

For Office document support (DOCX, PPTX, XLSX), LibreOffice is required:

# macOS
brew install --cask libreoffice

# Ubuntu/Debian
apt-get install libreoffice

For image parsing, ImageMagick is required:

# macOS
brew install imagemagick

# Ubuntu/Debian
apt-get install imagemagick

Step 1 — Produce the CLI Command or Script

Parse a Single File

# Basic text extraction
lit parse document.pdf

# JSON output saved to a file
lit parse document.pdf --format json -o output.json

# Specific page range
lit parse document.pdf --target-pages "1-5,10,15-20"

# Disable OCR (faster, text-only PDFs)
lit parse document.pdf --no-ocr

# Use an external HTTP OCR server for higher accuracy
lit parse document.pdf --ocr-server-url http://localhost:8828/ocr

# Higher DPI for better quality
lit parse document.pdf --dpi 300

Batch Parse a Directory

lit batch-parse ./input-directory ./output-directory

# Only process PDFs, recursively
lit batch-parse ./input ./output --extension .pdf --recursive

Generate Page Screenshots

Screenshots are useful for LLM agents that need to see visual layout.

# All pages
lit screenshot document.pdf -o ./screenshots

# Specific pages
lit screenshot document.pdf --pages "1,3,5" -o ./screenshots

# High-DPI PNG
lit screenshot document.pdf --dpi 300 --format png -o ./screenshots

# Page range
lit screenshot document.pdf --pages "1-10" -o ./screenshots

Step 3 — Key Options Reference

OCR Options

Option	Description
(default)	Tesseract.js — zero setup, built-in
`--ocr-language fra`	Set OCR language (ISO code)
`--ocr-server-url <url>`	Use external HTTP OCR server (EasyOCR, PaddleOCR, custom)
`--no-ocr`	Disable OCR entirely

Output Options

Option	Description
`--format json`	Structured JSON with bounding boxes
`--format text`	Plain text (default)
`-o <file>`	Save output to file

Performance / Quality Options

Option	Description
`--dpi <n>`	Rendering DPI (default: 150; use 300 for high quality)
`--max-pages <n>`	Limit pages parsed
`--target-pages <pages>`	Parse specific pages (e.g. `"1-5,10"`)
`--no-precise-bbox`	Disable precise bounding boxes (faster)
`--skip-diagonal-text`	Ignore rotated/diagonal text

Step 4 — Using a Config File

For repeated use with consistent options, generate a liteparse.config.json:

{
  "ocrLanguage": "en",
  "ocrEnabled": true,
  "maxPages": 1000,
  "dpi": 150,
  "outputFormat": "json",
  "preciseBoundingBox": true,
  "skipDiagonalText": false,
  "preserveVerySmallText": false
}

For an HTTP OCR server:

{
  "ocrServerUrl": "http://localhost:8828/ocr",
  "ocrLanguage": "en",
  "outputFormat": "json"
}

Use with:

lit parse document.pdf --config liteparse.config.json

Step 5 — HTTP OCR Server API (Advanced)

If the user wants to plug in a custom OCR backend, the server must implement:

Endpoint : POST /ocr
Accepts : file (multipart) and language (string) parameters
Returns :

{ "results": [ { "text": "Hello", "bbox": [x1, y1, x2, y2], "confidence": 0.98 } ] }

Ready-to-use wrappers exist for EasyOCR and PaddleOCR in the LiteParse repo.

Supported Input Formats

Category	Formats
PDF	`.pdf`
Word	`.doc`, `.docx`, `.docm`, `.odt`, `.rtf`
PowerPoint	`.ppt`, `.pptx`, `.pptm`,

Office documents require LibreOffice; images require ImageMagick. LiteParse auto-converts these formats to PDF before parsing.

Weekly Installs

181

Repository

run-llama/llama…t-skills

GitHub Stars

First Seen

6 days ago

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

codex178

opencode178

kimi-cli176

gemini-cli176

github-copilot176

amp176

Lark CLI妙记查询工具：快速获取飞书妙记元信息（标题、封面、时长）

31,500 周安装

--preserve-small-text