MarkItDown：微软官方文件转Markdown工具，支持PDF/Word/Excel/音频/网页等格式转换

markitdown-skill by julianobarbosa/claude-code-skills

76 周安装量

44 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/julianobarbosa/claude-code-skills --skill markitdown-skill

AI/机器学习文件管理数据处理

🇨🇳中文介绍

MarkItDown 技能

微软推出的 Python 工具，用于将各种文件格式转换为 Markdown，适用于 LLM 和文本分析流水线。

概述

MarkItDown 在转换文档的同时保留其结构（标题、列表、表格、链接）。它针对 LLM 消费而非人类可读输出进行了优化。

支持的格式

类别	格式
文档	PDF, Word (DOCX), PowerPoint (PPTX), Excel (XLSX, XLS)
媒体	图像 (EXIF + OCR), 音频 (WAV, MP3 转录)
网页	HTML, YouTube 网址, Wikipedia, RSS/Atom 订阅源
数据	CSV, JSON, XML, Jupyter 笔记本 (.ipynb)
归档文件	ZIP (遍历内容), EPub
邮件	Outlook MSG 文件

快速开始

安装

# 完整安装 (推荐)
pip install 'markitdown[all]'

# 最小化安装，包含特定格式支持
pip install 'markitdown[pdf,docx,pptx]'

# 使用 uv
uv pip install 'markitdown[all]'

可选依赖项

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

为 LLM 上下文准备数据

from markitdown import MarkItDown

def prepare_for_llm(file_path: str) -> str:
    """将文档转换为可供 LLM 使用的 Markdown 格式。"""
    md = MarkItDown()
    result = md.convert(file_path)

    # 添加来源引用
    content = f"# Source: {file_path}\n\n{result.text_content}"
    return content

# 与你的 LLM 一起使用
context = prepare_for_llm("report.pdf")

提取 YouTube 字幕

# CLI
markitdown "https://www.youtube.com/watch?v=VIDEO_ID" > transcript.md



# Python
from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("https://www.youtube.com/watch?v=VIDEO_ID")
print(result.text_content)

图像 OCR 与 AI 描述

from markitdown import MarkItDown
from openai import OpenAI

# 初始化，支持 LLM
client = OpenAI()
md = MarkItDown(
    llm_client=client,
    llm_model="gpt-4o"
)

# 转换图像并生成 AI 描述
result = md.convert("screenshot.png")
print(result.text_content)

转换 Jupyter 笔记本

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("analysis.ipynb")
print(result.text_content)  # 代码单元格、输出、markdown

提取 Wikipedia 内容

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("https://en.wikipedia.org/wiki/Python")
print(result.text_content)  # 仅主文章内容

解析 RSS 订阅源

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("https://example.com/feed.xml")
print(result.text_content)  # 订阅源条目以 markdown 格式呈现

MarkItDown 支持第三方插件以扩展功能。

# 列出已安装的插件
markitdown --list-plugins

# 在转换过程中启用插件
markitdown --use-plugins document.pdf



# 在 Python 中启用插件
md = MarkItDown(enable_plugins=True)
result = md.convert("document.pdf")

在 GitHub 上搜索 #markitdown-plugin 以查找可用插件。

MarkItDown 提供了一个 MCP (Model Context Protocol) 服务器，用于与 Claude Desktop 等 LLM 应用程序集成。

# 安装 MCP 服务器
pip install markitdown-mcp

# 或从源代码安装
git clone https://github.com/microsoft/markitdown.git
cd markitdown/packages/markitdown-mcp
pip install -e .

配置详情请参阅 markitdown-mcp。

# 构建镜像
docker build -t markitdown:latest .

# 转换文件
docker run --rm -i markitdown:latest < document.pdf > output.md

问题	解决方案
缺少依赖项	使用 `pip install 'markitdown[all]'` 安装
PDF 提取失败	对于复杂的 PDF，尝试使用 Azure Document Intelligence
图像文本未提取	确保已安装 OCR 依赖项或使用 LLM 模式
大文件超时	分块处理或使用流式处理
插件未找到	运行 `markitdown --list-plugins` 以验证安装

# 特定格式的 ModuleNotFoundError
pip install 'markitdown[pdf]'  # 安装缺失的依赖项

# Azure 身份验证
export AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT="<endpoint>"
export AZURE_DOCUMENT_INTELLIGENCE_KEY="<key>"

Python >= 3.10
推荐使用虚拟环境

创建虚拟环境

python -m venv .venv source .venv/bin/activate # Linux/macOS .venv\Scripts\activate # Windows

安装

pip install 'markitdown[all]'

references/cli-reference.md - 完整的 CLI 选项
references/api-reference.md - Python API 详情
references/examples.md - 扩展示例
references/advanced-features.md - 自定义转换器、URI 处理
GitHub: https://github.com/microsoft/markitdown
PyPI: https://pypi.org/project/markitdown/

🇺🇸English

MarkItDown Skill

Microsoft's Python utility for converting various file formats to Markdown for LLM and text analysis pipelines.

Overview

MarkItDown converts documents while preserving structure (headings, lists, tables, links). It's optimized for LLM consumption rather than human-readable output.

Supported Formats

Category	Formats
Documents	PDF, Word (DOCX), PowerPoint (PPTX), Excel (XLSX, XLS)
Media	Images (EXIF + OCR), Audio (WAV, MP3 transcription)
Web	HTML, YouTube URLs, Wikipedia, RSS/Atom feeds
Data	CSV, JSON, XML, Jupyter notebooks (.ipynb)
Archives	ZIP (iterates contents), EPub
Email	Outlook MSG files

Quick Start

Installation

# Full installation (recommended)
pip install 'markitdown[all]'

# Minimal with specific formats
pip install 'markitdown[pdf,docx,pptx]'

# Using uv
uv pip install 'markitdown[all]'

Optional Dependencies

Extra	Description
`[all]`	All optional dependencies
`[pdf]`	PDF file support
`[docx]`	Word documents
`[pptx]`	PowerPoint presentations
`[xlsx]`	Excel spreadsheets
`[xls]`	Legacy Excel files

Command-Line Usage

# Basic conversion
markitdown document.pdf > output.md

# Specify output file
markitdown document.pdf -o output.md

# Pipe input
cat document.pdf | markitdown > output.md

# With Azure Document Intelligence
markitdown document.pdf -o output.md -d -e "<endpoint>"

Python API

from markitdown import MarkItDown

# Basic conversion
md = MarkItDown()
result = md.convert("document.xlsx")
print(result.text_content)

# With LLM for image descriptions
from openai import OpenAI

client = OpenAI()
md = MarkItDown(
    llm_client=client,
    llm_model="gpt-4o",
    llm_prompt="Describe this image in detail"
)
result = md.convert("image.jpg")
print(result.text_content)

# With Azure Document Intelligence
md = MarkItDown(docintel_endpoint="<your-endpoint>")
result = md.convert("complex-document.pdf")
print(result.text_content)

Common Use Cases

Batch Convert Directory

from markitdown import MarkItDown
from pathlib import Path

md = MarkItDown()
input_dir = Path("./documents")
output_dir = Path("./markdown")
output_dir.mkdir(exist_ok=True)

for file in input_dir.glob("*"):
    if file.is_file():
        try:
            result = md.convert(str(file))
            output_file = output_dir / f"{file.stem}.md"
            output_file.write_text(result.text_content)
            print(f"Converted: {file.name}")
        except Exception as e:
            print(f"Failed: {file.name} - {e}")

Process for LLM Context

from markitdown import MarkItDown

def prepare_for_llm(file_path: str) -> str:
    """Convert document to LLM-ready markdown."""
    md = MarkItDown()
    result = md.convert(file_path)

    # Add source reference
    content = f"# Source: {file_path}\n\n{result.text_content}"
    return content

# Use with your LLM
context = prepare_for_llm("report.pdf")

Extract YouTube Transcript

# CLI
markitdown "https://www.youtube.com/watch?v=VIDEO_ID" > transcript.md



# Python
from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("https://www.youtube.com/watch?v=VIDEO_ID")
print(result.text_content)

Image OCR with AI Description

from markitdown import MarkItDown
from openai import OpenAI

# Initialize with LLM support
client = OpenAI()
md = MarkItDown(
    llm_client=client,
    llm_model="gpt-4o"
)

# Convert image with AI description
result = md.convert("screenshot.png")
print(result.text_content)

Convert Jupyter Notebook

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("analysis.ipynb")
print(result.text_content)  # Code cells, outputs, markdown

Extract Wikipedia Content

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("https://en.wikipedia.org/wiki/Python")
print(result.text_content)  # Main article content only

Parse RSS Feed

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("https://example.com/feed.xml")
print(result.text_content)  # Feed entries as markdown

Plugin System

MarkItDown supports third-party plugins for extended functionality.

# List installed plugins
markitdown --list-plugins

# Enable plugins during conversion
markitdown --use-plugins document.pdf



# Enable plugins in Python
md = MarkItDown(enable_plugins=True)
result = md.convert("document.pdf")

Search GitHub for #markitdown-plugin to find available plugins.

MCP Server Integration

MarkItDown offers an MCP (Model Context Protocol) server for integration with LLM applications like Claude Desktop.

# Install MCP server
pip install markitdown-mcp

# Or from source
git clone https://github.com/microsoft/markitdown.git
cd markitdown/packages/markitdown-mcp
pip install -e .

See markitdown-mcp for configuration details.

Docker Usage

# Build image
docker build -t markitdown:latest .

# Convert file
docker run --rm -i markitdown:latest < document.pdf > output.md

Troubleshooting

Issue	Solution
Missing dependencies	Install with `pip install 'markitdown[all]'`
PDF extraction fails	Try Azure Document Intelligence for complex PDFs
Image text not extracted	Ensure OCR dependencies installed or use LLM mode
Large file timeout	Process in chunks or use streaming
Plugin not found	Run `markitdown --list-plugins` to verify installation

Common Errors

# ModuleNotFoundError for specific format
pip install 'markitdown[pdf]'  # Install missing dependency

# Azure authentication
export AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT="<endpoint>"
export AZURE_DOCUMENT_INTELLIGENCE_KEY="<key>"

Requirements

Python >= 3.10
Virtual environment recommended

Create virtual environment

python -m venv .venv source .venv/bin/activate # Linux/macOS .venv\Scripts\activate # Windows

Install

pip install 'markitdown[all]'

References

references/cli-reference.md - Complete CLI options
references/api-reference.md - Python API details
references/examples.md - Extended examples
references/advanced-features.md - Custom converters, URI handling
GitHub: https://github.com/microsoft/markitdown
PyPI: https://pypi.org/project/markitdown/

Weekly Installs

Repository

julianobarbosa/…e-skills

GitHub Stars

First Seen

Jan 20, 2026

Security Audits

Gen Agent Trust HubWarn SocketPass SnykWarn

Installed on

cursor67

opencode67

gemini-cli65

codex61

github-copilot57

claude-code54

AI 代码实施计划编写技能 | 自动化开发任务分解与 TDD 流程规划工具

50,900 周安装

`[all]`	所有可选依赖项
`[pdf]`	PDF 文件支持
`[docx]`	Word 文档
`[pptx]`	PowerPoint 演示文稿
`[xlsx]`	Excel 电子表格
`[xls]`	旧版 Excel 文件
`[outlook]`	Outlook MSG 文件
`[az-doc-intel]`	Azure Document Intelligence
`[audio-transcription]`	WAV/MP3 转录
`[youtube-transcription]`	YouTube 视频字幕

MarkItDown：微软官方文件转Markdown工具，支持PDF/Word/Excel/音频/网页等格式转换

🇨🇳中文介绍

MarkItDown 技能

概述

支持的格式

快速开始

安装

可选依赖项

相关 Skills

命令行用法

Python API

常见用例

批量转换目录

为 LLM 上下文准备数据

提取 YouTube 字幕

图像 OCR 与 AI 描述

转换 Jupyter 笔记本

提取 Wikipedia 内容

解析 RSS 订阅源

插件系统

MCP 服务器集成

Docker 用法

故障排除

常见错误

要求

创建虚拟环境

安装

参考

🇺🇸English

MarkItDown Skill

Overview

Supported Formats

Quick Start

Installation

Optional Dependencies

Command-Line Usage

Python API

Common Use Cases

Batch Convert Directory

Process for LLM Context

Extract YouTube Transcript

Image OCR with AI Description

Convert Jupyter Notebook

Extract Wikipedia Content

Parse RSS Feed

Plugin System

MCP Server Integration

Docker Usage

Troubleshooting

Common Errors

Requirements

Create virtual environment

Install

References

最新 Skills