document-processing by dirnbauer/webconsulting-skills
npx skills add https://github.com/dirnbauer/webconsulting-skills --skill document-processing来源: 此技能改编自 Anthropic's Skills 中为 Claude Code 和 AI 代理设计的文档处理技能(pdf、docx、pptx、xlsx)。
创建、编辑和分析办公文档,包括 PDF、Word 文档、PowerPoint 演示文稿和 Excel 电子表格。
| 任务 | 文档类型 | 最佳工具 |
|---|---|---|
| 提取文本 | pdfplumber, pdftotext | |
| 合并/拆分 | pypdf, qpdf | |
| 填写表单 |
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
pdf-lib (JS), pypdf |
| 创建新文档 | reportlab |
| OCR 扫描文档 | pytesseract + pdf2image |
| 提取文本 | DOCX | pandoc, markitdown |
| 创建新文档 | DOCX | docx-js (JS) |
| 编辑现有文档 | DOCX | OOXML (解包/编辑/打包) |
| 提取文本 | PPTX | markitdown |
| 创建新文档 | PPTX | html2pptx, PptxGenJS |
| 编辑现有文档 | PPTX | OOXML (解包/编辑/打包) |
| 数据分析 | XLSX | pandas |
| 公式/格式化 | XLSX | openpyxl |
import pdfplumber
# 提取文本并保留布局
with pdfplumber.open("document.pdf") as pdf:
for page in pdf.pages:
text = page.extract_text()
print(text)
import pdfplumber
import pandas as pd
with pdfplumber.open("document.pdf") as pdf:
all_tables = []
for page in pdf.pages:
tables = page.extract_tables()
for table in tables:
if table:
df = pd.DataFrame(table[1:], columns=table[0])
all_tables.append(df)
# 合并所有表格
if all_tables:
combined_df = pd.concat(all_tables, ignore_index=True)
combined_df.to_excel("extracted_tables.xlsx", index=False)
from pypdf import PdfWriter, PdfReader
writer = PdfWriter()
for pdf_file in ["doc1.pdf", "doc2.pdf", "doc3.pdf"]:
reader = PdfReader(pdf_file)
for page in reader.pages:
writer.add_page(page)
with open("merged.pdf", "wb") as output:
writer.write(output)
from pypdf import PdfReader, PdfWriter
reader = PdfReader("input.pdf")
for i, page in enumerate(reader.pages):
writer = PdfWriter()
writer.add_page(page)
with open(f"page_{i+1}.pdf", "wb") as output:
writer.write(output)
from pypdf import PdfReader, PdfWriter
reader = PdfReader("input.pdf")
writer = PdfWriter()
page = reader.pages[0]
page.rotate(90) # 顺时针旋转 90 度
writer.add_page(page)
with open("rotated.pdf", "wb") as output:
writer.write(output)
# 需要安装:pip install pytesseract pdf2image
import pytesseract
from pdf2image import convert_from_path
# 将 PDF 转换为图像
images = convert_from_path('scanned.pdf')
# 对每一页进行 OCR
text = ""
for i, image in enumerate(images):
text += f"Page {i+1}:\n"
text += pytesseract.image_to_string(image)
text += "\n\n"
print(text)
from pypdf import PdfReader, PdfWriter
watermark = PdfReader("watermark.pdf").pages[0]
reader = PdfReader("document.pdf")
writer = PdfWriter()
for page in reader.pages:
page.merge_page(watermark)
writer.add_page(page)
with open("watermarked.pdf", "wb") as output:
writer.write(output)
from pypdf import PdfReader, PdfWriter
reader = PdfReader("input.pdf")
writer = PdfWriter()
for page in reader.pages:
writer.add_page(page)
writer.encrypt("userpassword", "ownerpassword")
with open("encrypted.pdf", "wb") as output:
writer.write(output)
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, PageBreak
from reportlab.lib.styles import getSampleStyleSheet
doc = SimpleDocTemplate("report.pdf", pagesize=letter)
styles = getSampleStyleSheet()
story = []
# 添加内容
title = Paragraph("Report Title", styles['Title'])
story.append(title)
story.append(Spacer(1, 12))
body = Paragraph("This is the body of the report. " * 20, styles['Normal'])
story.append(body)
story.append(PageBreak())
# 第 2 页
story.append(Paragraph("Page 2", styles['Heading1']))
story.append(Paragraph("Content for page 2", styles['Normal']))
doc.build(story)
# 提取文本 (poppler-utils)
pdftotext input.pdf output.txt
pdftotext -layout input.pdf output.txt # 保留布局
# 合并 PDF (qpdf)
qpdf --empty --pages file1.pdf file2.pdf -- merged.pdf
# 拆分页面
qpdf input.pdf --pages . 1-5 -- pages1-5.pdf
# 旋转页面
qpdf input.pdf output.pdf --rotate=+90:1
# 移除密码
qpdf --password=mypassword --decrypt encrypted.pdf decrypted.pdf
# 提取图像
pdfimages -j input.pdf output_prefix
# 使用 pandoc 转换为 markdown
pandoc document.docx -o output.md
# 保留修订跟踪
pandoc --track-changes=all document.docx -o output.md
import { Document, Paragraph, TextRun, HeadingLevel, Packer } from 'docx';
import * as fs from 'fs';
const doc = new Document({
sections: [{
properties: {},
children: [
new Paragraph({
text: "Document Title",
heading: HeadingLevel.HEADING_1,
}),
new Paragraph({
children: [
new TextRun("This is a "),
new TextRun({
text: "bold",
bold: true,
}),
new TextRun(" word in a paragraph."),
],
}),
new Paragraph({
text: "This is another paragraph.",
}),
],
}],
});
// 导出到文件
const buffer = await Packer.toBuffer(doc);
fs.writeFileSync("output.docx", buffer);
import { Document, Paragraph, Table, TableRow, TableCell, Packer } from 'docx';
const table = new Table({
rows: [
new TableRow({
children: [
new TableCell({ children: [new Paragraph("Header 1")] }),
new TableCell({ children: [new Paragraph("Header 2")] }),
new TableCell({ children: [new Paragraph("Header 3")] }),
],
}),
new TableRow({
children: [
new TableCell({ children: [new Paragraph("Cell 1")] }),
new TableCell({ children: [new Paragraph("Cell 2")] }),
new TableCell({ children: [new Paragraph("Cell 3")] }),
],
}),
],
});
const doc = new Document({
sections: [{
children: [
new Paragraph({ text: "Table Example", heading: HeadingLevel.HEADING_1 }),
table,
],
}],
});
对于复杂编辑,使用原始 OOXML:
解包文档:
python ooxml/scripts/unpack.py document.docx unpacked/
编辑 XML 文件(主要是 word/document.xml)
验证并打包:
python ooxml/scripts/validate.py unpacked/ --original document.docx
python ooxml/scripts/pack.py unpacked/ output.docx
对于带修订跟踪的文档审阅:
# 1. 获取当前状态
pandoc --track-changes=all document.docx -o current.md
# 2. 解包
python ooxml/scripts/unpack.py document.docx unpacked/
# 3. 使用修订跟踪模式进行编辑
# 使用 <w:ins> 表示插入,<w:del> 表示删除
# 4. 打包最终文档
python ooxml/scripts/pack.py unpacked/ reviewed.docx
python -m markitdown presentation.pptx
import PptxGenJS from 'pptxgenjs';
const pptx = new PptxGenJS();
// 幻灯片 1 - 标题
const slide1 = pptx.addSlide();
slide1.addText("Presentation Title", {
x: 1, y: 2, w: 8, h: 1.5,
fontSize: 36,
bold: true,
color: "363636",
align: "center",
});
slide1.addText("Subtitle goes here", {
x: 1, y: 3.5, w: 8, h: 0.5,
fontSize: 18,
color: "666666",
align: "center",
});
// 幻灯片 2 - 内容
const slide2 = pptx.addSlide();
slide2.addText("Key Points", {
x: 0.5, y: 0.5, w: 9, h: 0.8,
fontSize: 28,
bold: true,
});
slide2.addText([
{ text: "• First important point\n", options: { bullet: true } },
{ text: "• Second important point\n", options: { bullet: true } },
{ text: "• Third important point\n", options: { bullet: true } },
], {
x: 0.5, y: 1.5, w: 9, h: 3,
fontSize: 18,
});
// 幻灯片 3 - 图表
const slide3 = pptx.addSlide();
slide3.addChart(pptx.ChartType.bar, [
{ name: "Q1", labels: ["Jan", "Feb", "Mar"], values: [100, 200, 300] },
{ name: "Q2", labels: ["Apr", "May", "Jun"], values: [150, 250, 350] },
], {
x: 1, y: 1, w: 8, h: 4,
showLegend: true,
legendPos: "b",
});
// 保存
pptx.writeFile("output.pptx");
# 1. 解包
python ooxml/scripts/unpack.py presentation.pptx unpacked/
# 2. 关键文件:
# - ppt/slides/slide1.xml, slide2.xml 等
# - ppt/notesSlides/ 用于演讲者备注
# - ppt/theme/ 用于样式
# 3. 验证并打包
python ooxml/scripts/validate.py unpacked/ --original presentation.pptx
python ooxml/scripts/pack.py unpacked/ output.pptx
# 创建所有幻灯片的可视化概览
python scripts/thumbnail.py presentation.pptx --cols 4
# 首先转换为 PDF
soffice --headless --convert-to pdf presentation.pptx
# 然后将 PDF 转换为图像
pdftoppm -jpeg -r 150 presentation.pdf slide
# 创建 slide-1.jpg, slide-2.jpg 等
import pandas as pd
# 读取 Excel
df = pd.read_excel('file.xlsx') # 默认:第一个工作表
all_sheets = pd.read_excel('file.xlsx', sheet_name=None) # 所有工作表作为字典
# 分析
df.head() # 预览数据
df.info() # 列信息
df.describe() # 统计信息
# 筛选和转换
filtered = df[df['Sales'] > 1000]
grouped = df.groupby('Category')['Revenue'].sum()
# 写入 Excel
df.to_excel('output.xlsx', index=False)
from openpyxl import Workbook
from openpyxl.styles import Font, PatternFill, Alignment
wb = Workbook()
sheet = wb.active
# 添加数据
sheet['A1'] = 'Product'
sheet['B1'] = 'Price'
sheet['C1'] = 'Quantity'
sheet['D1'] = 'Total'
# 表头格式化
for cell in ['A1', 'B1', 'C1', 'D1']:
sheet[cell].font = Font(bold=True, color='FFFFFF')
sheet[cell].fill = PatternFill('solid', start_color='4472C4')
sheet[cell].alignment = Alignment(horizontal='center')
# 添加数据行
data = [
('Widget A', 10.00, 5),
('Widget B', 15.00, 3),
('Widget C', 20.00, 8),
]
for row_idx, (product, price, qty) in enumerate(data, start=2):
sheet[f'A{row_idx}'] = product
sheet[f'B{row_idx}'] = price
sheet[f'C{row_idx}'] = qty
# 公式 - 不是硬编码的值!
sheet[f'D{row_idx}'] = f'=B{row_idx}*C{row_idx}'
# 在底部添加求和公式
last_row = len(data) + 2
sheet[f'D{last_row}'] = f'=SUM(D2:D{last_row-1})'
# 列宽
sheet.column_dimensions['A'].width = 15
sheet.column_dimensions['B'].width = 10
sheet.column_dimensions['C'].width = 10
sheet.column_dimensions['D'].width = 10
wb.save('output.xlsx')
from openpyxl.styles import Font
# 行业标准颜色
BLUE = Font(color='0000FF') # 硬编码输入
BLACK = Font(color='000000') # 公式
GREEN = Font(color='008000') # 来自其他工作表的链接
RED = Font(color='FF0000') # 外部链接
# 应用到单元格
sheet['B5'].font = BLUE # 用户输入
sheet['B6'].font = BLACK # 公式
# 带千位分隔符的货币
sheet['B5'].number_format = '$#,##0'
# 带一位小数的百分比
sheet['B6'].number_format = '0.0%'
# 零显示为破折号
sheet['B7'].number_format = '$#,##0;($#,##0);"-"'
# 倍数
sheet['B8'].number_format = '0.0x'
# ❌ 错误 - 硬编码计算值
total = df['Sales'].sum()
sheet['B10'] = total # 硬编码 5000
# ✅ 正确 - 使用 Excel 公式
sheet['B10'] = '=SUM(B2:B9)'
# ❌ 错误 - 在 Python 中计算
growth = (current - previous) / previous
sheet['C5'] = growth
# ✅ 正确 - Excel 公式
sheet['C5'] = '=(C4-C2)/C2'
from openpyxl import load_workbook
# 加载并保留公式
wb = load_workbook('existing.xlsx')
sheet = wb.active
# 修改单元格
sheet['A1'] = 'New Value'
sheet.insert_rows(2)
sheet.delete_cols(3)
# 添加新工作表
new_sheet = wb.create_sheet('Analysis')
new_sheet['A1'] = 'Data'
wb.save('modified.xlsx')
创建/修改带公式的 Excel 文件后:
# 使用 LibreOffice 重新计算所有公式
python recalc.py output.xlsx
按需安装:
# PDF
pip install pypdf pdfplumber reportlab pytesseract pdf2image
# DOCX
npm install -g docx
pip install "markitdown[docx]"
# PPTX
npm install -g pptxgenjs
pip install "markitdown[pptx]"
# XLSX
pip install pandas openpyxl
# 命令行工具
sudo apt-get install poppler-utils qpdf libreoffice pandoc
| 我想要... | 命令/代码 |
|---|---|
| 提取 PDF 文本 | pdfplumber.open(f).pages[0].extract_text() |
| 合并 PDF | pypdf.PdfWriter() + 循环 |
| 拆分 PDF | 每个页面一个 PdfWriter() |
| OCR 扫描的 PDF | pdf2image → pytesseract |
| 转换 DOCX 为 MD | pandoc doc.docx -o doc.md |
| 创建 DOCX | docx-js (JavaScript) |
| 提取 PPTX 文本 | python -m markitdown pres.pptx |
| 创建 PPTX | PptxGenJS (JavaScript) |
| 分析 Excel | pandas.read_excel() |
| 带公式的 Excel | openpyxl |
此技能改编自 Anthropic's Skills。
原始仓库:
版权 (c) Anthropic - MIT 许可证
由 webconsulting.at 为此技能集合改编
每周安装次数
67
仓库
GitHub 星标数
13
首次出现
2026年2月7日
安全审计
安装于
gemini-cli61
opencode61
github-copilot60
codex60
cursor60
kimi-cli58
Source: This skill is adapted from Anthropic's Skills document processing skills (pdf, docx, pptx, xlsx) for Claude Code and AI agents.
Create, edit, and analyze office documents including PDFs, Word documents, PowerPoint presentations, and Excel spreadsheets.
| Task | Document Type | Best Tool |
|---|---|---|
| Extract text | pdfplumber, pdftotext | |
| Merge/split | pypdf, qpdf | |
| Fill forms | pdf-lib (JS), pypdf | |
| Create new | reportlab | |
| OCR scanned | pytesseract + pdf2image | |
| Extract text | DOCX | pandoc, markitdown |
| Create new | DOCX | docx-js (JS) |
| Edit existing | DOCX | OOXML (unpack/edit/pack) |
| Extract text | PPTX | markitdown |
| Create new | PPTX | html2pptx, PptxGenJS |
| Edit existing | PPTX | OOXML (unpack/edit/pack) |
| Data analysis | XLSX | pandas |
| Formulas/formatting | XLSX | openpyxl |
import pdfplumber
# Extract text with layout preservation
with pdfplumber.open("document.pdf") as pdf:
for page in pdf.pages:
text = page.extract_text()
print(text)
import pdfplumber
import pandas as pd
with pdfplumber.open("document.pdf") as pdf:
all_tables = []
for page in pdf.pages:
tables = page.extract_tables()
for table in tables:
if table:
df = pd.DataFrame(table[1:], columns=table[0])
all_tables.append(df)
# Combine all tables
if all_tables:
combined_df = pd.concat(all_tables, ignore_index=True)
combined_df.to_excel("extracted_tables.xlsx", index=False)
from pypdf import PdfWriter, PdfReader
writer = PdfWriter()
for pdf_file in ["doc1.pdf", "doc2.pdf", "doc3.pdf"]:
reader = PdfReader(pdf_file)
for page in reader.pages:
writer.add_page(page)
with open("merged.pdf", "wb") as output:
writer.write(output)
from pypdf import PdfReader, PdfWriter
reader = PdfReader("input.pdf")
for i, page in enumerate(reader.pages):
writer = PdfWriter()
writer.add_page(page)
with open(f"page_{i+1}.pdf", "wb") as output:
writer.write(output)
from pypdf import PdfReader, PdfWriter
reader = PdfReader("input.pdf")
writer = PdfWriter()
page = reader.pages[0]
page.rotate(90) # Rotate 90 degrees clockwise
writer.add_page(page)
with open("rotated.pdf", "wb") as output:
writer.write(output)
# Requires: pip install pytesseract pdf2image
import pytesseract
from pdf2image import convert_from_path
# Convert PDF to images
images = convert_from_path('scanned.pdf')
# OCR each page
text = ""
for i, image in enumerate(images):
text += f"Page {i+1}:\n"
text += pytesseract.image_to_string(image)
text += "\n\n"
print(text)
from pypdf import PdfReader, PdfWriter
watermark = PdfReader("watermark.pdf").pages[0]
reader = PdfReader("document.pdf")
writer = PdfWriter()
for page in reader.pages:
page.merge_page(watermark)
writer.add_page(page)
with open("watermarked.pdf", "wb") as output:
writer.write(output)
from pypdf import PdfReader, PdfWriter
reader = PdfReader("input.pdf")
writer = PdfWriter()
for page in reader.pages:
writer.add_page(page)
writer.encrypt("userpassword", "ownerpassword")
with open("encrypted.pdf", "wb") as output:
writer.write(output)
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, PageBreak
from reportlab.lib.styles import getSampleStyleSheet
doc = SimpleDocTemplate("report.pdf", pagesize=letter)
styles = getSampleStyleSheet()
story = []
# Add content
title = Paragraph("Report Title", styles['Title'])
story.append(title)
story.append(Spacer(1, 12))
body = Paragraph("This is the body of the report. " * 20, styles['Normal'])
story.append(body)
story.append(PageBreak())
# Page 2
story.append(Paragraph("Page 2", styles['Heading1']))
story.append(Paragraph("Content for page 2", styles['Normal']))
doc.build(story)
# Extract text (poppler-utils)
pdftotext input.pdf output.txt
pdftotext -layout input.pdf output.txt # Preserve layout
# Merge PDFs (qpdf)
qpdf --empty --pages file1.pdf file2.pdf -- merged.pdf
# Split pages
qpdf input.pdf --pages . 1-5 -- pages1-5.pdf
# Rotate pages
qpdf input.pdf output.pdf --rotate=+90:1
# Remove password
qpdf --password=mypassword --decrypt encrypted.pdf decrypted.pdf
# Extract images
pdfimages -j input.pdf output_prefix
# Convert to markdown with pandoc
pandoc document.docx -o output.md
# With tracked changes preserved
pandoc --track-changes=all document.docx -o output.md
import { Document, Paragraph, TextRun, HeadingLevel, Packer } from 'docx';
import * as fs from 'fs';
const doc = new Document({
sections: [{
properties: {},
children: [
new Paragraph({
text: "Document Title",
heading: HeadingLevel.HEADING_1,
}),
new Paragraph({
children: [
new TextRun("This is a "),
new TextRun({
text: "bold",
bold: true,
}),
new TextRun(" word in a paragraph."),
],
}),
new Paragraph({
text: "This is another paragraph.",
}),
],
}],
});
// Export to file
const buffer = await Packer.toBuffer(doc);
fs.writeFileSync("output.docx", buffer);
import { Document, Paragraph, Table, TableRow, TableCell, Packer } from 'docx';
const table = new Table({
rows: [
new TableRow({
children: [
new TableCell({ children: [new Paragraph("Header 1")] }),
new TableCell({ children: [new Paragraph("Header 2")] }),
new TableCell({ children: [new Paragraph("Header 3")] }),
],
}),
new TableRow({
children: [
new TableCell({ children: [new Paragraph("Cell 1")] }),
new TableCell({ children: [new Paragraph("Cell 2")] }),
new TableCell({ children: [new Paragraph("Cell 3")] }),
],
}),
],
});
const doc = new Document({
sections: [{
children: [
new Paragraph({ text: "Table Example", heading: HeadingLevel.HEADING_1 }),
table,
],
}],
});
For complex edits, work with raw OOXML:
Unpack the document:
python ooxml/scripts/unpack.py document.docx unpacked/
Edit XML files (primarily word/document.xml)
Validate and pack:
python ooxml/scripts/validate.py unpacked/ --original document.docx
python ooxml/scripts/pack.py unpacked/ output.docx
For document review with track changes:
# 1. Get current state
pandoc --track-changes=all document.docx -o current.md
# 2. Unpack
python ooxml/scripts/unpack.py document.docx unpacked/
# 3. Edit using tracked change patterns
# Use <w:ins> for insertions, <w:del> for deletions
# 4. Pack final document
python ooxml/scripts/pack.py unpacked/ reviewed.docx
python -m markitdown presentation.pptx
import PptxGenJS from 'pptxgenjs';
const pptx = new PptxGenJS();
// Slide 1 - Title
const slide1 = pptx.addSlide();
slide1.addText("Presentation Title", {
x: 1, y: 2, w: 8, h: 1.5,
fontSize: 36,
bold: true,
color: "363636",
align: "center",
});
slide1.addText("Subtitle goes here", {
x: 1, y: 3.5, w: 8, h: 0.5,
fontSize: 18,
color: "666666",
align: "center",
});
// Slide 2 - Content
const slide2 = pptx.addSlide();
slide2.addText("Key Points", {
x: 0.5, y: 0.5, w: 9, h: 0.8,
fontSize: 28,
bold: true,
});
slide2.addText([
{ text: "• First important point\n", options: { bullet: true } },
{ text: "• Second important point\n", options: { bullet: true } },
{ text: "• Third important point\n", options: { bullet: true } },
], {
x: 0.5, y: 1.5, w: 9, h: 3,
fontSize: 18,
});
// Slide 3 - Chart
const slide3 = pptx.addSlide();
slide3.addChart(pptx.ChartType.bar, [
{ name: "Q1", labels: ["Jan", "Feb", "Mar"], values: [100, 200, 300] },
{ name: "Q2", labels: ["Apr", "May", "Jun"], values: [150, 250, 350] },
], {
x: 1, y: 1, w: 8, h: 4,
showLegend: true,
legendPos: "b",
});
// Save
pptx.writeFile("output.pptx");
# 1. Unpack
python ooxml/scripts/unpack.py presentation.pptx unpacked/
# 2. Key files:
# - ppt/slides/slide1.xml, slide2.xml, etc.
# - ppt/notesSlides/ for speaker notes
# - ppt/theme/ for styling
# 3. Validate and pack
python ooxml/scripts/validate.py unpacked/ --original presentation.pptx
python ooxml/scripts/pack.py unpacked/ output.pptx
# Create visual overview of all slides
python scripts/thumbnail.py presentation.pptx --cols 4
# Convert to PDF first
soffice --headless --convert-to pdf presentation.pptx
# Then PDF to images
pdftoppm -jpeg -r 150 presentation.pdf slide
# Creates slide-1.jpg, slide-2.jpg, etc.
import pandas as pd
# Read Excel
df = pd.read_excel('file.xlsx') # Default: first sheet
all_sheets = pd.read_excel('file.xlsx', sheet_name=None) # All sheets as dict
# Analyze
df.head() # Preview data
df.info() # Column info
df.describe() # Statistics
# Filter and transform
filtered = df[df['Sales'] > 1000]
grouped = df.groupby('Category')['Revenue'].sum()
# Write Excel
df.to_excel('output.xlsx', index=False)
from openpyxl import Workbook
from openpyxl.styles import Font, PatternFill, Alignment
wb = Workbook()
sheet = wb.active
# Add data
sheet['A1'] = 'Product'
sheet['B1'] = 'Price'
sheet['C1'] = 'Quantity'
sheet['D1'] = 'Total'
# Header formatting
for cell in ['A1', 'B1', 'C1', 'D1']:
sheet[cell].font = Font(bold=True, color='FFFFFF')
sheet[cell].fill = PatternFill('solid', start_color='4472C4')
sheet[cell].alignment = Alignment(horizontal='center')
# Add data rows
data = [
('Widget A', 10.00, 5),
('Widget B', 15.00, 3),
('Widget C', 20.00, 8),
]
for row_idx, (product, price, qty) in enumerate(data, start=2):
sheet[f'A{row_idx}'] = product
sheet[f'B{row_idx}'] = price
sheet[f'C{row_idx}'] = qty
# FORMULA - not hardcoded value!
sheet[f'D{row_idx}'] = f'=B{row_idx}*C{row_idx}'
# Add sum formula at bottom
last_row = len(data) + 2
sheet[f'D{last_row}'] = f'=SUM(D2:D{last_row-1})'
# Column width
sheet.column_dimensions['A'].width = 15
sheet.column_dimensions['B'].width = 10
sheet.column_dimensions['C'].width = 10
sheet.column_dimensions['D'].width = 10
wb.save('output.xlsx')
from openpyxl.styles import Font
# Industry-standard colors
BLUE = Font(color='0000FF') # Hardcoded inputs
BLACK = Font(color='000000') # Formulas
GREEN = Font(color='008000') # Links from other sheets
RED = Font(color='FF0000') # External links
# Apply to cells
sheet['B5'].font = BLUE # User input
sheet['B6'].font = BLACK # Formula
# Currency with thousands separator
sheet['B5'].number_format = '$#,##0'
# Percentage with one decimal
sheet['B6'].number_format = '0.0%'
# Zeros as dashes
sheet['B7'].number_format = '$#,##0;($#,##0);"-"'
# Multiples
sheet['B8'].number_format = '0.0x'
# ❌ WRONG - Hardcoding calculated values
total = df['Sales'].sum()
sheet['B10'] = total # Hardcodes 5000
# ✅ CORRECT - Use Excel formulas
sheet['B10'] = '=SUM(B2:B9)'
# ❌ WRONG - Computing in Python
growth = (current - previous) / previous
sheet['C5'] = growth
# ✅ CORRECT - Excel formula
sheet['C5'] = '=(C4-C2)/C2'
from openpyxl import load_workbook
# Load with formulas preserved
wb = load_workbook('existing.xlsx')
sheet = wb.active
# Modify cells
sheet['A1'] = 'New Value'
sheet.insert_rows(2)
sheet.delete_cols(3)
# Add new sheet
new_sheet = wb.create_sheet('Analysis')
new_sheet['A1'] = 'Data'
wb.save('modified.xlsx')
After creating/modifying Excel files with formulas:
# Recalculate all formulas using LibreOffice
python recalc.py output.xlsx
Install as needed:
# PDF
pip install pypdf pdfplumber reportlab pytesseract pdf2image
# DOCX
npm install -g docx
pip install "markitdown[docx]"
# PPTX
npm install -g pptxgenjs
pip install "markitdown[pptx]"
# XLSX
pip install pandas openpyxl
# Command line tools
sudo apt-get install poppler-utils qpdf libreoffice pandoc
| I want to... | Command/Code |
|---|---|
| Extract PDF text | pdfplumber.open(f).pages[0].extract_text() |
| Merge PDFs | pypdf.PdfWriter() + loop |
| Split PDF | One PdfWriter() per page |
| OCR scanned PDF | pdf2image → pytesseract |
| Convert DOCX to MD | pandoc doc.docx -o doc.md |
| Create DOCX |
This skill is adapted from Anthropic's Skills.
Original repositories:
Copyright (c) Anthropic - MIT License
Adapted by webconsulting.at for this skill collection
Weekly Installs
67
Repository
GitHub Stars
13
First Seen
Feb 7, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykPass
Installed on
gemini-cli61
opencode61
github-copilot60
codex60
cursor60
kimi-cli58
Skills CLI 使用指南:AI Agent 技能包管理器安装与管理教程
43,100 周安装
docx-js (JavaScript) |
| Extract PPTX text | python -m markitdown pres.pptx |
| Create PPTX | PptxGenJS (JavaScript) |
| Analyze Excel | pandas.read_excel() |
| Excel with formulas | openpyxl |