数据分析技能：使用DuckDB高效分析Excel/CSV文件，支持SQL查询与统计摘要

data-analysis by bytedance/deer-flow

489 周安装量

45,200 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/bytedance/deer-flow --skill data-analysis

自动化数据分析数据处理

🇨🇳中文介绍

数据分析技能

概述

此技能使用 DuckDB（一个进程内分析型 SQL 引擎）来分析用户上传的 Excel/CSV 文件。它支持模式检查、基于 SQL 的查询、统计摘要和结果导出，所有这些功能都通过一个 Python 脚本实现。

核心能力

检查 Excel/CSV 文件结构（工作表、列、类型、行数）
对上传的数据执行任意 SQL 查询
生成统计摘要（平均值、中位数、标准差、百分位数、空值数量）
支持多工作表的 Excel 工作簿（每个工作表成为一个表）
将查询结果导出为 CSV、JSON 或 Markdown 格式
利用 DuckDB 的列式引擎高效处理大文件

工作流程

步骤 1：理解需求

当用户上传数据文件并请求分析时，需要明确：

文件位置：上传的 Excel/CSV 文件在 /mnt/user-data/uploads/ 下的路径
分析目标：用户希望获得哪些洞察（摘要、筛选、聚合、比较等）
输出格式：结果应如何呈现（表格、CSV 导出、JSON 等）
您无需检查 /mnt/user-data 下的文件夹

步骤 2：检查文件结构

首先，检查上传的文件以了解其模式：

python /mnt/skills/public/data-analysis/scripts/analyze.py \
  --files /mnt/user-data/uploads/data.xlsx \
  --action inspect

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

步骤 3：执行分析

基于模式，构建 SQL 查询来回答用户的问题。

python /mnt/skills/public/data-analysis/scripts/analyze.py \
  --files /mnt/user-data/uploads/data.xlsx \
  --action query \
  --sql "SELECT category, COUNT(*) as count, AVG(amount) as avg_amount FROM Sheet1 GROUP BY category ORDER BY count DESC"

python /mnt/skills/public/data-analysis/scripts/analyze.py \
  --files /mnt/user-data/uploads/data.xlsx \
  --action summary \
  --table Sheet1

这将为每个数值列返回：计数、平均值、标准差、最小值、25%、50%、75%、最大值、空值计数。对于字符串列：计数、唯一值数量、最高频值、频率、空值计数。

python /mnt/skills/public/data-analysis/scripts/analyze.py \
  --files /mnt/user-data/uploads/data.xlsx \
  --action query \
  --sql "SELECT * FROM Sheet1 WHERE amount > 1000" \
  --output-file /mnt/user-data/outputs/filtered-results.csv

支持的输出格式（根据扩展名自动检测）：

.csv — 逗号分隔值
.json — 记录组成的 JSON 数组
.md — Markdown 表格

参数	是否必需	描述
`--files`	是	Excel/CSV 文件的路径，以空格分隔
`--action`	是	可选值：`inspect`, `query`, `summary`
`--sql`	仅 `query` 时需要	要执行的 SQL 查询
`--table`	仅 `summary` 时需要	要汇总的表/工作表名称
`--output-file`	否	导出结果的路径（CSV/JSON/MD）

[!NOTE] 请不要读取 Python 文件，只需使用参数调用它。

Excel 文件：每个工作表成为一个表，以工作表名称命名（例如，Sheet1、Sales、Revenue）
CSV 文件：表名是去掉扩展名的文件名（例如，data.csv → data）
多个文件：所有文件中的所有表都在同一个查询上下文中可用，支持跨文件连接
特殊字符：包含空格或特殊字符的工作表/文件名会自动清理（空格 → 下划线）。对于以数字开头或包含特殊字符的名称，请使用双引号，例如 "2024_Sales"

-- 行数
SELECT COUNT(*) FROM Sheet1

-- 列中的不同值
SELECT DISTINCT category FROM Sheet1

-- 值分布
SELECT category, COUNT(*) as cnt FROM Sheet1 GROUP BY category ORDER BY cnt DESC

-- 日期范围
SELECT MIN(date_col), MAX(date_col) FROM Sheet1

-- 按类别和月份的营收
SELECT category, DATE_TRUNC('month', order_date) as month,
       SUM(revenue) as total_revenue
FROM Sales
GROUP BY category, month
ORDER BY month, total_revenue DESC

-- 按消费额排名的前 10 名客户
SELECT customer_name, SUM(amount) as total_spend
FROM Orders GROUP BY customer_name
ORDER BY total_spend DESC LIMIT 10

-- 将销售数据与来自不同文件的客户信息连接
SELECT s.order_id, s.amount, c.customer_name, c.region
FROM sales s
JOIN customers c ON s.customer_id = c.id
WHERE s.amount > 500

-- 累计总和与排名
SELECT order_date, amount,
       SUM(amount) OVER (ORDER BY order_date) as running_total,
       RANK() OVER (ORDER BY amount DESC) as amount_rank
FROM Sales

-- 透视：按类别的月度营收
SELECT category,
       SUM(CASE WHEN MONTH(date) = 1 THEN revenue END) as Jan,
       SUM(CASE WHEN MONTH(date) = 2 THEN revenue END) as Feb,
       SUM(CASE WHEN MONTH(date) = 3 THEN revenue END) as Mar
FROM Sales
GROUP BY category

用户上传 sales_2024.xlsx（包含工作表：Orders、Products、Customers）并询问：“分析我的销售数据——显示按营收排名的顶级产品和月度趋势。”

步骤 1：检查文件

python /mnt/skills/public/data-analysis/scripts/analyze.py \
  --files /mnt/user-data/uploads/sales_2024.xlsx \
  --action inspect

步骤 2：按营收排名的顶级产品

python /mnt/skills/public/data-analysis/scripts/analyze.py \
  --files /mnt/user-data/uploads/sales_2024.xlsx \
  --action query \
  --sql "SELECT p.product_name, SUM(o.quantity * o.unit_price) as total_revenue, SUM(o.quantity) as total_units FROM Orders o JOIN Products p ON o.product_id = p.id GROUP BY p.product_name ORDER BY total_revenue DESC LIMIT 10"

步骤 3：月度营收趋势

python /mnt/skills/public/data-analysis/scripts/analyze.py \
  --files /mnt/user-data/uploads/sales_2024.xlsx \
  --action query \
  --sql "SELECT DATE_TRUNC('month', order_date) as month, SUM(quantity * unit_price) as revenue FROM Orders GROUP BY month ORDER BY month" \
  --output-file /mnt/user-data/outputs/monthly-trends.csv

步骤 4：统计摘要

python /mnt/skills/public/data-analysis/scripts/analyze.py \
  --files /mnt/user-data/uploads/sales_2024.xlsx \
  --action summary \
  --table Orders

向用户呈现结果，并清晰地解释发现、趋势和可操作的见解。

用户上传 orders.csv 和 customers.xlsx 并询问：“哪个地区的平均订单价值最高？”

python /mnt/skills/public/data-analysis/scripts/analyze.py \
  --files /mnt/user-data/uploads/orders.csv /mnt/user-data/uploads/customers.xlsx \
  --action query \
  --sql "SELECT c.region, AVG(o.amount) as avg_order_value, COUNT(*) as order_count FROM orders o JOIN Customers c ON o.customer_id = c.id GROUP BY c.region ORDER BY avg_order_value DESC"

将查询结果以格式化表格的形式直接在对话中呈现
对于大型结果，导出到文件并通过 present_files 工具分享
始终用通俗易懂的语言解释发现和关键要点
当模式有趣时，建议进行后续分析
如果用户希望保留结果，则提供导出选项

脚本会自动缓存加载的数据，以避免每次调用时重新解析文件：

首次加载时，文件被解析并存储在 /mnt/user-data/workspace/.data-analysis-cache/ 下的持久化 DuckDB 数据库中
缓存键是所有输入文件内容的 SHA256 哈希值——如果文件发生变化，则会创建新的缓存
后续使用相同文件的调用将直接使用缓存的数据库（近乎即时启动）
缓存是透明的——无需额外参数

这在针对相同数据文件运行多个查询（检查 → 查询 → 摘要）时特别有用。

DuckDB 支持完整的 SQL，包括窗口函数、CTE、子查询和高级聚合
Excel 日期列会自动解析；请使用 DuckDB 日期函数（DATE_TRUNC、EXTRACT 等）
对于非常大的文件（100MB+），DuckDB 可以高效处理，无需将所有内容加载到内存中
包含空格的列名可以使用双引号访问："Column Name"

2026 年 2 月 17 日

🇺🇸English

Data Analysis Skill

Overview

This skill analyzes user-uploaded Excel/CSV files using DuckDB — an in-process analytical SQL engine. It supports schema inspection, SQL-based querying, statistical summaries, and result export, all through a single Python script.

Core Capabilities

Inspect Excel/CSV file structure (sheets, columns, types, row counts)
Execute arbitrary SQL queries against uploaded data
Generate statistical summaries (mean, median, stddev, percentiles, nulls)
Support multi-sheet Excel workbooks (each sheet becomes a table)
Export query results to CSV, JSON, or Markdown
Handle large files efficiently with DuckDB's columnar engine

Workflow

Step 1: Understand Requirements

When a user uploads data files and requests analysis, identify:

File location : Path(s) to uploaded Excel/CSV files under /mnt/user-data/uploads/
Analysis goal : What insights the user wants (summary, filtering, aggregation, comparison, etc.)
Output format : How results should be presented (table, CSV export, JSON, etc.)
You don't need to check the folder under /mnt/user-data

Step 2: Inspect File Structure

First, inspect the uploaded file to understand its schema:

python /mnt/skills/public/data-analysis/scripts/analyze.py \
  --files /mnt/user-data/uploads/data.xlsx \
  --action inspect

This returns:

Sheet names (for Excel) or filename (for CSV)
Column names, data types, and non-null counts
Row count per sheet/file
Sample data (first 5 rows)

Step 3: Perform Analysis

Based on the schema, construct SQL queries to answer the user's questions.

Run SQL Query

python /mnt/skills/public/data-analysis/scripts/analyze.py \
  --files /mnt/user-data/uploads/data.xlsx \
  --action query \
  --sql "SELECT category, COUNT(*) as count, AVG(amount) as avg_amount FROM Sheet1 GROUP BY category ORDER BY count DESC"

Generate Statistical Summary

python /mnt/skills/public/data-analysis/scripts/analyze.py \
  --files /mnt/user-data/uploads/data.xlsx \
  --action summary \
  --table Sheet1

This returns for each numeric column: count, mean, std, min, 25%, 50%, 75%, max, null_count. For string columns: count, unique, top value, frequency, null_count.

Export Results

python /mnt/skills/public/data-analysis/scripts/analyze.py \
  --files /mnt/user-data/uploads/data.xlsx \
  --action query \
  --sql "SELECT * FROM Sheet1 WHERE amount > 1000" \
  --output-file /mnt/user-data/outputs/filtered-results.csv

Supported output formats (auto-detected from extension):

.csv — Comma-separated values
.json — JSON array of records
.md — Markdown table

Parameters

Parameter	Required	Description
`--files`	Yes	Space-separated paths to Excel/CSV files
`--action`	Yes	One of: `inspect`, `query`, `summary`
`--sql`	For `query`	SQL query to execute

[!NOTE] Do NOT read the Python file, just call it with the parameters.

Table Naming Rules

Excel files : Each sheet becomes a table named after the sheet (e.g., Sheet1, Sales, Revenue)
CSV files : Table name is the filename without extension (e.g., data.csv → data)
Multiple files : All tables from all files are available in the same query context, enabling cross-file joins
Special characters : Sheet/file names with spaces or special characters are auto-sanitized (spaces → underscores). Use double quotes for names that start with numbers or contain special characters, e.g., "2024_Sales"

Analysis Patterns

Basic Exploration

-- Row count
SELECT COUNT(*) FROM Sheet1

-- Distinct values in a column
SELECT DISTINCT category FROM Sheet1

-- Value distribution
SELECT category, COUNT(*) as cnt FROM Sheet1 GROUP BY category ORDER BY cnt DESC

-- Date range
SELECT MIN(date_col), MAX(date_col) FROM Sheet1

Aggregation & Grouping

-- Revenue by category and month
SELECT category, DATE_TRUNC('month', order_date) as month,
       SUM(revenue) as total_revenue
FROM Sales
GROUP BY category, month
ORDER BY month, total_revenue DESC

-- Top 10 customers by spend
SELECT customer_name, SUM(amount) as total_spend
FROM Orders GROUP BY customer_name
ORDER BY total_spend DESC LIMIT 10

Cross-file Joins

-- Join sales with customer info from different files
SELECT s.order_id, s.amount, c.customer_name, c.region
FROM sales s
JOIN customers c ON s.customer_id = c.id
WHERE s.amount > 500

Window Functions

-- Running total and rank
SELECT order_date, amount,
       SUM(amount) OVER (ORDER BY order_date) as running_total,
       RANK() OVER (ORDER BY amount DESC) as amount_rank
FROM Sales

Pivot-style Analysis

-- Pivot: monthly revenue by category
SELECT category,
       SUM(CASE WHEN MONTH(date) = 1 THEN revenue END) as Jan,
       SUM(CASE WHEN MONTH(date) = 2 THEN revenue END) as Feb,
       SUM(CASE WHEN MONTH(date) = 3 THEN revenue END) as Mar
FROM Sales
GROUP BY category

Complete Example

User uploads sales_2024.xlsx (with sheets: Orders, Products, Customers) and asks: "Analyze my sales data — show top products by revenue and monthly trends."

Step 1: Inspect the file

python /mnt/skills/public/data-analysis/scripts/analyze.py \
  --files /mnt/user-data/uploads/sales_2024.xlsx \
  --action inspect

Step 2: Top products by revenue

python /mnt/skills/public/data-analysis/scripts/analyze.py \
  --files /mnt/user-data/uploads/sales_2024.xlsx \
  --action query \
  --sql "SELECT p.product_name, SUM(o.quantity * o.unit_price) as total_revenue, SUM(o.quantity) as total_units FROM Orders o JOIN Products p ON o.product_id = p.id GROUP BY p.product_name ORDER BY total_revenue DESC LIMIT 10"

Step 3: Monthly revenue trends

python /mnt/skills/public/data-analysis/scripts/analyze.py \
  --files /mnt/user-data/uploads/sales_2024.xlsx \
  --action query \
  --sql "SELECT DATE_TRUNC('month', order_date) as month, SUM(quantity * unit_price) as revenue FROM Orders GROUP BY month ORDER BY month" \
  --output-file /mnt/user-data/outputs/monthly-trends.csv

Step 4: Statistical summary

python /mnt/skills/public/data-analysis/scripts/analyze.py \
  --files /mnt/user-data/uploads/sales_2024.xlsx \
  --action summary \
  --table Orders

Present results to the user with clear explanations of findings, trends, and actionable insights.

Multi-file Example

User uploads orders.csv and customers.xlsx and asks: "Which region has the highest average order value?"

python /mnt/skills/public/data-analysis/scripts/analyze.py \
  --files /mnt/user-data/uploads/orders.csv /mnt/user-data/uploads/customers.xlsx \
  --action query \
  --sql "SELECT c.region, AVG(o.amount) as avg_order_value, COUNT(*) as order_count FROM orders o JOIN Customers c ON o.customer_id = c.id GROUP BY c.region ORDER BY avg_order_value DESC"

Output Handling

After analysis:

Present query results directly in conversation as formatted tables
For large results, export to file and share via present_files tool
Always explain findings in plain language with key takeaways
Suggest follow-up analyses when patterns are interesting
Offer to export results if the user wants to keep them

Caching

The script automatically caches loaded data to avoid re-parsing files on every call:

On first load, files are parsed and stored in a persistent DuckDB database under /mnt/user-data/workspace/.data-analysis-cache/
The cache key is a SHA256 hash of all input file contents — if files change, a new cache is created
Subsequent calls with the same files will use the cached database directly (near-instant startup)
Cache is transparent — no extra parameters needed

This is especially useful when running multiple queries against the same data files (inspect → query → summary).

Notes

DuckDB supports full SQL including window functions, CTEs, subqueries, and advanced aggregations
Excel date columns are automatically parsed; use DuckDB date functions (DATE_TRUNC, EXTRACT, etc.)
For very large files (100MB+), DuckDB handles them efficiently without loading everything into memory
Column names with spaces are accessible using double quotes: "Column Name"

Weekly Installs

204

Repository

bytedance/deer-flow

GitHub Stars

27.8K

First Seen

Feb 17, 2026

Security Audits

Gen Agent Trust HubFail SocketPass SnykPass

Installed on

gemini-cli200

github-copilot200

opencode200

kimi-cli199

amp199

codex199

DOCX文件创建、编辑与分析完整指南 - 使用docx-js、Pandoc和Python脚本

41,800 周安装