explore-data by anthropics/knowledge-work-plugins
npx skills add https://github.com/anthropics/knowledge-work-plugins --skill explore-data如果您看到不熟悉的占位符或需要检查连接了哪些工具,请参阅 CONNECTORS.md。
为数据表或上传的文件生成全面的数据剖析报告。在深入分析之前,了解其形态、质量和模式。
/explore-data <表名或文件>
如果已连接数据仓库 MCP 服务器:
如果提供了文件(CSV、Excel、Parquet、JSON):
如果两者都没有:
在分析任何数据之前,先了解其结构:
表级别问题:
列分类 — 将每列归类为以下之一:
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
运行以下剖析检查:
表级别指标:
所有列:
数值列(指标):
min, max, mean, median (p50)
standard deviation
percentiles: p1, p5, p25, p75, p95, p99
zero count
negative count (if unexpected)
字符串列(维度、文本):
min length, max length, avg length
empty string count
pattern analysis (do values follow a format?)
case consistency (all upper, all lower, mixed?)
leading/trailing whitespace count
日期/时间戳列:
min date, max date
null dates
future dates (if unexpected)
distribution by month/week
gaps in time series
布尔列:
true count, false count, null count
true rate
将剖析结果呈现为清晰汇总表,按列类型(维度、指标、日期、ID)分组。
应用以下质量评估框架。标记潜在问题:
在剖析各列之后:
根据列剖析结果,推荐:
建议用户接下来可以运行的 3-5 个具体分析:
## 数据剖析: [表名]
### 概览
- 行数: 2,340,891
- 列数: 23 (8 个维度, 6 个指标, 4 个日期, 5 个 ID)
- 日期范围: 2021-03-15 至 2024-01-22
### 列详情
[汇总表]
### 数据质量问题
[带严重性标记的问题]
### 推荐探索方向
[建议的后续分析编号列表]
为每列评分:
查找:
表明准确性问题的危险信号:
对于数值列,描述分布特征:
对于时间序列数据,查找:
通过以下方式识别自然细分:
数值列之间:
为团队使用记录数据集时:
## 表: [模式.表名]
**描述**: [此表代表什么]
**粒度**: [每行代表...]
**主键**: [列]
**行数**: [近似值,带日期]
**更新频率**: [实时 / 每小时 / 每天 / 每周]
**负责人**: [负责的团队或个人]
### 关键列
| 列 | 类型 | 描述 | 示例值 | 备注 |
|--------|------|-------------|----------------|-------|
| user_id | STRING | 唯一用户标识符 | "usr_abc123" | 外键指向 users.id |
| event_type | STRING | 事件类型 | "click", "view", "purchase" | 15 个不同值 |
| revenue | DECIMAL | 交易收入(美元) | 29.99, 149.00 | 非购买事件为空 |
| created_at | TIMESTAMP | 事件发生时间 | 2024-01-15 14:23:01 | 按此列分区 |
### 关系
- 通过 `user_id` 连接到 `users`
- 通过 `product_id` 连接到 `products`
- `event_details` 的父表(基于 event_id 的 1:多关系)
### 已知问题
- [列出任何已知的数据质量问题]
- [注意分析师可能遇到的任何陷阱]
### 常见查询模式
- [此表的典型用例]
连接到数据仓库时,使用这些模式来发现模式:
-- 列出模式中的所有表 (PostgreSQL)
SELECT table_name, table_type
FROM information_schema.tables
WHERE table_schema = 'public'
ORDER BY table_name;
-- 列详情 (PostgreSQL)
SELECT column_name, data_type, is_nullable, column_default
FROM information_schema.columns
WHERE table_name = 'my_table'
ORDER BY ordinal_position;
-- 表大小 (PostgreSQL)
SELECT relname, pg_size_pretty(pg_total_relation_size(relid))
FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC;
-- 所有表的行数 (通用模式)
-- 每表运行: SELECT COUNT(*) FROM table_name
探索不熟悉的数据环境时:
每周安装数
207
仓库
GitHub 星标
10.3K
首次出现
11 天前
安全审计
安装于
codex198
gemini-cli195
cursor194
opencode194
cline193
kimi-cli193
If you see unfamiliar placeholders or need to check which tools are connected, see CONNECTORS.md.
Generate a comprehensive data profile for a table or uploaded file. Understand its shape, quality, and patterns before diving into analysis.
/explore-data <table_name or file>
If a data warehouse MCP server is connected:
If a file is provided (CSV, Excel, Parquet, JSON):
If neither:
Before analyzing any data, understand its structure:
Table-level questions:
Column classification — categorize each column as one of:
Run the following profiling checks:
Table-level metrics:
All columns:
Numeric columns (metrics):
min, max, mean, median (p50)
standard deviation
percentiles: p1, p5, p25, p75, p95, p99
zero count
negative count (if unexpected)
String columns (dimensions, text):
min length, max length, avg length
empty string count
pattern analysis (do values follow a format?)
case consistency (all upper, all lower, mixed?)
leading/trailing whitespace count
Date/timestamp columns:
min date, max date
null dates
future dates (if unexpected)
distribution by month/week
gaps in time series
Boolean columns:
true count, false count, null count
true rate
Present the profile as a clean summary table , grouped by column type (dimensions, metrics, dates, IDs).
Apply the quality assessment framework below. Flag potential problems:
After profiling individual columns:
Based on the column profile, recommend:
Suggest 3-5 specific analyses the user could run next:
## Data Profile: [table_name]
### Overview
- Rows: 2,340,891
- Columns: 23 (8 dimensions, 6 metrics, 4 dates, 5 IDs)
- Date range: 2021-03-15 to 2024-01-22
### Column Details
[summary table]
### Data Quality Issues
[flagged issues with severity]
### Recommended Explorations
[numbered list of suggested follow-up analyses]
Rate each column:
Look for:
Red flags that suggest accuracy issues:
For numeric columns, characterize the distribution:
For time series data, look for:
Identify natural segments by:
Between numeric columns:
When documenting a dataset for team use:
## Table: [schema.table_name]
**Description**: [What this table represents]
**Grain**: [One row per...]
**Primary Key**: [column(s)]
**Row Count**: [approximate, with date]
**Update Frequency**: [real-time / hourly / daily / weekly]
**Owner**: [team or person responsible]
### Key Columns
| Column | Type | Description | Example Values | Notes |
|--------|------|-------------|----------------|-------|
| user_id | STRING | Unique user identifier | "usr_abc123" | FK to users.id |
| event_type | STRING | Type of event | "click", "view", "purchase" | 15 distinct values |
| revenue | DECIMAL | Transaction revenue in USD | 29.99, 149.00 | Null for non-purchase events |
| created_at | TIMESTAMP | When the event occurred | 2024-01-15 14:23:01 | Partitioned on this column |
### Relationships
- Joins to `users` on `user_id`
- Joins to `products` on `product_id`
- Parent of `event_details` (1:many on event_id)
### Known Issues
- [List any known data quality issues]
- [Note any gotchas for analysts]
### Common Query Patterns
- [Typical use cases for this table]
When connected to a data warehouse, use these patterns to discover schema:
-- List all tables in a schema (PostgreSQL)
SELECT table_name, table_type
FROM information_schema.tables
WHERE table_schema = 'public'
ORDER BY table_name;
-- Column details (PostgreSQL)
SELECT column_name, data_type, is_nullable, column_default
FROM information_schema.columns
WHERE table_name = 'my_table'
ORDER BY ordinal_position;
-- Table sizes (PostgreSQL)
SELECT relname, pg_size_pretty(pg_total_relation_size(relid))
FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC;
-- Row counts for all tables (general pattern)
-- Run per-table: SELECT COUNT(*) FROM table_name
When exploring an unfamiliar data environment:
Weekly Installs
207
Repository
GitHub Stars
10.3K
First Seen
11 days ago
Security Audits
Gen Agent Trust HubPassSocketPassSnykPass
Installed on
codex198
gemini-cli195
cursor194
opencode194
cline193
kimi-cli193
Excel财务建模规范与xlsx文件处理指南:专业格式、零错误公式与数据分析
42,900 周安装