数据集剖析与探索工具 - 自动生成数据质量报告，发现数据模式与关系

explore-data by anthropics/knowledge-work-plugins

207 周安装量

10,300 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/anthropics/knowledge-work-plugins --skill explore-data

数据可视化数据分析数据处理

🇨🇳中文介绍

/explore-data - 数据集剖析与探索

如果您看到不熟悉的占位符或需要检查连接了哪些工具，请参阅 CONNECTORS.md。

为数据表或上传的文件生成全面的数据剖析报告。在深入分析之前，了解其形态、质量和模式。

用法

/explore-data <表名或文件>

工作流程

1. 访问数据

如果已连接数据仓库 MCP 服务器：

解析表名（处理模式前缀，如果模糊则建议匹配项）
查询表元数据：列名、类型、描述（如果可用）
对实时数据运行剖析查询

如果提供了文件（CSV、Excel、Parquet、JSON）：

读取文件并加载到工作数据集中
从数据中推断列类型

如果两者都没有：

请用户提供表名（连接其仓库）或上传文件
如果他们描述了表模式，则提供运行哪些剖析查询的指导

2. 理解结构

在分析任何数据之前，先了解其结构：

表级别问题：

有多少行和列？
粒度是什么（每行代表什么）？
主键是什么？是否唯一？
数据最后更新时间？
数据回溯到多久以前？

列分类 — 将每列归类为以下之一：

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

3. 生成数据剖析报告

运行以下剖析检查：

表级别指标：

总行数
列数和类型细分
近似表大小（如果元数据中可用）
日期范围覆盖（日期列的最小/最大值）

空值计数和空值率
不同值计数和基数比（不同值数 / 总数）
最常见值（前 5-10 个及其频率）
最不常见值（后 5 个以发现异常）

数值列（指标）：

min, max, mean, median (p50)
standard deviation
percentiles: p1, p5, p25, p75, p95, p99
zero count
negative count (if unexpected)

字符串列（维度、文本）：

min length, max length, avg length
empty string count
pattern analysis (do values follow a format?)
case consistency (all upper, all lower, mixed?)
leading/trailing whitespace count

日期/时间戳列：

min date, max date
null dates
future dates (if unexpected)
distribution by month/week
gaps in time series

true count, false count, null count
true rate

将剖析结果呈现为清晰汇总表，按列类型（维度、指标、日期、ID）分组。

4. 识别数据质量问题

应用以下质量评估框架。标记潜在问题：

高空值率：空值率 >5% 的列（警告），>20% 的列（警报）
低基数意外：本应具有高基数但实际不高的列（例如，"user_id" 只有 50 个不同值）
高基数意外：本应是分类列但具有过多不同值的列
可疑值：预期只有正值时出现负金额、历史数据中出现未来日期、明显占位符值（例如，"N/A"、"TBD"、"test"、"999999"）
重复检测：检查是否存在自然键以及是否有重复
分布偏斜：极度偏斜的数值分布可能影响平均值
编码问题：分类字段中大小写混合、尾随空格、格式不一致

5. 发现关系和模式

在剖析各列之后：

外键候选：可能链接到其他表的 ID 列
层次结构：形成自然下钻路径的列（国家 > 州 > 城市）
相关性：一起变动的数值列
派生列：似乎由其他列计算得出的列
冗余列：具有相同或几乎相同信息的列

6. 建议有趣的维度和指标

根据列剖析结果，推荐：

用于切分数据的最佳维度列（具有合理基数，3-50 个值的分类列）
用于测量的关键指标列（具有有意义分布的数值列）
适用于趋势分析的时间列
数据中明显的自然分组或层次结构
链接到其他表的潜在连接键（ID 列、外键）

7. 推荐后续分析

建议用户接下来可以运行的 3-5 个具体分析：

"按 [维度] 分组，基于 [时间列] 对 [指标] 进行趋势分析"
"对 [偏斜列] 进行分布深入分析以了解异常值"
"对 [问题列] 进行数据质量调查"
"[指标_a] 和 [指标_b] 之间的相关性分析"
"使用 [日期列] 和 [状态列] 进行队列分析"

## 数据剖析: [表名]

### 概览
- 行数: 2,340,891
- 列数: 23 (8 个维度, 6 个指标, 4 个日期, 5 个 ID)
- 日期范围: 2021-03-15 至 2024-01-22

### 列详情
[汇总表]

### 数据质量问题
[带严重性标记的问题]

### 推荐探索方向
[建议的后续分析编号列表]

完整 (>99% 非空): 绿色
基本完整 (95-99%): 黄色 -- 调查空值
不完整 (80-95%): 橙色 -- 理解原因及其重要性
稀疏 (<80%): 红色 -- 不进行插补可能无法使用

值格式不一致：同一概念以不同方式表示（"USA"、"US"、"United States"、"us"）
类型不一致：数字存储为字符串，日期格式各异
参照完整性：外键与任何父记录不匹配
业务规则违规：负数量、结束日期早于开始日期、百分比 > 100
跨列一致性：状态 = "completed" 但 completed_at 为空

表明准确性问题的危险信号：

占位符值：0、-1、999999、"N/A"、"TBD"、"test"、"xxx"
默认值：单个值的频率高得可疑
陈旧数据：在活跃系统中 updated_at 显示近期无变化
不可能的值：年龄 > 150、遥远的未来日期、负持续时间
整数偏好：所有值都以 0 或 5 结尾（表明是估计值，而非测量值）

表最后更新时间？
预期更新频率是多少？
事件时间和加载时间之间是否存在延迟？
时间序列是否存在缺口？

对于数值列，描述分布特征：

正态：均值和众数接近，钟形
右偏：高值长尾（收入、会话持续时间常见）
左偏：低值长尾（较少见）
双峰：两个峰值（表明两个不同的群体）
幂律：少数非常大的值，许多小值（用户活动常见）
均匀：整个范围内频率大致相等（通常是合成或随机的）

对于时间序列数据，查找：

趋势：持续向上或向下运动
季节性：重复模式（每周、每月、每季度、每年）
星期几效应：工作日与周末的差异
节假日效应：已知节假日前后的下降或峰值
变化点：水平或趋势的突然变化
异常：打破模式的个别数据点

通过以下方式识别自然细分：

查找具有 3-20 个不同值的分类列
比较各细分值的指标分布
寻找行为显著不同的细分
测试细分是同质的还是包含子细分

计算所有指标对的相关性矩阵
标记强相关性（|r| > 0.7）以供调查
注意：相关性不意味着因果关系 -- 明确标记这一点
检查非线性关系（例如，二次、对数）

模式理解与文档

为团队使用记录数据集时：

## 表: [模式.表名]

**描述**: [此表代表什么]
**粒度**: [每行代表...]
**主键**: [列]
**行数**: [近似值，带日期]
**更新频率**: [实时 / 每小时 / 每天 / 每周]
**负责人**: [负责的团队或个人]

### 关键列

| 列 | 类型 | 描述 | 示例值 | 备注 |
|--------|------|-------------|----------------|-------|
| user_id | STRING | 唯一用户标识符 | "usr_abc123" | 外键指向 users.id |
| event_type | STRING | 事件类型 | "click", "view", "purchase" | 15 个不同值 |
| revenue | DECIMAL | 交易收入（美元） | 29.99, 149.00 | 非购买事件为空 |
| created_at | TIMESTAMP | 事件发生时间 | 2024-01-15 14:23:01 | 按此列分区 |

### 关系
- 通过 `user_id` 连接到 `users`
- 通过 `product_id` 连接到 `products`
- `event_details` 的父表（基于 event_id 的 1:多关系）

### 已知问题
- [列出任何已知的数据质量问题]
- [注意分析师可能遇到的任何陷阱]

### 常见查询模式
- [此表的典型用例]

连接到数据仓库时，使用这些模式来发现模式：

-- 列出模式中的所有表 (PostgreSQL)
SELECT table_name, table_type
FROM information_schema.tables
WHERE table_schema = 'public'
ORDER BY table_name;

-- 列详情 (PostgreSQL)
SELECT column_name, data_type, is_nullable, column_default
FROM information_schema.columns
WHERE table_name = 'my_table'
ORDER BY ordinal_position;

-- 表大小 (PostgreSQL)
SELECT relname, pg_size_pretty(pg_total_relation_size(relid))
FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC;

-- 所有表的行数 (通用模式)
-- 每表运行: SELECT COUNT(*) FROM table_name

谱系和依赖关系

探索不熟悉的数据环境时：

从"输出"表开始（报告或仪表板消耗的内容）
向上游追溯：哪些表输入到它们？
识别原始/暂存/集市层
映射从原始数据到分析表的转换链
注意数据在何处被丰富、筛选或聚合

对于非常大的表（1 亿行以上），剖析查询默认使用抽样 -- 如果需要精确计数请说明
如果是首次探索新数据集，此命令可以在编写特定查询之前让您了解整体情况
质量标记是启发式的 -- 并非每个标记都是真正的问题，但每个都值得快速查看

🇺🇸English

/explore-data - Profile and Explore a Dataset

If you see unfamiliar placeholders or need to check which tools are connected, see CONNECTORS.md.

Generate a comprehensive data profile for a table or uploaded file. Understand its shape, quality, and patterns before diving into analysis.

Usage

/explore-data <table_name or file>

Workflow

1. Access the Data

If a data warehouse MCP server is connected:

Resolve the table name (handle schema prefixes, suggest matches if ambiguous)
Query table metadata: column names, types, descriptions if available
Run profiling queries against the live data

If a file is provided (CSV, Excel, Parquet, JSON):

Read the file and load into a working dataset
Infer column types from the data

If neither:

Ask the user to provide a table name (with their warehouse connected) or upload a file
If they describe a table schema, provide guidance on what profiling queries to run

2. Understand Structure

Before analyzing any data, understand its structure:

Table-level questions:

How many rows and columns?
What is the grain (one row per what)?
What is the primary key? Is it unique?
When was the data last updated?
How far back does the data go?

Column classification — categorize each column as one of:

Identifier : Unique keys, foreign keys, entity IDs
Dimension : Categorical attributes for grouping/filtering (status, type, region, category)
Metric : Quantitative values for measurement (revenue, count, duration, score)
Temporal : Dates and timestamps (created_at, updated_at, event_date)
Text : Free-form text fields (description, notes, name)
Boolean : True/false flags
Structural : JSON, arrays, nested structures

3. Generate Data Profile

Run the following profiling checks:

Table-level metrics:

Total row count
Column count and types breakdown
Approximate table size (if available from metadata)
Date range coverage (min/max of date columns)

All columns:

Null count and null rate
Distinct count and cardinality ratio (distinct / total)
Most common values (top 5-10 with frequencies)
Least common values (bottom 5 to spot anomalies)

Numeric columns (metrics):

min, max, mean, median (p50)
standard deviation
percentiles: p1, p5, p25, p75, p95, p99
zero count
negative count (if unexpected)

String columns (dimensions, text):

min length, max length, avg length
empty string count
pattern analysis (do values follow a format?)
case consistency (all upper, all lower, mixed?)
leading/trailing whitespace count

Date/timestamp columns:

min date, max date
null dates
future dates (if unexpected)
distribution by month/week
gaps in time series

Boolean columns:

true count, false count, null count
true rate

Present the profile as a clean summary table , grouped by column type (dimensions, metrics, dates, IDs).

4. Identify Data Quality Issues

Apply the quality assessment framework below. Flag potential problems:

High null rates : Columns with >5% nulls (warn), >20% nulls (alert)
Low cardinality surprises : Columns that should be high-cardinality but aren't (e.g., a "user_id" with only 50 distinct values)
High cardinality surprises : Columns that should be categorical but have too many distinct values
Suspicious values : Negative amounts where only positive expected, future dates in historical data, obviously placeholder values (e.g., "N/A", "TBD", "test", "999999")
Duplicate detection : Check if there's a natural key and whether it has duplicates
Distribution skew : Extremely skewed numeric distributions that could affect averages
Encoding issues : Mixed case in categorical fields, trailing whitespace, inconsistent formats

5. Discover Relationships and Patterns

After profiling individual columns:

Foreign key candidates : ID columns that might link to other tables
Hierarchies : Columns that form natural drill-down paths (country > state > city)
Correlations : Numeric columns that move together
Derived columns : Columns that appear to be computed from others
Redundant columns : Columns with identical or near-identical information

6. Suggest Interesting Dimensions and Metrics

Based on the column profile, recommend:

Best dimension columns for slicing data (categorical columns with reasonable cardinality, 3-50 values)
Key metric columns for measurement (numeric columns with meaningful distributions)
Time columns suitable for trend analysis
Natural groupings or hierarchies apparent in the data
Potential join keys linking to other tables (ID columns, foreign keys)

7. Recommend Follow-Up Analyses

Suggest 3-5 specific analyses the user could run next:

"Trend analysis on [metric] by [time_column] grouped by [dimension]"
"Distribution deep-dive on [skewed_column] to understand outliers"
"Data quality investigation on [problematic_column]"
"Correlation analysis between [metric_a] and [metric_b]"
"Cohort analysis using [date_column] and [status_column]"

Output Format

## Data Profile: [table_name]

### Overview
- Rows: 2,340,891
- Columns: 23 (8 dimensions, 6 metrics, 4 dates, 5 IDs)
- Date range: 2021-03-15 to 2024-01-22

### Column Details
[summary table]

### Data Quality Issues
[flagged issues with severity]

### Recommended Explorations
[numbered list of suggested follow-up analyses]

Quality Assessment Framework

Completeness Score

Rate each column:

Complete (>99% non-null): Green
Mostly complete (95-99%): Yellow -- investigate the nulls
Incomplete (80-95%): Orange -- understand why and whether it matters
Sparse (<80%): Red -- may not be usable without imputation

Consistency Checks

Look for:

Value format inconsistency : Same concept represented differently ("USA", "US", "United States", "us")
Type inconsistency : Numbers stored as strings, dates in various formats
Referential integrity : Foreign keys that don't match any parent record
Business rule violations : Negative quantities, end dates before start dates, percentages > 100
Cross-column consistency : Status = "completed" but completed_at is null

Accuracy Indicators

Red flags that suggest accuracy issues:

Placeholder values : 0, -1, 999999, "N/A", "TBD", "test", "xxx"
Default values : Suspiciously high frequency of a single value
Stale data : Updated_at shows no recent changes in an active system
Impossible values : Ages > 150, dates in the far future, negative durations
Round number bias : All values ending in 0 or 5 (suggests estimation, not measurement)

Timeliness Assessment

When was the table last updated?
What is the expected update frequency?
Is there a lag between event time and load time?
Are there gaps in the time series?

Pattern Discovery Techniques

Distribution Analysis

For numeric columns, characterize the distribution:

Normal : Mean and median are close, bell-shaped
Skewed right : Long tail of high values (common for revenue, session duration)
Skewed left : Long tail of low values (less common)
Bimodal : Two peaks (suggests two distinct populations)
Power law : Few very large values, many small ones (common for user activity)
Uniform : Roughly equal frequency across range (often synthetic or random)

Temporal Patterns

For time series data, look for:

Trend : Sustained upward or downward movement
Seasonality : Repeating patterns (weekly, monthly, quarterly, annual)
Day-of-week effects : Weekday vs. weekend differences
Holiday effects : Drops or spikes around known holidays
Change points : Sudden shifts in level or trend
Anomalies : Individual data points that break the pattern

Segmentation Discovery

Identify natural segments by:

Finding categorical columns with 3-20 distinct values
Comparing metric distributions across segment values
Looking for segments with significantly different behavior
Testing whether segments are homogeneous or contain sub-segments

Correlation Exploration

Between numeric columns:

Compute correlation matrix for all metric pairs
Flag strong correlations (|r| > 0.7) for investigation
Note: Correlation does not imply causation -- flag this explicitly
Check for non-linear relationships (e.g., quadratic, logarithmic)

Schema Understanding and Documentation

Schema Documentation Template

When documenting a dataset for team use:

## Table: [schema.table_name]

**Description**: [What this table represents]
**Grain**: [One row per...]
**Primary Key**: [column(s)]
**Row Count**: [approximate, with date]
**Update Frequency**: [real-time / hourly / daily / weekly]
**Owner**: [team or person responsible]

### Key Columns

| Column | Type | Description | Example Values | Notes |
|--------|------|-------------|----------------|-------|
| user_id | STRING | Unique user identifier | "usr_abc123" | FK to users.id |
| event_type | STRING | Type of event | "click", "view", "purchase" | 15 distinct values |
| revenue | DECIMAL | Transaction revenue in USD | 29.99, 149.00 | Null for non-purchase events |
| created_at | TIMESTAMP | When the event occurred | 2024-01-15 14:23:01 | Partitioned on this column |

### Relationships
- Joins to `users` on `user_id`
- Joins to `products` on `product_id`
- Parent of `event_details` (1:many on event_id)

### Known Issues
- [List any known data quality issues]
- [Note any gotchas for analysts]

### Common Query Patterns
- [Typical use cases for this table]

Schema Exploration Queries

When connected to a data warehouse, use these patterns to discover schema:

-- List all tables in a schema (PostgreSQL)
SELECT table_name, table_type
FROM information_schema.tables
WHERE table_schema = 'public'
ORDER BY table_name;

-- Column details (PostgreSQL)
SELECT column_name, data_type, is_nullable, column_default
FROM information_schema.columns
WHERE table_name = 'my_table'
ORDER BY ordinal_position;

-- Table sizes (PostgreSQL)
SELECT relname, pg_size_pretty(pg_total_relation_size(relid))
FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC;

-- Row counts for all tables (general pattern)
-- Run per-table: SELECT COUNT(*) FROM table_name

Lineage and Dependencies

When exploring an unfamiliar data environment:

Start with the "output" tables (what reports or dashboards consume)
Trace upstream: What tables feed into them?
Identify raw/staging/mart layers
Map the transformation chain from raw data to analytical tables
Note where data is enriched, filtered, or aggregated

Tips

For very large tables (100M+ rows), profiling queries use sampling by default -- mention if you need exact counts
If exploring a new dataset for the first time, this command gives you the lay of the land before writing specific queries
The quality flags are heuristic -- not every flag is a real problem, but each is worth a quick look

Weekly Installs

207

Repository

anthropics/know…-plugins

GitHub Stars

10.3K

First Seen

11 days ago

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

codex198

gemini-cli195

cursor194

opencode194

cline193

kimi-cli193

Excel财务建模规范与xlsx文件处理指南：专业格式、零错误公式与数据分析

42,900 周安装