数据探索技能：系统化剖析数据集、评估质量与发现模式的方法指南

data-exploration by anthropics/knowledge-work-plugins

226 周安装量

10,700 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/anthropics/knowledge-work-plugins --skill data-exploration

质量管理数据分析数据处理

🇨🇳中文介绍

数据探索技能

用于剖析数据集、评估数据质量、发现模式和理解模式的系统化方法。

数据剖析方法

阶段 1：结构理解

在分析任何数据之前，先理解其结构：

表级别问题：

有多少行和列？
数据粒度是什么（每行代表什么）？
主键是什么？是否唯一？
数据最后更新时间是什么时候？
数据追溯多远？

列分类： 将每列归类为以下之一：

标识符：唯一键、外键、实体 ID
维度：用于分组/筛选的分类属性（状态、类型、区域、类别）
指标：用于测量的定量值（收入、计数、持续时间、分数）
时间：日期和时间戳（created_at、updated_at、event_date）
文本：自由格式文本字段（描述、备注、名称）
布尔值：真/假标志
结构：JSON、数组、嵌套结构

阶段 2：列级别剖析

为每一列计算：

所有列：

空值计数和空值率
不同值计数和基数比（不同值数 / 总数）
最常见值（前 5-10 个及其频率）
最不常见值（后 5 个以发现异常）

数值列（指标）：

min, max, mean, median (p50)
standard deviation
percentiles: p1, p5, p25, p75, p95, p99
zero count
negative count (if unexpected)

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

阶段 3：关系发现

在剖析单个列之后：

外键候选：可能链接到其他表的 ID 列
层次结构：形成自然下钻路径的列（国家 > 州 > 城市）
相关性：一起变动的数值列
派生列：似乎由其他列计算得出的列
冗余列：包含相同或几乎相同信息的列

为每一列评分：

完整（>99% 非空）：绿色
基本完整（95-99%）：黄色——调查空值
不完整（80-95%）：橙色——理解原因及其重要性
稀疏（<80%）：红色——未经插补可能无法使用

值格式不一致：相同概念以不同方式表示（"USA"、"US"、"United States"、"us"）
类型不一致：数字存储为字符串，日期格式各异
参照完整性：外键与任何父记录不匹配
业务规则违规：负数量、结束日期早于开始日期、百分比 > 100
跨列一致性：状态 = "completed" 但 completed_at 为空

表明准确性问题的危险信号：

占位符值：0、-1、999999、"N/A"、"TBD"、"test"、"xxx"
默认值：单个值的频率高得可疑
陈旧数据：在活跃系统中 updated_at 显示近期无变化
不可能的值：年龄 > 150、遥远的未来日期、负持续时间
整数偏好：所有值以 0 或 5 结尾（表明是估计值，而非测量值）

表最后更新时间是什么时候？
预期的更新频率是多少？
事件时间和加载时间之间是否存在延迟？
时间序列中是否存在间隔？

对于数值列，描述分布特征：

正态分布：均值和众数接近，钟形
右偏分布：高值长尾（收入、会话持续时间常见）
左偏分布：低值长尾（较少见）
双峰分布：两个峰值（表明两个不同的群体）
幂律分布：少数非常大的值，许多小值（用户活动常见）
均匀分布：整个范围内频率大致相等（通常是合成的或随机的）

对于时间序列数据，查找：

趋势：持续的向上或向下运动
季节性：重复模式（每周、每月、每季度、每年）
星期几效应：工作日与周末的差异
节假日效应：已知节假日前后的下降或峰值
变化点：水平或趋势的突然转变
异常值：打破模式的个别数据点

通过以下方式识别自然细分：

查找具有 3-20 个不同值的分类列
比较不同细分值的指标分布
寻找行为显著不同的细分
测试细分是否同质或包含子细分

计算所有指标对的相关矩阵
标记强相关性（|r| > 0.7）以供调查
注意：相关性并不意味着因果关系——明确标记这一点
检查非线性关系（例如，二次、对数）

模式理解和文档

为团队使用记录数据集时：

## Table: [schema.table_name]

**Description**: [What this table represents]
**Grain**: [One row per...]
**Primary Key**: [column(s)]
**Row Count**: [approximate, with date]
**Update Frequency**: [real-time / hourly / daily / weekly]
**Owner**: [team or person responsible]

### Key Columns

| Column | Type | Description | Example Values | Notes |
|--------|------|-------------|----------------|-------|
| user_id | STRING | Unique user identifier | "usr_abc123" | FK to users.id |
| event_type | STRING | Type of event | "click", "view", "purchase" | 15 distinct values |
| revenue | DECIMAL | Transaction revenue in USD | 29.99, 149.00 | Null for non-purchase events |
| created_at | TIMESTAMP | When the event occurred | 2024-01-15 14:23:01 | Partitioned on this column |

### Relationships
- Joins to `users` on `user_id`
- Joins to `products` on `product_id`
- Parent of `event_details` (1:many on event_id)

### Known Issues
- [List any known data quality issues]
- [Note any gotchas for analysts]

### Common Query Patterns
- [Typical use cases for this table]

连接到数据仓库时，使用这些模式来发现模式：

-- List all tables in a schema (PostgreSQL)
SELECT table_name, table_type
FROM information_schema.tables
WHERE table_schema = 'public'
ORDER BY table_name;

-- Column details (PostgreSQL)
SELECT column_name, data_type, is_nullable, column_default
FROM information_schema.columns
WHERE table_name = 'my_table'
ORDER BY ordinal_position;

-- Table sizes (PostgreSQL)
SELECT relname, pg_size_pretty(pg_total_relation_size(relid))
FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC;

-- Row counts for all tables (general pattern)
-- Run per-table: SELECT COUNT(*) FROM table_name

谱系和依赖关系

探索不熟悉的数据环境时：

从"输出"表开始（报告或仪表板消耗的内容）
向上游追溯：哪些表输入到它们？
识别原始/暂存/集市层
映射从原始数据到分析表的转换链
注意数据在何处被丰富、筛选或聚合

🇺🇸English

Data Exploration Skill

Systematic methodology for profiling datasets, assessing data quality, discovering patterns, and understanding schemas.

Data Profiling Methodology

Phase 1: Structural Understanding

Before analyzing any data, understand its structure:

Table-level questions:

How many rows and columns?
What is the grain (one row per what)?
What is the primary key? Is it unique?
When was the data last updated?
How far back does the data go?

Column classification: Categorize each column as one of:

Identifier : Unique keys, foreign keys, entity IDs
Dimension : Categorical attributes for grouping/filtering (status, type, region, category)
Metric : Quantitative values for measurement (revenue, count, duration, score)
Temporal : Dates and timestamps (created_at, updated_at, event_date)
Text : Free-form text fields (description, notes, name)
Boolean : True/false flags
Structural : JSON, arrays, nested structures

Phase 2: Column-Level Profiling

For each column, compute:

All columns:

Null count and null rate
Distinct count and cardinality ratio (distinct / total)
Most common values (top 5-10 with frequencies)
Least common values (bottom 5 to spot anomalies)

Numeric columns (metrics):

min, max, mean, median (p50)
standard deviation
percentiles: p1, p5, p25, p75, p95, p99
zero count
negative count (if unexpected)

String columns (dimensions, text):

min length, max length, avg length
empty string count
pattern analysis (do values follow a format?)
case consistency (all upper, all lower, mixed?)
leading/trailing whitespace count

Date/timestamp columns:

min date, max date
null dates
future dates (if unexpected)
distribution by month/week
gaps in time series

Boolean columns:

true count, false count, null count
true rate

Phase 3: Relationship Discovery

After profiling individual columns:

Foreign key candidates : ID columns that might link to other tables
Hierarchies : Columns that form natural drill-down paths (country > state > city)
Correlations : Numeric columns that move together
Derived columns : Columns that appear to be computed from others
Redundant columns : Columns with identical or near-identical information

Quality Assessment Framework

Completeness Score

Rate each column:

Complete (>99% non-null): Green
Mostly complete (95-99%): Yellow -- investigate the nulls
Incomplete (80-95%): Orange -- understand why and whether it matters
Sparse (<80%): Red -- may not be usable without imputation

Consistency Checks

Look for:

Value format inconsistency : Same concept represented differently ("USA", "US", "United States", "us")
Type inconsistency : Numbers stored as strings, dates in various formats
Referential integrity : Foreign keys that don't match any parent record
Business rule violations : Negative quantities, end dates before start dates, percentages > 100
Cross-column consistency : Status = "completed" but completed_at is null

Accuracy Indicators

Red flags that suggest accuracy issues:

Placeholder values : 0, -1, 999999, "N/A", "TBD", "test", "xxx"
Default values : Suspiciously high frequency of a single value
Stale data : Updated_at shows no recent changes in an active system
Impossible values : Ages > 150, dates in the far future, negative durations
Round number bias : All values ending in 0 or 5 (suggests estimation, not measurement)

Timeliness Assessment

When was the table last updated?
What is the expected update frequency?
Is there a lag between event time and load time?
Are there gaps in the time series?

Pattern Discovery Techniques

Distribution Analysis

For numeric columns, characterize the distribution:

Normal : Mean and median are close, bell-shaped
Skewed right : Long tail of high values (common for revenue, session duration)
Skewed left : Long tail of low values (less common)
Bimodal : Two peaks (suggests two distinct populations)
Power law : Few very large values, many small ones (common for user activity)
Uniform : Roughly equal frequency across range (often synthetic or random)

Temporal Patterns

For time series data, look for:

Trend : Sustained upward or downward movement
Seasonality : Repeating patterns (weekly, monthly, quarterly, annual)
Day-of-week effects : Weekday vs. weekend differences
Holiday effects : Drops or spikes around known holidays
Change points : Sudden shifts in level or trend
Anomalies : Individual data points that break the pattern

Segmentation Discovery

Identify natural segments by:

Finding categorical columns with 3-20 distinct values
Comparing metric distributions across segment values
Looking for segments with significantly different behavior
Testing whether segments are homogeneous or contain sub-segments

Correlation Exploration

Between numeric columns:

Compute correlation matrix for all metric pairs
Flag strong correlations (|r| > 0.7) for investigation
Note: Correlation does not imply causation -- flag this explicitly
Check for non-linear relationships (e.g., quadratic, logarithmic)

Schema Understanding and Documentation

Schema Documentation Template

When documenting a dataset for team use:

## Table: [schema.table_name]

**Description**: [What this table represents]
**Grain**: [One row per...]
**Primary Key**: [column(s)]
**Row Count**: [approximate, with date]
**Update Frequency**: [real-time / hourly / daily / weekly]
**Owner**: [team or person responsible]

### Key Columns

| Column | Type | Description | Example Values | Notes |
|--------|------|-------------|----------------|-------|
| user_id | STRING | Unique user identifier | "usr_abc123" | FK to users.id |
| event_type | STRING | Type of event | "click", "view", "purchase" | 15 distinct values |
| revenue | DECIMAL | Transaction revenue in USD | 29.99, 149.00 | Null for non-purchase events |
| created_at | TIMESTAMP | When the event occurred | 2024-01-15 14:23:01 | Partitioned on this column |

### Relationships
- Joins to `users` on `user_id`
- Joins to `products` on `product_id`
- Parent of `event_details` (1:many on event_id)

### Known Issues
- [List any known data quality issues]
- [Note any gotchas for analysts]

### Common Query Patterns
- [Typical use cases for this table]

Schema Exploration Queries

When connected to a data warehouse, use these patterns to discover schema:

-- List all tables in a schema (PostgreSQL)
SELECT table_name, table_type
FROM information_schema.tables
WHERE table_schema = 'public'
ORDER BY table_name;

-- Column details (PostgreSQL)
SELECT column_name, data_type, is_nullable, column_default
FROM information_schema.columns
WHERE table_name = 'my_table'
ORDER BY ordinal_position;

-- Table sizes (PostgreSQL)
SELECT relname, pg_size_pretty(pg_total_relation_size(relid))
FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC;

-- Row counts for all tables (general pattern)
-- Run per-table: SELECT COUNT(*) FROM table_name

Lineage and Dependencies

When exploring an unfamiliar data environment:

Start with the "output" tables (what reports or dashboards consume)
Trace upstream: What tables feed into them?
Identify raw/staging/mart layers
Map the transformation chain from raw data to analytical tables
Note where data is enriched, filtered, or aggregated

Weekly Installs

198

Repository

anthropics/know…-plugins

GitHub Stars

8.9K

First Seen

Jan 31, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

opencode174

codex163

gemini-cli160

github-copilot155

claude-code154

amp143

Excel财务建模规范与xlsx文件处理指南：专业格式、零错误公式与数据分析

42,000 周安装

数据探索技能：系统化剖析数据集、评估质量与发现模式的方法指南

🇨🇳中文介绍

数据探索技能

数据剖析方法

阶段 1：结构理解

阶段 2：列级别剖析

相关 Skills

阶段 3：关系发现

质量评估框架

完整性评分

一致性检查

准确性指标

及时性评估

模式发现技术

分布分析

时间模式

细分发现

相关性探索