profiling-tables by astronomer/agents
npx skills add https://github.com/astronomer/agents --skill profiling-tables生成一份全面的表格概况,供新团队成员理解数据使用。
查询列元数据:
SELECT COLUMN_NAME, DATA_TYPE, COMMENT
FROM <database>.INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_SCHEMA = '<schema>' AND TABLE_NAME = '<table>'
ORDER BY ORDINAL_POSITION
如果表名未完全限定,请先在 INFORMATION_SCHEMA.TABLES 中搜索以定位它。
通过 run_sql 运行:
SELECT
COUNT(*) as total_rows,
COUNT(*) / 1000000.0 as millions_of_rows
FROM <table>
对于每一列,根据数据类型收集适当的统计信息:
SELECT
MIN(column_name) as min_val,
MAX(column_name) as max_val,
AVG(column_name) as avg_val,
STDDEV(column_name) as std_dev,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY column_name) as median,
SUM(CASE WHEN column_name IS NULL THEN 1 ELSE 0 END) as null_count,
COUNT(DISTINCT column_name) as distinct_count
FROM <table>
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
SELECT
MIN(LEN(column_name)) as min_length,
MAX(LEN(column_name)) as max_length,
AVG(LEN(column_name)) as avg_length,
SUM(CASE WHEN column_name IS NULL OR column_name = '' THEN 1 ELSE 0 END) as empty_count,
COUNT(DISTINCT column_name) as distinct_count
FROM <table>
SELECT
MIN(column_name) as earliest,
MAX(column_name) as latest,
DATEDIFF('day', MIN(column_name), MAX(column_name)) as date_range_days,
SUM(CASE WHEN column_name IS NULL THEN 1 ELSE 0 END) as null_count
FROM <table>
对于看起来像分类/维度键的列:
SELECT
column_name,
COUNT(*) as frequency,
ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER(), 2) as percentage
FROM <table>
GROUP BY column_name
ORDER BY frequency DESC
LIMIT 20
这揭示了:
获取代表性行:
SELECT *
FROM <table>
LIMIT 10
如果表很大并且希望获得多样性,可以从不同的时间段或类别中进行抽样。
总结各个维度的质量:
提供结构化的概况:
用 2-3 句话描述此表包含的内容、使用者以及其新鲜度。
| 列名 | 类型 | 空值% | 唯一值数 | 描述 |
|---|---|---|---|---|
| ... | ... | ... | ... | ... |
列出发现的任何数据质量问题。
针对此数据的常见问题,提供 3-5 个有用的查询。
每周安装量
375
仓库
GitHub 星标数
269
首次出现时间
2026 年 1 月 23 日
安全审计
安装于
opencode275
codex269
github-copilot269
cursor267
gemini-cli256
amp236
Generate a comprehensive profile of a table that a new team member could use to understand the data.
Query column metadata:
SELECT COLUMN_NAME, DATA_TYPE, COMMENT
FROM <database>.INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_SCHEMA = '<schema>' AND TABLE_NAME = '<table>'
ORDER BY ORDINAL_POSITION
If the table name isn't fully qualified, search INFORMATION_SCHEMA.TABLES to locate it first.
Run via run_sql:
SELECT
COUNT(*) as total_rows,
COUNT(*) / 1000000.0 as millions_of_rows
FROM <table>
For each column, gather appropriate statistics based on data type:
SELECT
MIN(column_name) as min_val,
MAX(column_name) as max_val,
AVG(column_name) as avg_val,
STDDEV(column_name) as std_dev,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY column_name) as median,
SUM(CASE WHEN column_name IS NULL THEN 1 ELSE 0 END) as null_count,
COUNT(DISTINCT column_name) as distinct_count
FROM <table>
SELECT
MIN(LEN(column_name)) as min_length,
MAX(LEN(column_name)) as max_length,
AVG(LEN(column_name)) as avg_length,
SUM(CASE WHEN column_name IS NULL OR column_name = '' THEN 1 ELSE 0 END) as empty_count,
COUNT(DISTINCT column_name) as distinct_count
FROM <table>
SELECT
MIN(column_name) as earliest,
MAX(column_name) as latest,
DATEDIFF('day', MIN(column_name), MAX(column_name)) as date_range_days,
SUM(CASE WHEN column_name IS NULL THEN 1 ELSE 0 END) as null_count
FROM <table>
For columns that look like categorical/dimension keys:
SELECT
column_name,
COUNT(*) as frequency,
ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER(), 2) as percentage
FROM <table>
GROUP BY column_name
ORDER BY frequency DESC
LIMIT 20
This reveals:
Get representative rows:
SELECT *
FROM <table>
LIMIT 10
If the table is large and you want variety, sample from different time periods or categories.
Summarize quality across dimensions:
Provide a structured profile:
2-3 sentences describing what this table contains, who uses it, and how fresh it is.
| Column | Type | Nulls% | Distinct | Description |
|---|---|---|---|---|
| ... | ... | ... | ... | ... |
List any data quality concerns discovered.
3-5 useful queries for common questions about this data.
Weekly Installs
375
Repository
GitHub Stars
269
First Seen
Jan 23, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykPass
Installed on
opencode275
codex269
github-copilot269
cursor267
gemini-cli256
amp236
Python PDF处理教程:合并拆分、提取文本表格、创建PDF文件
54,200 周安装
营销心理学实战指南:心智模型与行为设计,提升转化率与用户体验
349 周安装
Spring Boot @JsonTest JSON序列化单元测试指南:JacksonTester使用与最佳实践
349 周安装
scikit-learn 机器学习教程:Python 分类、回归、聚类、降维与模型评估实战指南
350 周安装
技术文档撰写专家 | AI辅助生成用户指南、API文档、架构文档、教程
350 周安装
Gemini AI 图像生成技能 - 使用 Google AI 生成和编辑网站图片,支持 4K 分辨率和文本渲染
350 周安装
Next.js 16 AI 助手技能包 - 提升开发效率的智能工具集合
350 周安装