warehouse-init by astronomer/agents
npx skills add https://github.com/astronomer/agents --skill warehouse-init为数据仓库生成一个全面的、用户可编辑的模式参考文件。
脚本位置: ../analyzing-data/scripts/ — 以下所有 CLI 命令均相对于 analyzing-data 技能的目录。在运行任何 scripts/cli.py 命令之前,请先 cd 到相对于此文件的 ../analyzing-data/ 目录。
.astro/warehouse.md — 一个可版本控制、团队可共享的参考文件cat ~/.astro/agents/warehouse.yml
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
获取要发现的数据库列表(例如,databases: [HQ, ANALYTICS, RAW])。
启动一个子代理以在代码中查找业务上下文:
Task(
subagent_type="Explore",
prompt="""
在代码库中搜索数据模型文档:
1. dbt 模型:**/models/**/*.yml, **/schema.yml
- 提取表描述、列描述
- 注意主键和测试
2. Gusty/声明式 SQL:**/dags/**/*.sql 包含 YAML 前置元数据
- 解析前置元数据以获取:描述、主键、测试
- 注意模式映射
3. 包含数据层文档的 AGENTS.md 或 CLAUDE.md 文件
返回一个映射:
表名 -> {描述, 主键, 重要列, 层级}
"""
)
为每个数据库启动一个子代理,使用 Task 工具:
For each database in configured_databases:
Task(
subagent_type="general-purpose",
prompt="""
发现数据库 {DATABASE} 的所有元数据。
使用 CLI 运行 SQL 查询:
# 脚本相对于 ../analyzing-data/
uv run scripts/cli.py exec "df = run_sql('...')"
uv run scripts/cli.py exec "print(df)"
1. 查询模式:
SELECT SCHEMA_NAME FROM {DATABASE}.INFORMATION_SCHEMA.SCHEMATA
2. 查询表及其行数:
SELECT TABLE_SCHEMA, TABLE_NAME, ROW_COUNT, COMMENT
FROM {DATABASE}.INFORMATION_SCHEMA.TABLES
ORDER BY TABLE_SCHEMA, TABLE_NAME
3. 对于重要模式(MODEL_*, METRICS_*, MART_*),查询列:
SELECT TABLE_NAME, COLUMN_NAME, DATA_TYPE, COMMENT
FROM {DATABASE}.INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_SCHEMA = 'X'
返回结构化摘要:
- 数据库名称
- 包含表数量的模式列表
- 对于每个表:名称、行数、关键列
- 将任何行数超过 1 亿的表标记为“大表”
"""
)
并行运行所有子代理(包含多个 Task 调用的单条消息)。
对于关键分类列(如 OPERATOR, STATUS, TYPE, FEATURE),发现值族:
uv run cli.py exec "df = run_sql('''
SELECT DISTINCT column_name, COUNT(*) as occurrences
FROM table
WHERE column_name IS NOT NULL
GROUP BY column_name
ORDER BY occurrences DESC
LIMIT 50
''')"
uv run cli.py exec "print(df)"
通过共同的前缀/后缀(例如,Export* 对应 ExportCSV, ExportJSON, ExportParquet)将相关值分组到族中。
合并数据仓库元数据 + 代码库上下文:
将文件写入:
.astro/warehouse.md(默认 - 项目特定,可版本控制)~/.astro/agents/warehouse.md(如果使用 --global 标志)# 数据仓库模式
> 由 `/data:warehouse-init` 于 {DATE} 生成。可自由编辑以添加业务上下文。
## 快速参考
| 概念 | 表 | 关键列 | 日期列 |
|---------|-------|------------|-------------|
| customers | HQ.MODEL_ASTRO.ORGANIZATIONS | ORG_ID | CREATED_AT |
<!-- 在此处添加您的概念映射 -->
## 分类列
在这些列上进行筛选时,请先探索值族(值通常有变体):
| 表 | 列 | 值族 |
|-------|--------|----------------|
| {TABLE} | {COLUMN} | `{PREFIX}*` ({VALUE1}, {VALUE2}, ...) |
<!-- 由 /data:warehouse-init 根据实际数据仓库数据填充 -->
## 数据层层次结构
优先查询下游:`reporting` > `mart_*` > `metric_*` > `model_*` > `IN_*`
| 层级 | 前缀 | 用途 |
|-------|--------|---------|
| 报表 | `reporting.*` | 为仪表板优化 |
| 集市 | `mart_*` | 组合分析 |
| 指标 | `metric_*` | 不同粒度的 KPI |
| 模型 | `model_*` | 经过清洗的单一事实来源 |
| 原始 | `IN_*` | 源数据 - 避免使用 |
## {DATABASE} 数据库
### {SCHEMA} 模式
#### {TABLE_NAME}
{如果从代码中找到,则显示描述}
| 列 | 类型 | 描述 |
|--------|------|-------------|
| COL1 | VARCHAR | {来自代码或推断} |
- **行数:** {ROW_COUNT}
- **关键列:** {来自代码或推断的主键}
{如果 ROW_COUNT > 100M: - **⚠️ 警告:** 大表 - 始终添加日期筛选条件}
## 关系
{基于列名(如 *_ID)推断的关系}
| 选项 | 效果 |
|---|---|
/data:warehouse-init | 生成 .astro/warehouse.md |
/data:warehouse-init --refresh | 重新生成,保留用户编辑 |
/data:warehouse-init --database HQ | 仅发现特定数据库 |
/data:warehouse-init --global | 写入 ~/.astro/agents/ 目录 |
生成 warehouse.md 后,填充概念缓存:
# 脚本相对于 ../analyzing-data/
uv run cli.py concept import -p .astro/warehouse.md
uv run cli.py concept learn customers HQ.MART_CUST.CURRENT_ASTRO_CUSTS -k ACCT_ID
询问用户:
您是否希望将快速参考表添加到您的 CLAUDE.md 文件中?
这可以确保模式映射始终在数据查询的上下文中,将复杂查询的准确率从约 25% 提高到约 100%。
选项:
- 是,添加到 CLAUDE.md(推荐)- 追加快速参考部分
- 否,跳过 - 仅使用 warehouse.md 和缓存
如果用户选择是:
.claude/CLAUDE.md 或 CLAUDE.md 是否存在.claude/CLAUDE.md要添加的快速参考部分:
## 数据仓库快速参考
查询数据仓库时,请使用以下表映射:
| 概念 | 表 | 关键列 | 日期列 |
|---------|-------|------------|-------------|
{来自 warehouse.md 快速参考的行}
**大表(始终按日期筛选):** {列出行数超过 1 亿的表}
> 由 `/data:warehouse-init` 自动生成。运行 `/data:warehouse-init --refresh` 以更新。
如果选择是: 将快速参考部分追加到 .claude/CLAUDE.md 或 CLAUDE.md。
告知用户:
已生成 .astro/warehouse.md
摘要:
- {N} 个数据库,{N} 个模式,{N} 个表
- {N} 个表已使用代码描述进行丰富
- {N} 个概念已缓存以供即时查找
后续步骤:
1. 编辑 .astro/warehouse.md 以添加业务上下文
2. 提交到版本控制
3. 模式更改时运行 /data:warehouse-init --refresh
当指定 --refresh 时:
<!-- ... -->)<!-- REMOVED --> 注释标记已删除的表运行时缓存的默认 TTL 为 7 天。7 天后,缓存的条目将过期,并在下次使用时重新发现。
在以下情况下运行 /data:warehouse-init --refresh:
注意以下指标:
如果您怀疑缓存有问题:
# 脚本相对于 ../analyzing-data/
uv run scripts/cli.py cache status
uv run scripts/cli.py cache clear --stale-only
uv run scripts/cli.py cache clear
| 模式 | 来源 | 提取内容 |
|---|---|---|
**/models/**/*.yml | dbt | 表/列描述、测试 |
**/dags/**/*.sql | gusty | YAML 前置元数据(描述、主键) |
AGENTS.md, CLAUDE.md | 文档 | 数据层层次结构、约定 |
**/docs/**/*.md | 文档 | 业务上下文 |
User: /data:warehouse-init
Agent:
→ 正在读取数据仓库配置...
→ 找到 1 个数据仓库,包含数据库:HQ, PRODUCT
→ 正在代码库中搜索数据文档...
找到:包含数据层层次结构的 AGENTS.md
找到:dags/declarative/ 目录下 45 个包含 YAML 前置元数据的 SQL 文件
→ 启动并行数据仓库发现...
[数据库: HQ] 正在发现模式...
[数据库: PRODUCT] 正在发现模式...
→ HQ: 找到 29 个模式,401 个表
→ PRODUCT: 找到 1 个模式,0 个表
→ 正在合并数据仓库元数据与代码上下文...
已使用代码描述丰富了 45 个表
→ 已生成 .astro/warehouse.md
摘要:
- 2 个数据库
- 30 个模式
- 401 个表
- 45 个表已使用代码描述进行丰富
- 8 个大表已标记(>100M 行)
后续步骤:
1. 查看 .astro/warehouse.md
2. 向快速参考添加概念映射
3. 提交到版本控制
4. 模式更改时运行 /data:warehouse-init --refresh
每周安装量
182
代码仓库
GitHub 星标数
269
首次出现
2026年2月13日
安全审计
安装于
cursor160
opencode153
github-copilot151
codex150
gemini-cli149
kimi-cli148
Generate a comprehensive, user-editable schema reference file for the data warehouse.
Scripts: ../analyzing-data/scripts/ — All CLI commands below are relative to the analyzing-data skill's directory. Before running any scripts/cli.py command, cd to ../analyzing-data/ relative to this file.
.astro/warehouse.md - a version-controllable, team-shareable referencecat ~/.astro/agents/warehouse.yml
Get the list of databases to discover (e.g., databases: [HQ, ANALYTICS, RAW]).
Launch a subagent to find business context in code:
Task(
subagent_type="Explore",
prompt="""
Search for data model documentation in the codebase:
1. dbt models: **/models/**/*.yml, **/schema.yml
- Extract table descriptions, column descriptions
- Note primary keys and tests
2. Gusty/declarative SQL: **/dags/**/*.sql with YAML frontmatter
- Parse frontmatter for: description, primary_key, tests
- Note schema mappings
3. AGENTS.md or CLAUDE.md files with data layer documentation
Return a mapping of:
table_name -> {description, primary_key, important_columns, layer}
"""
)
Launch one subagent per database using the Task tool:
For each database in configured_databases:
Task(
subagent_type="general-purpose",
prompt="""
Discover all metadata for database {DATABASE}.
Use the CLI to run SQL queries:
# Scripts are relative to ../analyzing-data/
uv run scripts/cli.py exec "df = run_sql('...')"
uv run scripts/cli.py exec "print(df)"
1. Query schemas:
SELECT SCHEMA_NAME FROM {DATABASE}.INFORMATION_SCHEMA.SCHEMATA
2. Query tables with row counts:
SELECT TABLE_SCHEMA, TABLE_NAME, ROW_COUNT, COMMENT
FROM {DATABASE}.INFORMATION_SCHEMA.TABLES
ORDER BY TABLE_SCHEMA, TABLE_NAME
3. For important schemas (MODEL_*, METRICS_*, MART_*), query columns:
SELECT TABLE_NAME, COLUMN_NAME, DATA_TYPE, COMMENT
FROM {DATABASE}.INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_SCHEMA = 'X'
Return a structured summary:
- Database name
- List of schemas with table counts
- For each table: name, row_count, key columns
- Flag any tables with >100M rows as "large"
"""
)
Run all subagents in parallel (single message with multiple Task calls).
For key categorical columns (like OPERATOR, STATUS, TYPE, FEATURE), discover value families:
uv run cli.py exec "df = run_sql('''
SELECT DISTINCT column_name, COUNT(*) as occurrences
FROM table
WHERE column_name IS NOT NULL
GROUP BY column_name
ORDER BY occurrences DESC
LIMIT 50
''')"
uv run cli.py exec "print(df)"
Group related values into families by common prefix/suffix (e.g., Export* for ExportCSV, ExportJSON, ExportParquet).
Combine warehouse metadata + codebase context:
Write the file to:
.astro/warehouse.md (default - project-specific, version-controllable)~/.astro/agents/warehouse.md (if --global flag)# Warehouse Schema
> Generated by `/data:warehouse-init` on {DATE}. Edit freely to add business context.
## Quick Reference
| Concept | Table | Key Column | Date Column |
|---------|-------|------------|-------------|
| customers | HQ.MODEL_ASTRO.ORGANIZATIONS | ORG_ID | CREATED_AT |
<!-- Add your concept mappings here -->
## Categorical Columns
When filtering on these columns, explore value families first (values often have variants):
| Table | Column | Value Families |
|-------|--------|----------------|
| {TABLE} | {COLUMN} | `{PREFIX}*` ({VALUE1}, {VALUE2}, ...) |
<!-- Populated by /data:warehouse-init from actual warehouse data -->
## Data Layer Hierarchy
Query downstream first: `reporting` > `mart_*` > `metric_*` > `model_*` > `IN_*`
| Layer | Prefix | Purpose |
|-------|--------|---------|
| Reporting | `reporting.*` | Dashboard-optimized |
| Mart | `mart_*` | Combined analytics |
| Metric | `metric_*` | KPIs at various grains |
| Model | `model_*` | Cleansed sources of truth |
| Raw | `IN_*` | Source data - avoid |
## {DATABASE} Database
### {SCHEMA} Schema
#### {TABLE_NAME}
{DESCRIPTION from code if found}
| Column | Type | Description |
|--------|------|-------------|
| COL1 | VARCHAR | {from code or inferred} |
- **Rows:** {ROW_COUNT}
- **Key column:** {PRIMARY_KEY from code or inferred}
{IF ROW_COUNT > 100M: - **⚠️ WARNING:** Large table - always add date filters}
## Relationships
{Inferred relationships based on column names like *_ID}
| Option | Effect |
|---|---|
/data:warehouse-init | Generate .astro/warehouse.md |
/data:warehouse-init --refresh | Regenerate, preserving user edits |
/data:warehouse-init --database HQ | Only discover specific database |
/data:warehouse-init --global | Write to ~/.astro/agents/ instead |
After generating warehouse.md, populate the concept cache:
# Scripts are relative to ../analyzing-data/
uv run cli.py concept import -p .astro/warehouse.md
uv run cli.py concept learn customers HQ.MART_CUST.CURRENT_ASTRO_CUSTS -k ACCT_ID
Ask the user:
Would you like to add the Quick Reference table to your CLAUDE.md file?
This ensures the schema mappings are always in context for data queries, improving accuracy from ~25% to ~100% for complex queries.
Options:
- Yes, add to CLAUDE.md (Recommended) - Append Quick Reference section
- No, skip - Use warehouse.md and cache only
If user chooses Yes:
.claude/CLAUDE.md or CLAUDE.md exists.claude/CLAUDE.md with just the Quick ReferenceQuick Reference section to add:
## Data Warehouse Quick Reference
When querying the warehouse, use these table mappings:
| Concept | Table | Key Column | Date Column |
|---------|-------|------------|-------------|
{rows from warehouse.md Quick Reference}
**Large tables (always filter by date):** {list tables with >100M rows}
> Auto-generated by `/data:warehouse-init`. Run `/data:warehouse-init --refresh` to update.
If yes: Append the Quick Reference section to .claude/CLAUDE.md or CLAUDE.md.
Tell the user:
Generated .astro/warehouse.md
Summary:
- {N} databases, {N} schemas, {N} tables
- {N} tables enriched with code descriptions
- {N} concepts cached for instant lookup
Next steps:
1. Edit .astro/warehouse.md to add business context
2. Commit to version control
3. Run /data:warehouse-init --refresh when schema changes
When --refresh is specified:
<!-- ... -->)<!-- REMOVED --> commentThe runtime cache has a 7-day TTL by default. After 7 days, cached entries expire and will be re-discovered on next use.
Run /data:warehouse-init --refresh when:
Watch for these indicators:
If you suspect cache issues:
# Scripts are relative to ../analyzing-data/
uv run scripts/cli.py cache status
uv run scripts/cli.py cache clear --stale-only
uv run scripts/cli.py cache clear
| Pattern | Source | What We Extract |
|---|---|---|
**/models/**/*.yml | dbt | table/column descriptions, tests |
**/dags/**/*.sql | gusty | YAML frontmatter (description, primary_key) |
AGENTS.md, CLAUDE.md | docs | data layer hierarchy, conventions |
**/docs/**/*.md | docs | business context |
User: /data:warehouse-init
Agent:
→ Reading warehouse configuration...
→ Found 1 warehouse with databases: HQ, PRODUCT
→ Searching codebase for data documentation...
Found: AGENTS.md with data layer hierarchy
Found: 45 SQL files with YAML frontmatter in dags/declarative/
→ Launching parallel warehouse discovery...
[Database: HQ] Discovering schemas...
[Database: PRODUCT] Discovering schemas...
→ HQ: Found 29 schemas, 401 tables
→ PRODUCT: Found 1 schema, 0 tables
→ Merging warehouse metadata with code context...
Enriched 45 tables with descriptions from code
→ Generated .astro/warehouse.md
Summary:
- 2 databases
- 30 schemas
- 401 tables
- 45 tables enriched with code descriptions
- 8 large tables flagged (>100M rows)
Next steps:
1. Review .astro/warehouse.md
2. Add concept mappings to Quick Reference
3. Commit to version control
4. Run /data:warehouse-init --refresh when schema changes
Weekly Installs
182
Repository
GitHub Stars
269
First Seen
Feb 13, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykPass
Installed on
cursor160
opencode153
github-copilot151
codex150
gemini-cli149
kimi-cli148
Python PDF处理教程:合并拆分、提取文本表格、创建PDF文件
55,400 周安装
Nx Import 使用指南:从源仓库导入代码并保留Git历史
250 周安装
OpenPencil CLI 工具:.fig 设计文件命令行操作与 MCP 服务器 | 设计自动化
250 周安装
学术深度研究技能:AI驱动的学术文献综述与多源验证工具,生成APA格式报告
250 周安装
React PDF 渲染器 - 使用 JSON 生成 PDF 文档,支持自定义组件和流式渲染
250 周安装
后端安全编码专家 | 安全开发实践、漏洞预防与防御性编程技术指南
250 周安装
TanStack Form:高性能无头表单库,支持TypeScript、Zod、Valibot验证
250 周安装