数据仓库模式初始化工具warehouse-init：自动生成可编辑模式文档

warehouse-init by astronomer/agents

256 周安装量

290 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/astronomer/agents --skill warehouse-init

自动化数据分析数据处理

🇨🇳中文介绍

初始化数据仓库模式

为数据仓库生成一个全面的、用户可编辑的模式参考文件。

脚本位置： ../analyzing-data/scripts/ — 以下所有 CLI 命令均相对于 analyzing-data 技能的目录。在运行任何 scripts/cli.py 命令之前，请先 cd 到相对于此文件的 ../analyzing-data/ 目录。

功能说明

从数据仓库中发现所有数据库、模式、表和列
使用代码库上下文进行丰富（dbt 模型、gusty SQL、模式文档）
记录行数并识别大表
生成 .astro/warehouse.md — 一个可版本控制、团队可共享的参考文件
无需查询数据仓库即可实现概念→表的即时查找

处理流程

步骤 1：读取数据仓库配置

cat ~/.astro/agents/warehouse.yml

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

相关 Skills

FlyClaw：零登录航班聚合查询工具，Python实现多源航班信息与价格搜索

4,000,000 周安装

Azure RBAC 权限管理工具：查找最小角色、创建自定义角色与自动化分配

104,600 周安装

Azure Data Explorer (Kusto) 查询技能：KQL数据分析、日志遥测与时间序列处理

102,600 周安装

专业SEO审计工具：全面网站诊断、技术SEO优化与页面分析指南

59,900 周安装

Task(
    subagent_type="Explore",
    prompt="""
    在代码库中搜索数据模型文档：

    1. dbt 模型：**/models/**/*.yml, **/schema.yml
       - 提取表描述、列描述
       - 注意主键和测试

    2. Gusty/声明式 SQL：**/dags/**/*.sql 包含 YAML 前置元数据
       - 解析前置元数据以获取：描述、主键、测试
       - 注意模式映射

    3. 包含数据层文档的 AGENTS.md 或 CLAUDE.md 文件

    返回一个映射：
      表名 -> {描述, 主键, 重要列, 层级}
    """
)

For each database in configured_databases:
    Task(
        subagent_type="general-purpose",
        prompt="""
        发现数据库 {DATABASE} 的所有元数据。

        使用 CLI 运行 SQL 查询：
        # 脚本相对于 ../analyzing-data/
        uv run scripts/cli.py exec "df = run_sql('...')"
        uv run scripts/cli.py exec "print(df)"

        1. 查询模式：
           SELECT SCHEMA_NAME FROM {DATABASE}.INFORMATION_SCHEMA.SCHEMATA

        2. 查询表及其行数：
           SELECT TABLE_SCHEMA, TABLE_NAME, ROW_COUNT, COMMENT
           FROM {DATABASE}.INFORMATION_SCHEMA.TABLES
           ORDER BY TABLE_SCHEMA, TABLE_NAME

        3. 对于重要模式（MODEL_*, METRICS_*, MART_*），查询列：
           SELECT TABLE_NAME, COLUMN_NAME, DATA_TYPE, COMMENT
           FROM {DATABASE}.INFORMATION_SCHEMA.COLUMNS
           WHERE TABLE_SCHEMA = 'X'

        返回结构化摘要：
        - 数据库名称
        - 包含表数量的模式列表
        - 对于每个表：名称、行数、关键列
        - 将任何行数超过 1 亿的表标记为“大表”
        """
    )

# 数据仓库模式

> 由 `/data:warehouse-init` 于 {DATE} 生成。可自由编辑以添加业务上下文。

## 快速参考

| 概念 | 表 | 关键列 | 日期列 |
|---------|-------|------------|-------------|
| customers | HQ.MODEL_ASTRO.ORGANIZATIONS | ORG_ID | CREATED_AT |
<!-- 在此处添加您的概念映射 -->

## 分类列

在这些列上进行筛选时，请先探索值族（值通常有变体）：

| 表 | 列 | 值族 |
|-------|--------|----------------|
| {TABLE} | {COLUMN} | `{PREFIX}*` ({VALUE1}, {VALUE2}, ...) |
<!-- 由 /data:warehouse-init 根据实际数据仓库数据填充 -->

## 数据层层次结构

优先查询下游：`reporting` > `mart_*` > `metric_*` > `model_*` > `IN_*`

| 层级 | 前缀 | 用途 |
|-------|--------|---------|
| 报表 | `reporting.*` | 为仪表板优化 |
| 集市 | `mart_*` | 组合分析 |
| 指标 | `metric_*` | 不同粒度的 KPI |
| 模型 | `model_*` | 经过清洗的单一事实来源 |
| 原始 | `IN_*` | 源数据 - 避免使用 |

## {DATABASE} 数据库

### {SCHEMA} 模式

#### {TABLE_NAME}
{如果从代码中找到，则显示描述}

| 列 | 类型 | 描述 |
|--------|------|-------------|
| COL1 | VARCHAR | {来自代码或推断} |

- **行数：** {ROW_COUNT}
- **关键列：** {来自代码或推断的主键}
{如果 ROW_COUNT > 100M: - **⚠️ 警告：** 大表 - 始终添加日期筛选条件}

## 关系

选项	效果
`/data:warehouse-init`	生成 .astro/warehouse.md
`/data:warehouse-init --refresh`	重新生成，保留用户编辑
`/data:warehouse-init --database HQ`	仅发现特定数据库
`/data:warehouse-init --global`	写入 ~/.astro/agents/ 目录

模式	来源	提取内容
`/models//*.yml`	dbt	表/列描述、测试
`/dags//*.sql`	gusty	YAML 前置元数据（描述、主键）
`AGENTS.md`, `CLAUDE.md`	文档	数据层层次结构、约定
`/docs//*.md`	文档	业务上下文

User: /data:warehouse-init

Agent:
→ 正在读取数据仓库配置...
→ 找到 1 个数据仓库，包含数据库：HQ, PRODUCT

→ 正在代码库中搜索数据文档...
  找到：包含数据层层次结构的 AGENTS.md
  找到：dags/declarative/ 目录下 45 个包含 YAML 前置元数据的 SQL 文件

→ 启动并行数据仓库发现...
  [数据库: HQ] 正在发现模式...
  [数据库: PRODUCT] 正在发现模式...

→ HQ: 找到 29 个模式，401 个表
→ PRODUCT: 找到 1 个模式，0 个表

→ 正在合并数据仓库元数据与代码上下文...
  已使用代码描述丰富了 45 个表

→ 已生成 .astro/warehouse.md

摘要：
  - 2 个数据库
  - 30 个模式
  - 401 个表
  - 45 个表已使用代码描述进行丰富
  - 8 个大表已标记（>100M 行）

后续步骤：
  1. 查看 .astro/warehouse.md
  2. 向快速参考添加概念映射
  3. 提交到版本控制
  4. 模式更改时运行 /data:warehouse-init --refresh

🇺🇸English

Initialize Warehouse Schema

Generate a comprehensive, user-editable schema reference file for the data warehouse.

Scripts: ../analyzing-data/scripts/ — All CLI commands below are relative to the analyzing-data skill's directory. Before running any scripts/cli.py command, cd to ../analyzing-data/ relative to this file.

What This Does

Discovers all databases, schemas, tables, and columns from the warehouse
Enriches with codebase context (dbt models, gusty SQL, schema docs)
Records row counts and identifies large tables
Generates .astro/warehouse.md - a version-controllable, team-shareable reference
Enables instant concept→table lookups without warehouse queries

Process

Step 1: Read Warehouse Configuration

cat ~/.astro/agents/warehouse.yml

Get the list of databases to discover (e.g., databases: [HQ, ANALYTICS, RAW]).

Step 2: Search Codebase for Context (Parallel)

Launch a subagent to find business context in code:

Task(
    subagent_type="Explore",
    prompt="""
    Search for data model documentation in the codebase:

    1. dbt models: **/models/**/*.yml, **/schema.yml
       - Extract table descriptions, column descriptions
       - Note primary keys and tests

    2. Gusty/declarative SQL: **/dags/**/*.sql with YAML frontmatter
       - Parse frontmatter for: description, primary_key, tests
       - Note schema mappings

    3. AGENTS.md or CLAUDE.md files with data layer documentation

    Return a mapping of:
      table_name -> {description, primary_key, important_columns, layer}
    """
)

Step 3: Parallel Warehouse Discovery

Launch one subagent per database using the Task tool:

For each database in configured_databases:
    Task(
        subagent_type="general-purpose",
        prompt="""
        Discover all metadata for database {DATABASE}.

        Use the CLI to run SQL queries:
        # Scripts are relative to ../analyzing-data/
        uv run scripts/cli.py exec "df = run_sql('...')"
        uv run scripts/cli.py exec "print(df)"

        1. Query schemas:
           SELECT SCHEMA_NAME FROM {DATABASE}.INFORMATION_SCHEMA.SCHEMATA

        2. Query tables with row counts:
           SELECT TABLE_SCHEMA, TABLE_NAME, ROW_COUNT, COMMENT
           FROM {DATABASE}.INFORMATION_SCHEMA.TABLES
           ORDER BY TABLE_SCHEMA, TABLE_NAME

        3. For important schemas (MODEL_*, METRICS_*, MART_*), query columns:
           SELECT TABLE_NAME, COLUMN_NAME, DATA_TYPE, COMMENT
           FROM {DATABASE}.INFORMATION_SCHEMA.COLUMNS
           WHERE TABLE_SCHEMA = 'X'

        Return a structured summary:
        - Database name
        - List of schemas with table counts
        - For each table: name, row_count, key columns
        - Flag any tables with >100M rows as "large"
        """
    )

Run all subagents in parallel (single message with multiple Task calls).

Step 4: Discover Categorical Value Families

For key categorical columns (like OPERATOR, STATUS, TYPE, FEATURE), discover value families:

uv run cli.py exec "df = run_sql('''
SELECT DISTINCT column_name, COUNT(*) as occurrences
FROM table
WHERE column_name IS NOT NULL
GROUP BY column_name
ORDER BY occurrences DESC
LIMIT 50
''')"
uv run cli.py exec "print(df)"

Group related values into families by common prefix/suffix (e.g., Export* for ExportCSV, ExportJSON, ExportParquet).

Step 5: Merge Results

Combine warehouse metadata + codebase context:

Quick Reference table - concept → table mappings (pre-populated from code if found)
Categorical Columns - value families for key filter columns
Database sections - one per database
Schema subsections - tables grouped by schema
Table details - columns, row counts, descriptions from code , warnings

Step 6: Generate warehouse.md

Write the file to:

.astro/warehouse.md (default - project-specific, version-controllable)
~/.astro/agents/warehouse.md (if --global flag)

Output Format

# Warehouse Schema

> Generated by `/data:warehouse-init` on {DATE}. Edit freely to add business context.

## Quick Reference

| Concept | Table | Key Column | Date Column |
|---------|-------|------------|-------------|
| customers | HQ.MODEL_ASTRO.ORGANIZATIONS | ORG_ID | CREATED_AT |
<!-- Add your concept mappings here -->

## Categorical Columns

When filtering on these columns, explore value families first (values often have variants):

| Table | Column | Value Families |
|-------|--------|----------------|
| {TABLE} | {COLUMN} | `{PREFIX}*` ({VALUE1}, {VALUE2}, ...) |
<!-- Populated by /data:warehouse-init from actual warehouse data -->

## Data Layer Hierarchy

Query downstream first: `reporting` > `mart_*` > `metric_*` > `model_*` > `IN_*`

| Layer | Prefix | Purpose |
|-------|--------|---------|
| Reporting | `reporting.*` | Dashboard-optimized |
| Mart | `mart_*` | Combined analytics |
| Metric | `metric_*` | KPIs at various grains |
| Model | `model_*` | Cleansed sources of truth |
| Raw | `IN_*` | Source data - avoid |

## {DATABASE} Database

### {SCHEMA} Schema

#### {TABLE_NAME}
{DESCRIPTION from code if found}

| Column | Type | Description |
|--------|------|-------------|
| COL1 | VARCHAR | {from code or inferred} |

- **Rows:** {ROW_COUNT}
- **Key column:** {PRIMARY_KEY from code or inferred}
{IF ROW_COUNT > 100M: - **⚠️ WARNING:** Large table - always add date filters}

## Relationships

{Inferred relationships based on column names like *_ID}

Command Options

Option	Effect
`/data:warehouse-init`	Generate .astro/warehouse.md
`/data:warehouse-init --refresh`	Regenerate, preserving user edits
`/data:warehouse-init --database HQ`	Only discover specific database
`/data:warehouse-init --global`	Write to ~/.astro/agents/ instead

Step 7: Pre-populate Cache

After generating warehouse.md, populate the concept cache:

# Scripts are relative to ../analyzing-data/
uv run cli.py concept import -p .astro/warehouse.md
uv run cli.py concept learn customers HQ.MART_CUST.CURRENT_ASTRO_CUSTS -k ACCT_ID

Step 8: Offer CLAUDE.md Integration (Ask User)

Ask the user:

Would you like to add the Quick Reference table to your CLAUDE.md file?

This ensures the schema mappings are always in context for data queries, improving accuracy from ~25% to ~100% for complex queries.

Options:

Yes, add to CLAUDE.md (Recommended) - Append Quick Reference section

No, skip - Use warehouse.md and cache only

If user chooses Yes:

Check if .claude/CLAUDE.md or CLAUDE.md exists
If exists, append the Quick Reference section (avoid duplicates)
If not exists, create .claude/CLAUDE.md with just the Quick Reference

Quick Reference section to add:

## Data Warehouse Quick Reference

When querying the warehouse, use these table mappings:

| Concept | Table | Key Column | Date Column |
|---------|-------|------------|-------------|
{rows from warehouse.md Quick Reference}

**Large tables (always filter by date):** {list tables with >100M rows}

> Auto-generated by `/data:warehouse-init`. Run `/data:warehouse-init --refresh` to update.

If yes: Append the Quick Reference section to .claude/CLAUDE.md or CLAUDE.md.

After Generation

Tell the user:

Generated .astro/warehouse.md

Summary:
  - {N} databases, {N} schemas, {N} tables
  - {N} tables enriched with code descriptions
  - {N} concepts cached for instant lookup

Next steps:
  1. Edit .astro/warehouse.md to add business context
  2. Commit to version control
  3. Run /data:warehouse-init --refresh when schema changes

Refresh Behavior

When --refresh is specified:

Read existing warehouse.md
Preserve all HTML comments ()
Preserve Quick Reference table entries (user-added)
Preserve user-added descriptions
Update row counts and add new tables
Mark removed tables with  comment

Cache Staleness & Schema Drift

The runtime cache has a 7-day TTL by default. After 7 days, cached entries expire and will be re-discovered on next use.

When to Refresh

Run /data:warehouse-init --refresh when:

Schema changes : Tables added, renamed, or removed
Column changes : New columns added or types changed
After deployments : If your data pipeline deploys schema migrations
Weekly : As a good practice, even if no known changes

Signs of Stale Cache

Watch for these indicators:

Queries fail with "table not found" errors
Results seem wrong or outdated
New tables aren't being discovered

Manual Cache Reset

If you suspect cache issues:

# Scripts are relative to ../analyzing-data/
uv run scripts/cli.py cache status
uv run scripts/cli.py cache clear --stale-only
uv run scripts/cli.py cache clear

Codebase Patterns Recognized

Pattern	Source	What We Extract
`/models//*.yml`	dbt	table/column descriptions, tests
`/dags//*.sql`	gusty	YAML frontmatter (description, primary_key)
`AGENTS.md`, `CLAUDE.md`	docs	data layer hierarchy, conventions
`/docs//*.md`	docs	business context

Example Session

User: /data:warehouse-init

Agent:
→ Reading warehouse configuration...
→ Found 1 warehouse with databases: HQ, PRODUCT

→ Searching codebase for data documentation...
  Found: AGENTS.md with data layer hierarchy
  Found: 45 SQL files with YAML frontmatter in dags/declarative/

→ Launching parallel warehouse discovery...
  [Database: HQ] Discovering schemas...
  [Database: PRODUCT] Discovering schemas...

→ HQ: Found 29 schemas, 401 tables
→ PRODUCT: Found 1 schema, 0 tables

→ Merging warehouse metadata with code context...
  Enriched 45 tables with descriptions from code

→ Generated .astro/warehouse.md

Summary:
  - 2 databases
  - 30 schemas
  - 401 tables
  - 45 tables enriched with code descriptions
  - 8 large tables flagged (>100M rows)

Next steps:
  1. Review .astro/warehouse.md
  2. Add concept mappings to Quick Reference
  3. Commit to version control
  4. Run /data:warehouse-init --refresh when schema changes

Weekly Installs

182

Repository

astronomer/agents

GitHub Stars

269

First Seen

Feb 13, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

cursor160

opencode153

github-copilot151

codex150

gemini-cli149

kimi-cli148

数据仓库模式初始化工具warehouse-init：自动生成可编辑模式文档

🇨🇳中文介绍

初始化数据仓库模式

功能说明

处理流程

步骤 1：读取数据仓库配置

相关 Skills

步骤 2：在代码库中搜索上下文（并行）

步骤 3：并行数据仓库发现

步骤 4：发现分类值族

步骤 5：合并结果

步骤 6：生成 warehouse.md

输出格式

命令选项

步骤 7：预填充缓存

步骤 8：提供 CLAUDE.md 集成（询问用户）

生成后

刷新行为

缓存过期与模式漂移

何时刷新

缓存过期的迹象

手动缓存重置

识别的代码库模式

示例会话