数据上下文提取器：从数据仓库提取知识并生成定制化分析技能

data-context-extractor by anthropics/knowledge-work-plugins

395 周安装量

10,300 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/anthropics/knowledge-work-plugins --skill data-context-extractor

数据分析数据处理知识管理

🇨🇳中文介绍

数据上下文提取器

一种元技能，可从分析师处提取公司特定的数据知识，并生成定制化的数据分析技能。

工作原理

此技能有两种模式：

引导模式：从零开始创建一个新的数据分析技能
迭代模式：通过添加特定领域的参考文件来改进现有技能

引导模式

使用场景：用户希望为其数据仓库创建一个新的数据上下文技能。

第一阶段：数据库连接与发现

步骤 1：识别数据库类型

询问："您使用的是什么数据仓库？"

常见选项：

BigQuery
Snowflake
PostgreSQL/Redshift
Databricks

使用 ~~data warehouse 工具（查询和模式）进行连接。如果不清楚，请检查当前会话中可用的 MCP 工具。

步骤 2：探索模式

使用 ~~data warehouse 模式工具来：

列出可用的数据集/模式
识别最重要的表（询问用户："分析师最常查询哪 3-5 张表？"）
获取这些关键表的模式详细信息

按方言的示例探索查询：

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

第二阶段：核心问题（询问这些）

模式发现后，以对话方式询问这些问题（不要一次性全部问完）：

实体消歧（关键）

"当这里的人说'用户'或'客户'时，具体指的是什么？是否有不同类型？"

多种实体类型（用户 vs 账户 vs 组织）
它们之间的关系（1:1，1:多，多:多）
哪些 ID 字段将它们链接在一起

"什么是[客户/用户/账户]的主要标识符？同一个实体是否有多个 ID？"

主键 vs 业务键
UUID vs 整数 ID
遗留 ID 系统

"人们最常询问的 2-3 个指标是什么？每个指标是如何计算的？"

精确的公式（ARR = monthly_revenue × 12）
每个指标来源于哪些表/列
时间周期惯例（最近 7 天，日历月等）

"哪些数据应该总是从查询中过滤掉？（测试数据、欺诈、内部用户等）"

应始终包含的标准 WHERE 子句
指示排除项的标志列（is_test, is_internal, is_fraud）
要排除的特定值（status = 'deleted'）

"新分析师在使用这些数据时通常会犯什么错误？"

容易混淆的列名
时区问题
NULL 处理的特殊之处
历史状态表与当前状态表

第三阶段：生成技能

创建具有以下结构的技能：

[company]-data-analyst/
├── SKILL.md
└── references/
    ├── entities.md          # 实体定义和关系
    ├── metrics.md           # KPI 计算
    ├── tables/              # 每个领域一个文件
    │   ├── [domain1].md
    │   └── [domain2].md
    └── dashboards.json      # 可选：现有仪表板目录

SKILL.md 模板：参见 references/skill-template.md

SQL 方言部分：参见 references/sql-dialects.md 并包含适当的方言说明。

参考文件模板：参见 references/domain-template.md

第四阶段：打包和交付

在技能目录中创建所有文件
打包为 zip 文件
向用户展示并总结捕获的内容

使用场景：用户拥有现有技能，但需要添加上下文。

步骤 1：加载现有技能

请用户上传其现有技能（zip 或文件夹），或者如果已在会话中则定位它。

阅读当前的 SKILL.md 和参考文件，以了解已记录的内容。

步骤 2：识别差距

询问："哪个领域或主题需要更多上下文？哪些查询失败或产生了错误结果？"

新的数据领域（营销、财务、产品等）
缺少指标定义
未记录的表关系
新术语

步骤 3：针对性发现

针对已识别的领域：

探索相关表：使用 ~~data warehouse 模式工具查找该领域的表
询问领域特定问题：

 * "[领域] 分析使用哪些表？"

 * "[领域] 的关键指标是什么？"
 * "[领域] 数据是否有特殊的过滤器或陷阱？"

3. 生成新的参考文件：使用领域模板创建 references/[domain].md

步骤 4：更新和重新打包

添加新的参考文件
更新 SKILL.md 的"知识库导航"部分以包含新领域
重新打包技能
将更新后的技能呈现给用户

每个参考文件应包含：

位置：完整的表路径
描述：此表包含的内容，何时使用它
主键：如何唯一标识行
更新频率：数据刷新频率
关键列：包含列名、类型、描述、备注的表格
关系：此表如何与其他表连接
示例查询：2-3 个常见查询模式

指标名称：人类可读的名称
定义：通俗易懂的解释
公式：带有列引用的精确计算
源表：数据来源
注意事项：边缘情况、排除项、陷阱

实体名称：它的称呼
定义：它在业务中代表什么
主表：在哪里可以找到此实体
ID 字段：如何识别它
关系：它如何与其他实体关联
常用过滤器：标准排除项（内部、测试等）

在交付生成的技能之前，请验证：

SKILL.md 具有完整的前言（名称、描述）
实体消歧部分清晰
关键术语已定义
标准过滤器/排除项已记录
每个领域至少有 2-3 个示例查询
SQL 使用了正确的方言语法
参考文件已从 SKILL.md 导航部分链接

🇺🇸English

Data Context Extractor

A meta-skill that extracts company-specific data knowledge from analysts and generates tailored data analysis skills.

How It Works

This skill has two modes:

Bootstrap Mode : Create a new data analysis skill from scratch
Iteration Mode : Improve an existing skill by adding domain-specific reference files

Bootstrap Mode

Use when: User wants to create a new data context skill for their warehouse.

Phase 1: Database Connection & Discovery

Step 1: Identify the database type

Ask: "What data warehouse are you using?"

Common options:

BigQuery
Snowflake
PostgreSQL/Redshift
Databricks

Use ~~data warehouse tools (query and schema) to connect. If unclear, check available MCP tools in the current session.

Step 2: Explore the schema

Use ~~data warehouse schema tools to:

List available datasets/schemas
Identify the most important tables (ask user: "Which 3-5 tables do analysts query most often?")
Pull schema details for those key tables

Sample exploration queries by dialect:

-- BigQuery: List datasets
SELECT schema_name FROM INFORMATION_SCHEMA.SCHEMATA

-- BigQuery: List tables in a dataset
SELECT table_name FROM `project.dataset.INFORMATION_SCHEMA.TABLES`

-- Snowflake: List schemas
SHOW SCHEMAS IN DATABASE my_database

-- Snowflake: List tables
SHOW TABLES IN SCHEMA my_schema

Phase 2: Core Questions (Ask These)

After schema discovery, ask these questions conversationally (not all at once):

Entity Disambiguation (Critical)

"When people here say 'user' or 'customer', what exactly do they mean? Are there different types?"

Listen for:

Multiple entity types (user vs account vs organization)
Relationships between them (1:1, 1:many, many:many)
Which ID fields link them together

Primary Identifiers

"What's the main identifier for a [customer/user/account]? Are there multiple IDs for the same entity?"

Listen for:

Primary keys vs business keys
UUID vs integer IDs
Legacy ID systems

Key Metrics

"What are the 2-3 metrics people ask about most? How is each one calculated?"

Listen for:

Exact formulas (ARR = monthly_revenue × 12)
Which tables/columns feed each metric
Time period conventions (trailing 7 days, calendar month, etc.)

Data Hygiene

"What should ALWAYS be filtered out of queries? (test data, fraud, internal users, etc.)"

Listen for:

Standard WHERE clauses to always include
Flag columns that indicate exclusions (is_test, is_internal, is_fraud)
Specific values to exclude (status = 'deleted')

Common Gotchas

"What mistakes do new analysts typically make with this data?"

Listen for:

Confusing column names
Timezone issues
NULL handling quirks
Historical vs current state tables

Phase 3: Generate the Skill

Create a skill with this structure:

[company]-data-analyst/
├── SKILL.md
└── references/
    ├── entities.md          # Entity definitions and relationships
    ├── metrics.md           # KPI calculations
    ├── tables/              # One file per domain
    │   ├── [domain1].md
    │   └── [domain2].md
    └── dashboards.json      # Optional: existing dashboards catalog

SKILL.md Template : See references/skill-template.md

SQL Dialect Section : See references/sql-dialects.md and include the appropriate dialect notes.

Reference File Template : See references/domain-template.md

Phase 4: Package and Deliver

Create all files in the skill directory
Package as a zip file
Present to user with summary of what was captured

Iteration Mode

Use when: User has an existing skill but needs to add more context.

Step 1: Load Existing Skill

Ask user to upload their existing skill (zip or folder), or locate it if already in the session.

Read the current SKILL.md and reference files to understand what's already documented.

Step 2: Identify the Gap

Ask: "What domain or topic needs more context? What queries are failing or producing wrong results?"

Common gaps:

A new data domain (marketing, finance, product, etc.)
Missing metric definitions
Undocumented table relationships
New terminology

Step 3: Targeted Discovery

For the identified domain:

Explore relevant tables : Use ~~data warehouse schema tools to find tables in that domain
Ask domain-specific questions :
- "What tables are used for [domain] analysis?"
- "What are the key metrics for [domain]?"
- "Any special filters or gotchas for [domain] data?"
Generate new reference file : Create references/[domain].md using the domain template

Step 4: Update and Repackage

Add the new reference file
Update SKILL.md's "Knowledge Base Navigation" section to include the new domain
Repackage the skill
Present the updated skill to user

Reference File Standards

Each reference file should include:

For Table Documentation

Location : Full table path
Description : What this table contains, when to use it
Primary Key : How to uniquely identify rows
Update Frequency : How often data refreshes
Key Columns : Table with column name, type, description, notes
Relationships : How this table joins to others
Sample Queries : 2-3 common query patterns

For Metrics Documentation

Metric Name : Human-readable name
Definition : Plain English explanation
Formula : Exact calculation with column references
Source Table(s) : Where the data comes from
Caveats : Edge cases, exclusions, gotchas

For Entity Documentation

Entity Name : What it's called
Definition : What it represents in the business
Primary Table : Where to find this entity
ID Field(s) : How to identify it
Relationships : How it relates to other entities
Common Filters : Standard exclusions (internal, test, etc.)

Quality Checklist

Before delivering a generated skill, verify:

SKILL.md has complete frontmatter (name, description)
Entity disambiguation section is clear
Key terminology is defined
Standard filters/exclusions are documented
At least 2-3 sample queries per domain
SQL uses correct dialect syntax
Reference files are linked from SKILL.md navigation section

Weekly Installs

156

Repository

anthropics/know…-plugins

GitHub Stars

8.8K

First Seen

Jan 31, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

opencode140

codex132

gemini-cli129

claude-code124

github-copilot124

amp113

Excel财务建模规范与xlsx文件处理指南：专业格式、零错误公式与数据分析

38,800 周安装

数据上下文提取器：从数据仓库提取知识并生成定制化分析技能

🇨🇳中文介绍

数据上下文提取器

工作原理

引导模式

第一阶段：数据库连接与发现

相关 Skills

第二阶段：核心问题（询问这些）

第三阶段：生成技能

第四阶段：打包和交付

迭代模式

步骤 1：加载现有技能

步骤 2：识别差距

步骤 3：针对性发现

步骤 4：更新和重新打包

参考文件标准

对于表文档

对于指标文档

对于实体文档

质量检查清单

🇺🇸English

Data Context Extractor

How It Works

Bootstrap Mode

Phase 1: Database Connection & Discovery

Phase 2: Core Questions (Ask These)

Phase 3: Generate the Skill

Phase 4: Package and Deliver

Iteration Mode

Step 1: Load Existing Skill

Step 2: Identify the Gap

Step 3: Targeted Discovery

Step 4: Update and Repackage

Reference File Standards

For Table Documentation

For Metrics Documentation

For Entity Documentation

Quality Checklist

最新 Skills