skill-judge：AI Agent Skill 评估工具，优化知识增量与专家思维模式

skill-judge by softaworks/agent-toolkit

571 周安装量

1,200 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/softaworks/agent-toolkit --skill skill-judge

AI/机器学习开发自动化

🇨🇳中文介绍

Skill Judge

根据官方规范及 17+ 官方示例总结的模式，评估 Agent Skills。

核心理念

什么是 Skill？

Skill 不是教程。Skill 是一种知识外化机制。

传统的 AI 知识被锁定在模型参数中。要教授新能力：

Traditional: Collect data → GPU cluster → Train → Deploy new version
Cost: $10,000 - $1,000,000+
Timeline: Weeks to months

Skills 改变了这一点：

Skill: Edit SKILL.md → Save → Takes effect on next invocation
Cost: $0
Timeline: Instant

这是从“训练 AI”到“教育 AI”的范式转变——就像一个无需训练的热插拔 LoRA 适配器。你用自然语言编辑一个 Markdown 文件，模型的行为就会改变。

核心公式

好的 Skill = 专家独有知识 − Claude 已知知识

Skill 的价值由其知识增量衡量——即它提供的知识与模型已知知识之间的差距。

专家独有知识：决策树、权衡取舍、边缘情况、反模式、特定领域的思维框架——需要多年经验积累的东西
Claude 已知知识：基本概念、标准库用法、常见编程模式、通用最佳实践

当一个 Skill 解释“什么是 PDF”或“如何写 for 循环”时，它是在压缩 Claude 已经知道的知识。这是令牌浪费——上下文窗口是与系统提示、对话历史、其他 Skills 和用户请求共享的公共资源。

工具 vs Skill

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

概念	本质	功能	示例
工具	模型能够做什么	执行动作	bash, read_file, write_file, WebSearch
Skill	模型知道如何做	指导决策	PDF 处理, MCP 构建, 前端设计

Skill 中的三种知识类型

评估时，对每个部分进行分类：

类型	定义	处理方式
专家	Claude 确实不知道这个	必须保留——这是 Skill 的价值所在
激活	Claude 知道但可能想不到	如果简短可以保留——作为提醒
冗余	Claude 肯定知道这个	应该删除——浪费令牌

Skill 设计的艺术在于最大化专家内容，谨慎使用激活内容，并无情地消除冗余内容。

评估维度（总分 120 分）

D1: 知识增量（20 分）——核心维度

最重要的维度。Skill 是否添加了真正的专家知识？

分数	标准
0-5	解释 Claude 已知的基础知识（什么是 X，如何写代码，标准库教程）
6-10	混合：一些专家知识被明显的内容稀释
11-15	主要是专家知识，冗余内容极少
16-20	纯粹的知识增量——每个段落都物有所值

危险信号（立即得分 ≤5）：

“什么是 [基本概念]”部分
标准操作的分步教程
解释如何使用常见库
通用最佳实践（“写干净的代码”、“处理错误”）
行业标准术语的定义

积极信号（高知识增量的指标）：

非显而易见选择的决策树（“当 X 失败时，尝试 Y，因为 Z”）
只有专家才知道的权衡取舍（“A 更快但 B 能处理边缘情况 C”）
来自实际经验的边缘情况
“永远不要做 X，因为 [非显而易见的原因]”
特定领域的思维框架

对于每个部分，问：“Claude 已经知道这个了吗？”
如果是在解释某事，问：“这是在向 Claude 解释，还是为 Claude 解释？”
统计专家 vs 激活 vs 冗余的段落数量

D2: 思维方式 + 适当流程（15 分）

Skill 是否传递了专家的思维模式以及必要的特定领域流程？

专家和新手的区别不在于“知道如何操作”——而在于“如何思考问题”。但当 Claude 缺乏特定领域的流程知识时，仅有思维模式是不够的。

类型	示例	价值
思维模式	“设计前要问：什么让它令人难忘？”	高——塑造决策
特定领域流程	“OOXML 工作流：解包 → 编辑 XML → 验证 → 打包”	高——Claude 可能不知道这个
通用流程	“步骤 1：打开文件，步骤 2：编辑，步骤 3：保存”	低——Claude 已经知道
分数	标准
---	---
0-3	只有 Claude 已知的通用流程
4-7	有领域流程但缺乏思维框架
8-11	良好平衡：思维模式 + 特定领域工作流
12-15	专家级：塑造思维并提供 Claude 不知道的流程

什么算有价值的流程：

Claude 未训练过的工作流（新工具、专有系统）
非显而易见的正确顺序（例如，“验证在打包之前，而不是之后”）
容易遗漏的关键步骤（例如，“编辑后必须重新计算公式”）
特定领域的序列（例如，MCP 服务器的 4 阶段开发流程）

什么算冗余流程：

通用文件操作（打开、读取、写入、保存）
标准编程模式（循环、条件、错误处理）
有详细文档记录的常见库用法

专家思维模式示例：

Before [action], ask yourself:
- **Purpose**: What problem does this solve? Who uses it?
- **Constraints**: What are the hidden requirements?
- **Differentiation**: What makes this solution memorable?

有价值的领域流程示例：

### Redlining Workflow (Claude wouldn't know this sequence)
1. Convert to markdown: `pandoc --track-changes=all`
2. Map text to XML: grep for text in document.xml
3. Implement changes in batches of 3-10
4. Pack and verify: check ALL changes were applied

冗余通用流程示例：

Step 1: Open the file
Step 2: Find the section
Step 3: Make the change
Step 4: Save and test

它是否告诉 Claude思考什么？（思维模式）
它是否告诉 Claude如何做它不知道的事情？（领域流程）

一个好的 Skill 在需要时两者都提供。

D3: 反模式质量（15 分）

Skill 是否有有效的“永远不要”列表？

为什么这很重要：专家知识的一半在于知道不该做什么。资深设计师看到白色背景上的紫色渐变会本能地反感——“太像 AI 生成的了。”这种对“绝对不该做什么”的直觉来自于踩过无数地雷。

Claude 没有踩过这些地雷。它不知道 Inter 字体被过度使用，不知道紫色渐变是 AI 生成内容的标志。好的 Skills 必须明确说明这些“绝对禁忌”。

分数	标准
0-3	未提及反模式
4-7	通用警告（“避免错误”、“小心”、“考虑边缘情况”）
8-11	带有一些理由的具体“永远不要”列表
12-15	带有“为什么”的专家级反模式——只有经验才能教会的东西

专家反模式（具体 + 原因）：

NEVER use generic AI-generated aesthetics like:
- Overused font families (Inter, Roboto, Arial)
- Cliched color schemes (particularly purple gradients on white backgrounds)
- Predictable layouts and component patterns
- Default border-radius on everything

弱反模式（模糊，无理由）：

Avoid making mistakes.
Be careful with edge cases.
Don't write bad code.

测试：专家会阅读反模式列表并说“是的，我吃过苦头才学会这个”吗？还是他们会说“这对每个人来说都是显而易见的”？

D4: 规范合规性——特别是描述（15 分）

Skill 是否遵循官方格式要求？特别关注描述质量。

分数	标准
0-5	缺少 frontmatter 或格式无效
6-10	有 frontmatter 但描述模糊或不完整
11-13	有效的 frontmatter，描述有“是什么”但“何时用”较弱
14-15	完美：全面的描述，包含“是什么”、“何时用”和触发关键词

Frontmatter 要求：

name：小写，仅限字母数字 + 连字符，≤64 个字符
description：最关键的字段——决定 Skill 是否会被使用

为什么描述是最重要的字段：

┌─────────────────────────────────────────────────────────────────────┐
│  SKILL ACTIVATION FLOW                                              │
│                                                                     │
│  User Request → Agent sees ALL skill descriptions → Decides which  │
│                 (only descriptions, not bodies!)     to activate    │
│                                                                     │
│  If description doesn't match → Skill NEVER gets loaded            │
│  If description is vague → Skill might not trigger when it should  │
│  If description lacks keywords → Skill is invisible to the Agent   │
└─────────────────────────────────────────────────────────────────────┘

残酷的事实：内容完美但描述糟糕的 Skill 是无用的——它永远不会被激活。描述是告诉 Agent“在这些情况下使用我”的唯一机会。

描述必须回答三个问题：

是什么：这个 Skill 做什么？（功能）
何时用：在什么情况下应该使用它？（触发场景）
关键词：哪些术语应该触发这个 Skill？（可搜索术语）

优秀描述（包含所有三个要素）：

description: "Comprehensive document creation, editing, and analysis with support
for tracked changes, comments, formatting preservation, and text extraction.
When Claude needs to work with professional documents (.docx files) for:
(1) Creating new documents, (2) Modifying or editing content,
(3) Working with tracked changes, (4) Adding comments, or any other document tasks"

是什么：创建、编辑、分析、跟踪更改、评论
何时用：“当 Claude 需要处理...时，用于：(1)... (2)... (3)...”
关键词：.docx 文件、跟踪更改、专业文档

糟糕描述（缺少要素）：

description: "处理文档相关功能"

是什么：模糊（“文档相关功能”——具体是什么？）
何时用：缺失（Agent 应该在什么时候使用这个？）
关键词：缺失（没有“.docx”，没有具体场景）

另一个糟糕示例：

description: "A helpful skill for various tasks"

这毫无用处——Agent 不知道何时激活它。

描述质量检查清单：

列出具体能力（不仅仅是“帮助处理 X”）
包含明确的触发场景（“当...时使用”、“当用户要求...”）
包含可搜索的关键词（文件扩展名、领域术语、动作动词）
足够具体，让 Agent 确切知道何时使用
包含必须使用此 Skill 的场景（不仅仅是“可以使用”）

D5: 渐进式披露（15 分）

Skill 是否实现了适当的内容分层？

Skill 加载有三个层次：

Layer 1: Metadata (always in memory)
         Only name + description
         ~100 tokens per skill

Layer 2: SKILL.md Body (loaded after triggering)
         Detailed guidelines, code examples, decision trees
         Ideal: < 500 lines

Layer 3: Resources (loaded on demand)
         scripts/, references/, assets/
         No limit

分数	标准
0-5	所有内容都堆在 SKILL.md 中（>500 行，无结构）
6-10	有参考资料但不清楚何时加载
11-13	良好的分层，存在强制加载触发器
14-15	完美：决策树 + 显式触发器 + “不要加载”指导

对于有 references 目录的 Skills，检查加载触发器质量：

触发器质量	特征
差	参考资料列在末尾，无加载指导
中	有一些触发器但未嵌入工作流
好	工作流步骤中存在强制加载触发器
优秀	场景检测 + 条件触发器 + “不要加载”

Loading too little ◄─────────────────────────────────► Loading too much
- References sit unused                    - Wastes context space
- Agent doesn't know when to load          - Irrelevant info dilutes key content
- Knowledge is there but never accessed    - Unnecessary token overhead

良好加载触发器（嵌入工作流中）：

### Creating New Document

**MANDATORY - READ ENTIRE FILE**: Before proceeding, you MUST read
[`docx-js.md`](docx-js.md) (~500 lines) completely from start to finish.
**NEVER set any range limits when reading this file.**

**Do NOT load** `ooxml.md` or `redlining.md` for this task.

糟糕加载触发器（仅列出）：

## References
- docx-js.md - for creating documents
- ooxml.md - for editing
- redlining.md - for tracking changes

对于简单的 Skills（无参考资料，<100 行）：根据简洁性和自包含性评分。

D6: 自由度校准（15 分）

对于任务的脆弱性，具体的程度是否合适？

不同的任务需要不同级别的约束。这是关于将自由度与脆弱性相匹配。

分数	标准
0-5	严重不匹配（创意任务用僵化脚本，脆弱操作用模糊指导）
6-10	部分合适，有些不匹配
11-13	对大多数场景有良好的校准
14-15	全程完美的自由度校准

自由度谱系：

任务类型	应该具有	原因	示例 Skill
创意/设计	高自由度	多种有效方法，差异化是价值所在	frontend-design
代码审查	中自由度	存在原则但需要判断	code-review
文件格式操作	低自由度	一个错误的字节就会损坏文件，一致性至关重要	docx, xlsx, pdf

高自由度（基于文本的指导）：

Commit to a BOLD aesthetic direction. Pick an extreme: brutally minimal,
maximalist chaos, retro-futuristic, organic natural...

中自由度（伪代码或参数化）：

Review priority:
1. Security vulnerabilities (must fix)
2. Logic errors (must fix)
3. Performance issues (should fix)
4. Maintainability (optional)

低自由度（特定脚本，精确步骤）：

**MANDATORY**: Use exact script in `scripts/create-doc.py`
Parameters: --title "X" --author "Y"
Do NOT modify the script.

测试：问“如果 Agent 犯了错误，后果是什么？”

后果严重 → 低自由度
后果轻微 → 高自由度

D7: 模式识别（10 分）

Skill 是否遵循既定的官方模式？

通过分析 17 个官方 Skills，我们确定了 5 种主要设计模式：

模式	~行数	关键特征	示例	何时使用
思维方式	~50	思维 > 技术，强大的“永远不要”列表，高自由度	frontend-design	需要品味的创意任务
导航	~30	最小的 SKILL.md，路由到子文件	internal-comms	多个不同的场景
哲学	~150	两步：哲学 → 表达，强调工艺	canvas-design	需要原创性的艺术/创作
流程	~200	分阶段工作流，检查点，中自由度	mcp-builder	复杂的多步骤项目
工具	~300	决策树，代码示例，低自由度	docx, pdf, xlsx	对特定格式的精确操作
分数	标准
---	---
0-3	无可识别模式，结构混乱
4-6	部分遵循模式但有显著偏差
7-8	清晰的模式，有轻微偏差
9-10	熟练应用适当的模式

模式选择指南：

你的任务特征	推荐模式
需要品味和创造力	思维方式（~50 行）
需要原创性和工艺质量	哲学（~150 行）
有多个不同的子场景	导航（~30 行）
复杂的多步骤项目	流程（~200 行）
对特定格式的精确操作	工具（~300 行）

D8: 实际可用性（15 分）

Agent 能否实际有效地使用这个 Skill？

分数	标准
0-5	指导令人困惑、不完整、矛盾或未经测试
6-10	可用但有明显缺陷
11-13	对常见情况有清晰的指导
14-15	全面覆盖，包括边缘情况和错误处理

决策树：对于多路径场景，是否有关于选择哪条路径的清晰指导？
代码示例：它们真的能工作吗？还是会导致崩溃的伪代码？
错误处理：如果主要方法失败怎么办？是否提供了备用方案？
边缘情况：是否涵盖了不寻常但现实的场景？
可操作性：Agent 是否可以立即行动，还是需要自己弄清楚？

良好可用性（决策树 + 备用方案）：

| Task | Primary Tool | Fallback | When to Use Fallback |
|------|-------------|----------|----------------------|
| Read text | pdftotext | PyMuPDF | Need layout info |
| Extract tables | camelot-py | tabula-py | camelot fails |

**Common issues**:
- Scanned PDF: pdftotext returns blank → Use OCR first
- Encrypted PDF: Permission error → Use PyMuPDF with password

糟糕可用性（模糊）：

Use appropriate tools for PDF processing.
Handle errors properly.
Consider edge cases.

评估时永远不要做

永远不要仅仅因为它“看起来很专业”或格式良好就给高分
永远不要忽略令牌浪费——每个冗余段落都应导致扣分
永远不要被长度所迷惑——一个 43 行的 Skill 可能优于一个 500 行的 Skill
永远不要跳过对决策树的心理测试——它们是否真的能导向正确的选择？
永远不要以“但它提供了有用的上下文”为由原谅对基础知识的解释
永远不要忽视缺失的反模式——如果没有“永远不要”列表，那就是一个重大缺陷
永远不要假设所有流程都有价值——区分特定领域流程和通用流程
永远不要低估描述字段——糟糕的描述 = Skill 永远不会被使用
永远不要把“何时使用”信息只放在正文中——Agent 在加载前只能看到描述

步骤 1：第一遍——知识增量扫描

完整阅读 SKILL.md，并对每个部分提问：

“Claude 已经知道这个了吗？”

将每个部分标记为：

[E] 专家：Claude 确实不知道这个——有价值
[A] 激活：Claude 知道但简短的提醒有用——可接受
[R] 冗余：Claude 肯定知道这个——应该删除

计算粗略比例：E:A:R

好的 Skill：>70% 专家，<20% 激活，<10% 冗余
一般的 Skill：40-70% 专家，高激活
差的 Skill：<40% 专家，高冗余

步骤 2：结构分析

[ ] Check frontmatter validity
[ ] Count total lines in SKILL.md
[ ] List all reference files and their sizes
[ ] Identify which pattern the Skill follows
[ ] Check for loading triggers (if references exist)

步骤 3：为每个维度评分

对于 8 个维度中的每一个：

找到具体证据（引用相关行）
用一句话理由分配分数
如果分数 < 最大值，注明具体改进点

步骤 4：计算总分和等级

Total = D1 + D2 + D3 + D4 + D5 + D6 + D7 + D8
Max = 120 points

等级标准（基于百分比）：

等级	百分比	含义
A	90%+ (108+)	优秀——生产就绪的专家级 Skill
B	80-89% (96-107)	良好——需要少量改进
C	70-79% (84-95)	合格——有明确的改进路径
D	60-69% (72-83)	低于平均水平——有显著问题
F	<60% (<72)	差——需要根本性重新设计

步骤 5：生成报告

# Skill Evaluation Report: [Skill Name]

## Summary
- **Total Score**: X/120 (X%)
- **Grade**: [A/B/C/D/F]
- **Pattern**: [Mindset/Navigation/Philosophy/Process/Tool]
- **Knowledge Ratio**: E:A:R = X:Y:Z
- **Verdict**: [One sentence assessment]

## Dimension Scores

| Dimension | Score | Max | Notes |
|-----------|-------|-----|-------|
| D1: Knowledge Delta | X | 20 | |
| D2: Mindset vs Mechanics | X | 15 | |
| D3: Anti-Pattern Quality | X | 15 | |
| D4: Specification Compliance | X | 15 | |
| D5: Progressive Disclosure | X | 15 | |
| D6: Freedom Calibration | X | 15 | |
| D7: Pattern Recognition | X | 10 | |
| D8: Practical Usability | X | 15 | |

## Critical Issues
[List must-fix problems that significantly impact the Skill's effectiveness]

## Top 3 Improvements
1. [Highest impact improvement with specific guidance]
2. [Second priority improvement]
3. [Third priority improvement]

## Detailed Analysis
[For each dimension scoring below 80%, provide:
- What's missing or problematic
- Specific examples from the Skill
- Concrete suggestions for improvement]

Symptom: Explains what PDF is, how Python works, basic library usage
Root cause: Author assumes Skill should "teach" the model
Fix: Claude already knows this. Delete all basic explanations.
     Focus on expert decisions, trade-offs, and anti-patterns.

模式 2：信息堆砌

Symptom: SKILL.md is 800+ lines with everything included
Root cause: No progressive disclosure design
Fix: Core routing and decision trees in SKILL.md (<300 lines ideal)
     Detailed content in references/, loaded on-demand

模式 3：孤立的参考资料

Symptom: References directory exists but files are never loaded
Root cause: No explicit loading triggers
Fix: Add "MANDATORY - READ ENTIRE FILE" at workflow decision points
     Add "Do NOT Load" to prevent over-loading

模式 4：复选框式流程

Symptom: Step 1, Step 2, Step 3... mechanical procedures
Root cause: Author thinks in procedures, not thinking frameworks
Fix: Transform into "Before doing X, ask yourself..."
     Focus on decision principles, not operation sequences

模式 5：模糊警告

Symptom: "Be careful", "avoid errors", "consider edge cases"
Root cause: Author knows things can go wrong but hasn't articulated specifics
Fix: Specific NEVER list with concrete examples and non-obvious reasons
     "NEVER use X because [specific problem that takes experience to learn]"

模式 6：隐形 Skill

Symptom: Great content but skill rarely gets activated
Root cause: Description is vague, missing keywords, or lacks trigger scenarios
Fix: Description must answer WHAT, WHEN, and include KEYWORDS
     "Use when..." + specific scenarios + searchable terms

Example fix:
BAD:  "Helps with document tasks"
GOOD: "Create, edit, and analyze .docx files. Use when working with
       Word documents, tracked changes, or professional document formatting."

模式 7：位置错误

Symptom: "When to use this Skill" section in body, not in description
Root cause: Misunderstanding of three-layer loading
Fix: Move all triggering information to description field
     Body is only loaded AFTER triggering decision is made

模式 8：过度工程化

Symptom: README.md, CHANGELOG.md, INSTALLATION_GUIDE.md, CONTRIBUTING.md
Root cause: Treating Skill like a software project
Fix: Delete all auxiliary files. Only include what Agent needs for the task.
     No documentation about the Skill itself.

模式 9：自由度不匹配

Symptom: Rigid scripts for creative tasks, vague guidance for fragile operations
Root cause: Not considering task fragility
Fix: High freedom for creative (principles, not steps)
     Low freedom for fragile (exact scripts, no parameters)

快速参考检查清单

┌─────────────────────────────────────────────────────────────────────────┐
│  SKILL EVALUATION QUICK CHECK                                           │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  KNOWLEDGE DELTA (most important):                                      │
│    [ ] No "What is X" explanations for basic concepts                   │
│    [ ] No step-by-step tutorials for standard operations                │
│    [ ] Has decision trees for non-obvious choices                       │
│    [ ] Has trade-offs only experts would know                           │
│    [ ] Has edge cases from real-world experience                        │
│                                                                         │
│  MINDSET + PROCEDURES:                                                  │
│    [ ] Transfers thinking patterns (how to think about problems)        │
│    [ ] Has "Before doing X, ask yourself..." frameworks                 │
│    [ ] Includes domain-specific procedures Claude wouldn't know         │
│    [ ] Distinguishes valuable procedures from generic ones              │
│                                                                         │
│  ANTI-PATTERNS:                                                         │
│    [ ] Has explicit NEVER list                                          │
│    [ ] Anti-patterns are specific, not vague                            │
│    [ ] Includes WHY (non-obvious reasons)                               │
│                                                                         │
│  SPECIFICATION (description is critical!):                              │
│    [ ] Valid YAML frontmatter                                           │
│    [ ] name: lowercase, ≤64 chars                                       │
│    [ ] description answers: WHAT does it do?                            │
│    [ ] description answers: WHEN should it be used?                     │
│    [ ] description contains trigger KEYWORDS                            │
│    [ ] description is specific enough for Agent to know when to use     │
│    [ ] Includes scenarios where this skill MUST be used (not just "can be used")
│                                                                         │
│  STRUCTURE:                                                             │
│    [ ] SKILL.md < 500 lines (ideal < 300)                               │
│    [ ] Heavy content in references/                                     │
│    [ ] Loading triggers embedded in workflow                            │
│    [ ] Has "Do NOT Load" for preventing over-loading                    │
│                                                                         │
│  FREEDOM:                                                               │
│    [ ] Creative tasks → High freedom (principles)                       │
│    [ ] Fragile operations → Low freedom (exact scripts)                 │
│                                                                         │
│  USABILITY:                                                             │
│    [ ] Decision trees for multi-path scenarios                          │
│    [ ] Working code examples                                            │
│    [ ] Error handling and fallbacks                                     │
│    [ ] Edge cases covered                                               │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

评估任何 Skill 时，始终回到这个基本问题：

“该领域的专家看到这个 Skill 会说：‘是的，这捕捉了我花了多年才学到的知识’吗？”

如果答案是肯定的 → Skill 有真正的价值。如果答案是否定的 → 它只是在压缩 Claude 已经知道的东西。

最好的 Skills 是压缩的专家大脑——它们将设计师 10 年的美学积累压缩成 43 行，或将文档专家的操作经验压缩成 200 行的决策树。

被压缩的必须是 Claude 没有的东西。否则，就是垃圾压缩。

这个 Skill (skill-judge) 本身应该通过评估：

知识增量：提供了 Claude 自己无法生成的特定评估标准
思维方式：塑造了如何思考 Skill 质量，而不仅仅是检查清单项目
反模式：“评估时永远不要做”部分包含具体的禁忌
规范合规性：有效的 frontmatter 和全面的描述
渐进式披露：自包含，无需外部参考资料
自由度：适合评估任务的中等自由度
模式：遵循带有决策框架的工具模式
可用性：清晰的协议、报告模板、快速参考

将此 Skill 与自身进行对比评估，作为校准练习。

🇺🇸English

Skill Judge

Evaluate Agent Skills against official specifications and patterns derived from 17+ official examples.

Core Philosophy

What is a Skill?

A Skill is NOT a tutorial. A Skill is a knowledge externalization mechanism.

Traditional AI knowledge is locked in model parameters. To teach new capabilities:

Traditional: Collect data → GPU cluster → Train → Deploy new version
Cost: $10,000 - $1,000,000+
Timeline: Weeks to months

Skills change this:

Skill: Edit SKILL.md → Save → Takes effect on next invocation
Cost: $0
Timeline: Instant

This is the paradigm shift from "training AI" to "educating AI" — like a hot-swappable LoRA adapter that requires no training. You edit a Markdown file in natural language, and the model's behavior changes.

The Core Formula

Good Skill = Expert-only Knowledge − What Claude Already Knows

A Skill's value is measured by its knowledge delta — the gap between what it provides and what the model already knows.

Expert-only knowledge : Decision trees, trade-offs, edge cases, anti-patterns, domain-specific thinking frameworks — things that take years of experience to accumulate
What Claude already knows : Basic concepts, standard library usage, common programming patterns, general best practices

When a Skill explains "what is PDF" or "how to write a for-loop", it's compressing knowledge Claude already has. This is token waste — context window is a public resource shared with system prompts, conversation history, other Skills, and user requests.

Tool vs Skill

Concept	Essence	Function	Example
Tool	What model CAN do	Execute actions	bash, read_file, write_file, WebSearch
Skill	What model KNOWS how to do	Guide decisions	PDF processing, MCP building, frontend design

Tools define capability boundaries — without bash tool, model can't execute commands. Skills inject knowledge — without frontend-design Skill, model produces generic UI.

The equation :

General Agent + Excellent Skill = Domain Expert Agent

Same Claude model, different Skills loaded, becomes different experts.

Three Types of Knowledge in Skills

When evaluating, categorize each section:

Type	Definition	Treatment
Expert	Claude genuinely doesn't know this	Must keep — this is the Skill's value
Activation	Claude knows but may not think of	Keep if brief — serves as reminder
Redundant	Claude definitely knows this	Should delete — wastes tokens

The art of Skill design is maximizing Expert content, using Activation sparingly, and eliminating Redundant ruthlessly.

Evaluation Dimensions (120 points total)

D1: Knowledge Delta (20 points) — THE CORE DIMENSION

The most important dimension. Does the Skill add genuine expert knowledge?

Score	Criteria
0-5	Explains basics Claude knows (what is X, how to write code, standard library tutorials)
6-10	Mixed: some expert knowledge diluted by obvious content
11-15	Mostly expert knowledge with minimal redundancy
16-20	Pure knowledge delta — every paragraph earns its tokens

Red flags (instant score ≤5):

"What is [basic concept]" sections
Step-by-step tutorials for standard operations
Explaining how to use common libraries
Generic best practices ("write clean code", "handle errors")
Definitions of industry-standard terms

Green flags (indicators of high knowledge delta):

Decision trees for non-obvious choices ("when X fails, try Y because Z")
Trade-offs only an expert would know ("A is faster but B handles edge case C")
Edge cases from real-world experience
"NEVER do X because [non-obvious reason]"
Domain-specific thinking frameworks

Evaluation questions :

For each section, ask: "Does Claude already know this?"
If explaining something, ask: "Is this explaining TO Claude or FOR Claude?"
Count paragraphs that are Expert vs Activation vs Redundant

D2: Mindset + Appropriate Procedures (15 points)

Does the Skill transfer expert thinking patterns along with necessary domain-specific procedures?

The difference between experts and novices isn't "knowing how to operate" — it's "how to think about the problem." But thinking patterns alone aren't enough when Claude lacks domain-specific procedural knowledge.

Key distinction :

Type	Example	Value
Thinking patterns	"Before designing, ask: What makes this memorable?"	High — shapes decision-making
Domain-specific procedures	"OOXML workflow: unpack → edit XML → validate → pack"	High — Claude may not know this
Generic procedures	"Step 1: Open file, Step 2: Edit, Step 3: Save"	Low — Claude already knows
Score	Criteria
---	---
0-3	Only generic procedures Claude already knows
4-7	Has domain procedures but lacks thinking frameworks
8-11	Good balance: thinking patterns + domain-specific workflows
12-15

What counts as valuable procedures :

Workflows Claude hasn't been trained on (new tools, proprietary systems)
Correct ordering that's non-obvious (e.g., "validate BEFORE packing, not after")
Critical steps that are easy to miss (e.g., "MUST recalculate formulas after editing")
Domain-specific sequences (e.g., MCP server's 4-phase development process)

What counts as redundant procedures :

Generic file operations (open, read, write, save)
Standard programming patterns (loops, conditionals, error handling)
Common library usage that's well-documented

Expert thinking patterns look like :

Before [action], ask yourself:
- **Purpose**: What problem does this solve? Who uses it?
- **Constraints**: What are the hidden requirements?
- **Differentiation**: What makes this solution memorable?

Valuable domain procedures look like :

### Redlining Workflow (Claude wouldn't know this sequence)
1. Convert to markdown: `pandoc --track-changes=all`
2. Map text to XML: grep for text in document.xml
3. Implement changes in batches of 3-10
4. Pack and verify: check ALL changes were applied

Redundant generic procedures look like :

Step 1: Open the file
Step 2: Find the section
Step 3: Make the change
Step 4: Save and test

The test :

Does it tell Claude WHAT to think about? (thinking patterns)
Does it tell Claude HOW to do things it wouldn't know? (domain procedures)

A good Skill provides both when needed.

D3: Anti-Pattern Quality (15 points)

Does the Skill have effective NEVER lists?

Why this matters : Half of expert knowledge is knowing what NOT to do. A senior designer sees purple gradient on white background and instinctively cringes — "too AI-generated." This intuition for "what absolutely not to do" comes from stepping on countless landmines.

Claude hasn't stepped on these landmines. It doesn't know Inter font is overused, doesn't know purple gradients are the signature of AI-generated content. Good Skills must explicitly state these "absolute don'ts."

Score	Criteria
0-3	No anti-patterns mentioned
4-7	Generic warnings ("avoid errors", "be careful", "consider edge cases")
8-11	Specific NEVER list with some reasoning
12-15	Expert-grade anti-patterns with WHY — things only experience teaches

Expert anti-patterns (specific + reason):

NEVER use generic AI-generated aesthetics like:
- Overused font families (Inter, Roboto, Arial)
- Cliched color schemes (particularly purple gradients on white backgrounds)
- Predictable layouts and component patterns
- Default border-radius on everything

Weak anti-patterns (vague, no reasoning):

Avoid making mistakes.
Be careful with edge cases.
Don't write bad code.

The test : Would an expert read the anti-pattern list and say "yes, I learned this the hard way"? Or would they say "this is obvious to everyone"?

D4: Specification Compliance — Especially Description (15 points)

Does the Skill follow official format requirements? Special focus on description quality.

Score	Criteria
0-5	Missing frontmatter or invalid format
6-10	Has frontmatter but description is vague or incomplete
11-13	Valid frontmatter, description has WHAT but weak on WHEN
14-15	Perfect: comprehensive description with WHAT, WHEN, and trigger keywords

Frontmatter requirements :

name: lowercase, alphanumeric + hyphens only, ≤64 characters
description: THE MOST CRITICAL FIELD — determines if skill gets used at all

Why description is THE MOST IMPORTANT field :

┌─────────────────────────────────────────────────────────────────────┐
│  SKILL ACTIVATION FLOW                                              │
│                                                                     │
│  User Request → Agent sees ALL skill descriptions → Decides which  │
│                 (only descriptions, not bodies!)     to activate    │
│                                                                     │
│  If description doesn't match → Skill NEVER gets loaded            │
│  If description is vague → Skill might not trigger when it should  │
│  If description lacks keywords → Skill is invisible to the Agent   │
└─────────────────────────────────────────────────────────────────────┘

The brutal truth : A Skill with perfect content but poor description is useless — it will never be activated. The description is the only chance to tell the Agent "use me in these situations."

Description must answer THREE questions :

WHAT : What does this Skill do? (functionality)
WHEN : In what situations should it be used? (trigger scenarios)
KEYWORDS : What terms should trigger this Skill? (searchable terms)

Excellent description (all three elements):

description: "Comprehensive document creation, editing, and analysis with support
for tracked changes, comments, formatting preservation, and text extraction.
When Claude needs to work with professional documents (.docx files) for:
(1) Creating new documents, (2) Modifying or editing content,
(3) Working with tracked changes, (4) Adding comments, or any other document tasks"

Analysis:

WHAT: creation, editing, analysis, tracked changes, comments
WHEN: "When Claude needs to work with... for: (1)... (2)... (3)..."
KEYWORDS: .docx files, tracked changes, professional documents

Poor description (missing elements):

description: "处理文档相关功能"

Problems:

WHAT: vague ("文档相关功能" — what specifically?)
WHEN: missing (when should Agent use this?)
KEYWORDS: missing (no ".docx", no specific scenarios)

Another poor example :

description: "A helpful skill for various tasks"

This is useless — Agent has no idea when to activate it.

Description quality checklist :

Lists specific capabilities (not just "helps with X")
Includes explicit trigger scenarios ("Use when...", "When user asks for...")
Contains searchable keywords (file extensions, domain terms, action verbs)
Specific enough that Agent knows EXACTLY when to use it
Includes scenarios where this skill MUST be used (not just "can be used")

D5: Progressive Disclosure (15 points)

Does the Skill implement proper content layering?

Skill loading has three layers:

Layer 1: Metadata (always in memory)
         Only name + description
         ~100 tokens per skill

Layer 2: SKILL.md Body (loaded after triggering)
         Detailed guidelines, code examples, decision trees
         Ideal: < 500 lines

Layer 3: Resources (loaded on demand)
         scripts/, references/, assets/
         No limit

Score	Criteria
0-5	Everything dumped in SKILL.md (>500 lines, no structure)
6-10	Has references but unclear when to load them
11-13	Good layering with MANDATORY triggers present
14-15	Perfect: decision trees + explicit triggers + "Do NOT Load" guidance

For Skills WITH references directory , check Loading Trigger Quality:

Trigger Quality	Characteristics
Poor	References listed at end, no loading guidance
Mediocre	Some triggers but not embedded in workflow
Good	MANDATORY triggers in workflow steps
Excellent	Scenario detection + conditional triggers + "Do NOT Load"

The loading problem :

Loading too little ◄─────────────────────────────────► Loading too much
- References sit unused                    - Wastes context space
- Agent doesn't know when to load          - Irrelevant info dilutes key content
- Knowledge is there but never accessed    - Unnecessary token overhead

Good loading trigger (embedded in workflow):

### Creating New Document

**MANDATORY - READ ENTIRE FILE**: Before proceeding, you MUST read
[`docx-js.md`](docx-js.md) (~500 lines) completely from start to finish.
**NEVER set any range limits when reading this file.**

**Do NOT load** `ooxml.md` or `redlining.md` for this task.

Bad loading trigger (just listed):

## References
- docx-js.md - for creating documents
- ooxml.md - for editing
- redlining.md - for tracking changes

For simple Skills (no references, <100 lines): Score based on conciseness and self-containment.

D6: Freedom Calibration (15 points)

Is the level of specificity appropriate for the task's fragility?

Different tasks need different levels of constraint. This is about matching freedom to fragility.

Score	Criteria
0-5	Severely mismatched (rigid scripts for creative tasks, vague for fragile ops)
6-10	Partially appropriate, some mismatches
11-13	Good calibration for most scenarios
14-15	Perfect freedom calibration throughout

The freedom spectrum :

Task Type	Should Have	Why	Example Skill
Creative/Design	High freedom	Multiple valid approaches, differentiation is value	frontend-design
Code review	Medium freedom	Principles exist but judgment required	code-review
File format operations	Low freedom	One wrong byte corrupts file, consistency critical	docx, xlsx, pdf

High freedom (text-based instructions):

Commit to a BOLD aesthetic direction. Pick an extreme: brutally minimal,
maximalist chaos, retro-futuristic, organic natural...

Medium freedom (pseudocode or parameterized):

Review priority:
1. Security vulnerabilities (must fix)
2. Logic errors (must fix)
3. Performance issues (should fix)
4. Maintainability (optional)

Low freedom (specific scripts, exact steps):

**MANDATORY**: Use exact script in `scripts/create-doc.py`
Parameters: --title "X" --author "Y"
Do NOT modify the script.

The test : Ask "if Agent makes a mistake, what's the consequence?"

High consequence → Low freedom
Low consequence → High freedom

D7: Pattern Recognition (10 points)

Does the Skill follow an established official pattern?

Through analyzing 17 official Skills, we identified 5 main design patterns:

Pattern	~Lines	Key Characteristics	Example	When to Use
Mindset	~50	Thinking > technique, strong NEVER list, high freedom	frontend-design	Creative tasks requiring taste
Navigation	~30	Minimal SKILL.md, routes to sub-files	internal-comms	Multiple distinct scenarios
Philosophy	~150	Two-step: Philosophy → Express, emphasizes craft	canvas-design	Art/creation requiring originality
Process	~200	Phased workflow, checkpoints, medium freedom	mcp-builder	Complex multi-step projects
Tool	~300

Pattern selection guide :

Your Task Characteristics	Recommended Pattern
Needs taste and creativity	Mindset (~50 lines)
Needs originality and craft quality	Philosophy (~150 lines)
Has multiple distinct sub-scenarios	Navigation (~30 lines)
Complex multi-step project	Process (~200 lines)
Precise operations on specific format	Tool (~300 lines)

D8: Practical Usability (15 points)

Can an Agent actually use this Skill effectively?

Score	Criteria
0-5	Confusing, incomplete, contradictory, or untested guidance
6-10	Usable but with noticeable gaps
11-13	Clear guidance for common cases
14-15	Comprehensive coverage including edge cases and error handling

Check for :

Decision trees : For multi-path scenarios, is there clear guidance on which path to take?
Code examples : Do they actually work? Or are they pseudocode that breaks?
Error handling : What if the main approach fails? Are fallbacks provided?
Edge cases : Are unusual but realistic scenarios covered?
Actionability : Can Agent immediately act, or needs to figure things out?

Good usability (decision tree + fallback):

| Task | Primary Tool | Fallback | When to Use Fallback |
|------|-------------|----------|----------------------|
| Read text | pdftotext | PyMuPDF | Need layout info |
| Extract tables | camelot-py | tabula-py | camelot fails |

**Common issues**:
- Scanned PDF: pdftotext returns blank → Use OCR first
- Encrypted PDF: Permission error → Use PyMuPDF with password

Poor usability (vague):

Use appropriate tools for PDF processing.
Handle errors properly.
Consider edge cases.

NEVER Do When Evaluating

NEVER give high scores just because it "looks professional" or is well-formatted
NEVER ignore token waste — every redundant paragraph should result in deduction
NEVER let length impress you — a 43-line Skill can outperform a 500-line Skill
NEVER skip mentally testing the decision trees — do they actually lead to correct choices?
NEVER forgive explaining basics with "but it provides helpful context"
NEVER overlook missing anti-patterns — if there's no NEVER list, that's a significant gap
NEVER assume all procedures are valuable — distinguish domain-specific from generic
NEVER undervalue the description field — poor description = skill never gets used
NEVER put "when to use" info only in the body — Agent only sees description before loading

Evaluation Protocol

Step 1: First Pass — Knowledge Delta Scan

Read SKILL.md completely and for each section ask:

"Does Claude already know this?"

Mark each section as:

[E] Expert : Claude genuinely doesn't know this — value-add
[A] Activation : Claude knows but brief reminder is useful — acceptable
[R] Redundant : Claude definitely knows this — should be deleted

Calculate rough ratio: E:A:R

Good Skill: >70% Expert, <20% Activation, <10% Redundant
Mediocre Skill: 40-70% Expert, high Activation
Bad Skill: <40% Expert, high Redundant

Step 2: Structure Analysis

[ ] Check frontmatter validity
[ ] Count total lines in SKILL.md
[ ] List all reference files and their sizes
[ ] Identify which pattern the Skill follows
[ ] Check for loading triggers (if references exist)

Step 3: Score Each Dimension

For each of the 8 dimensions:

Find specific evidence (quote relevant lines)
Assign score with one-line justification
Note specific improvements if score < max

Step 4: Calculate Total & Grade

Total = D1 + D2 + D3 + D4 + D5 + D6 + D7 + D8
Max = 120 points

Grade Scale (percentage-based):

Grade	Percentage	Meaning
A	90%+ (108+)	Excellent — production-ready expert Skill
B	80-89% (96-107)	Good — minor improvements needed
C	70-79% (84-95)	Adequate — clear improvement path
D	60-69% (72-83)	Below Average — significant issues
F	<60% (<72)	Poor — needs fundamental redesign

Step 5: Generate Report

# Skill Evaluation Report: [Skill Name]

## Summary
- **Total Score**: X/120 (X%)
- **Grade**: [A/B/C/D/F]
- **Pattern**: [Mindset/Navigation/Philosophy/Process/Tool]
- **Knowledge Ratio**: E:A:R = X:Y:Z
- **Verdict**: [One sentence assessment]

## Dimension Scores

| Dimension | Score | Max | Notes |
|-----------|-------|-----|-------|
| D1: Knowledge Delta | X | 20 | |
| D2: Mindset vs Mechanics | X | 15 | |
| D3: Anti-Pattern Quality | X | 15 | |
| D4: Specification Compliance | X | 15 | |
| D5: Progressive Disclosure | X | 15 | |
| D6: Freedom Calibration | X | 15 | |
| D7: Pattern Recognition | X | 10 | |
| D8: Practical Usability | X | 15 | |

## Critical Issues
[List must-fix problems that significantly impact the Skill's effectiveness]

## Top 3 Improvements
1. [Highest impact improvement with specific guidance]
2. [Second priority improvement]
3. [Third priority improvement]

## Detailed Analysis
[For each dimension scoring below 80%, provide:
- What's missing or problematic
- Specific examples from the Skill
- Concrete suggestions for improvement]

Common Failure Patterns

Pattern 1: The Tutorial

Symptom: Explains what PDF is, how Python works, basic library usage
Root cause: Author assumes Skill should "teach" the model
Fix: Claude already knows this. Delete all basic explanations.
     Focus on expert decisions, trade-offs, and anti-patterns.

Pattern 2: The Dump

Symptom: SKILL.md is 800+ lines with everything included
Root cause: No progressive disclosure design
Fix: Core routing and decision trees in SKILL.md (<300 lines ideal)
     Detailed content in references/, loaded on-demand

Pattern 3: The Orphan References

Symptom: References directory exists but files are never loaded
Root cause: No explicit loading triggers
Fix: Add "MANDATORY - READ ENTIRE FILE" at workflow decision points
     Add "Do NOT Load" to prevent over-loading

Pattern 4: The Checkbox Procedure

Symptom: Step 1, Step 2, Step 3... mechanical procedures
Root cause: Author thinks in procedures, not thinking frameworks
Fix: Transform into "Before doing X, ask yourself..."
     Focus on decision principles, not operation sequences

Pattern 5: The Vague Warning

Symptom: "Be careful", "avoid errors", "consider edge cases"
Root cause: Author knows things can go wrong but hasn't articulated specifics
Fix: Specific NEVER list with concrete examples and non-obvious reasons
     "NEVER use X because [specific problem that takes experience to learn]"

Pattern 6: The Invisible Skill

Symptom: Great content but skill rarely gets activated
Root cause: Description is vague, missing keywords, or lacks trigger scenarios
Fix: Description must answer WHAT, WHEN, and include KEYWORDS
     "Use when..." + specific scenarios + searchable terms

Example fix:
BAD:  "Helps with document tasks"
GOOD: "Create, edit, and analyze .docx files. Use when working with
       Word documents, tracked changes, or professional document formatting."

Pattern 7: The Wrong Location

Symptom: "When to use this Skill" section in body, not in description
Root cause: Misunderstanding of three-layer loading
Fix: Move all triggering information to description field
     Body is only loaded AFTER triggering decision is made

Pattern 8: The Over-Engineered

Symptom: README.md, CHANGELOG.md, INSTALLATION_GUIDE.md, CONTRIBUTING.md
Root cause: Treating Skill like a software project
Fix: Delete all auxiliary files. Only include what Agent needs for the task.
     No documentation about the Skill itself.

Pattern 9: The Freedom Mismatch

Symptom: Rigid scripts for creative tasks, vague guidance for fragile operations
Root cause: Not considering task fragility
Fix: High freedom for creative (principles, not steps)
     Low freedom for fragile (exact scripts, no parameters)

Quick Reference Checklist

┌─────────────────────────────────────────────────────────────────────────┐
│  SKILL EVALUATION QUICK CHECK                                           │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  KNOWLEDGE DELTA (most important):                                      │
│    [ ] No "What is X" explanations for basic concepts                   │
│    [ ] No step-by-step tutorials for standard operations                │
│    [ ] Has decision trees for non-obvious choices                       │
│    [ ] Has trade-offs only experts would know                           │
│    [ ] Has edge cases from real-world experience                        │
│                                                                         │
│  MINDSET + PROCEDURES:                                                  │
│    [ ] Transfers thinking patterns (how to think about problems)        │
│    [ ] Has "Before doing X, ask yourself..." frameworks                 │
│    [ ] Includes domain-specific procedures Claude wouldn't know         │
│    [ ] Distinguishes valuable procedures from generic ones              │
│                                                                         │
│  ANTI-PATTERNS:                                                         │
│    [ ] Has explicit NEVER list                                          │
│    [ ] Anti-patterns are specific, not vague                            │
│    [ ] Includes WHY (non-obvious reasons)                               │
│                                                                         │
│  SPECIFICATION (description is critical!):                              │
│    [ ] Valid YAML frontmatter                                           │
│    [ ] name: lowercase, ≤64 chars                                       │
│    [ ] description answers: WHAT does it do?                            │
│    [ ] description answers: WHEN should it be used?                     │
│    [ ] description contains trigger KEYWORDS                            │
│    [ ] description is specific enough for Agent to know when to use     │
│                                                                         │
│  STRUCTURE:                                                             │
│    [ ] SKILL.md < 500 lines (ideal < 300)                               │
│    [ ] Heavy content in references/                                     │
│    [ ] Loading triggers embedded in workflow                            │
│    [ ] Has "Do NOT Load" for preventing over-loading                    │
│                                                                         │
│  FREEDOM:                                                               │
│    [ ] Creative tasks → High freedom (principles)                       │
│    [ ] Fragile operations → Low freedom (exact scripts)                 │
│                                                                         │
│  USABILITY:                                                             │
│    [ ] Decision trees for multi-path scenarios                          │
│    [ ] Working code examples                                            │
│    [ ] Error handling and fallbacks                                     │
│    [ ] Edge cases covered                                               │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

The Meta-Question

When evaluating any Skill, always return to this fundamental question:

"Would an expert in this domain, looking at this Skill, say: 'Yes, this captures knowledge that took me years to learn'?"

If the answer is yes → the Skill has genuine value. If the answer is no → it's compressing what Claude already knows.

The best Skills are compressed expert brains — they take a designer's 10 years of aesthetic accumulation and compress it into 43 lines, or a document expert's operational experience into a 200-line decision tree.

What gets compressed must be things Claude doesn't have. Otherwise, it's garbage compression.

Self-Evaluation Note

This Skill (skill-judge) should itself pass evaluation:

Knowledge Delta : Provides specific evaluation criteria Claude wouldn't generate on its own
Mindset : Shapes how to think about Skill quality, not just checklist items
Anti-Patterns : "NEVER Do When Evaluating" section with specific don'ts
Specification : Valid frontmatter with comprehensive description
Progressive Disclosure : Self-contained, no external references needed
Freedom : Medium freedom appropriate for evaluation task
Pattern : Follows Tool pattern with decision frameworks
Usability : Clear protocol, report template, quick reference

Evaluate this Skill against itself as a calibration exercise.

Weekly Installs

571

Repository

softaworks/agent-toolkit

GitHub Stars

1.2K

First Seen

Jan 20, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

codex416

gemini-cli416

cursor416

claude-code415

opencode400

cline398

agent-browser 浏览器自动化工具 - Vercel Labs 命令行网页操作与测试

136,300 周安装

skill-judge：AI Agent Skill 评估工具，优化知识增量与专家思维模式

🇨🇳中文介绍

Skill Judge

核心理念

什么是 Skill？

核心公式

工具 vs Skill

相关 Skills

Skill 中的三种知识类型

评估维度（总分 120 分）

D1: 知识增量（20 分）——核心维度

D2: 思维方式 + 适当流程（15 分）

D3: 反模式质量（15 分）

D4: 规范合规性——特别是描述（15 分）

D5: 渐进式披露（15 分）

D6: 自由度校准（15 分）

D7: 模式识别（10 分）

D8: 实际可用性（15 分）

评估时永远不要做

评估协议

步骤 1：第一遍——知识增量扫描

步骤 2：结构分析

步骤 3：为每个维度评分

步骤 4：计算总分和等级

步骤 5：生成报告

常见失败模式

模式 1：教程

模式 2：信息堆砌

模式 3：孤立的参考资料

模式 4：复选框式流程

模式 5：模糊警告

模式 6：隐形 Skill

模式 7：位置错误

模式 8：过度工程化

模式 9：自由度不匹配

快速参考检查清单

元问题

自我评估说明

🇺🇸English

Skill Judge

Core Philosophy

What is a Skill?

The Core Formula

Tool vs Skill

Three Types of Knowledge in Skills

Evaluation Dimensions (120 points total)

D1: Knowledge Delta (20 points) — THE CORE DIMENSION

D2: Mindset + Appropriate Procedures (15 points)

D3: Anti-Pattern Quality (15 points)

D4: Specification Compliance — Especially Description (15 points)

D5: Progressive Disclosure (15 points)

D6: Freedom Calibration (15 points)

D7: Pattern Recognition (10 points)

D8: Practical Usability (15 points)

NEVER Do When Evaluating

Evaluation Protocol

Step 1: First Pass — Knowledge Delta Scan

Step 2: Structure Analysis

Step 3: Score Each Dimension

Step 4: Calculate Total & Grade

Step 5: Generate Report

Common Failure Patterns

Pattern 1: The Tutorial

Pattern 2: The Dump

Pattern 3: The Orphan References

Pattern 4: The Checkbox Procedure

Pattern 5: The Vague Warning

Pattern 6: The Invisible Skill

Pattern 7: The Wrong Location

Pattern 8: The Over-Engineered

Pattern 9: The Freedom Mismatch

Quick Reference Checklist

The Meta-Question

Self-Evaluation Note

最新 Skills