skill-judge by softaworks/agent-toolkit
npx skills add https://github.com/softaworks/agent-toolkit --skill skill-judge根据官方规范及 17+ 官方示例总结的模式,评估 Agent Skills。
Skill 不是教程。Skill 是一种知识外化机制。
传统的 AI 知识被锁定在模型参数中。要教授新能力:
Traditional: Collect data → GPU cluster → Train → Deploy new version
Cost: $10,000 - $1,000,000+
Timeline: Weeks to months
Skills 改变了这一点:
Skill: Edit SKILL.md → Save → Takes effect on next invocation
Cost: $0
Timeline: Instant
这是从“训练 AI”到“教育 AI”的范式转变——就像一个无需训练的热插拔 LoRA 适配器。你用自然语言编辑一个 Markdown 文件,模型的行为就会改变。
好的 Skill = 专家独有知识 − Claude 已知知识
Skill 的价值由其知识增量衡量——即它提供的知识与模型已知知识之间的差距。
当一个 Skill 解释“什么是 PDF”或“如何写 for 循环”时,它是在压缩 Claude 已经知道的知识。这是令牌浪费——上下文窗口是与系统提示、对话历史、其他 Skills 和用户请求共享的公共资源。
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
| 概念 | 本质 | 功能 | 示例 |
|---|---|---|---|
| 工具 | 模型能够做什么 | 执行动作 | bash, read_file, write_file, WebSearch |
| Skill | 模型知道如何做 | 指导决策 | PDF 处理, MCP 构建, 前端设计 |
工具定义了能力边界——没有 bash 工具,模型就无法执行命令。Skills 注入知识——没有前端设计 Skill,模型只能生成通用 UI。
公式:
General Agent + Excellent Skill = Domain Expert Agent
同一个 Claude 模型,加载不同的 Skills,就成为不同的专家。
评估时,对每个部分进行分类:
| 类型 | 定义 | 处理方式 |
|---|---|---|
| 专家 | Claude 确实不知道这个 | 必须保留——这是 Skill 的价值所在 |
| 激活 | Claude 知道但可能想不到 | 如果简短可以保留——作为提醒 |
| 冗余 | Claude 肯定知道这个 | 应该删除——浪费令牌 |
Skill 设计的艺术在于最大化专家内容,谨慎使用激活内容,并无情地消除冗余内容。
最重要的维度。Skill 是否添加了真正的专家知识?
| 分数 | 标准 |
|---|---|
| 0-5 | 解释 Claude 已知的基础知识(什么是 X,如何写代码,标准库教程) |
| 6-10 | 混合:一些专家知识被明显的内容稀释 |
| 11-15 | 主要是专家知识,冗余内容极少 |
| 16-20 | 纯粹的知识增量——每个段落都物有所值 |
危险信号(立即得分 ≤5):
积极信号(高知识增量的指标):
评估问题:
Skill 是否传递了专家的思维模式以及必要的特定领域流程?
专家和新手的区别不在于“知道如何操作”——而在于“如何思考问题”。但当 Claude 缺乏特定领域的流程知识时,仅有思维模式是不够的。
关键区别:
| 类型 | 示例 | 价值 |
|---|---|---|
| 思维模式 | “设计前要问:什么让它令人难忘?” | 高——塑造决策 |
| 特定领域流程 | “OOXML 工作流:解包 → 编辑 XML → 验证 → 打包” | 高——Claude 可能不知道这个 |
| 通用流程 | “步骤 1:打开文件,步骤 2:编辑,步骤 3:保存” | 低——Claude 已经知道 |
| 分数 | 标准 | |
| --- | --- | |
| 0-3 | 只有 Claude 已知的通用流程 | |
| 4-7 | 有领域流程但缺乏思维框架 | |
| 8-11 | 良好平衡:思维模式 + 特定领域工作流 | |
| 12-15 | 专家级:塑造思维并提供 Claude 不知道的流程 |
什么算有价值的流程:
什么算冗余流程:
专家思维模式示例:
Before [action], ask yourself:
- **Purpose**: What problem does this solve? Who uses it?
- **Constraints**: What are the hidden requirements?
- **Differentiation**: What makes this solution memorable?
有价值的领域流程示例:
### Redlining Workflow (Claude wouldn't know this sequence)
1. Convert to markdown: `pandoc --track-changes=all`
2. Map text to XML: grep for text in document.xml
3. Implement changes in batches of 3-10
4. Pack and verify: check ALL changes were applied
冗余通用流程示例:
Step 1: Open the file
Step 2: Find the section
Step 3: Make the change
Step 4: Save and test
测试:
一个好的 Skill 在需要时两者都提供。
Skill 是否有有效的“永远不要”列表?
为什么这很重要:专家知识的一半在于知道不该做什么。资深设计师看到白色背景上的紫色渐变会本能地反感——“太像 AI 生成的了。”这种对“绝对不该做什么”的直觉来自于踩过无数地雷。
Claude 没有踩过这些地雷。它不知道 Inter 字体被过度使用,不知道紫色渐变是 AI 生成内容的标志。好的 Skills 必须明确说明这些“绝对禁忌”。
| 分数 | 标准 |
|---|---|
| 0-3 | 未提及反模式 |
| 4-7 | 通用警告(“避免错误”、“小心”、“考虑边缘情况”) |
| 8-11 | 带有一些理由的具体“永远不要”列表 |
| 12-15 | 带有“为什么”的专家级反模式——只有经验才能教会的东西 |
专家反模式(具体 + 原因):
NEVER use generic AI-generated aesthetics like:
- Overused font families (Inter, Roboto, Arial)
- Cliched color schemes (particularly purple gradients on white backgrounds)
- Predictable layouts and component patterns
- Default border-radius on everything
弱反模式(模糊,无理由):
Avoid making mistakes.
Be careful with edge cases.
Don't write bad code.
测试:专家会阅读反模式列表并说“是的,我吃过苦头才学会这个”吗?还是他们会说“这对每个人来说都是显而易见的”?
Skill 是否遵循官方格式要求?特别关注描述质量。
| 分数 | 标准 |
|---|---|
| 0-5 | 缺少 frontmatter 或格式无效 |
| 6-10 | 有 frontmatter 但描述模糊或不完整 |
| 11-13 | 有效的 frontmatter,描述有“是什么”但“何时用”较弱 |
| 14-15 | 完美:全面的描述,包含“是什么”、“何时用”和触发关键词 |
Frontmatter 要求:
name:小写,仅限字母数字 + 连字符,≤64 个字符description:最关键的字段——决定 Skill 是否会被使用为什么描述是最重要的字段:
┌─────────────────────────────────────────────────────────────────────┐
│ SKILL ACTIVATION FLOW │
│ │
│ User Request → Agent sees ALL skill descriptions → Decides which │
│ (only descriptions, not bodies!) to activate │
│ │
│ If description doesn't match → Skill NEVER gets loaded │
│ If description is vague → Skill might not trigger when it should │
│ If description lacks keywords → Skill is invisible to the Agent │
└─────────────────────────────────────────────────────────────────────┘
残酷的事实:内容完美但描述糟糕的 Skill 是无用的——它永远不会被激活。描述是告诉 Agent“在这些情况下使用我”的唯一机会。
描述必须回答三个问题:
优秀描述(包含所有三个要素):
description: "Comprehensive document creation, editing, and analysis with support
for tracked changes, comments, formatting preservation, and text extraction.
When Claude needs to work with professional documents (.docx files) for:
(1) Creating new documents, (2) Modifying or editing content,
(3) Working with tracked changes, (4) Adding comments, or any other document tasks"
分析:
糟糕描述(缺少要素):
description: "处理文档相关功能"
问题:
另一个糟糕示例:
description: "A helpful skill for various tasks"
这毫无用处——Agent 不知道何时激活它。
描述质量检查清单:
Skill 是否实现了适当的内容分层?
Skill 加载有三个层次:
Layer 1: Metadata (always in memory)
Only name + description
~100 tokens per skill
Layer 2: SKILL.md Body (loaded after triggering)
Detailed guidelines, code examples, decision trees
Ideal: < 500 lines
Layer 3: Resources (loaded on demand)
scripts/, references/, assets/
No limit
| 分数 | 标准 |
|---|---|
| 0-5 | 所有内容都堆在 SKILL.md 中(>500 行,无结构) |
| 6-10 | 有参考资料但不清楚何时加载 |
| 11-13 | 良好的分层,存在强制加载触发器 |
| 14-15 | 完美:决策树 + 显式触发器 + “不要加载”指导 |
对于有 references 目录的 Skills,检查加载触发器质量:
| 触发器质量 | 特征 |
|---|---|
| 差 | 参考资料列在末尾,无加载指导 |
| 中 | 有一些触发器但未嵌入工作流 |
| 好 | 工作流步骤中存在强制加载触发器 |
| 优秀 | 场景检测 + 条件触发器 + “不要加载” |
加载问题:
Loading too little ◄─────────────────────────────────► Loading too much
- References sit unused - Wastes context space
- Agent doesn't know when to load - Irrelevant info dilutes key content
- Knowledge is there but never accessed - Unnecessary token overhead
良好加载触发器(嵌入工作流中):
### Creating New Document
**MANDATORY - READ ENTIRE FILE**: Before proceeding, you MUST read
[`docx-js.md`](docx-js.md) (~500 lines) completely from start to finish.
**NEVER set any range limits when reading this file.**
**Do NOT load** `ooxml.md` or `redlining.md` for this task.
糟糕加载触发器(仅列出):
## References
- docx-js.md - for creating documents
- ooxml.md - for editing
- redlining.md - for tracking changes
对于简单的 Skills(无参考资料,<100 行):根据简洁性和自包含性评分。
对于任务的脆弱性,具体的程度是否合适?
不同的任务需要不同级别的约束。这是关于将自由度与脆弱性相匹配。
| 分数 | 标准 |
|---|---|
| 0-5 | 严重不匹配(创意任务用僵化脚本,脆弱操作用模糊指导) |
| 6-10 | 部分合适,有些不匹配 |
| 11-13 | 对大多数场景有良好的校准 |
| 14-15 | 全程完美的自由度校准 |
自由度谱系:
| 任务类型 | 应该具有 | 原因 | 示例 Skill |
|---|---|---|---|
| 创意/设计 | 高自由度 | 多种有效方法,差异化是价值所在 | frontend-design |
| 代码审查 | 中自由度 | 存在原则但需要判断 | code-review |
| 文件格式操作 | 低自由度 | 一个错误的字节就会损坏文件,一致性至关重要 | docx, xlsx, pdf |
高自由度(基于文本的指导):
Commit to a BOLD aesthetic direction. Pick an extreme: brutally minimal,
maximalist chaos, retro-futuristic, organic natural...
中自由度(伪代码或参数化):
Review priority:
1. Security vulnerabilities (must fix)
2. Logic errors (must fix)
3. Performance issues (should fix)
4. Maintainability (optional)
低自由度(特定脚本,精确步骤):
**MANDATORY**: Use exact script in `scripts/create-doc.py`
Parameters: --title "X" --author "Y"
Do NOT modify the script.
测试:问“如果 Agent 犯了错误,后果是什么?”
Skill 是否遵循既定的官方模式?
通过分析 17 个官方 Skills,我们确定了 5 种主要设计模式:
| 模式 | ~行数 | 关键特征 | 示例 | 何时使用 |
|---|---|---|---|---|
| 思维方式 | ~50 | 思维 > 技术,强大的“永远不要”列表,高自由度 | frontend-design | 需要品味的创意任务 |
| 导航 | ~30 | 最小的 SKILL.md,路由到子文件 | internal-comms | 多个不同的场景 |
| 哲学 | ~150 | 两步:哲学 → 表达,强调工艺 | canvas-design | 需要原创性的艺术/创作 |
| 流程 | ~200 | 分阶段工作流,检查点,中自由度 | mcp-builder | 复杂的多步骤项目 |
| 工具 | ~300 | 决策树,代码示例,低自由度 | docx, pdf, xlsx | 对特定格式的精确操作 |
| 分数 | 标准 | |||
| --- | --- | |||
| 0-3 | 无可识别模式,结构混乱 | |||
| 4-6 | 部分遵循模式但有显著偏差 | |||
| 7-8 | 清晰的模式,有轻微偏差 | |||
| 9-10 | 熟练应用适当的模式 |
模式选择指南:
| 你的任务特征 | 推荐模式 |
|---|---|
| 需要品味和创造力 | 思维方式(~50 行) |
| 需要原创性和工艺质量 | 哲学(~150 行) |
| 有多个不同的子场景 | 导航(~30 行) |
| 复杂的多步骤项目 | 流程(~200 行) |
| 对特定格式的精确操作 | 工具(~300 行) |
Agent 能否实际有效地使用这个 Skill?
| 分数 | 标准 |
|---|---|
| 0-5 | 指导令人困惑、不完整、矛盾或未经测试 |
| 6-10 | 可用但有明显缺陷 |
| 11-13 | 对常见情况有清晰的指导 |
| 14-15 | 全面覆盖,包括边缘情况和错误处理 |
检查:
良好可用性(决策树 + 备用方案):
| Task | Primary Tool | Fallback | When to Use Fallback |
|------|-------------|----------|----------------------|
| Read text | pdftotext | PyMuPDF | Need layout info |
| Extract tables | camelot-py | tabula-py | camelot fails |
**Common issues**:
- Scanned PDF: pdftotext returns blank → Use OCR first
- Encrypted PDF: Permission error → Use PyMuPDF with password
糟糕可用性(模糊):
Use appropriate tools for PDF processing.
Handle errors properly.
Consider edge cases.
完整阅读 SKILL.md,并对每个部分提问:
“Claude 已经知道这个了吗?”
将每个部分标记为:
计算粗略比例:E:A:R
[ ] Check frontmatter validity
[ ] Count total lines in SKILL.md
[ ] List all reference files and their sizes
[ ] Identify which pattern the Skill follows
[ ] Check for loading triggers (if references exist)
对于 8 个维度中的每一个:
Total = D1 + D2 + D3 + D4 + D5 + D6 + D7 + D8
Max = 120 points
等级标准(基于百分比):
| 等级 | 百分比 | 含义 |
|---|---|---|
| A | 90%+ (108+) | 优秀——生产就绪的专家级 Skill |
| B | 80-89% (96-107) | 良好——需要少量改进 |
| C | 70-79% (84-95) | 合格——有明确的改进路径 |
| D | 60-69% (72-83) | 低于平均水平——有显著问题 |
| F | <60% (<72) | 差——需要根本性重新设计 |
# Skill Evaluation Report: [Skill Name]
## Summary
- **Total Score**: X/120 (X%)
- **Grade**: [A/B/C/D/F]
- **Pattern**: [Mindset/Navigation/Philosophy/Process/Tool]
- **Knowledge Ratio**: E:A:R = X:Y:Z
- **Verdict**: [One sentence assessment]
## Dimension Scores
| Dimension | Score | Max | Notes |
|-----------|-------|-----|-------|
| D1: Knowledge Delta | X | 20 | |
| D2: Mindset vs Mechanics | X | 15 | |
| D3: Anti-Pattern Quality | X | 15 | |
| D4: Specification Compliance | X | 15 | |
| D5: Progressive Disclosure | X | 15 | |
| D6: Freedom Calibration | X | 15 | |
| D7: Pattern Recognition | X | 10 | |
| D8: Practical Usability | X | 15 | |
## Critical Issues
[List must-fix problems that significantly impact the Skill's effectiveness]
## Top 3 Improvements
1. [Highest impact improvement with specific guidance]
2. [Second priority improvement]
3. [Third priority improvement]
## Detailed Analysis
[For each dimension scoring below 80%, provide:
- What's missing or problematic
- Specific examples from the Skill
- Concrete suggestions for improvement]
Symptom: Explains what PDF is, how Python works, basic library usage
Root cause: Author assumes Skill should "teach" the model
Fix: Claude already knows this. Delete all basic explanations.
Focus on expert decisions, trade-offs, and anti-patterns.
Symptom: SKILL.md is 800+ lines with everything included
Root cause: No progressive disclosure design
Fix: Core routing and decision trees in SKILL.md (<300 lines ideal)
Detailed content in references/, loaded on-demand
Symptom: References directory exists but files are never loaded
Root cause: No explicit loading triggers
Fix: Add "MANDATORY - READ ENTIRE FILE" at workflow decision points
Add "Do NOT Load" to prevent over-loading
Symptom: Step 1, Step 2, Step 3... mechanical procedures
Root cause: Author thinks in procedures, not thinking frameworks
Fix: Transform into "Before doing X, ask yourself..."
Focus on decision principles, not operation sequences
Symptom: "Be careful", "avoid errors", "consider edge cases"
Root cause: Author knows things can go wrong but hasn't articulated specifics
Fix: Specific NEVER list with concrete examples and non-obvious reasons
"NEVER use X because [specific problem that takes experience to learn]"
Symptom: Great content but skill rarely gets activated
Root cause: Description is vague, missing keywords, or lacks trigger scenarios
Fix: Description must answer WHAT, WHEN, and include KEYWORDS
"Use when..." + specific scenarios + searchable terms
Example fix:
BAD: "Helps with document tasks"
GOOD: "Create, edit, and analyze .docx files. Use when working with
Word documents, tracked changes, or professional document formatting."
Symptom: "When to use this Skill" section in body, not in description
Root cause: Misunderstanding of three-layer loading
Fix: Move all triggering information to description field
Body is only loaded AFTER triggering decision is made
Symptom: README.md, CHANGELOG.md, INSTALLATION_GUIDE.md, CONTRIBUTING.md
Root cause: Treating Skill like a software project
Fix: Delete all auxiliary files. Only include what Agent needs for the task.
No documentation about the Skill itself.
Symptom: Rigid scripts for creative tasks, vague guidance for fragile operations
Root cause: Not considering task fragility
Fix: High freedom for creative (principles, not steps)
Low freedom for fragile (exact scripts, no parameters)
┌─────────────────────────────────────────────────────────────────────────┐
│ SKILL EVALUATION QUICK CHECK │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ KNOWLEDGE DELTA (most important): │
│ [ ] No "What is X" explanations for basic concepts │
│ [ ] No step-by-step tutorials for standard operations │
│ [ ] Has decision trees for non-obvious choices │
│ [ ] Has trade-offs only experts would know │
│ [ ] Has edge cases from real-world experience │
│ │
│ MINDSET + PROCEDURES: │
│ [ ] Transfers thinking patterns (how to think about problems) │
│ [ ] Has "Before doing X, ask yourself..." frameworks │
│ [ ] Includes domain-specific procedures Claude wouldn't know │
│ [ ] Distinguishes valuable procedures from generic ones │
│ │
│ ANTI-PATTERNS: │
│ [ ] Has explicit NEVER list │
│ [ ] Anti-patterns are specific, not vague │
│ [ ] Includes WHY (non-obvious reasons) │
│ │
│ SPECIFICATION (description is critical!): │
│ [ ] Valid YAML frontmatter │
│ [ ] name: lowercase, ≤64 chars │
│ [ ] description answers: WHAT does it do? │
│ [ ] description answers: WHEN should it be used? │
│ [ ] description contains trigger KEYWORDS │
│ [ ] description is specific enough for Agent to know when to use │
│ [ ] Includes scenarios where this skill MUST be used (not just "can be used")
│ │
│ STRUCTURE: │
│ [ ] SKILL.md < 500 lines (ideal < 300) │
│ [ ] Heavy content in references/ │
│ [ ] Loading triggers embedded in workflow │
│ [ ] Has "Do NOT Load" for preventing over-loading │
│ │
│ FREEDOM: │
│ [ ] Creative tasks → High freedom (principles) │
│ [ ] Fragile operations → Low freedom (exact scripts) │
│ │
│ USABILITY: │
│ [ ] Decision trees for multi-path scenarios │
│ [ ] Working code examples │
│ [ ] Error handling and fallbacks │
│ [ ] Edge cases covered │
│ │
└─────────────────────────────────────────────────────────────────────────┘
评估任何 Skill 时,始终回到这个基本问题:
“该领域的专家看到这个 Skill 会说:‘是的,这捕捉了我花了多年才学到的知识’吗?”
如果答案是肯定的 → Skill 有真正的价值。如果答案是否定的 → 它只是在压缩 Claude 已经知道的东西。
最好的 Skills 是压缩的专家大脑——它们将设计师 10 年的美学积累压缩成 43 行,或将文档专家的操作经验压缩成 200 行的决策树。
被压缩的必须是 Claude 没有的东西。否则,就是垃圾压缩。
这个 Skill (skill-judge) 本身应该通过评估:
将此 Skill 与自身进行对比评估,作为校准练习。
每周安装次数
571
仓库
GitHub Stars
1.2K
首次出现
Jan 20, 2026
安全审计
安装于
codex416
gemini-cli416
cursor416
claude-code415
opencode400
cline398
Evaluate Agent Skills against official specifications and patterns derived from 17+ official examples.
A Skill is NOT a tutorial. A Skill is a knowledge externalization mechanism.
Traditional AI knowledge is locked in model parameters. To teach new capabilities:
Traditional: Collect data → GPU cluster → Train → Deploy new version
Cost: $10,000 - $1,000,000+
Timeline: Weeks to months
Skills change this:
Skill: Edit SKILL.md → Save → Takes effect on next invocation
Cost: $0
Timeline: Instant
This is the paradigm shift from "training AI" to "educating AI" — like a hot-swappable LoRA adapter that requires no training. You edit a Markdown file in natural language, and the model's behavior changes.
Good Skill = Expert-only Knowledge − What Claude Already Knows
A Skill's value is measured by its knowledge delta — the gap between what it provides and what the model already knows.
When a Skill explains "what is PDF" or "how to write a for-loop", it's compressing knowledge Claude already has. This is token waste — context window is a public resource shared with system prompts, conversation history, other Skills, and user requests.
| Concept | Essence | Function | Example |
|---|---|---|---|
| Tool | What model CAN do | Execute actions | bash, read_file, write_file, WebSearch |
| Skill | What model KNOWS how to do | Guide decisions | PDF processing, MCP building, frontend design |
Tools define capability boundaries — without bash tool, model can't execute commands. Skills inject knowledge — without frontend-design Skill, model produces generic UI.
The equation :
General Agent + Excellent Skill = Domain Expert Agent
Same Claude model, different Skills loaded, becomes different experts.
When evaluating, categorize each section:
| Type | Definition | Treatment |
|---|---|---|
| Expert | Claude genuinely doesn't know this | Must keep — this is the Skill's value |
| Activation | Claude knows but may not think of | Keep if brief — serves as reminder |
| Redundant | Claude definitely knows this | Should delete — wastes tokens |
The art of Skill design is maximizing Expert content, using Activation sparingly, and eliminating Redundant ruthlessly.
The most important dimension. Does the Skill add genuine expert knowledge?
| Score | Criteria |
|---|---|
| 0-5 | Explains basics Claude knows (what is X, how to write code, standard library tutorials) |
| 6-10 | Mixed: some expert knowledge diluted by obvious content |
| 11-15 | Mostly expert knowledge with minimal redundancy |
| 16-20 | Pure knowledge delta — every paragraph earns its tokens |
Red flags (instant score ≤5):
Green flags (indicators of high knowledge delta):
Evaluation questions :
Does the Skill transfer expert thinking patterns along with necessary domain-specific procedures?
The difference between experts and novices isn't "knowing how to operate" — it's "how to think about the problem." But thinking patterns alone aren't enough when Claude lacks domain-specific procedural knowledge.
Key distinction :
| Type | Example | Value |
|---|---|---|
| Thinking patterns | "Before designing, ask: What makes this memorable?" | High — shapes decision-making |
| Domain-specific procedures | "OOXML workflow: unpack → edit XML → validate → pack" | High — Claude may not know this |
| Generic procedures | "Step 1: Open file, Step 2: Edit, Step 3: Save" | Low — Claude already knows |
| Score | Criteria | |
| --- | --- | |
| 0-3 | Only generic procedures Claude already knows | |
| 4-7 | Has domain procedures but lacks thinking frameworks | |
| 8-11 | Good balance: thinking patterns + domain-specific workflows | |
| 12-15 |
What counts as valuable procedures :
What counts as redundant procedures :
Expert thinking patterns look like :
Before [action], ask yourself:
- **Purpose**: What problem does this solve? Who uses it?
- **Constraints**: What are the hidden requirements?
- **Differentiation**: What makes this solution memorable?
Valuable domain procedures look like :
### Redlining Workflow (Claude wouldn't know this sequence)
1. Convert to markdown: `pandoc --track-changes=all`
2. Map text to XML: grep for text in document.xml
3. Implement changes in batches of 3-10
4. Pack and verify: check ALL changes were applied
Redundant generic procedures look like :
Step 1: Open the file
Step 2: Find the section
Step 3: Make the change
Step 4: Save and test
The test :
A good Skill provides both when needed.
Does the Skill have effective NEVER lists?
Why this matters : Half of expert knowledge is knowing what NOT to do. A senior designer sees purple gradient on white background and instinctively cringes — "too AI-generated." This intuition for "what absolutely not to do" comes from stepping on countless landmines.
Claude hasn't stepped on these landmines. It doesn't know Inter font is overused, doesn't know purple gradients are the signature of AI-generated content. Good Skills must explicitly state these "absolute don'ts."
| Score | Criteria |
|---|---|
| 0-3 | No anti-patterns mentioned |
| 4-7 | Generic warnings ("avoid errors", "be careful", "consider edge cases") |
| 8-11 | Specific NEVER list with some reasoning |
| 12-15 | Expert-grade anti-patterns with WHY — things only experience teaches |
Expert anti-patterns (specific + reason):
NEVER use generic AI-generated aesthetics like:
- Overused font families (Inter, Roboto, Arial)
- Cliched color schemes (particularly purple gradients on white backgrounds)
- Predictable layouts and component patterns
- Default border-radius on everything
Weak anti-patterns (vague, no reasoning):
Avoid making mistakes.
Be careful with edge cases.
Don't write bad code.
The test : Would an expert read the anti-pattern list and say "yes, I learned this the hard way"? Or would they say "this is obvious to everyone"?
Does the Skill follow official format requirements? Special focus on description quality.
| Score | Criteria |
|---|---|
| 0-5 | Missing frontmatter or invalid format |
| 6-10 | Has frontmatter but description is vague or incomplete |
| 11-13 | Valid frontmatter, description has WHAT but weak on WHEN |
| 14-15 | Perfect: comprehensive description with WHAT, WHEN, and trigger keywords |
Frontmatter requirements :
name: lowercase, alphanumeric + hyphens only, ≤64 charactersdescription: THE MOST CRITICAL FIELD — determines if skill gets used at allWhy description is THE MOST IMPORTANT field :
┌─────────────────────────────────────────────────────────────────────┐
│ SKILL ACTIVATION FLOW │
│ │
│ User Request → Agent sees ALL skill descriptions → Decides which │
│ (only descriptions, not bodies!) to activate │
│ │
│ If description doesn't match → Skill NEVER gets loaded │
│ If description is vague → Skill might not trigger when it should │
│ If description lacks keywords → Skill is invisible to the Agent │
└─────────────────────────────────────────────────────────────────────┘
The brutal truth : A Skill with perfect content but poor description is useless — it will never be activated. The description is the only chance to tell the Agent "use me in these situations."
Description must answer THREE questions :
Excellent description (all three elements):
description: "Comprehensive document creation, editing, and analysis with support
for tracked changes, comments, formatting preservation, and text extraction.
When Claude needs to work with professional documents (.docx files) for:
(1) Creating new documents, (2) Modifying or editing content,
(3) Working with tracked changes, (4) Adding comments, or any other document tasks"
Analysis:
Poor description (missing elements):
description: "处理文档相关功能"
Problems:
Another poor example :
description: "A helpful skill for various tasks"
This is useless — Agent has no idea when to activate it.
Description quality checklist :
Does the Skill implement proper content layering?
Skill loading has three layers:
Layer 1: Metadata (always in memory)
Only name + description
~100 tokens per skill
Layer 2: SKILL.md Body (loaded after triggering)
Detailed guidelines, code examples, decision trees
Ideal: < 500 lines
Layer 3: Resources (loaded on demand)
scripts/, references/, assets/
No limit
| Score | Criteria |
|---|---|
| 0-5 | Everything dumped in SKILL.md (>500 lines, no structure) |
| 6-10 | Has references but unclear when to load them |
| 11-13 | Good layering with MANDATORY triggers present |
| 14-15 | Perfect: decision trees + explicit triggers + "Do NOT Load" guidance |
For Skills WITH references directory , check Loading Trigger Quality:
| Trigger Quality | Characteristics |
|---|---|
| Poor | References listed at end, no loading guidance |
| Mediocre | Some triggers but not embedded in workflow |
| Good | MANDATORY triggers in workflow steps |
| Excellent | Scenario detection + conditional triggers + "Do NOT Load" |
The loading problem :
Loading too little ◄─────────────────────────────────► Loading too much
- References sit unused - Wastes context space
- Agent doesn't know when to load - Irrelevant info dilutes key content
- Knowledge is there but never accessed - Unnecessary token overhead
Good loading trigger (embedded in workflow):
### Creating New Document
**MANDATORY - READ ENTIRE FILE**: Before proceeding, you MUST read
[`docx-js.md`](docx-js.md) (~500 lines) completely from start to finish.
**NEVER set any range limits when reading this file.**
**Do NOT load** `ooxml.md` or `redlining.md` for this task.
Bad loading trigger (just listed):
## References
- docx-js.md - for creating documents
- ooxml.md - for editing
- redlining.md - for tracking changes
For simple Skills (no references, <100 lines): Score based on conciseness and self-containment.
Is the level of specificity appropriate for the task's fragility?
Different tasks need different levels of constraint. This is about matching freedom to fragility.
| Score | Criteria |
|---|---|
| 0-5 | Severely mismatched (rigid scripts for creative tasks, vague for fragile ops) |
| 6-10 | Partially appropriate, some mismatches |
| 11-13 | Good calibration for most scenarios |
| 14-15 | Perfect freedom calibration throughout |
The freedom spectrum :
| Task Type | Should Have | Why | Example Skill |
|---|---|---|---|
| Creative/Design | High freedom | Multiple valid approaches, differentiation is value | frontend-design |
| Code review | Medium freedom | Principles exist but judgment required | code-review |
| File format operations | Low freedom | One wrong byte corrupts file, consistency critical | docx, xlsx, pdf |
High freedom (text-based instructions):
Commit to a BOLD aesthetic direction. Pick an extreme: brutally minimal,
maximalist chaos, retro-futuristic, organic natural...
Medium freedom (pseudocode or parameterized):
Review priority:
1. Security vulnerabilities (must fix)
2. Logic errors (must fix)
3. Performance issues (should fix)
4. Maintainability (optional)
Low freedom (specific scripts, exact steps):
**MANDATORY**: Use exact script in `scripts/create-doc.py`
Parameters: --title "X" --author "Y"
Do NOT modify the script.
The test : Ask "if Agent makes a mistake, what's the consequence?"
Does the Skill follow an established official pattern?
Through analyzing 17 official Skills, we identified 5 main design patterns:
| Pattern | ~Lines | Key Characteristics | Example | When to Use |
|---|---|---|---|---|
| Mindset | ~50 | Thinking > technique, strong NEVER list, high freedom | frontend-design | Creative tasks requiring taste |
| Navigation | ~30 | Minimal SKILL.md, routes to sub-files | internal-comms | Multiple distinct scenarios |
| Philosophy | ~150 | Two-step: Philosophy → Express, emphasizes craft | canvas-design | Art/creation requiring originality |
| Process | ~200 | Phased workflow, checkpoints, medium freedom | mcp-builder | Complex multi-step projects |
| Tool | ~300 |
Pattern selection guide :
| Your Task Characteristics | Recommended Pattern |
|---|---|
| Needs taste and creativity | Mindset (~50 lines) |
| Needs originality and craft quality | Philosophy (~150 lines) |
| Has multiple distinct sub-scenarios | Navigation (~30 lines) |
| Complex multi-step project | Process (~200 lines) |
| Precise operations on specific format | Tool (~300 lines) |
Can an Agent actually use this Skill effectively?
| Score | Criteria |
|---|---|
| 0-5 | Confusing, incomplete, contradictory, or untested guidance |
| 6-10 | Usable but with noticeable gaps |
| 11-13 | Clear guidance for common cases |
| 14-15 | Comprehensive coverage including edge cases and error handling |
Check for :
Good usability (decision tree + fallback):
| Task | Primary Tool | Fallback | When to Use Fallback |
|------|-------------|----------|----------------------|
| Read text | pdftotext | PyMuPDF | Need layout info |
| Extract tables | camelot-py | tabula-py | camelot fails |
**Common issues**:
- Scanned PDF: pdftotext returns blank → Use OCR first
- Encrypted PDF: Permission error → Use PyMuPDF with password
Poor usability (vague):
Use appropriate tools for PDF processing.
Handle errors properly.
Consider edge cases.
Read SKILL.md completely and for each section ask:
"Does Claude already know this?"
Mark each section as:
Calculate rough ratio: E:A:R
[ ] Check frontmatter validity
[ ] Count total lines in SKILL.md
[ ] List all reference files and their sizes
[ ] Identify which pattern the Skill follows
[ ] Check for loading triggers (if references exist)
For each of the 8 dimensions:
Total = D1 + D2 + D3 + D4 + D5 + D6 + D7 + D8
Max = 120 points
Grade Scale (percentage-based):
| Grade | Percentage | Meaning |
|---|---|---|
| A | 90%+ (108+) | Excellent — production-ready expert Skill |
| B | 80-89% (96-107) | Good — minor improvements needed |
| C | 70-79% (84-95) | Adequate — clear improvement path |
| D | 60-69% (72-83) | Below Average — significant issues |
| F | <60% (<72) | Poor — needs fundamental redesign |
# Skill Evaluation Report: [Skill Name]
## Summary
- **Total Score**: X/120 (X%)
- **Grade**: [A/B/C/D/F]
- **Pattern**: [Mindset/Navigation/Philosophy/Process/Tool]
- **Knowledge Ratio**: E:A:R = X:Y:Z
- **Verdict**: [One sentence assessment]
## Dimension Scores
| Dimension | Score | Max | Notes |
|-----------|-------|-----|-------|
| D1: Knowledge Delta | X | 20 | |
| D2: Mindset vs Mechanics | X | 15 | |
| D3: Anti-Pattern Quality | X | 15 | |
| D4: Specification Compliance | X | 15 | |
| D5: Progressive Disclosure | X | 15 | |
| D6: Freedom Calibration | X | 15 | |
| D7: Pattern Recognition | X | 10 | |
| D8: Practical Usability | X | 15 | |
## Critical Issues
[List must-fix problems that significantly impact the Skill's effectiveness]
## Top 3 Improvements
1. [Highest impact improvement with specific guidance]
2. [Second priority improvement]
3. [Third priority improvement]
## Detailed Analysis
[For each dimension scoring below 80%, provide:
- What's missing or problematic
- Specific examples from the Skill
- Concrete suggestions for improvement]
Symptom: Explains what PDF is, how Python works, basic library usage
Root cause: Author assumes Skill should "teach" the model
Fix: Claude already knows this. Delete all basic explanations.
Focus on expert decisions, trade-offs, and anti-patterns.
Symptom: SKILL.md is 800+ lines with everything included
Root cause: No progressive disclosure design
Fix: Core routing and decision trees in SKILL.md (<300 lines ideal)
Detailed content in references/, loaded on-demand
Symptom: References directory exists but files are never loaded
Root cause: No explicit loading triggers
Fix: Add "MANDATORY - READ ENTIRE FILE" at workflow decision points
Add "Do NOT Load" to prevent over-loading
Symptom: Step 1, Step 2, Step 3... mechanical procedures
Root cause: Author thinks in procedures, not thinking frameworks
Fix: Transform into "Before doing X, ask yourself..."
Focus on decision principles, not operation sequences
Symptom: "Be careful", "avoid errors", "consider edge cases"
Root cause: Author knows things can go wrong but hasn't articulated specifics
Fix: Specific NEVER list with concrete examples and non-obvious reasons
"NEVER use X because [specific problem that takes experience to learn]"
Symptom: Great content but skill rarely gets activated
Root cause: Description is vague, missing keywords, or lacks trigger scenarios
Fix: Description must answer WHAT, WHEN, and include KEYWORDS
"Use when..." + specific scenarios + searchable terms
Example fix:
BAD: "Helps with document tasks"
GOOD: "Create, edit, and analyze .docx files. Use when working with
Word documents, tracked changes, or professional document formatting."
Symptom: "When to use this Skill" section in body, not in description
Root cause: Misunderstanding of three-layer loading
Fix: Move all triggering information to description field
Body is only loaded AFTER triggering decision is made
Symptom: README.md, CHANGELOG.md, INSTALLATION_GUIDE.md, CONTRIBUTING.md
Root cause: Treating Skill like a software project
Fix: Delete all auxiliary files. Only include what Agent needs for the task.
No documentation about the Skill itself.
Symptom: Rigid scripts for creative tasks, vague guidance for fragile operations
Root cause: Not considering task fragility
Fix: High freedom for creative (principles, not steps)
Low freedom for fragile (exact scripts, no parameters)
┌─────────────────────────────────────────────────────────────────────────┐
│ SKILL EVALUATION QUICK CHECK │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ KNOWLEDGE DELTA (most important): │
│ [ ] No "What is X" explanations for basic concepts │
│ [ ] No step-by-step tutorials for standard operations │
│ [ ] Has decision trees for non-obvious choices │
│ [ ] Has trade-offs only experts would know │
│ [ ] Has edge cases from real-world experience │
│ │
│ MINDSET + PROCEDURES: │
│ [ ] Transfers thinking patterns (how to think about problems) │
│ [ ] Has "Before doing X, ask yourself..." frameworks │
│ [ ] Includes domain-specific procedures Claude wouldn't know │
│ [ ] Distinguishes valuable procedures from generic ones │
│ │
│ ANTI-PATTERNS: │
│ [ ] Has explicit NEVER list │
│ [ ] Anti-patterns are specific, not vague │
│ [ ] Includes WHY (non-obvious reasons) │
│ │
│ SPECIFICATION (description is critical!): │
│ [ ] Valid YAML frontmatter │
│ [ ] name: lowercase, ≤64 chars │
│ [ ] description answers: WHAT does it do? │
│ [ ] description answers: WHEN should it be used? │
│ [ ] description contains trigger KEYWORDS │
│ [ ] description is specific enough for Agent to know when to use │
│ │
│ STRUCTURE: │
│ [ ] SKILL.md < 500 lines (ideal < 300) │
│ [ ] Heavy content in references/ │
│ [ ] Loading triggers embedded in workflow │
│ [ ] Has "Do NOT Load" for preventing over-loading │
│ │
│ FREEDOM: │
│ [ ] Creative tasks → High freedom (principles) │
│ [ ] Fragile operations → Low freedom (exact scripts) │
│ │
│ USABILITY: │
│ [ ] Decision trees for multi-path scenarios │
│ [ ] Working code examples │
│ [ ] Error handling and fallbacks │
│ [ ] Edge cases covered │
│ │
└─────────────────────────────────────────────────────────────────────────┘
When evaluating any Skill, always return to this fundamental question:
"Would an expert in this domain, looking at this Skill, say: 'Yes, this captures knowledge that took me years to learn'?"
If the answer is yes → the Skill has genuine value. If the answer is no → it's compressing what Claude already knows.
The best Skills are compressed expert brains — they take a designer's 10 years of aesthetic accumulation and compress it into 43 lines, or a document expert's operational experience into a 200-line decision tree.
What gets compressed must be things Claude doesn't have. Otherwise, it's garbage compression.
This Skill (skill-judge) should itself pass evaluation:
Evaluate this Skill against itself as a calibration exercise.
Weekly Installs
571
Repository
GitHub Stars
1.2K
First Seen
Jan 20, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykPass
Installed on
codex416
gemini-cli416
cursor416
claude-code415
opencode400
cline398
agent-browser 浏览器自动化工具 - Vercel Labs 命令行网页操作与测试
136,300 周安装
MCP Builder快速构建Claude工具服务器 - FastMCP Python/TypeScript开发指南
416 周安装
策略对比脚本 - 多策略回测分析与可视化工具,优化交易决策
416 周安装
Spring Boot 3.x OpenAPI 文档生成指南 - SpringDoc集成与Swagger UI配置
416 周安装
React Native 移动端 UI 设计规范与无障碍开发指南 | 最佳实践
417 周安装
CRM自动化工作流:HubSpot/Salesforce/Pipedrive潜在客户管理、交易跟踪与多CRM同步
417 周安装
敏捷产品负责人工具包 - 自动生成用户故事、冲刺规划与优先级排序
417 周安装
| Expert-level: shapes thinking AND provides procedures Claude wouldn't know |
| Decision trees, code examples, low freedom |
| docx, pdf, xlsx |
| Precise operations on specific formats |
| Score | Criteria |
| --- | --- |
| 0-3 | No recognizable pattern, chaotic structure |
| 4-6 | Partially follows a pattern with significant deviations |
| 7-8 | Clear pattern with minor deviations |
| 9-10 | Masterful application of appropriate pattern |