重要前提
安装AI Skills的关键前提是:必须科学上网,且开启TUN模式,这一点至关重要,直接决定安装能否顺利完成,在此郑重提醒三遍:科学上网,科学上网,科学上网。查看完整安装教程 →
Evals by danielmiessler/personal_ai_infrastructure
npx skills add https://github.com/danielmiessler/personal_ai_infrastructure --skill Evals执行前,请检查以下路径的用户自定义设置: ~/.claude/PAI/USER/SKILLCUSTOMIZATIONS/Evals/
如果此目录存在,则加载并应用其中找到的任何 PREFERENCES.md、配置或资源。这些设置将覆盖默认行为。如果目录不存在,则使用技能默认设置。
当此技能被调用时,在执行任何其他操作之前**,您必须发送此通知。**
发送语音通知:
curl -s -X POST http://localhost:8888/notify
-H "Content-Type: application/json"
-d '{"message": "Running the WORKFLOWNAME workflow in the Evals skill to ACTION"}'
> /dev/null 2>&1 &
输出文本通知:
Running the WorkflowName workflow in the Evals skill to ACTION...
这是强制性的,不可选。技能被调用后立即执行此 curl 命令。
基于 Anthropic 的《Demystifying Evals for AI Agents》(2026年1月)构建的综合性智能体评估系统。
关键区别: 评估智能体工作流(对话记录、工具调用、多轮对话),而不仅仅是单一输出。
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
| 类型 | 优势 | 劣势 | 适用场景 |
|---|---|---|---|
| 基于代码 | 快速、廉价、确定性、可复现 | 脆弱、缺乏细微差别 | 测试、状态检查、工具验证 |
| 基于模型 | 灵活、能捕捉细微差别、可扩展 | 非确定性、昂贵 | 质量评分标准、断言、比较 |
| 人工 | 黄金标准、能处理主观性 | 昂贵、缓慢 | 校准、抽查、A/B 测试 |
| 类型 | 通过目标 | 目的 |
|---|---|---|
| 能力 | ~70% | 延伸目标,衡量改进潜力 |
| 回归 | ~99% | 质量门禁,检测性能倒退 |
| 请求模式 | 路由至 |
|---|---|
| 运行评估、评估套件、运行测试、基准测试 | Workflows/RunEval.md |
| 比较模型、模型对比、A/B 测试模型 | Workflows/CompareModels.md |
| 比较提示词、提示词对比、测试提示词 | Workflows/ComparePrompts.md |
| 创建评分器、模型评分器、评估评分器 | Workflows/CreateJudge.md |
| 创建用例、新评估、测试用例、创建套件 | Workflows/CreateUseCase.md |
| 查看结果、评估结果、分数、通过率 | Workflows/ViewResults.md |
| 触发词 | 工具 |
|---|---|
| 运行套件 | Tools/AlgorithmBridge.ts |
| 记录失败 | Tools/FailureToTask.ts log |
| 转换失败案例 | Tools/FailureToTask.ts convert-all |
| 创建套件 | Tools/SuiteManager.ts create |
| 检查饱和度 | Tools/SuiteManager.ts check-saturation |
# 运行评估套件
bun run ~/.claude/skills/Utilities/Evals/Tools/AlgorithmBridge.ts -s <suite>
# 记录失败案例以供后续转换
bun run ~/.claude/skills/Utilities/Evals/Tools/FailureToTask.ts log "description" -c category -s severity
# 将失败案例转换为测试任务
bun run ~/.claude/skills/Utilities/Evals/Tools/FailureToTask.ts convert-all
# 管理套件
bun run ~/.claude/skills/Utilities/Evals/Tools/SuiteManager.ts create <name> -t capability -d "description"
bun run ~/.claude/skills/Utilities/Evals/Tools/SuiteManager.ts list
bun run ~/.claude/skills/Utilities/Evals/Tools/SuiteManager.ts check-saturation <name>
bun run ~/.claude/skills/Utilities/Evals/Tools/SuiteManager.ts graduate <name>
Evals 是 THE ALGORITHM ISC 行的一种验证方法:
# 运行评估并更新 ISC 行
bun run ~/.claude/skills/Utilities/Evals/Tools/AlgorithmBridge.ts -s regression-core -r 3 -u
ISC 行可以指定评估验证:
| # | What Ideal Looks Like | Verify |
|---|----------------------|--------|
| 1 | Auth bypass fixed | eval:auth-security |
| 2 | Tests all pass | eval:regression |
| 评分器 | 用例 |
|---|---|
string_match | 精确子字符串匹配 |
regex_match | 模式匹配 |
binary_tests | 运行测试文件 |
static_analysis | 代码检查、类型检查、安全扫描 |
state_check | 验证执行后的系统状态 |
tool_calls | 验证特定工具是否被调用 |
| 评分器 | 用例 |
|---|---|
llm_rubric | 根据详细评分标准打分 |
natural_language_assert | 检查断言是否为真 |
pairwise_comparison | 与参考进行比较(交换位置) |
针对常见智能体类型的预配置评分器栈:
| 领域 | 主要评分器 |
|---|---|
coding | binary_tests + static_analysis + tool_calls + llm_rubric |
conversational | llm_rubric + natural_language_assert + state_check |
research | llm_rubric + natural_language_assert + tool_calls |
computer_use | state_check + tool_calls + llm_rubric |
完整配置请参见 Data/DomainPatterns.yaml。
task:
id: "fix-auth-bypass_1"
description: "Fix authentication bypass when password is empty"
type: regression # or capability
domain: coding
graders:
- type: binary_tests
required: [test_empty_pw.py]
weight: 0.30
- type: tool_calls
weight: 0.20
params:
sequence: [read_file, edit_file, run_tests]
- type: llm_rubric
weight: 0.50
params:
rubric: prompts/security_review.md
trials: 3
pass_threshold: 0.75
| 资源 | 用途 |
|---|---|
Types/index.ts | 核心类型定义 |
Graders/CodeBased/ | 确定性评分器 |
Graders/ModelBased/ | LLM 驱动的评分器 |
Tools/TranscriptCapture.ts | 捕获智能体轨迹 |
Tools/TrialRunner.ts | 多轮次执行(带 pass@k) |
Tools/SuiteManager.ts | 套件管理和饱和度检查 |
Tools/FailureToTask.ts | 将失败案例转换为测试任务 |
Tools/AlgorithmBridge.ts | ALGORITHM 集成 |
Data/DomainPatterns.yaml | 领域特定的评分器配置 |
每周安装量
63
代码仓库
GitHub 星标数
10.5K
首次出现
2026年1月24日
安全审计
安装于
gemini-cli56
codex55
opencode54
github-copilot53
claude-code50
amp49
Before executing, check for user customizations at: ~/.claude/PAI/USER/SKILLCUSTOMIZATIONS/Evals/
If this directory exists, load and apply any PREFERENCES.md, configurations, or resources found there. These override default behavior. If the directory does not exist, proceed with skill defaults.
You MUST send this notification BEFORE doing anything else when this skill is invoked.
Send voice notification :
curl -s -X POST http://localhost:8888/notify \
-H "Content-Type: application/json" \
-d '{"message": "Running the WORKFLOWNAME workflow in the Evals skill to ACTION"}' \
> /dev/null 2>&1 &
Output text notification :
Running the **WorkflowName** workflow in the **Evals** skill to ACTION...
This is not optional. Execute this curl command immediately upon skill invocation.
Comprehensive agent evaluation system based on Anthropic's "Demystifying Evals for AI Agents" (Jan 2026).
Key differentiator: Evaluates agent workflows (transcripts, tool calls, multi-turn conversations), not just single outputs.
| Type | Strengths | Weaknesses | Use For |
|---|---|---|---|
| Code-based | Fast, cheap, deterministic, reproducible | Brittle, lacks nuance | Tests, state checks, tool verification |
| Model-based | Flexible, captures nuance, scalable | Non-deterministic, expensive | Quality rubrics, assertions, comparisons |
| Human | Gold standard, handles subjectivity | Expensive, slow | Calibration, spot checks, A/B testing |
| Type | Pass Target | Purpose |
|---|---|---|
| Capability | ~70% | Stretch goals, measuring improvement potential |
| Regression | ~99% | Quality gates, detecting backsliding |
| Request Pattern | Route To |
|---|---|
| Run eval, evaluate suite, run tests, benchmark | Workflows/RunEval.md |
| Compare models, model comparison, A/B test models | Workflows/CompareModels.md |
| Compare prompts, prompt comparison, test prompts | Workflows/ComparePrompts.md |
| Create judge, model grader, evaluation judge | Workflows/CreateJudge.md |
| Create use case, new eval, test case, create suite | Workflows/CreateUseCase.md |
| View results, eval results, scores, pass rate |
| Trigger | Tool |
|---|---|
| Run suite | Tools/AlgorithmBridge.ts |
| Log failure | Tools/FailureToTask.ts log |
| Convert failures | Tools/FailureToTask.ts convert-all |
| Create suite | Tools/SuiteManager.ts create |
| Check saturation | Tools/SuiteManager.ts check-saturation |
# Run an eval suite
bun run ~/.claude/skills/Utilities/Evals/Tools/AlgorithmBridge.ts -s <suite>
# Log a failure for later conversion
bun run ~/.claude/skills/Utilities/Evals/Tools/FailureToTask.ts log "description" -c category -s severity
# Convert failures to test tasks
bun run ~/.claude/skills/Utilities/Evals/Tools/FailureToTask.ts convert-all
# Manage suites
bun run ~/.claude/skills/Utilities/Evals/Tools/SuiteManager.ts create <name> -t capability -d "description"
bun run ~/.claude/skills/Utilities/Evals/Tools/SuiteManager.ts list
bun run ~/.claude/skills/Utilities/Evals/Tools/SuiteManager.ts check-saturation <name>
bun run ~/.claude/skills/Utilities/Evals/Tools/SuiteManager.ts graduate <name>
Evals is a verification method for THE ALGORITHM ISC rows:
# Run eval and update ISC row
bun run ~/.claude/skills/Utilities/Evals/Tools/AlgorithmBridge.ts -s regression-core -r 3 -u
ISC rows can specify eval verification:
| # | What Ideal Looks Like | Verify |
|---|----------------------|--------|
| 1 | Auth bypass fixed | eval:auth-security |
| 2 | Tests all pass | eval:regression |
| Grader | Use Case |
|---|---|
string_match | Exact substring matching |
regex_match | Pattern matching |
binary_tests | Run test files |
static_analysis | Lint, type-check, security scan |
state_check | Verify system state after execution |
tool_calls | Verify specific tools were called |
| Grader | Use Case |
|---|---|
llm_rubric | Score against detailed rubric |
natural_language_assert | Check assertions are true |
pairwise_comparison | Compare to reference with position swap |
Pre-configured grader stacks for common agent types:
| Domain | Primary Graders |
|---|---|
coding | binary_tests + static_analysis + tool_calls + llm_rubric |
conversational | llm_rubric + natural_language_assert + state_check |
research | llm_rubric + natural_language_assert + tool_calls |
computer_use | state_check + tool_calls + llm_rubric |
See Data/DomainPatterns.yaml for full configurations.
task:
id: "fix-auth-bypass_1"
description: "Fix authentication bypass when password is empty"
type: regression # or capability
domain: coding
graders:
- type: binary_tests
required: [test_empty_pw.py]
weight: 0.30
- type: tool_calls
weight: 0.20
params:
sequence: [read_file, edit_file, run_tests]
- type: llm_rubric
weight: 0.50
params:
rubric: prompts/security_review.md
trials: 3
pass_threshold: 0.75
| Resource | Purpose |
|---|---|
Types/index.ts | Core type definitions |
Graders/CodeBased/ | Deterministic graders |
Graders/ModelBased/ | LLM-powered graders |
Tools/TranscriptCapture.ts | Capture agent trajectories |
Tools/TrialRunner.ts | Multi-trial execution with pass@k |
Tools/SuiteManager.ts |
Weekly Installs
63
Repository
GitHub Stars
10.5K
First Seen
Jan 24, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykWarn
Installed on
gemini-cli56
codex55
opencode54
github-copilot53
claude-code50
amp49
AI界面设计评审工具 - 全面评估UI/UX设计质量、检测AI生成痕迹与优化用户体验
58,500 周安装
Sass 最佳实践指南:缩进语法、模块化架构与可维护样式表开发
194 周安装
Xcode Build Direct:使用原生 CLI 工具构建、测试与自动化 iOS/macOS 项目
191 周安装
代码审查专家:严格检测代码缺陷、低效与不良实践,提升代码质量
193 周安装
Supabase Auth 身份验证技能:邮箱密码注册登录、会话管理、用户元数据操作
193 周安装
Notion API 集成指南:使用 Membrane CLI 自动化笔记、数据库与项目管理
192 周安装
使用Polars进行高效数据分析 - 数据加载、清洗、转换与可视化完整指南
192 周安装
Workflows/ViewResults.md| Suite management and saturation |
Tools/FailureToTask.ts | Convert failures to test tasks |
Tools/AlgorithmBridge.ts | ALGORITHM integration |
Data/DomainPatterns.yaml | Domain-specific grader configs |