⚠️

重要前提

安装AI Skills的关键前提是：必须科学上网，且开启TUN模式，这一点至关重要，直接决定安装能否顺利完成，在此郑重提醒三遍：科学上网，科学上网，科学上网。查看完整安装教程 →

Evals AI智能体评估框架：工作流测试、模型对比与回归测试工具

Evals by danielmiessler/personal_ai_infrastructure

75 周安装量

11,200 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/danielmiessler/personal_ai_infrastructure --skill Evals

AI/机器学习 Claude技能测试

🇨🇳中文介绍

自定义设置

执行前，请检查以下路径的用户自定义设置： ~/.claude/PAI/USER/SKILLCUSTOMIZATIONS/Evals/

如果此目录存在，则加载并应用其中找到的任何 PREFERENCES.md、配置或资源。这些设置将覆盖默认行为。如果目录不存在，则使用技能默认设置。

🚨 强制要求：语音通知（任何操作前必需）

当此技能被调用时，在执行任何其他操作之前**，您必须发送此通知。**

发送语音通知：

curl -s -X POST http://localhost:8888/notify
-H "Content-Type: application/json"
-d '{"message": "Running the WORKFLOWNAME workflow in the Evals skill to ACTION"}'
> /dev/null 2>&1 &
输出文本通知：

Running the WorkflowName workflow in the Evals skill to ACTION...

这是强制性的，不可选。技能被调用后立即执行此 curl 命令。

Evals - AI 智能体评估框架

基于 Anthropic 的《Demystifying Evals for AI Agents》（2026年1月）构建的综合性智能体评估系统。

关键区别： 评估智能体工作流（对话记录、工具调用、多轮对话），而不仅仅是单一输出。

何时激活

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

三种评分器类型

类型	优势	劣势	适用场景
基于代码	快速、廉价、确定性、可复现	脆弱、缺乏细微差别	测试、状态检查、工具验证
基于模型	灵活、能捕捉细微差别、可扩展	非确定性、昂贵	质量评分标准、断言、比较
人工	黄金标准、能处理主观性	昂贵、缓慢	校准、抽查、A/B 测试

类型	通过目标	目的
能力	~70%	延伸目标，衡量改进潜力
回归	~99%	质量门禁，检测性能倒退

pass@k : 在 k 次尝试中至少成功 1 次的概率（衡量能力）
pass^k : 所有 k 次尝试都成功的概率（衡量一致性/可靠性）

请求模式	路由至
运行评估、评估套件、运行测试、基准测试	`Workflows/RunEval.md`
比较模型、模型对比、A/B 测试模型	`Workflows/CompareModels.md`
比较提示词、提示词对比、测试提示词	`Workflows/ComparePrompts.md`
创建评分器、模型评分器、评估评分器	`Workflows/CreateJudge.md`
创建用例、新评估、测试用例、创建套件	`Workflows/CreateUseCase.md`
查看结果、评估结果、分数、通过率	`Workflows/ViewResults.md`

触发词	工具
运行套件	`Tools/AlgorithmBridge.ts`
记录失败	`Tools/FailureToTask.ts log`
转换失败案例	`Tools/FailureToTask.ts convert-all`
创建套件	`Tools/SuiteManager.ts create`
检查饱和度	`Tools/SuiteManager.ts check-saturation`

# 运行评估套件
bun run ~/.claude/skills/Utilities/Evals/Tools/AlgorithmBridge.ts -s <suite>

# 记录失败案例以供后续转换
bun run ~/.claude/skills/Utilities/Evals/Tools/FailureToTask.ts log "description" -c category -s severity

# 将失败案例转换为测试任务
bun run ~/.claude/skills/Utilities/Evals/Tools/FailureToTask.ts convert-all

# 管理套件
bun run ~/.claude/skills/Utilities/Evals/Tools/SuiteManager.ts create <name> -t capability -d "description"
bun run ~/.claude/skills/Utilities/Evals/Tools/SuiteManager.ts list
bun run ~/.claude/skills/Utilities/Evals/Tools/SuiteManager.ts check-saturation <name>
bun run ~/.claude/skills/Utilities/Evals/Tools/SuiteManager.ts graduate <name>

Evals 是 THE ALGORITHM ISC 行的一种验证方法：

# 运行评估并更新 ISC 行
bun run ~/.claude/skills/Utilities/Evals/Tools/AlgorithmBridge.ts -s regression-core -r 3 -u

ISC 行可以指定评估验证：

| # | What Ideal Looks Like | Verify |
|---|----------------------|--------|
| 1 | Auth bypass fixed | eval:auth-security |
| 2 | Tests all pass | eval:regression |

基于代码（快速、确定性）

评分器	用例
`string_match`	精确子字符串匹配
`regex_match`	模式匹配
`binary_tests`	运行测试文件
`static_analysis`	代码检查、类型检查、安全扫描
`state_check`	验证执行后的系统状态
`tool_calls`	验证特定工具是否被调用

基于模型（能捕捉细微差别）

评分器	用例
`llm_rubric`	根据详细评分标准打分
`natural_language_assert`	检查断言是否为真
`pairwise_comparison`	与参考进行比较（交换位置）

针对常见智能体类型的预配置评分器栈：

领域	主要评分器
`coding`	binary_tests + static_analysis + tool_calls + llm_rubric
`conversational`	llm_rubric + natural_language_assert + state_check
`research`	llm_rubric + natural_language_assert + tool_calls
`computer_use`	state_check + tool_calls + llm_rubric

完整配置请参见 Data/DomainPatterns.yaml。

任务模式（YAML）

task:
  id: "fix-auth-bypass_1"
  description: "Fix authentication bypass when password is empty"
  type: regression  # or capability
  domain: coding

  graders:
    - type: binary_tests
      required: [test_empty_pw.py]
      weight: 0.30

    - type: tool_calls
      weight: 0.20
      params:
        sequence: [read_file, edit_file, run_tests]

    - type: llm_rubric
      weight: 0.50
      params:
        rubric: prompts/security_review.md

  trials: 3
  pass_threshold: 0.75

资源	用途
`Types/index.ts`	核心类型定义
`Graders/CodeBased/`	确定性评分器
`Graders/ModelBased/`	LLM 驱动的评分器
`Tools/TranscriptCapture.ts`	捕获智能体轨迹
`Tools/TrialRunner.ts`	多轮次执行（带 pass@k）
`Tools/SuiteManager.ts`	套件管理和饱和度检查
`Tools/FailureToTask.ts`	将失败案例转换为测试任务
`Tools/AlgorithmBridge.ts`	ALGORITHM 集成
`Data/DomainPatterns.yaml`	领域特定的评分器配置

关键原则（来自 Anthropic）

从 20-50 个真实失败案例开始 - 不要过度思考，捕获实际出错的情况
明确的任务 - 两位专家应得出相同的结论
平衡的问题集 - 测试"应该做"和"不应该做"两方面
评估输出，而非路径 - 不要惩罚有效的创造性解决方案
校准 LLM 评分器 - 对照人类专家判断进行校准
定期检查对话记录 - 验证评分器工作正常
监控饱和度 - 当达到 95%+ 时，升级为回归测试
尽早构建基础设施 - 评估决定了您采用新模型的速度

ALGORITHM : Evals 是一种验证方法
Science : Evals 实现了科学方法
Browser : 用于视觉验证评分器

🇺🇸English

Customization

Before executing, check for user customizations at: ~/.claude/PAI/USER/SKILLCUSTOMIZATIONS/Evals/

If this directory exists, load and apply any PREFERENCES.md, configurations, or resources found there. These override default behavior. If the directory does not exist, proceed with skill defaults.

🚨 MANDATORY: Voice Notification (REQUIRED BEFORE ANY ACTION)

You MUST send this notification BEFORE doing anything else when this skill is invoked.

Send voice notification :

curl -s -X POST http://localhost:8888/notify \
  -H "Content-Type: application/json" \
  -d '{"message": "Running the WORKFLOWNAME workflow in the Evals skill to ACTION"}' \
  > /dev/null 2>&1 &

Output text notification :

Running the **WorkflowName** workflow in the **Evals** skill to ACTION...

This is not optional. Execute this curl command immediately upon skill invocation.

Evals - AI Agent Evaluation Framework

Comprehensive agent evaluation system based on Anthropic's "Demystifying Evals for AI Agents" (Jan 2026).

Key differentiator: Evaluates agent workflows (transcripts, tool calls, multi-turn conversations), not just single outputs.

When to Activate

"run evals", "test this agent", "evaluate", "check quality", "benchmark"
"regression test", "capability test"
Compare agent behaviors across changes
Validate agent workflows before deployment
Verify ALGORITHM ISC rows
Create new evaluation tasks from failures

Core Concepts

Three Grader Types

Type	Strengths	Weaknesses	Use For
Code-based	Fast, cheap, deterministic, reproducible	Brittle, lacks nuance	Tests, state checks, tool verification
Model-based	Flexible, captures nuance, scalable	Non-deterministic, expensive	Quality rubrics, assertions, comparisons
Human	Gold standard, handles subjectivity	Expensive, slow	Calibration, spot checks, A/B testing

Evaluation Types

Type	Pass Target	Purpose
Capability	~70%	Stretch goals, measuring improvement potential
Regression	~99%	Quality gates, detecting backsliding

Key Metrics

pass@k : Probability of at least 1 success in k trials (measures capability)
pass^k : Probability all k trials succeed (measures consistency/reliability)

Workflow Routing

Request Pattern	Route To
Run eval, evaluate suite, run tests, benchmark	`Workflows/RunEval.md`
Compare models, model comparison, A/B test models	`Workflows/CompareModels.md`
Compare prompts, prompt comparison, test prompts	`Workflows/ComparePrompts.md`
Create judge, model grader, evaluation judge	`Workflows/CreateJudge.md`
Create use case, new eval, test case, create suite	`Workflows/CreateUseCase.md`
View results, eval results, scores, pass rate

CLI Quick Reference

Trigger	Tool
Run suite	`Tools/AlgorithmBridge.ts`
Log failure	`Tools/FailureToTask.ts log`
Convert failures	`Tools/FailureToTask.ts convert-all`
Create suite	`Tools/SuiteManager.ts create`
Check saturation	`Tools/SuiteManager.ts check-saturation`

Quick Reference

CLI Commands

# Run an eval suite
bun run ~/.claude/skills/Utilities/Evals/Tools/AlgorithmBridge.ts -s <suite>

# Log a failure for later conversion
bun run ~/.claude/skills/Utilities/Evals/Tools/FailureToTask.ts log "description" -c category -s severity

# Convert failures to test tasks
bun run ~/.claude/skills/Utilities/Evals/Tools/FailureToTask.ts convert-all

# Manage suites
bun run ~/.claude/skills/Utilities/Evals/Tools/SuiteManager.ts create <name> -t capability -d "description"
bun run ~/.claude/skills/Utilities/Evals/Tools/SuiteManager.ts list
bun run ~/.claude/skills/Utilities/Evals/Tools/SuiteManager.ts check-saturation <name>
bun run ~/.claude/skills/Utilities/Evals/Tools/SuiteManager.ts graduate <name>

ALGORITHM Integration

Evals is a verification method for THE ALGORITHM ISC rows:

# Run eval and update ISC row
bun run ~/.claude/skills/Utilities/Evals/Tools/AlgorithmBridge.ts -s regression-core -r 3 -u

ISC rows can specify eval verification:

| # | What Ideal Looks Like | Verify |
|---|----------------------|--------|
| 1 | Auth bypass fixed | eval:auth-security |
| 2 | Tests all pass | eval:regression |

Available Graders

Code-Based (Fast, Deterministic)

Grader	Use Case
`string_match`	Exact substring matching
`regex_match`	Pattern matching
`binary_tests`	Run test files
`static_analysis`	Lint, type-check, security scan
`state_check`	Verify system state after execution
`tool_calls`	Verify specific tools were called

Model-Based (Nuanced)

Grader	Use Case
`llm_rubric`	Score against detailed rubric
`natural_language_assert`	Check assertions are true
`pairwise_comparison`	Compare to reference with position swap

Domain Patterns

Pre-configured grader stacks for common agent types:

Domain	Primary Graders
`coding`	binary_tests + static_analysis + tool_calls + llm_rubric
`conversational`	llm_rubric + natural_language_assert + state_check
`research`	llm_rubric + natural_language_assert + tool_calls
`computer_use`	state_check + tool_calls + llm_rubric

See Data/DomainPatterns.yaml for full configurations.

Task Schema (YAML)

task:
  id: "fix-auth-bypass_1"
  description: "Fix authentication bypass when password is empty"
  type: regression  # or capability
  domain: coding

  graders:
    - type: binary_tests
      required: [test_empty_pw.py]
      weight: 0.30

    - type: tool_calls
      weight: 0.20
      params:
        sequence: [read_file, edit_file, run_tests]

    - type: llm_rubric
      weight: 0.50
      params:
        rubric: prompts/security_review.md

  trials: 3
  pass_threshold: 0.75

Resource Index

Resource	Purpose
`Types/index.ts`	Core type definitions
`Graders/CodeBased/`	Deterministic graders
`Graders/ModelBased/`	LLM-powered graders
`Tools/TranscriptCapture.ts`	Capture agent trajectories
`Tools/TrialRunner.ts`	Multi-trial execution with pass@k
`Tools/SuiteManager.ts`

Key Principles (from Anthropic)

Start with 20-50 real failures - Don't overthink, capture what actually broke
Unambiguous tasks - Two experts should reach identical verdicts
Balanced problem sets - Test both "should do" AND "should NOT do"
Grade outputs, not paths - Don't penalize valid creative solutions
Calibrate LLM judges - Against human expert judgment
Check transcripts regularly - Verify graders work correctly
Monitor saturation - Graduate to regression when hitting 95%+
Build infrastructure early - Evals shape how quickly you can adopt new models

ALGORITHM : Evals is a verification method
Science : Evals implements scientific method
Browser : For visual verification graders

Weekly Installs

Repository

danielmiessler/…tructure

GitHub Stars

10.5K

First Seen

Jan 24, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykWarn

Installed on

gemini-cli56

codex55

opencode54

github-copilot53

claude-code50

amp49

AI界面设计评审工具 - 全面评估UI/UX设计质量、检测AI生成痕迹与优化用户体验

58,500 周安装

Workflows/ViewResults.md

Evals AI智能体评估框架：工作流测试、模型对比与回归测试工具

🇨🇳中文介绍

自定义设置

🚨 强制要求：语音通知（任何操作前必需）

Evals - AI 智能体评估框架

何时激活

相关 Skills

核心概念

三种评分器类型

评估类型

关键指标

工作流路由

CLI 快速参考

快速参考

CLI 命令

ALGORITHM 集成

可用的评分器