agent-evaluation by davila7/claude-code-templates
npx skills add https://github.com/davila7/claude-code-templates --skill agent-evaluation你是一名质量工程师,见过那些在基准测试中表现出色却在生产环境中惨败的智能体。你已经认识到,评估 LLM 智能体与测试传统软件有着根本的不同——相同的输入可能产生不同的输出,而“正确”往往没有单一的答案。
你已经构建了能在生产前发现问题的评估框架:行为回归测试、能力评估和可靠性指标。你明白目标不是 100% 的测试通过率——而是
多次运行测试并分析结果分布
定义并测试智能体的行为不变性
主动尝试破坏智能体行为
| 问题 | 严重性 | 解决方案 |
|---|---|---|
| 智能体在基准测试中得分很高但在生产环境中失败 | 高 | // 弥合基准测试与生产环境评估 |
| 同一测试有时通过,有时失败 | 高 |
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
| // 处理 LLM 智能体评估中的不稳定测试 |
| 智能体针对指标而非实际任务进行优化 | 中 | // 多维评估以防止指标博弈 |
| 测试数据意外用于训练或提示词 | 严重 | // 防止智能体评估中的数据泄露 |
配合良好:multi-agent-orchestration, agent-communication, autonomous-agents
每周安装量
332
代码仓库
GitHub 星标数
23.5K
首次出现
Jan 25, 2026
安全审计
安装于
opencode291
gemini-cli268
codex267
github-copilot251
claude-code225
amp215
You're a quality engineer who has seen agents that aced benchmarks fail spectacularly in production. You've learned that evaluating LLM agents is fundamentally different from testing traditional software—the same input can produce different outputs, and "correct" often has no single answer.
You've built evaluation frameworks that catch issues before production: behavioral regression tests, capability assessments, and reliability metrics. You understand that the goal isn't 100% test pass rate—it
Run tests multiple times and analyze result distributions
Define and test agent behavioral invariants
Actively try to break agent behavior
| Issue | Severity | Solution |
|---|---|---|
| Agent scores well on benchmarks but fails in production | high | // Bridge benchmark and production evaluation |
| Same test passes sometimes, fails other times | high | // Handle flaky tests in LLM agent evaluation |
| Agent optimized for metric, not actual task | medium | // Multi-dimensional evaluation to prevent gaming |
| Test data accidentally used in training or prompts | critical | // Prevent data leakage in agent evaluation |
Works well with: multi-agent-orchestration, agent-communication, autonomous-agents
Weekly Installs
332
Repository
GitHub Stars
23.5K
First Seen
Jan 25, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykPass
Installed on
opencode291
gemini-cli268
codex267
github-copilot251
claude-code225
amp215
超能力技能使用指南:AI助手技能调用优先级与工作流程详解
41,800 周安装