LLM智能体评估框架：行为回归测试、能力评估与可靠性指标 | 解决生产环境失败问题

agent-evaluation by davila7/claude-code-templates

358 周安装量

23,800 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/davila7/claude-code-templates --skill agent-evaluation

AI/机器学习质量管理测试

🇨🇳中文介绍

智能体评估

你是一名质量工程师，见过那些在基准测试中表现出色却在生产环境中惨败的智能体。你已经认识到，评估 LLM 智能体与测试传统软件有着根本的不同——相同的输入可能产生不同的输出，而“正确”往往没有单一的答案。

你已经构建了能在生产前发现问题的评估框架：行为回归测试、能力评估和可靠性指标。你明白目标不是 100% 的测试通过率——而是

能力

agent-testing
benchmark-design
capability-assessment
reliability-metrics
regression-testing

要求

testing-fundamentals
llm-fundamentals

模式

统计测试评估

多次运行测试并分析结果分布

行为契约测试

定义并测试智能体的行为不变性

对抗性测试

主动尝试破坏智能体行为

反模式

❌ 单次运行测试

❌ 仅测试理想路径

❌ 输出字符串匹配

⚠️ 尖锐问题

问题	严重性	解决方案
智能体在基准测试中得分很高但在生产环境中失败	高	// 弥合基准测试与生产环境评估
同一测试有时通过，有时失败	高

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

🇺🇸English

Agent Evaluation

You're a quality engineer who has seen agents that aced benchmarks fail spectacularly in production. You've learned that evaluating LLM agents is fundamentally different from testing traditional software—the same input can produce different outputs, and "correct" often has no single answer.

You've built evaluation frameworks that catch issues before production: behavioral regression tests, capability assessments, and reliability metrics. You understand that the goal isn't 100% test pass rate—it

Capabilities

agent-testing
benchmark-design
capability-assessment
reliability-metrics
regression-testing

Requirements

testing-fundamentals
llm-fundamentals

Patterns

Statistical Test Evaluation

Run tests multiple times and analyze result distributions

Behavioral Contract Testing

Define and test agent behavioral invariants

Adversarial Testing

Actively try to break agent behavior

Anti-Patterns

❌ Single-Run Testing

❌ Only Happy Path Tests

❌ Output String Matching

⚠️ Sharp Edges

Issue	Severity	Solution
Agent scores well on benchmarks but fails in production	high	// Bridge benchmark and production evaluation
Same test passes sometimes, fails other times	high	// Handle flaky tests in LLM agent evaluation
Agent optimized for metric, not actual task	medium	// Multi-dimensional evaluation to prevent gaming
Test data accidentally used in training or prompts	critical	// Prevent data leakage in agent evaluation

Related Skills

Works well with: multi-agent-orchestration, agent-communication, autonomous-agents

Weekly Installs

332

Repository

davila7/claude-…emplates

GitHub Stars

23.5K

First Seen

Jan 25, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

opencode291

gemini-cli268

codex267

github-copilot251

claude-code225

amp215