agent-eval：编程智能体评估工具，自动化测试比较AI代码助手性能

agent-eval by affaan-m/everything-claude-code

167 周安装量

102,100 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/affaan-m/everything-claude-code --skill agent-eval

AI/机器学习自动化测试

🇨🇳中文介绍

Agent Eval Skill

一个轻量级 CLI 工具，用于在可复现的任务上对编程智能体进行头对头比较。每次“哪个编程智能体最好？”的对比都基于主观感受——这个工具将其系统化。

何时使用

在您自己的代码库上比较编程智能体（Claude Code、Aider、Codex 等）
在采用新工具或模型前测量智能体性能
当智能体更新其模型或工具时运行回归检查
为团队做出有数据支持的智能体选择决策

安装

注意： 在审查源代码后，从其代码仓库安装 agent-eval。

核心概念

YAML 任务定义

以声明方式定义任务。每个任务指定要做什么、涉及哪些文件以及如何判断成功：

name: add-retry-logic
description: Add exponential backoff retry to the HTTP client
repo: ./my-project
files:
  - src/http_client.py
prompt: |
  Add retry logic with exponential backoff to all HTTP requests.
  Max 3 retries. Initial delay 1s, max delay 30s.
judge:
  - type: pytest
    command: pytest tests/test_http_client.py -v
  - type: grep
    pattern: "exponential_backoff|retry"
    files: src/http_client.py
commit: "abc1234"  # pin to specific commit for reproducibility

Git Worktree 隔离

每个智能体运行都获得自己独立的 git worktree——无需 Docker。这提供了可复现的隔离环境，确保智能体之间不会相互干扰或破坏基础代码库。

收集的指标

指标

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

🇺🇸English

Agent Eval Skill

A lightweight CLI tool for comparing coding agents head-to-head on reproducible tasks. Every "which coding agent is best?" comparison runs on vibes — this tool systematizes it.

When to Activate

Comparing coding agents (Claude Code, Aider, Codex, etc.) on your own codebase
Measuring agent performance before adopting a new tool or model
Running regression checks when an agent updates its model or tooling
Producing data-backed agent selection decisions for a team

Installation

Note: Install agent-eval from its repository after reviewing the source.

Core Concepts

YAML Task Definitions

Define tasks declaratively. Each task specifies what to do, which files to touch, and how to judge success:

name: add-retry-logic
description: Add exponential backoff retry to the HTTP client
repo: ./my-project
files:
  - src/http_client.py
prompt: |
  Add retry logic with exponential backoff to all HTTP requests.
  Max 3 retries. Initial delay 1s, max delay 30s.
judge:
  - type: pytest
    command: pytest tests/test_http_client.py -v
  - type: grep
    pattern: "exponential_backoff|retry"
    files: src/http_client.py
commit: "abc1234"  # pin to specific commit for reproducibility

Git Worktree Isolation

Each agent run gets its own git worktree — no Docker required. This provides reproducibility isolation so agents cannot interfere with each other or corrupt the base repo.

Metrics Collected

Metric	What It Measures
Pass rate	Did the agent produce code that passes the judge?
Cost	API spend per task (when available)
Time	Wall-clock seconds to completion
Consistency	Pass rate across repeated runs (e.g., 3/3 = 100%)

Workflow

1. Define Tasks

Create a tasks/ directory with YAML files, one per task:

mkdir tasks
# Write task definitions (see template above)

2. Run Agents

Execute agents against your tasks:

agent-eval run --task tasks/add-retry-logic.yaml --agent claude-code --agent aider --runs 3

Each run:

Creates a fresh git worktree from the specified commit
Hands the prompt to the agent
Runs the judge criteria
Records pass/fail, cost, and time

3. Compare Results

Generate a comparison report:

agent-eval report --format table



Task: add-retry-logic (3 runs each)
┌──────────────┬───────────┬────────┬────────┬─────────────┐
│ Agent        │ Pass Rate │ Cost   │ Time   │ Consistency │
├──────────────┼───────────┼────────┼────────┼─────────────┤
│ claude-code  │ 3/3       │ $0.12  │ 45s    │ 100%        │
│ aider        │ 2/3       │ $0.08  │ 38s    │  67%        │
└──────────────┴───────────┴────────┴────────┴─────────────┘

Judge Types

Code-Based (deterministic)

judge:
  - type: pytest
    command: pytest tests/ -v
  - type: command
    command: npm run build

Pattern-Based

judge:
  - type: grep
    pattern: "class.*Retry"
    files: src/**/*.py

Model-Based (LLM-as-judge)

judge:
  - type: llm
    prompt: |
      Does this implementation correctly handle exponential backoff?
      Check for: max retries, increasing delays, jitter.

Best Practices

Start with 3-5 tasks that represent your real workload, not toy examples
Run at least 3 trials per agent to capture variance — agents are non-deterministic
Pin the commit in your task YAML so results are reproducible across days/weeks
Include at least one deterministic judge (tests, build) per task — LLM judges add noise
Track cost alongside pass rate — a 95% agent at 10x the cost may not be the right choice
Version your task definitions — they are test fixtures, treat them as code

通过率	智能体生成的代码是否通过了评判？
成本	每项任务的 API 花费（如可用）
时间	完成所需的实际秒数
一致性	跨多次运行的通过率（例如，3/3 = 100%）

agent-eval：编程智能体评估工具，自动化测试比较AI代码助手性能

🇨🇳中文介绍

Agent Eval Skill

何时使用

安装

核心概念

YAML 任务定义

Git Worktree 隔离

收集的指标

相关 Skills

工作流程

1. 定义任务

2. 运行智能体

3. 比较结果

评判类型

基于代码（确定性）

基于模式

基于模型（LLM 作为评判者）

最佳实践

链接

🇺🇸English

Agent Eval Skill

When to Activate

Installation

Core Concepts

YAML Task Definitions

Git Worktree Isolation

Metrics Collected

Workflow

1. Define Tasks

2. Run Agents

3. Compare Results

Judge Types

Code-Based (deterministic)

Pattern-Based

Model-Based (LLM-as-judge)

Best Practices

Links

最新 Skills