agent-eval by affaan-m/everything-claude-code
npx skills add https://github.com/affaan-m/everything-claude-code --skill agent-eval一个轻量级 CLI 工具,用于在可复现的任务上对编程智能体进行头对头比较。每次“哪个编程智能体最好?”的对比都基于主观感受——这个工具将其系统化。
注意: 在审查源代码后,从其代码仓库安装 agent-eval。
以声明方式定义任务。每个任务指定要做什么、涉及哪些文件以及如何判断成功:
name: add-retry-logic
description: Add exponential backoff retry to the HTTP client
repo: ./my-project
files:
- src/http_client.py
prompt: |
Add retry logic with exponential backoff to all HTTP requests.
Max 3 retries. Initial delay 1s, max delay 30s.
judge:
- type: pytest
command: pytest tests/test_http_client.py -v
- type: grep
pattern: "exponential_backoff|retry"
files: src/http_client.py
commit: "abc1234" # pin to specific commit for reproducibility
每个智能体运行都获得自己独立的 git worktree——无需 Docker。这提供了可复现的隔离环境,确保智能体之间不会相互干扰或破坏基础代码库。
| 指标 |
|---|
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
| 测量内容 |
|---|
| 通过率 | 智能体生成的代码是否通过了评判? |
| 成本 | 每项任务的 API 花费(如可用) |
| 时间 | 完成所需的实际秒数 |
| 一致性 | 跨多次运行的通过率(例如,3/3 = 100%) |
创建一个 tasks/ 目录,其中包含 YAML 文件,每个文件对应一个任务:
mkdir tasks
# 编写任务定义(参见上面的模板)
针对您的任务执行智能体:
agent-eval run --task tasks/add-retry-logic.yaml --agent claude-code --agent aider --runs 3
每次运行:
生成比较报告:
agent-eval report --format table
Task: add-retry-logic (3 runs each)
┌──────────────┬───────────┬────────┬────────┬─────────────┐
│ Agent │ Pass Rate │ Cost │ Time │ Consistency │
├──────────────┼───────────┼────────┼────────┼─────────────┤
│ claude-code │ 3/3 │ $0.12 │ 45s │ 100% │
│ aider │ 2/3 │ $0.08 │ 38s │ 67% │
└──────────────┴───────────┴────────┴────────┴─────────────┘
judge:
- type: pytest
command: pytest tests/ -v
- type: command
command: npm run build
judge:
- type: grep
pattern: "class.*Retry"
files: src/**/*.py
judge:
- type: llm
prompt: |
Does this implementation correctly handle exponential backoff?
Check for: max retries, increasing delays, jitter.
每周安装次数
167
代码仓库
GitHub 星标数
102.1K
首次出现
4 天前
安全审计
安装于
codex157
cursor142
opencode140
gemini-cli140
github-copilot140
amp140
A lightweight CLI tool for comparing coding agents head-to-head on reproducible tasks. Every "which coding agent is best?" comparison runs on vibes — this tool systematizes it.
Note: Install agent-eval from its repository after reviewing the source.
Define tasks declaratively. Each task specifies what to do, which files to touch, and how to judge success:
name: add-retry-logic
description: Add exponential backoff retry to the HTTP client
repo: ./my-project
files:
- src/http_client.py
prompt: |
Add retry logic with exponential backoff to all HTTP requests.
Max 3 retries. Initial delay 1s, max delay 30s.
judge:
- type: pytest
command: pytest tests/test_http_client.py -v
- type: grep
pattern: "exponential_backoff|retry"
files: src/http_client.py
commit: "abc1234" # pin to specific commit for reproducibility
Each agent run gets its own git worktree — no Docker required. This provides reproducibility isolation so agents cannot interfere with each other or corrupt the base repo.
| Metric | What It Measures |
|---|---|
| Pass rate | Did the agent produce code that passes the judge? |
| Cost | API spend per task (when available) |
| Time | Wall-clock seconds to completion |
| Consistency | Pass rate across repeated runs (e.g., 3/3 = 100%) |
Create a tasks/ directory with YAML files, one per task:
mkdir tasks
# Write task definitions (see template above)
Execute agents against your tasks:
agent-eval run --task tasks/add-retry-logic.yaml --agent claude-code --agent aider --runs 3
Each run:
Generate a comparison report:
agent-eval report --format table
Task: add-retry-logic (3 runs each)
┌──────────────┬───────────┬────────┬────────┬─────────────┐
│ Agent │ Pass Rate │ Cost │ Time │ Consistency │
├──────────────┼───────────┼────────┼────────┼─────────────┤
│ claude-code │ 3/3 │ $0.12 │ 45s │ 100% │
│ aider │ 2/3 │ $0.08 │ 38s │ 67% │
└──────────────┴───────────┴────────┴────────┴─────────────┘
judge:
- type: pytest
command: pytest tests/ -v
- type: command
command: npm run build
judge:
- type: grep
pattern: "class.*Retry"
files: src/**/*.py
judge:
- type: llm
prompt: |
Does this implementation correctly handle exponential backoff?
Check for: max retries, increasing delays, jitter.
Weekly Installs
167
Repository
GitHub Stars
102.1K
First Seen
4 days ago
Security Audits
Gen Agent Trust HubFailSocketPassSnykWarn
Installed on
codex157
cursor142
opencode140
gemini-cli140
github-copilot140
amp140
AI Elements:基于shadcn/ui的AI原生应用组件库,快速构建对话界面
62,200 周安装
Magento 2 Hyvä CMS 组件创建器 - 快速构建自定义CMS组件
163 周安装
项目文档协调器 - 自动化文档生成与上下文管理工具
163 周安装
GPUI 布局与样式:Rust 类型安全的 CSS 样式库,Flexbox 布局与链式 API
163 周安装
Telegram自动化指南:通过Rube MCP与Composio实现消息发送、聊天管理
163 周安装
AI内容生成工具:一键生成图像、图表、文档、提案和PDF,提升内容创作效率
164 周安装
iOS无障碍功能开发指南:VoiceOver、动态字体、WCAG合规性检查与App Store审核
164 周安装