DEC Bench 评估场景创建指南：构建确定性AI智能体测试框架

dec-bench-evals by 514-labs/agent-evals

1 周安装量

GitHub

安装命令

npx skills add https://github.com/514-labs/agent-evals --skill dec-bench-evals

AI/机器学习自动化测试

🇨🇳中文介绍

DEC Bench 评估

当用户希望创建或扩展 DEC Bench 场景时使用此技能。目标是生成一个确定性的、可运行的评估，而非模糊的基准测试想法。

快速开始

默认创作循环：

dec-bench create --name <id> --domain <domain> --tier <tier>
dec-bench validate --scenario <id>
dec-bench build --scenario <id> --harness <harness> --agent <agent> --model <model> --version <version>
dec-bench run --scenario <id> --harness <harness> --persona naive --mode no-plan
dec-bench results --latest --scenario <id>
dec-bench audit open --scenario <id> --run-id <run-id>
dec-bench registry add --scenario scenarios/<id>
dec-bench registry publish --id <id>

规则：

在 build 或 run 之前运行 dec-bench validate。
将 build 和 run 视为独立的检查：验证镜像路径，验证评分行为。

🇺🇸English

DEC Bench Evals

Use this skill when the user wants to create or extend a DEC Bench scenario. The goal is a deterministic, runnable evaluation, not a vague benchmark idea.

Quick Start

Default authoring loop:

dec-bench create --name <id> --domain <domain> --tier <tier>
dec-bench validate --scenario <id>
dec-bench build --scenario <id> --harness <harness> --agent <agent> --model <model> --version <version>
dec-bench run --scenario <id> --harness <harness> --persona naive --mode no-plan
dec-bench results --latest --scenario <id>
dec-bench audit open --scenario <id> --run-id <run-id>
dec-bench registry add --scenario scenarios/<id>
dec-bench registry publish --id <id>

Rules:

Run dec-bench validate before build or run.
Treat build and run as separate checks: build verifies the image path, run verifies scoring behavior.

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

搭建脚手架之前

首先确定以下内容：

场景 ID：小写，用连字符连接，特定于任务。
领域：foo-bar、b2b-saas、b2c-saas、ugc、e-commerce、advertising、consumption-based-infra 其中之一。
层级：tier-1、tier-2 或 tier-3。
起始状态：损坏/不完整或干净/全新。
主要能力：评估要测试的主要推理技能。
测试框架：base-rt、classic-de、olap-for-swe 或有充分理由的自定义框架。
成功标准：具体的通过/失败检查，而非主观判断。

优先选择仍能锻炼目标能力的最小层级。保持起始状态确定且易于重置。

优秀评估的特点

优秀的 DEC Bench 场景：

测试一个清晰的工作流程或故障模式
使用真实但紧凑的种子数据
让智能体解决可观察的约束
使用确定性断言对行为进行评分
保持设置在重复运行中可重现

使用 LLM 作为评判的评分方式
模糊的任务，如“改进管道”
在运行之间发生变化的隐藏状态
在不同角色之间改变目标的提示
除非关卡明确需要重新运行行为，否则避免使用有副作用的断言

dec-bench create 生成一个包含以下内容的场景目录：

prompts/naive.md
prompts/savvy.md
init/
assertions/functional.ts
assertions/correct.ts
assertions/robust.ts
assertions/performant.ts
assertions/production.ts
scenario.json
supervisord.conf

按此顺序处理这些文件。

两个提示必须针对相同的验收标准。

naive.md：使用平实的语言，最少的实现提示，除非任务自然会提及，否则不要提及具体工具名称。
savvy.md：明确指定工具、模式、路径、约束和操作细节。
不要通过改变所需结果来使 savvy 提示变得更容易。
保持提示足够具体，使得断言感觉是必然的，而不是令人惊讶的。

使用 init/ 和 supervisord.conf 创建起始状态。

损坏/不完整的起始状态：播种足够健康的基础设施以及一个或多个可诊断的缺陷。
干净/全新的起始状态：播种健康的基础设施和真实的源数据，然后让智能体构建缺失的解决方案。
保持数据确定性。
通过环境变量暴露连接设置，智能体和断言都可以使用这些变量。
仅启动场景所需的服务。

在 supervisord.conf 中：

使用明确的程序和启动顺序
保持 autorestart=false
避免非必要的后台服务

断言是评估的核心。只编写场景断言；框架提供通用的核心断言。

每个导出的异步函数应测试一件事。
函数名称将成为评分输出中的断言键。
返回包含 passed 以及可操作的 message 和有用的 details 的 AssertionResult。
保持断言确定性、快速且无副作用，除非重新运行行为是重点。
将辅助函数（如 queryRows<T>()）放在同一个关卡文件中。
优先使用数据库和工件检查，而非日志文本启发式方法。

使用框架上下文：

ctx.clickhouse 用于 ClickHouse 查询
ctx.postgres 用于 Postgres 查询
ctx.env() 用于连接设置和其他环境变量

功能性：它能运行
正确性：它是正确的
鲁棒性：它能处理混乱或重复的执行
性能：它满足运行时或查询阈值
生产就绪性：你可以交付它

只有前面的关卡通过时，当前关卡才会计入。场景断言必须与框架的核心断言一起达到 80% 的关卡阈值。

`scenario.json` 规则

至少填充以下字段：

id
title
description
tier
domain
harness
tasks
personaPrompts
infrastructure
tags
baselineMetrics
referenceMetrics

personaPrompts 应指向 prompts/naive.md 和 prompts/savvy.md。
tasks[] 应是具体的并已分类。
infrastructure.services 和 infrastructure.description 应描述实际的起始状态。
基准指标和参考指标应是合理的，而非理想化的。

有关完整契约、枚举值和工作示例，请参阅 guide.md。

除非工具链需求确实新颖，否则使用内置的测试框架。

base-rt：基础基础设施加常见运行时工具
classic-de：dbt 和更重的数据工程工具
olap-for-swe：专注于 MooseStack 的工作流程

仅当场景真正需要额外的包或出站策略更改时，才创建自定义框架。

要将创作的场景贡献给上游：

dec-bench registry add --scenario scenarios/<id>
dec-bench registry publish --id <id>

仅在场景在本地验证并运行后，才使用 registry publish。

当此技能激活时，产生以下内容之一：

一个包含领域、层级、起始状态、能力、测试框架和断言计划的具体场景提案
对脚手架场景文件的直接编辑
针对现有场景的针对性扩展计划

不要停留在想法列表上。将用户请求转换为可运行的场景文件或逐文件的实施计划。

阅读 guide.md 以获取：

完整的 scenario.json 模式和枚举值
所有五个关卡的完整断言示例
naive 与 savvy 提示示例
测试框架选择详情
注册表发布标志和审查清单
此技能的 skills.sh 兼容安装说明

Use results to inspect the latest run before opening the audit UI.

Use audit open for the browser view, or audit export if you only need the bundle.

If the workspace is not a DEC Bench repo, stop and ask whether the user wants a DEC Bench scenario scaffold or only a scenario design proposal.

Scenario ID: lowercase, hyphenated, specific to the task.
Domain: one of foo-bar, b2b-saas, b2c-saas, ugc, e-commerce, advertising, consumption-based-infra.
Tier: tier-1, tier-2, or tier-3.
Starting state: broken/incomplete or clean/greenfield.
Primary competency: the main reasoning skill the eval is testing.
Harness: base-rt, classic-de, olap-for-swe, or a justified custom harness.
Success criteria: concrete pass/fail checks, not subjective judgments.

Prefer the smallest tier that still exercises the intended competency. Keep the starting state deterministic and easy to reset.

What Good Evals Look Like

Good DEC Bench scenarios:

test one clear workflow or failure mode
use realistic but compact seed data
make the agent resolve observable constraints
score behavior with deterministic assertions
keep setup reproducible across repeat runs

LLM-as-judge scoring
vague tasks like "improve the pipeline"
hidden state that changes between runs
prompts that move the goalposts between personas
assertions with side effects unless the gate explicitly needs rerun behavior

dec-bench create generates a scenario directory with:

prompts/naive.md
prompts/savvy.md
init/
assertions/functional.ts
assertions/correct.ts
assertions/robust.ts
assertions/performant.ts
assertions/production.ts
scenario.json
supervisord.conf

Work through those files in that order.

Both prompts must target the same acceptance criteria.

naive.md: plain language, minimal implementation hints, no named tools unless the task would naturally mention them.
savvy.md: explicit tools, schemas, paths, constraints, and operational details.
Do not make the savvy prompt easier by changing the required outcome.
Keep prompts specific enough that assertions feel inevitable rather than surprising.

Infrastructure Rules

Use init/ and supervisord.conf to create the starting state.

Broken/incomplete start: seed healthy-enough infrastructure plus one or more diagnosable defects.
Clean/greenfield start: seed healthy infrastructure and realistic source data, then let the agent build the missing solution.
Keep data deterministic.
Expose connection settings through environment variables that both the agent and assertions can consume.
Start only the services the scenario needs.

In supervisord.conf:

use explicit programs and startup order
keep autorestart=false
avoid incidental background services

Assertions are the core of the eval. Write scenario assertions only; the framework provides universal core assertions.

Each exported async function should test one thing.
Function names become assertion keys in the scoring output.
Return AssertionResult with passed plus actionable message and useful details.
Keep assertions deterministic, fast, and side-effect free unless rerun behavior is the point.
Put helper functions like queryRows<T>() inside the same gate file.
Prefer database and artifact checks over log-text heuristics.

Use the framework context:

ctx.clickhouse for ClickHouse queries
ctx.postgres for Postgres queries
ctx.env() for connection settings and other environment variables

Functional: it runs
Correct: it is right
Robust: it handles messy or repeated execution
Performant: it meets runtime or query thresholds
Production: you would ship it

A gate only counts if earlier gates pass. Scenario assertions must clear the 80% gate threshold together with the framework's core assertions.

`scenario.json` Rules

Populate at least these fields:

id
title
description
tier
domain
harness
tasks
personaPrompts
infrastructure
tags
baselineMetrics
referenceMetrics

personaPrompts should point to prompts/naive.md and prompts/savvy.md.
tasks[] should be concrete and categorized.
infrastructure.services and infrastructure.description should describe the actual starting state.
Baseline and reference metrics should be plausible, not aspirational.

For the full contract, enum values, and worked examples, see guide.md.

Use the built-in harnesses unless the toolchain requirements are truly new.

base-rt: base infrastructure plus common runtime tools
classic-de: dbt and heavier data engineering tooling
olap-for-swe: MooseStack-focused workflows

Create a custom harness only when the scenario genuinely needs additional packages or outbound policy changes.

To contribute an authored scenario upstream:

dec-bench registry add --scenario scenarios/<id>
dec-bench registry publish --id <id>

Use registry publish only after the scenario validates and runs locally.

When this skill activates, produce one of these:

a concrete scenario proposal with domain, tier, starting state, competency, harness, and assertion plan
direct edits to scaffolded scenario files
a targeted extension plan for an existing scenario

Do not stop at a list of ideas. Convert the user request into runnable scenario files or a file-by-file implementation plan.

full scenario.json schema and enum values
complete assertion examples for all five gates
naive vs. savvy prompt examples
harness selection details
registry publish flags and review checklist
skills.sh-compatible installation notes for this skill

AI Elements：基于shadcn/ui的AI原生应用组件库，快速构建对话界面

62,200 周安装

DEC Bench 评估场景创建指南：构建确定性AI智能体测试框架

🇨🇳中文介绍

DEC Bench 评估

快速开始

🇺🇸English

DEC Bench Evals

Quick Start

相关 Skills

搭建脚手架之前

优秀评估的特点

脚手架输出

提示规则

基础设施规则

断言规则

`scenario.json` 规则

测试框架指南

发布流程

预期输出

额外资源

Before You Scaffold

What Good Evals Look Like

Scaffold Output

Prompt Rules

Infrastructure Rules

Assertion Rules

`scenario.json` Rules

Harness Guidance

Publishing Flow

Expected Output

Additional Resource

最新 Skills

DEC Bench 评估场景创建指南：构建确定性AI智能体测试框架

🇨🇳中文介绍

DEC Bench 评估

快速开始

🇺🇸English

DEC Bench Evals

Quick Start

相关 Skills

搭建脚手架之前

优秀评估的特点

脚手架输出

提示规则

基础设施规则

断言规则

scenario.json 规则

测试框架指南

发布流程

预期输出

额外资源

Before You Scaffold

What Good Evals Look Like

Scaffold Output

Prompt Rules

Infrastructure Rules

Assertion Rules

scenario.json Rules

Harness Guidance

Publishing Flow

Expected Output

Additional Resource

最新 Skills

`scenario.json` 规则

`scenario.json` Rules