npx skills add https://github.com/514-labs/agent-evals --skill dec-bench-evals当用户希望创建或扩展 DEC Bench 场景时使用此技能。目标是生成一个确定性的、可运行的评估,而非模糊的基准测试想法。
默认创作循环:
dec-bench create --name <id> --domain <domain> --tier <tier>
dec-bench validate --scenario <id>
dec-bench build --scenario <id> --harness <harness> --agent <agent> --model <model> --version <version>
dec-bench run --scenario <id> --harness <harness> --persona naive --mode no-plan
dec-bench results --latest --scenario <id>
dec-bench audit open --scenario <id> --run-id <run-id>
dec-bench registry add --scenario scenarios/<id>
dec-bench registry publish --id <id>
规则:
build 或 run 之前运行 dec-bench validate。build 和 run 视为独立的检查: 验证镜像路径, 验证评分行为。Use this skill when the user wants to create or extend a DEC Bench scenario. The goal is a deterministic, runnable evaluation, not a vague benchmark idea.
Default authoring loop:
dec-bench create --name <id> --domain <domain> --tier <tier>
dec-bench validate --scenario <id>
dec-bench build --scenario <id> --harness <harness> --agent <agent> --model <model> --version <version>
dec-bench run --scenario <id> --harness <harness> --persona naive --mode no-plan
dec-bench results --latest --scenario <id>
dec-bench audit open --scenario <id> --run-id <run-id>
dec-bench registry add --scenario scenarios/<id>
dec-bench registry publish --id <id>
Rules:
dec-bench validate before build or run.build and run as separate checks: build verifies the image path, run verifies scoring behavior.广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
buildrunresults 检查最近的运行。audit open 打开浏览器视图,如果只需要捆绑包,则使用 audit export。首先确定以下内容:
foo-bar、b2b-saas、b2c-saas、ugc、e-commerce、advertising、consumption-based-infra 其中之一。tier-1、tier-2 或 tier-3。base-rt、classic-de、olap-for-swe 或有充分理由的自定义框架。优先选择仍能锻炼目标能力的最小层级。保持起始状态确定且易于重置。
优秀的 DEC Bench 场景:
避免:
dec-bench create 生成一个包含以下内容的场景目录:
prompts/naive.mdprompts/savvy.mdinit/assertions/functional.tsassertions/correct.tsassertions/robust.tsassertions/performant.tsassertions/production.tsscenario.jsonsupervisord.conf按此顺序处理这些文件。
两个提示必须针对相同的验收标准。
naive.md:使用平实的语言,最少的实现提示,除非任务自然会提及,否则不要提及具体工具名称。savvy.md:明确指定工具、模式、路径、约束和操作细节。savvy 提示变得更容易。使用 init/ 和 supervisord.conf 创建起始状态。
在 supervisord.conf 中:
autorestart=false断言是评估的核心。只编写场景断言;框架提供通用的核心断言。
passed 以及可操作的 message 和有用的 details 的 AssertionResult。queryRows<T>())放在同一个关卡文件中。使用框架上下文:
ctx.clickhouse 用于 ClickHouse 查询ctx.postgres 用于 Postgres 查询ctx.env() 用于连接设置和其他环境变量关卡模型:
只有前面的关卡通过时,当前关卡才会计入。场景断言必须与框架的核心断言一起达到 80% 的关卡阈值。
scenario.json 规则至少填充以下字段:
idtitledescriptiontierdomainharnesstaskspersonaPromptsinfrastructuretagsbaselineMetricsreferenceMetrics重要细节:
personaPrompts 应指向 prompts/naive.md 和 prompts/savvy.md。tasks[] 应是具体的并已分类。infrastructure.services 和 infrastructure.description 应描述实际的起始状态。有关完整契约、枚举值和工作示例,请参阅 guide.md。
除非工具链需求确实新颖,否则使用内置的测试框架。
base-rt:基础基础设施加常见运行时工具classic-de:dbt 和更重的数据工程工具olap-for-swe:专注于 MooseStack 的工作流程仅当场景真正需要额外的包或出站策略更改时,才创建自定义框架。
要将创作的场景贡献给上游:
dec-bench registry add --scenario scenarios/<id>dec-bench registry publish --id <id>仅在场景在本地验证并运行后,才使用 registry publish。
当此技能激活时,产生以下内容之一:
不要停留在想法列表上。将用户请求转换为可运行的场景文件或逐文件的实施计划。
阅读 guide.md 以获取:
scenario.json 模式和枚举值每周安装次数
1
仓库
首次出现
1 天前
安全审计
安装于
cursor1
results to inspect the latest run before opening the audit UI.audit open for the browser view, or audit export if you only need the bundle.Decide these first:
foo-bar, b2b-saas, b2c-saas, ugc, e-commerce, advertising, consumption-based-infra.tier-1, tier-2, or tier-3.base-rt, classic-de, olap-for-swe, or a justified custom harness.Prefer the smallest tier that still exercises the intended competency. Keep the starting state deterministic and easy to reset.
Good DEC Bench scenarios:
Avoid:
dec-bench create generates a scenario directory with:
prompts/naive.mdprompts/savvy.mdinit/assertions/functional.tsassertions/correct.tsassertions/robust.tsassertions/performant.tsassertions/production.tsscenario.jsonsupervisord.confWork through those files in that order.
Both prompts must target the same acceptance criteria.
naive.md: plain language, minimal implementation hints, no named tools unless the task would naturally mention them.savvy.md: explicit tools, schemas, paths, constraints, and operational details.Use init/ and supervisord.conf to create the starting state.
In supervisord.conf:
autorestart=falseAssertions are the core of the eval. Write scenario assertions only; the framework provides universal core assertions.
AssertionResult with passed plus actionable message and useful details.queryRows<T>() inside the same gate file.Use the framework context:
ctx.clickhouse for ClickHouse queriesctx.postgres for Postgres queriesctx.env() for connection settings and other environment variablesGate model:
A gate only counts if earlier gates pass. Scenario assertions must clear the 80% gate threshold together with the framework's core assertions.
scenario.json RulesPopulate at least these fields:
idtitledescriptiontierdomainharnesstaskspersonaPromptsinfrastructuretagsbaselineMetricsreferenceMetricsImportant details:
personaPrompts should point to prompts/naive.md and prompts/savvy.md.tasks[] should be concrete and categorized.infrastructure.services and infrastructure.description should describe the actual starting state.For the full contract, enum values, and worked examples, see guide.md.
Use the built-in harnesses unless the toolchain requirements are truly new.
base-rt: base infrastructure plus common runtime toolsclassic-de: dbt and heavier data engineering toolingolap-for-swe: MooseStack-focused workflowsCreate a custom harness only when the scenario genuinely needs additional packages or outbound policy changes.
To contribute an authored scenario upstream:
dec-bench registry add --scenario scenarios/<id>dec-bench registry publish --id <id>Use registry publish only after the scenario validates and runs locally.
When this skill activates, produce one of these:
Do not stop at a list of ideas. Convert the user request into runnable scenario files or a file-by-file implementation plan.
Read guide.md for:
scenario.json schema and enum valuesWeekly Installs
1
Repository
First Seen
1 day ago
Security Audits
Installed on
cursor1
AI Elements:基于shadcn/ui的AI原生应用组件库,快速构建对话界面
62,200 周安装
Loom视频转录获取器 - 自动提取Loom视频字幕与文本,支持GraphQL API
163 周安装
bioRxiv数据库Python工具:高效搜索下载预印本,支持关键词/作者/日期/类别筛选
163 周安装
Magento 2 Hyvä CMS 组件创建器 - 快速构建自定义CMS组件
163 周安装
项目文档协调器 - 自动化文档生成与上下文管理工具
163 周安装
GPUI 布局与样式:Rust 类型安全的 CSS 样式库,Flexbox 布局与链式 API
163 周安装
Telegram自动化指南:通过Rube MCP与Composio实现消息发送、聊天管理
163 周安装