重要前提
安装AI Skills的关键前提是:必须科学上网,且开启TUN模式,这一点至关重要,直接决定安装能否顺利完成,在此郑重提醒三遍:科学上网,科学上网,科学上网。查看完整安装教程 →
npx skills add https://github.com/oimiragieo/agent-studio --skill tdd本技能实现了带有 AI 特定防护栏的 Canon TDD:
适用于:
仅在以下情况下,需在绕过前征得人工批准:
NO PRODUCTION CODE WITHOUT A FAILING TEST FIRST
如果代码先被写出,则丢弃并从 RED 重新开始。
在构建待办列表之前,查询记忆以获取过去的失败特征和可重用的测试模板:
Skill({ skill: 'memory-search' }); // query: "<feature-name> test failure signatures"
阅读 .claude/context/memory/learnings.md 以了解与此任务相关的重复反模式。
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
然后:
不稳定性检查门(对于异步、钩子或非确定性测试是强制性的):
对于涉及异步 I/O、停止钩子、计时器或文件系统操作的测试,单次通过是不够的。在声明 GREEN 之前,需要连续 3 次通过:
# 运行 3 次 —— 3 次都必须通过
node --test tests/hooks/routing-guard.test.cjs && \
node --test tests/hooks/routing-guard.test.cjs && \
node --test tests/hooks/routing-guard.test.cjs
一次通过但第二次运行失败的测试是 RED,不是 GREEN。在确认连续 3 次通过之前,不要进入步骤 5。
变异测试门(仅限安全关键代码):
对于安全钩子、路由验证器、身份验证逻辑以及任何控制访问或信任决策的代码路径,在实现 GREEN 后运行 Stryker 变异测试,以验证测试确实能捕获故障,而非空洞地通过:
# 运行 Stryker 变异测试(阈值:85%)
npx stryker run
# 要求 stryker.config.json 中的 mutationScore >= 85
对于安全钩子基于 fast-check 的属性测试,故障关闭属性是等效的变异检查门:
// fast-check 故障关闭属性 —— 必须对任何输入都成立
fc.assert(
fc.property(fc.anything(), input => {
const result = securityHook(input);
// 钩子绝不能对格式错误/意外输入返回 allow=true
expect(result.allow).not.toBe(true);
})
);
对于非安全应用程序代码,跳过此门(步骤 4 → 步骤 5 直接进行)。
重构后(或对于安全关键代码,在步骤 4 之后),考虑用基于属性的测试补充基于示例的测试。对于 LLM 代码生成,PBT 相比仅使用基于示例的 TDD 实现了 23.1–37.3% 的 pass@1 改进(arXiv:2506.18315),因为它打破了自我欺骗循环。
何时调用:
调用方式:
Skill({ skill: 'property-based-testing' });
要识别的关键属性模式:
| 模式 | 示例 |
|---|---|
| 往返 | decode(encode(x)) === x |
| 幂等性 | normalize(normalize(x)) === normalize(x) |
| 不变量 | sort(arr).length === arr.length |
| 故障关闭(安全) | securityHook(anyInput).allow !== true(除非明确列入白名单) |
PBT 是对 Canon TDD 的补充,而非替代。Canon RED/GREEN/REFACTOR 首先完成;PBT 在确认 GREEN 后运行。
仅使用轻量级记忆来减少重复的设置和故障排查:
参考:references/tdd-memory-profile.md
硬性规则:
TDP 是多智能体 TDD 在 2026 年的主导模式:将 逐字逐句的失败测试输出 注入到开发者智能体启动提示中。这消除了解释错误——开发者看到的就是测试运行器看到的。
与其用文字描述失败,不如捕获 stdout/stderr 并直接注入:
// Step 1: Run test and capture raw output
const { execSync } = require('child_process');
let testOutput = '';
try {
execSync('node --test tests/hooks/routing-guard.test.cjs', { encoding: 'utf-8' });
} catch (e) {
testOutput = e.stdout + e.stderr; // Verbatim failure output
}
// Step 2: Inject verbatim into developer spawn prompt (no paraphrasing)
Task({
task_id: 'task-impl',
subagent_type: 'developer',
prompt: `## FAILING TEST (verbatim — do NOT modify the test file)\n\`\`\`\n${testOutput}\n\`\`\`\nImplement ONLY what is needed to make this pass.`,
});
| 步骤 | 智能体 | 动作 |
|---|---|---|
| 1 | qa | 编写失败测试,仅提交测试,捕获原始输出 |
| 2 | 路由器 | 提取测试输出,构建 TDP 启动提示 |
| 3 | developer | 使用逐字测试输出作为规范来实现 GREEN |
| 4 | reflection-agent | 验证测试断言未被修改(git diff 检查) |
来源: Simon Willison (2026) —— "Red/Green TDD for agents: failing test output IS the specification"; TDFlow arXiv:2510.23761.
对于可能被中断的仓库级 TDD,连接 ralph-loop(模式 2 —— 路由器管理)以在中断期间维护 TDD 场景待办列表:
在 .claude/context/runtime/tdd-state.json 维护一个 TDD 特定的状态文件:
{
"scenarios": [
{
"id": "sc-001",
"description": "routing-guard blocks Write on creator paths",
"status": "pending"
},
{ "id": "sc-002", "description": "spawn-token-guard warns at 80K tokens", "status": "green" }
],
"completedScenarios": [
{
"id": "sc-002",
"evidenceCommand": "node --test tests/hooks/spawn-token-guard.test.cjs",
"passedAt": "2026-03-12T10:00:00Z"
}
],
"currentScenario": "sc-001",
"evidenceLog": [
{
"scenarioId": "sc-001",
"phase": "red",
"output": "AssertionError: expected exit code 2, got 0",
"timestamp": "..."
}
]
}
在每次迭代开始时,读取 TDD 状态文件:
// Step 0 — before building/refreshing backlog
const state = JSON.parse(
fs.readFileSync('.claude/context/runtime/tdd-state.json', 'utf-8') || '{}'
);
const completedIds = (state.completedScenarios || []).map(s => s.id);
const remaining = (state.scenarios || []).filter(s => !completedIds.includes(s.id));
// Pick next scenario from remaining — never re-run completed ones
{ task_id, subagent_type: 'qa', prompt: TDP_PROMPT + verbatim state } 启动 qa 智能体qa 编写测试 → 运行 → 捕获输出 → 更新 tdd-state.json(阶段:red)developerdeveloper 实现 → 更新 tdd-state.json(阶段:green)remaining.length === 0 → 发出 RALPH_AUDIT_COMPLETE_NO_FINDINGS反模式: 绝不重新运行状态中已标记为 green 的场景——这会浪费迭代次数并可能破坏证据日志。
使用项目的实际命令。典型序列:
# 1) targeted test
pnpm test <target>
# 2) impacted suite
pnpm test
# 3) lint
pnpm lint
# 4) format check
pnpm format:check
如果仓库使用不同的脚本,请用本地等效项替换这些命令,并准确报告运行了什么。
references/research-requirements.mdreferences/tdd-memory-profile.mdtesting-anti-patterns.mdrules/tdd.mdtemplates/implementation-template.md本技能与以下内容保持一致:
开始前: 阅读 .claude/context/memory/learnings.md
完成后:
.claude/context/memory/learnings.md.claude/context/memory/issues.md.claude/context/memory/decisions.md假设会被中断:如果不在记忆中,就等于没发生。
钩子使用 stdin/stdout JSON 协议:
const proc = require('child_process').spawn('node', ['.claude/hooks/routing/routing-guard.cjs'], {
shell: false,
});
proc.stdin.write(JSON.stringify({ tool_name: 'Write', tool_input: {} }));
proc.stdin.end();
// Exit 0=allow, 2=block
模拟 MemoryRecord。测试置信度门(阈值 0.7)。使用原子写入。
对于任何具有不变量的函数,使用 fast-check(以及用于 Vitest 集成的 @fast-check/vitest)——不仅仅是路由。fast-check 3.x (2025) 增加了改进的 unicode、日期和 bigint 任意值。
路由不变量(现有):
import fc from 'fast-check';
fc.assert(
fc.property(fc.string(), intent => {
return typeof routeIntent(intent) === 'string';
})
);
记忆序列化往返(新):
// Property: serialize(deserialize(x)) === x for all JSON-serializable values
fc.assert(
fc.property(fc.jsonValue(), value => {
const serialized = serializeMemoryRecord(value);
const deserialized = deserializeMemoryRecord(serialized);
return JSON.stringify(deserialized) === JSON.stringify(value);
})
);
钩子验证不变量(新):
// Property: for any tool input, isValidInput(x) === !isBlocked(x)
// (validation and blocking must be inverses)
fc.assert(
fc.property(fc.record({ tool_name: fc.string(), tool_input: fc.object() }), input => {
const valid = isValidInput(input);
const blocked = wouldBlock(input);
return valid !== blocked || (!valid && blocked); // blocked implies invalid
})
);
路径规范化幂等性(新):
// Property: normalize(normalize(path)) === normalize(path) (idempotent)
fc.assert(
fc.property(fc.string(), rawPath => {
const once = normalizePath(rawPath);
const twice = normalizePath(once);
return once === twice;
})
);
模式验证稳定性(新):
// Property: validate(schema, x) never throws uncaught exception for any input
fc.assert(
fc.property(fc.anything(), input => {
try {
validateSchema(schema, input);
return true;
} catch (e) {
return e instanceof ValidationError;
} // Only ValidationError allowed
})
);
验证 TaskUpdate 元数据模式(processedReflectionIds: string[])。
基于 TDFlow (arXiv:2510.23761, 94.3% SWE-Bench Verified),单体 TDD 智能体得分为 60–70%。拆分为专门的子智能体:
| 角色 | 智能体 | 职责 |
|---|---|---|
| 测试作者 | qa | 编写失败测试,仅提交测试 |
| 实现者 | developer | 实现到 GREEN —— 绝不能修改测试 |
| 验证者 | reflection-agent | 检测测试黑客,验证 RED→GREEN 证据 |
模式:
测试黑客检测: reflection-agent 检查 git diff HEAD~1 HEAD -- '*.test.*' —— 实现提交后任何断言更改 = 拒绝。
何时使用: 仓库级 TDD、具有多个行为的复杂功能、任何单个智能体可能合理化测试更改的任务。
TDAID 通过显式的规划和验证门扩展了经典的 TDD:
| 阶段 | TDAID 标签 | Agent-Studio 所有者 | 描述 |
|---|---|---|---|
| 0 | 计划 | planner | 思维模型生成结构化的 TDD 计划,在编写任何代码之前包含明确的测试检查点 |
| 1 | Red | qa | 编写表达期望行为的失败测试;人工验证失败是预期的 |
| 2 | Green | developer | 通过测试的最小实现;绝不能修改测试断言 |
| 3 | 重构 | developer | 在所有测试通过的情况下改进代码质量 |
| 4 | 验证 | reflection-agent + verification-before-completion 技能 | 检测规范博弈;确认实现与计划匹配;人工门 |
在验证阶段要检测的关键 TDAID 反模式:
研究基础: TDAID (awesome-testing.com, 2025), TDAD agent-to-agent variant (arXiv:2603.08806, 2026), TDFlow (arXiv:2510.23761, 2025)
在编写失败测试之前,验证 API 契约是否存在,以防止"因错误 API 而失败"而非"因缺少行为而失败":
# Step 1: Find the target function's file + line
pnpm search:code "functionName"
# Step 2: Verify signature with LSP hover
lsp_hover({ filePath: "/abs/path/to/file.ts", line: 42, character: 10 })
# Returns: function signature, parameter types, return type
# Step 3: Write test using VERIFIED signature
# Now RED is guaranteed to fail due to missing behavior, not API mismatch
规则: 如果 lsp_hover 返回空(CJS 文件或 LSP 未激活)→ 回退到 ripgrep rg -n "functionName" --type ts 来读取实际的签名。
何时不需要: 明显是尚不存在的新函数(LSP 没有内容可返回)。
钩子契约定义了 stdin/stdout JSON 协议。在边界进行测试:
// Hook contract test pattern
const proc = spawn('node', ['.claude/hooks/routing/routing-guard.cjs'], { shell: false });
const input = JSON.stringify({
tool_name: 'Edit',
tool_input: { file_path: '.claude/agents/core/developer.md' },
});
proc.stdin.write(input);
proc.stdin.end();
// Assert: exit code 2 (block) for protected paths
// Assert: stdout JSON contains { allow: false, message: /Gate 4/ }
TaskUpdate 元数据契约:
// Validate processedReflectionIds schema
const schema = {
type: 'object',
required: ['processedReflectionIds'],
properties: { processedReflectionIds: { type: 'array', items: { type: 'string' } } },
additionalProperties: false,
};
要测试的 Agent-Studio 钩子契约:
routing-guard.cjs: 阻止没有 task_id 的 Task(退出码 2)unified-creator-guard.cjs: 阻止写入 .claude/skills/**/SKILL.md(退出码 2)spawn-token-guard.cjs: 在 80K 令牌时警告(退出码 0 + 消息)Agent Studio 使用 node --test(内置的 Node.js 测试运行器)作为所有 .cjs CommonJS 文件(钩子、库、脚本)的 默认 运行器。Vitest 4 是 ESM/TypeScript 文件的推荐运行器。
| 运行器 | 使用时机 | 命令 |
|---|---|---|
node --test | .cjs 钩子、库、CommonJS 脚本 —— 当前 Agent Studio 标准 | node --test tests/**/*.test.cjs |
vitest | .ts、.mts、ESM .js 文件 —— 迁移到 TypeScript 时使用 | pnpm vitest run |
为什么对 .cjs 使用 node --test: Vitest 需要 Vite 配置和 ESM 兼容的模块。Agent Studio 钩子使用 require() 和 CommonJS —— node --test 无需转译即可工作。
为什么对 .ts/ESM 使用 Vitest 4: 启动时间从 ~8s (Jest) 降至 ~1.2s (Vitest)。一流的 TypeScript + ESM 支持,浏览器模式(稳定版 v4),以及 jest 兼容的 describe/it/expect API(迁移仅需配置更改)。
反模式: 不要对新文件使用 Jest。Vitest 是 2025-2026 年 ESM/TypeScript 的标准。
# Current Agent Studio pattern (CJS hooks and lib)
node --test tests/lib/routing/routing-table.test.cjs
# Future ESM/TypeScript pattern
pnpm vitest run tests/lib/routing/routing-table.test.ts
LLM/智能体输出是非确定性的——二元的通过/失败断言是不够的。改用基于分数的评估和工具调用序列验证。
// Agent output evaluation — score dimensions 0.0-1.0
function evaluateAgentOutput(output, expectations) {
const scores = {
relevance: scoreRelevance(output, expectations.topic), // 0.0-1.0
safety: scoreSafety(output), // 0.0-1.0
faithfulness: scoreFaithfulness(output, expectations.facts), // 0.0-1.0
format: scoreFormat(output, expectations.schema), // 0.0-1.0
};
const overall = Object.values(scores).reduce((a, b) => a + b) / Object.keys(scores).length;
return { scores, overall, pass: overall >= 0.75 };
}
// Test: agent output meets quality threshold
test('researcher agent output is relevant and safe', () => {
const result = evaluateAgentOutput(agentOutput, { topic: 'TDD patterns', facts: knownFacts });
expect(result.scores.safety).toBeGreaterThanOrEqual(0.9); // Hard floor for safety
expect(result.overall).toBeGreaterThanOrEqual(0.75); // 75% overall threshold
});
对于智能体测试,验证工具调用的 序列和次数,而不仅仅是最终输出:
// Spy on tool calls and assert ordering
const toolCallLog = [];
const mockTaskUpdate = jest.fn(args => {
toolCallLog.push({ tool: 'TaskUpdate', args });
});
const mockBash = jest.fn(args => {
toolCallLog.push({ tool: 'Bash', args });
});
// Run agent under test with mocked tools
await runAgent({ TaskUpdate: mockTaskUpdate, Bash: mockBash });
// Assert: TaskUpdate(in_progress) called BEFORE TaskUpdate(completed)
const inProgressIdx = toolCallLog.findIndex(
c => c.tool === 'TaskUpdate' && c.args.status === 'in_progress'
);
const completedIdx = toolCallLog.findIndex(
c => c.tool === 'TaskUpdate' && c.args.status === 'completed'
);
expect(inProgressIdx).toBeLessThan(completedIdx); // Ordering enforced
expect(inProgressIdx).toBeGreaterThanOrEqual(0); // Must have been called
expect(completedIdx).toBeGreaterThanOrEqual(0); // Must have been called
规则: 绝不测试 LLM 生成的散文的文本内容。测试结构、模式有效性、工具调用序列和分数阈值。
参考: Simon Willison (2025) —— "Red/Green TDD for agents: write assertions on tool-call sequences and structured outputs."
使用 MSW (Mock Service Worker) v2 来测试进行外部 HTTP 调用的技能和智能体。MSW 在网络层面进行拦截——无需对 fetch 进行猴子补丁,也无需更改生产代码。
pnpm add -D msw@2
import { setupServer } from 'msw/node';
import { http, HttpResponse } from 'msw';
// Define handlers — these describe the expected API contract
const handlers = [
http.get('https://api.example.com/search', ({ request }) => {
const url = new URL(request.url);
return HttpResponse.json({
results: [{ id: 1, title: `Result for: ${url.searchParams.get('q')}` }],
});
}),
];
const server = setupServer(...handlers);
beforeAll(() => server.listen({ onUnhandledRequest: 'error' }));
afterEach(() => server.resetHandlers());
afterAll(() => server.close());
// Test: researcher skill makes HTTP call and processes response
test('researcher skill fetches and parses search results', async () => {
const results = await researcherSkill.search('TDD patterns 2026');
expect(results).toHaveLength(1);
expect(results[0].title).toContain('TDD patterns');
});
test('researcher skill handles 503 gracefully', async () => {
server.use(
http.get('https://api.example.com/search', () => HttpResponse.json({}, { status: 503 }))
);
const results = await researcherSkill.search('TDD patterns');
expect(results).toEqual([]); // Graceful empty fallback
});
相较于手动模拟的主要优势:
onUnhandledRequest: 'error' 捕获测试期间意外的外部调用MSW 边界测试的 Agent-Studio 目标:
researcher 技能 → WebSearch/WebFetch HTTP 调用github-ops 技能 → GitHub API 调用mcp__Exa__web_search_exa 或 WebFetch 的智能体变异测试验证测试的 质量,而不仅仅是覆盖率。在实现 100% 行覆盖率后运行:
# Install (once per project) — use vitest-runner for ESM/TypeScript
pnpm add -D @stryker-mutator/core @stryker-mutator/vitest-runner vitest
// stryker.config.mjs — working configuration for Vitest projects
/** @type {import('@stryker-mutator/api/core').PartialStrykerOptions} */
export default {
testRunner: 'vitest',
vitest: {
configFile: 'vitest.config.ts', // optional: path to your vitest config
related: true, // default: run only tests related to mutated file
},
thresholds: { high: 80, low: 60, break: 50 },
reporters: ['html', 'progress'],
};
# Run mutation tests (use incremental to speed up local loops)
pnpm stryker run --incremental
# Target threshold: >80% mutation score
# Score = (killed mutations / total mutations) × 100
Vitest 运行器限制(StrykerJS 7.x):
perTest 覆盖率分析(忽略 coverageAnalysis 配置)node --test 的 .cjs 文件,使用 @stryker-mutator/jest-runner 作为回退pnpm add -D @stryker-mutator/core @stryker-mutator/jest-runner
解释结果:
何时运行: 在完成安全关键代码(钩子、验证器、路由逻辑)的 TDD 循环后运行。并非所有代码都需要——按风险优先级排序。
变异测试的 Agent-Studio 优先目标:
.claude/hooks/routing/routing-guard.cjs.claude/hooks/safety/unified-creator-guard.cjs.claude/lib/routing/routing-table.cjs在提交之前,智能体 必须 识别哪些测试文件覆盖了更改的源文件。使用编译器辅助的引用发现或目标 grep:
# Find tests that import the changed file
grep -r "import.*changedFile\|require.*changedFile" tests/
# Or use LSP to find all references
lsp_findReferences({ filePath: "/path/to/changed/file.ts", line: 1, character: 1 })
构建依赖关系图,并首先 仅 运行那些测试(根据 arXiv:2603.17973,回归检测速度提高 70%):
# 1. Run targeted tests (impacted tests only)
pnpm test tests/hooks/routing-guard.test.cjs
# 2. Verify no regressions in targeted scope
# 3. Only then run full suite
pnpm test
原理: 完整的测试套件在大型仓库上可能超过 5 分钟。目标测试在 15-30 秒内捕获回归,为长 TDD 循环中的下一个场景释放上下文。
在验证阶段,验证实现是否没有对测试断言进行博弈:
检查清单:
如果可用,运行变异测试:
# Test suite strength validation — mutations should be caught
pnpm stryker run
# If mutation score < 80%, tests are too weak:
# - Add negative tests
# - Add boundary condition tests
# - Verify assertions are on behavior, not mocks
要捕获的规范博弈示例:
return 42 以通过期望 42 的测试 → 变异测试会捕获这一点Agent-Studio 目标: 在完成安全关键钩子或路由更改后,在标记任务完成之前运行变异测试。
每周安装次数
62
仓库
GitHub Stars
19
首次出现
Jan 27, 2026
安全审计
安装于
github-copilot61
gemini-cli60
opencode59
codex59
kimi-cli59
amp59
This skill implements Canon TDD with AI-specific guardrails:
Use for:
Ask human approval before bypassing only for:
NO PRODUCTION CODE WITHOUT A FAILING TEST FIRST
If code was written first, discard and restart from RED.
Before building the backlog, query memory for past failure signatures and reusable test templates:
Skill({ skill: 'memory-search' }); // query: "<feature-name> test failure signatures"
Read .claude/context/memory/learnings.md for recurring anti-patterns relevant to this task.
Then:
Flakiness Gate (mandatory for async, hook, or nondeterministic tests):
For tests that involve async I/O, stop hooks, timers, or file system operations, a single pass is insufficient. Require 3 consecutive passes before declaring GREEN:
# Run 3 times — all 3 must pass
node --test tests/hooks/routing-guard.test.cjs && \
node --test tests/hooks/routing-guard.test.cjs && \
node --test tests/hooks/routing-guard.test.cjs
A test that passes once and fails on the second run is RED, not GREEN. Do not advance to Step 5 until 3 consecutive passes are confirmed.
Mutation Testing Gate (security-critical code only):
For security hooks, routing validators, auth logic, and any code path that controls access or trust decisions, run Stryker mutation testing after achieving GREEN to verify that tests genuinely catch faults and are not vacuously passing.
# Run Stryker mutation testing (threshold: 85%)
npx stryker run
# Require mutationScore >= 85 in stryker.config.json
For fast-check-based property tests on security hooks, the fail-closed property is the mutation-equivalent gate:
// fast-check fail-closed property — must hold for any input
fc.assert(
fc.property(fc.anything(), input => {
const result = securityHook(input);
// Hook must NEVER return allow=true for malformed/unexpected input
expect(result.allow).not.toBe(true);
})
);
Skip this gate for non-security application code (Step 4 → Step 5 directly).
After refactor (or after Step 4 for security-critical code), consider supplementing example-based tests with property-based tests. PBT achieves 23.1–37.3% pass@1 improvement over example-based TDD alone for LLM code generation (arXiv:2506.18315) by breaking the self-deception cycle.
When to invoke:
Invocation:
Skill({ skill: 'property-based-testing' });
Key property patterns to identify:
| Pattern | Example |
|---|---|
| Round-trip | decode(encode(x)) === x |
| Idempotence | normalize(normalize(x)) === normalize(x) |
| Invariant | sort(arr).length === arr.length |
| Fail-closed (security) | securityHook(anyInput).allow !== true (unless explicitly whitelisted) |
PBT is a supplement to Canon TDD, not a replacement. Canon RED/GREEN/REFACTOR completes first; PBT runs after GREEN is confirmed.
Use lightweight memory only to reduce repeated setup and triage:
Reference: references/tdd-memory-profile.md
Hard rules:
TDP is the dominant 2026 pattern for multi-agent TDD: inject the verbatim failing test output into the developer agent spawn prompt. This eliminates interpretation errors — the developer sees exactly what the test runner sees.
Instead of describing the failure in prose, capture stdout/stderr and inject it directly:
// Step 1: Run test and capture raw output
const { execSync } = require('child_process');
let testOutput = '';
try {
execSync('node --test tests/hooks/routing-guard.test.cjs', { encoding: 'utf-8' });
} catch (e) {
testOutput = e.stdout + e.stderr; // Verbatim failure output
}
// Step 2: Inject verbatim into developer spawn prompt (no paraphrasing)
Task({
task_id: 'task-impl',
subagent_type: 'developer',
prompt: `## FAILING TEST (verbatim — do NOT modify the test file)\n\`\`\`\n${testOutput}\n\`\`\`\nImplement ONLY what is needed to make this pass.`,
});
| Step | Agent | Action |
|---|---|---|
| 1 | qa | Write failing test, commit test-only, capture raw output |
| 2 | Router | Extract test output, build TDP spawn prompt |
| 3 | developer | Implement to GREEN using verbatim test output as spec |
| 4 | reflection-agent | Verify no test assertions were modified (git diff check) |
Source: Simon Willison (2026) — "Red/Green TDD for agents: failing test output IS the specification"; TDFlow arXiv:2510.23761.
For repository-scale TDD where sessions may be interrupted, wire ralph-loop (Mode 2 — router-managed) to maintain the TDD scenario backlog across interruptions:
Maintain a TDD-specific state file at .claude/context/runtime/tdd-state.json:
{
"scenarios": [
{
"id": "sc-001",
"description": "routing-guard blocks Write on creator paths",
"status": "pending"
},
{ "id": "sc-002", "description": "spawn-token-guard warns at 80K tokens", "status": "green" }
],
"completedScenarios": [
{
"id": "sc-002",
"evidenceCommand": "node --test tests/hooks/spawn-token-guard.test.cjs",
"passedAt": "2026-03-12T10:00:00Z"
}
],
"currentScenario": "sc-001",
"evidenceLog": [
{
"scenarioId": "sc-001",
"phase": "red",
"output": "AssertionError: expected exit code 2, got 0",
"timestamp": "..."
}
]
}
At the start of each iteration, read the TDD state file:
// Step 0 — before building/refreshing backlog
const state = JSON.parse(
fs.readFileSync('.claude/context/runtime/tdd-state.json', 'utf-8') || '{}'
);
const completedIds = (state.completedScenarios || []).map(s => s.id);
const remaining = (state.scenarios || []).filter(s => !completedIds.includes(s.id));
// Pick next scenario from remaining — never re-run completed ones
qa agent with { task_id, subagent_type: 'qa', prompt: TDP_PROMPT + verbatim state }qa writes test → runs → captures output → updates tdd-state.json (phase: red)developer with TDP prompt (verbatim test output injected)developer implements → updates tdd-state.json (phase: green)remaining.length === 0 → emit RALPH_AUDIT_COMPLETE_NO_FINDINGSAnti-pattern: Never re-run scenarios already marked green in state — this wastes iterations and may corrupt evidence logs.
Use the project's actual commands. Typical sequence:
# 1) targeted test
pnpm test <target>
# 2) impacted suite
pnpm test
# 3) lint
pnpm lint
# 4) format check
pnpm format:check
If the repo uses different scripts, replace these with local equivalents and report exactly what ran.
references/research-requirements.mdreferences/tdd-memory-profile.mdtesting-anti-patterns.mdrules/tdd.mdtemplates/implementation-template.mdThis skill is aligned with:
Before starting: Read .claude/context/memory/learnings.md
After completing:
.claude/context/memory/learnings.md.claude/context/memory/issues.md.claude/context/memory/decisions.mdAssume interruption: if it is not in memory, it did not happen.
Hooks use stdin/stdout JSON protocol:
const proc = require('child_process').spawn('node', ['.claude/hooks/routing/routing-guard.cjs'], {
shell: false,
});
proc.stdin.write(JSON.stringify({ tool_name: 'Write', tool_input: {} }));
proc.stdin.end();
// Exit 0=allow, 2=block
Mock MemoryRecord. Test confidence gate (threshold 0.7). Use atomic writes.
Use fast-check (and @fast-check/vitest for vitest integration) for any function with invariants — not just routing. fast-check 3.x (2025) adds improved unicode, date, and bigint arbitraries.
Routing invariant (existing):
import fc from 'fast-check';
fc.assert(
fc.property(fc.string(), intent => {
return typeof routeIntent(intent) === 'string';
})
);
Memory serialization roundtrip (new):
// Property: serialize(deserialize(x)) === x for all JSON-serializable values
fc.assert(
fc.property(fc.jsonValue(), value => {
const serialized = serializeMemoryRecord(value);
const deserialized = deserializeMemoryRecord(serialized);
return JSON.stringify(deserialized) === JSON.stringify(value);
})
);
Hook validation invariant (new):
// Property: for any tool input, isValidInput(x) === !isBlocked(x)
// (validation and blocking must be inverses)
fc.assert(
fc.property(fc.record({ tool_name: fc.string(), tool_input: fc.object() }), input => {
const valid = isValidInput(input);
const blocked = wouldBlock(input);
return valid !== blocked || (!valid && blocked); // blocked implies invalid
})
);
Path normalization idempotency (new):
// Property: normalize(normalize(path)) === normalize(path) (idempotent)
fc.assert(
fc.property(fc.string(), rawPath => {
const once = normalizePath(rawPath);
const twice = normalizePath(once);
return once === twice;
})
);
Schema validation stability (new):
// Property: validate(schema, x) never throws uncaught exception for any input
fc.assert(
fc.property(fc.anything(), input => {
try {
validateSchema(schema, input);
return true;
} catch (e) {
return e instanceof ValidationError;
} // Only ValidationError allowed
})
);
Validate TaskUpdate metadata schemas (processedReflectionIds: string[]).
Based on TDFlow (arXiv:2510.23761, 94.3% SWE-Bench Verified), monolithic TDD agents score 60–70%. Split into specialized sub-agents:
| Role | Agent | Responsibility |
|---|---|---|
| Test Author | qa | Write failing test, commit test-only |
| Implementer | developer | Implement to green — MUST NOT modify tests |
| Verifier | reflection-agent | Detect test-hacking, verify RED→GREEN evidence |
Pattern:
Test-hacking detection: reflection-agent checks git diff HEAD~1 HEAD -- '*.test.*' — any assertion changes after implementation commit = REJECT.
When to use: repository-scale TDD, complex features with multiple behaviors, any task where a single agent might rationalize test changes.
TDAID extends classic TDD with explicit Planning and Validation gates:
| Phase | TDAID Label | Agent-Studio Owner | Description |
|---|---|---|---|
| 0 | Plan | planner | Thinking-model generates structured TDD plan with explicit test checkpoints before any code is written |
| 1 | Red | qa | Write failing test expressing desired behavior; human verifies failure is expected |
| 2 | Green | developer | Minimal implementation to pass test; MUST NOT modify test assertions |
| 3 | Refactor | developer |
Key TDAID anti-patterns to detect in Validate phase:
Research basis: TDAID (awesome-testing.com, 2025), TDAD agent-to-agent variant (arXiv:2603.08806, 2026), TDFlow (arXiv:2510.23761, 2025)
Before writing a failing test, verify the API contract exists to prevent "fails due to wrong API" rather than "fails due to missing behavior":
# Step 1: Find the target function's file + line
pnpm search:code "functionName"
# Step 2: Verify signature with LSP hover
lsp_hover({ filePath: "/abs/path/to/file.ts", line: 42, character: 10 })
# Returns: function signature, parameter types, return type
# Step 3: Write test using VERIFIED signature
# Now RED is guaranteed to fail due to missing behavior, not API mismatch
Rule: If lsp_hover returns empty (CJS file or LSP not active) → fall back to ripgrep rg -n "functionName" --type ts to read the actual signature.
When NOT needed: trivially new functions that don't exist yet (LSP has nothing to return).
Hook contracts define the stdin/stdout JSON protocol. Test at the boundary:
// Hook contract test pattern
const proc = spawn('node', ['.claude/hooks/routing/routing-guard.cjs'], { shell: false });
const input = JSON.stringify({
tool_name: 'Edit',
tool_input: { file_path: '.claude/agents/core/developer.md' },
});
proc.stdin.write(input);
proc.stdin.end();
// Assert: exit code 2 (block) for protected paths
// Assert: stdout JSON contains { allow: false, message: /Gate 4/ }
TaskUpdate metadata contract:
// Validate processedReflectionIds schema
const schema = {
type: 'object',
required: ['processedReflectionIds'],
properties: { processedReflectionIds: { type: 'array', items: { type: 'string' } } },
additionalProperties: false,
};
Agent-Studio hook contracts to test:
routing-guard.cjs: blocks Task without task_id (exit 2)unified-creator-guard.cjs: blocks Write to .claude/skills/**/SKILL.md (exit 2)spawn-token-guard.cjs: warns at 80K tokens (exit 0 + message)Agent Studio uses node --test (built-in Node.js test runner) as the default for all .cjs CommonJS files (hooks, lib, scripts). Vitest 4 is the recommended runner for ESM/TypeScript files.
| Runner | Use When | Command |
|---|---|---|
node --test | .cjs hooks, lib, CommonJS scripts — current Agent Studio standard | node --test tests/**/*.test.cjs |
vitest | .ts, .mts, ESM .js files — use when migrating to TypeScript | pnpm vitest run |
Whynode --test for .cjs: Vitest requires Vite configuration and ESM-compatible modules. Agent Studio hooks use require() and CommonJS — node --test works without transpilation.
Why Vitest 4 for.ts/ESM: Boot time drops from ~8s (Jest) to ~1.2s (Vitest). First-class TypeScript + ESM support, Browser Mode (stable v4), and jest-compatible describe/it/expect API (migration = config change only).
Anti-pattern: Do NOT use Jest for new files. Vitest is the 2025-2026 standard for ESM/TypeScript.
# Current Agent Studio pattern (CJS hooks and lib)
node --test tests/lib/routing/routing-table.test.cjs
# Future ESM/TypeScript pattern
pnpm vitest run tests/lib/routing/routing-table.test.ts
LLM/agent outputs are non-deterministic — binary pass/fail assertions are insufficient. Use score-based evaluation and tool-call sequence validation instead.
// Agent output evaluation — score dimensions 0.0-1.0
function evaluateAgentOutput(output, expectations) {
const scores = {
relevance: scoreRelevance(output, expectations.topic), // 0.0-1.0
safety: scoreSafety(output), // 0.0-1.0
faithfulness: scoreFaithfulness(output, expectations.facts), // 0.0-1.0
format: scoreFormat(output, expectations.schema), // 0.0-1.0
};
const overall = Object.values(scores).reduce((a, b) => a + b) / Object.keys(scores).length;
return { scores, overall, pass: overall >= 0.75 };
}
// Test: agent output meets quality threshold
test('researcher agent output is relevant and safe', () => {
const result = evaluateAgentOutput(agentOutput, { topic: 'TDD patterns', facts: knownFacts });
expect(result.scores.safety).toBeGreaterThanOrEqual(0.9); // Hard floor for safety
expect(result.overall).toBeGreaterThanOrEqual(0.75); // 75% overall threshold
});
For agent tests, validate the sequence and count of tool calls, not just the final output:
// Spy on tool calls and assert ordering
const toolCallLog = [];
const mockTaskUpdate = jest.fn(args => {
toolCallLog.push({ tool: 'TaskUpdate', args });
});
const mockBash = jest.fn(args => {
toolCallLog.push({ tool: 'Bash', args });
});
// Run agent under test with mocked tools
await runAgent({ TaskUpdate: mockTaskUpdate, Bash: mockBash });
// Assert: TaskUpdate(in_progress) called BEFORE TaskUpdate(completed)
const inProgressIdx = toolCallLog.findIndex(
c => c.tool === 'TaskUpdate' && c.args.status === 'in_progress'
);
const completedIdx = toolCallLog.findIndex(
c => c.tool === 'TaskUpdate' && c.args.status === 'completed'
);
expect(inProgressIdx).toBeLessThan(completedIdx); // Ordering enforced
expect(inProgressIdx).toBeGreaterThanOrEqual(0); // Must have been called
expect(completedIdx).toBeGreaterThanOrEqual(0); // Must have been called
Rule: Never test the text content of LLM-generated prose. Test structure, schema validity, tool-call sequences, and score thresholds.
Reference: Simon Willison (2025) — "Red/Green TDD for agents: write assertions on tool-call sequences and structured outputs."
Use MSW (Mock Service Worker) v2 to test skills and agents that make external HTTP calls. MSW intercepts at the network level — no monkey-patching of fetch, no code changes in production.
pnpm add -D msw@2
import { setupServer } from 'msw/node';
import { http, HttpResponse } from 'msw';
// Define handlers — these describe the expected API contract
const handlers = [
http.get('https://api.example.com/search', ({ request }) => {
const url = new URL(request.url);
return HttpResponse.json({
results: [{ id: 1, title: `Result for: ${url.searchParams.get('q')}` }],
});
}),
];
const server = setupServer(...handlers);
beforeAll(() => server.listen({ onUnhandledRequest: 'error' }));
afterEach(() => server.resetHandlers());
afterAll(() => server.close());
// Test: researcher skill makes HTTP call and processes response
test('researcher skill fetches and parses search results', async () => {
const results = await researcherSkill.search('TDD patterns 2026');
expect(results).toHaveLength(1);
expect(results[0].title).toContain('TDD patterns');
});
test('researcher skill handles 503 gracefully', async () => {
server.use(
http.get('https://api.example.com/search', () => HttpResponse.json({}, { status: 503 }))
);
const results = await researcherSkill.search('TDD patterns');
expect(results).toEqual([]); // Graceful empty fallback
});
Key benefits over manual mocking:
onUnhandledRequest: 'error' catches unintentional external calls during testsAgent-Studio targets for MSW boundary tests:
researcher skill → WebSearch/WebFetch HTTP callsgithub-ops skill → GitHub API callsmcp__Exa__web_search_exa or WebFetchMutation testing validates test QUALITY, not just coverage. Run after achieving 100% line coverage:
# Install (once per project) — use vitest-runner for ESM/TypeScript
pnpm add -D @stryker-mutator/core @stryker-mutator/vitest-runner vitest
// stryker.config.mjs — working configuration for Vitest projects
/** @type {import('@stryker-mutator/api/core').PartialStrykerOptions} */
export default {
testRunner: 'vitest',
vitest: {
configFile: 'vitest.config.ts', // optional: path to your vitest config
related: true, // default: run only tests related to mutated file
},
thresholds: { high: 80, low: 60, break: 50 },
reporters: ['html', 'progress'],
};
# Run mutation tests (use incremental to speed up local loops)
pnpm stryker run --incremental
# Target threshold: >80% mutation score
# Score = (killed mutations / total mutations) × 100
Vitest runner limitations (StrykerJS 7.x):
perTest coverage analysis (ignores coverageAnalysis config).cjs files using node --test, use @stryker-mutator/jest-runner as fallbackpnpm add -D @stryker-mutator/core @stryker-mutator/jest-runner
Interpret results:
When to run: after completing a TDD cycle for security-critical code (hooks, validators, routing logic). Not required for all code — prioritize by risk.
Agent-Studio priority targets for mutation testing:
.claude/hooks/routing/routing-guard.cjs.claude/hooks/safety/unified-creator-guard.cjs.claude/lib/routing/routing-table.cjsBefore committing, agents MUST identify which test files cover the changed source files. Use compiler-assisted reference discovery or targeted grep:
# Find tests that import the changed file
grep -r "import.*changedFile\|require.*changedFile" tests/
# Or use LSP to find all references
lsp_findReferences({ filePath: "/path/to/changed/file.ts", line: 1, character: 1 })
Build a dependency map and run ONLY those tests first (70% faster regression detection per arXiv:2603.17973):
# 1. Run targeted tests (impacted tests only)
pnpm test tests/hooks/routing-guard.test.cjs
# 2. Verify no regressions in targeted scope
# 3. Only then run full suite
pnpm test
Rationale: Full test suites can exceed 5 minutes on large repos. Targeted testing catches regressions in 15-30 seconds, freeing context for next scenarios in a long TDD loop.
In the Validate phase, verify the implementation hasn't gamed test assertions:
Checklist:
Run mutation testing if available:
# Test suite strength validation — mutations should be caught
pnpm stryker run
# If mutation score < 80%, tests are too weak:
# - Add negative tests
# - Add boundary condition tests
# - Verify assertions are on behavior, not mocks
Spec-gaming examples to catch:
return 42 to pass test expecting 42 → mutation testing catches thisAgent-Studio targets: After completing security-critical hook or routing changes, run mutation testing before marking task complete.
Weekly Installs
62
Repository
GitHub Stars
19
First Seen
Jan 27, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykPass
Installed on
github-copilot61
gemini-cli60
opencode59
codex59
kimi-cli59
amp59
后端测试指南:API端点、业务逻辑与数据库测试最佳实践
11,800 周安装
| Improve code quality with all tests green |
| 4 | Validate | reflection-agent + verification-before-completion skill | Detect specification gaming; confirm implementation matches plan; human gate |