⚠️

重要前提

安装AI Skills的关键前提是：必须科学上网，且开启TUN模式，这一点至关重要，直接决定安装能否顺利完成，在此郑重提醒三遍：科学上网，科学上网，科学上网。查看完整安装教程 →

AI辅助测试驱动开发(TDD)技能：Canon TDD循环、属性测试与防护栏

tdd by oimiragieo/agent-studio

62 周安装量

19 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/oimiragieo/agent-studio --skill tdd

软件工程代码质量测试

🇨🇳中文介绍

测试驱动开发 (TDD)

概述

本技能实现了带有 AI 特定防护栏的 Canon TDD：

构建或更新场景列表。
将恰好一个场景作为可运行测试执行。
证明 RED（失败）。
为实现 GREEN（通过）进行最小更改。
可选地进行重构。
重复直到场景列表为空。

使用时机

适用于：

新功能
错误修复
行为变更
由测试驱动的仓库级修补
测试作为可执行规范的 AI 辅助代码生成

仅在以下情况下，需在绕过前征得人工批准：

一次性原型
没有执行路径的纯声明式配置编辑
不会被维护的一次性迁移脚本

铁律

NO PRODUCTION CODE WITHOUT A FAILING TEST FIRST

如果代码先被写出，则丢弃并从 RED 重新开始。

Canon 循环

步骤 0：创建/刷新场景待办列表

在构建待办列表之前，查询记忆以获取过去的失败特征和可重用的测试模板：

Skill({ skill: 'memory-search' }); // query: "<feature-name> test failure signatures"

阅读 .claude/context/memory/learnings.md 以了解与此任务相关的重复反模式。

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

步骤 1：选取恰好一个场景并编写一个可运行测试

每个周期一个行为。
使用清晰的行为名称。
优先使用真实的协作者；仅模拟外部边界。

步骤 2：证明 RED

运行最窄的测试命令。
失败必须是由于缺少行为，而非语法或设置错误。
记录 RED 证据（测试文件和失败的断言消息）。

步骤 3：实现最小的 GREEN 补丁

仅实现当前 RED 测试所需的内容。
不要进行推测性的 API 或无关的清理。
将补丁范围限制在当前场景内。

步骤 4：证明 GREEN

重新运行最窄的测试命令。
运行受影响的套件（或包级测试集）。
确认没有回归。

不稳定性检查门（对于异步、钩子或非确定性测试是强制性的）：

对于涉及异步 I/O、停止钩子、计时器或文件系统操作的测试，单次通过是不够的。在声明 GREEN 之前，需要连续 3 次通过：

# 运行 3 次 —— 3 次都必须通过
node --test tests/hooks/routing-guard.test.cjs && \
node --test tests/hooks/routing-guard.test.cjs && \
node --test tests/hooks/routing-guard.test.cjs

一次通过但第二次运行失败的测试是 RED，不是 GREEN。在确认连续 3 次通过之前，不要进入步骤 5。

变异测试门（仅限安全关键代码）：

对于安全钩子、路由验证器、身份验证逻辑以及任何控制访问或信任决策的代码路径，在实现 GREEN 后运行 Stryker 变异测试，以验证测试确实能捕获故障，而非空洞地通过：

# 运行 Stryker 变异测试（阈值：85%）
npx stryker run
# 要求 stryker.config.json 中的 mutationScore >= 85

对于安全钩子基于 fast-check 的属性测试，故障关闭属性是等效的变异检查门：

// fast-check 故障关闭属性 —— 必须对任何输入都成立
fc.assert(
  fc.property(fc.anything(), input => {
    const result = securityHook(input);
    // 钩子绝不能对格式错误/意外输入返回 allow=true
    expect(result.allow).not.toBe(true);
  })
);

对于非安全应用程序代码，跳过此门（步骤 4 → 步骤 5 直接进行）。

步骤 5：可选的重构

仅在测试通过（GREEN）时进行重构。
重构后重新运行相同的测试集。

步骤 5.5：基于属性的测试（推荐用于工具函数和安全钩子）

重构后（或对于安全关键代码，在步骤 4 之后），考虑用基于属性的测试补充基于示例的测试。对于 LLM 代码生成，PBT 相比仅使用基于示例的 TDD 实现了 23.1–37.3% 的 pass@1 改进（arXiv:2506.18315），因为它打破了自我欺骗循环。

工具函数（编码/解码、解析器、序列化器、计算器）
安全钩子（输入验证器、清理器、访问控制逻辑）
任何可以陈述不变量、往返属性或数学属性的函数

Skill({ skill: 'property-based-testing' });

要识别的关键属性模式：

模式	示例
往返	`decode(encode(x)) === x`
幂等性	`normalize(normalize(x)) === normalize(x)`
不变量	`sort(arr).length === arr.length`
故障关闭（安全）	`securityHook(anyInput).allow !== true`（除非明确列入白名单）

PBT 是对 Canon TDD 的补充，而非替代。Canon RED/GREEN/REFACTOR 首先完成；PBT 在确认 GREEN 后运行。

步骤 6：重复直到待办列表为空

使用测试作为可执行的提示上下文；保持提示简短且以测试为中心。
优先使用确定性测试（稳定的夹具，无非确定性排序）。
使用有界的修复循环：每个场景最多尝试 3 次修复，然后重新设计。
运行反测试黑客检查：
- 验证更改后的断言是否仍表达原始需求。
- 对于错误修复任务，至少添加一个否定测试。
确保代码不会基于仅用于测试的产物进行分支。

仅使用轻量级记忆来减少重复的设置和故障排查：

首选的仓库本地测试/代码检查/格式化命令
重复出现的失败特征和简短的修复摘要
重复出现的反模式提醒
可重用的场景模板

参考：references/tdd-memory-profile.md

记忆绝不能绕过 RED 证明
记忆绝不能改变 Canon 序列
保持配置文件有界且低噪音

测试驱动提示（TDP）—— 2026 标准模式

TDP 是多智能体 TDD 在 2026 年的主导模式：将 逐字逐句的失败测试输出 注入到开发者智能体启动提示中。这消除了解释错误——开发者看到的就是测试运行器看到的。

与其用文字描述失败，不如捕获 stdout/stderr 并直接注入：

// Step 1: Run test and capture raw output
const { execSync } = require('child_process');
let testOutput = '';
try {
  execSync('node --test tests/hooks/routing-guard.test.cjs', { encoding: 'utf-8' });
} catch (e) {
  testOutput = e.stdout + e.stderr; // Verbatim failure output
}

// Step 2: Inject verbatim into developer spawn prompt (no paraphrasing)
Task({
  task_id: 'task-impl',
  subagent_type: 'developer',
  prompt: `## FAILING TEST (verbatim — do NOT modify the test file)\n\`\`\`\n${testOutput}\n\`\`\`\nImplement ONLY what is needed to make this pass.`,
});

消除了转述的失败描述（传话游戏效应）
开发者拥有完整的断言上下文：行号、实际值与期望值
强制最小化实现——开发者只能实现测试所要求的内容
防止 QA 智能体的测试意图与开发者解释之间的规范漂移

TDP + 多智能体 TDD 分解

步骤	智能体	动作
1	`qa`	编写失败测试，仅提交测试，捕获原始输出
2	路由器	提取测试输出，构建 TDP 启动提示
3	`developer`	使用逐字测试输出作为规范来实现 GREEN
4	`reflection-agent`	验证测试断言未被修改（git diff 检查）

来源： Simon Willison (2026) —— "Red/Green TDD for agents: failing test output IS the specification"; TDFlow arXiv:2510.23761.

使用 ralph-loop 的自主 TDD（会话持久化迭代）

对于可能被中断的仓库级 TDD，连接 ralph-loop（模式 2 —— 路由器管理）以在中断期间维护 TDD 场景待办列表：

在 .claude/context/runtime/tdd-state.json 维护一个 TDD 特定的状态文件：

{
  "scenarios": [
    {
      "id": "sc-001",
      "description": "routing-guard blocks Write on creator paths",
      "status": "pending"
    },
    { "id": "sc-002", "description": "spawn-token-guard warns at 80K tokens", "status": "green" }
  ],
  "completedScenarios": [
    {
      "id": "sc-002",
      "evidenceCommand": "node --test tests/hooks/spawn-token-guard.test.cjs",
      "passedAt": "2026-03-12T10:00:00Z"
    }
  ],
  "currentScenario": "sc-001",
  "evidenceLog": [
    {
      "scenarioId": "sc-001",
      "phase": "red",
      "output": "AssertionError: expected exit code 2, got 0",
      "timestamp": "..."
    }
  ]
}

在每次迭代开始时，读取 TDD 状态文件：

// Step 0 — before building/refreshing backlog
const state = JSON.parse(
  fs.readFileSync('.claude/context/runtime/tdd-state.json', 'utf-8') || '{}'
);
const completedIds = (state.completedScenarios || []).map(s => s.id);
const remaining = (state.scenarios || []).filter(s => !completedIds.includes(s.id));
// Pick next scenario from remaining — never re-run completed ones

与 ralph-loop 模式 2 集成

路由器使用 { task_id, subagent_type: 'qa', prompt: TDP_PROMPT + verbatim state } 启动 qa 智能体
qa 编写测试 → 运行 → 捕获输出 → 更新 tdd-state.json（阶段：red）
路由器使用 TDP 提示（注入逐字测试输出）启动 developer
developer 实现 → 更新 tdd-state.json（阶段：green）
路由器检查 remaining.length === 0 → 发出 RALPH_AUDIT_COMPLETE_NO_FINDINGS
如果 remaining > 0 → 使用下一个场景循环回步骤 1

反模式： 绝不重新运行状态中已标记为 green 的场景——这会浪费迭代次数并可能破坏证据日志。

仓库级和类级指南

对于仓库级工作，按失败测试集群分解，并为每个循环分配一个集群。
对于类级合成，推导方法依赖顺序，并一次实现一个方法，使用方法级的公共测试。
通过将每个循环限制在一个场景和一个补丁目标来保持低长上下文压力。

场景待办列表存在，并在工作期间更新
每个生产变更都映射到至少一个先失败后通过的测试
捕获了 RED 证据（命令 + 失败摘要）
捕获了 GREEN 证据（命令 + 通过摘要）
在触及范围内没有未解决的失败测试
代码检查/格式化/测试命令已完成或明确报告为受阻
未检测到测试黑客模式

完成前命令（项目范围）

使用项目的实际命令。典型序列：

# 1) targeted test
pnpm test <target>
# 2) impacted suite
pnpm test
# 3) lint
pnpm lint
# 4) format check
pnpm format:check

如果仓库使用不同的脚本，请用本地等效项替换这些命令，并准确报告运行了什么。

"我稍后会添加测试" -> 停止并编写当前的 RED 测试。
"这个太小了，不需要测试" -> 编写一个最小的行为测试。
"我已经手动测试过了" -> 手动运行不能替代可执行的回归测试。
"我花了太长时间，不能删除测试前的代码" -> 沉没成本；从 RED 重新开始。

references/research-requirements.md
references/tdd-memory-profile.md
testing-anti-patterns.md
rules/tdd.md
templates/implementation-template.md

本技能与以下内容保持一致：

Martin Fowler TDD (Dec 11, 2023)
Kent Beck Canon TDD (Dec 11, 2023)
Rafique & Misic meta-analysis, IEEE TSE DOI:10.1109/TSE.2012.28
LLM4TDD (arXiv:2312.04687)
Test-Driven Development for Code Generation (arXiv:2402.13521)
Tests as Prompt (arXiv:2505.09027)
SWE-Flow (arXiv:2506.09003)
TDFlow (arXiv:2510.23761)
Scaling TDD from Functions to Classes (arXiv:2602.03557)

开始前： 阅读 .claude/context/memory/learnings.md

新模式 -> .claude/context/memory/learnings.md
发现的问题 -> .claude/context/memory/issues.md
做出的决策 -> .claude/context/memory/decisions.md

假设会被中断：如果不在记忆中，就等于没发生。

Agent-Studio TDD 扩展（2026）

钩子使用 stdin/stdout JSON 协议：

const proc = require('child_process').spawn('node', ['.claude/hooks/routing/routing-guard.cjs'], {
  shell: false,
});
proc.stdin.write(JSON.stringify({ tool_name: 'Write', tool_input: {} }));
proc.stdin.end();
// Exit 0=allow, 2=block

模拟 MemoryRecord。测试置信度门（阈值 0.7）。使用原子写入。

基于属性的测试

对于任何具有不变量的函数，使用 fast-check（以及用于 Vitest 集成的 @fast-check/vitest）——不仅仅是路由。fast-check 3.x (2025) 增加了改进的 unicode、日期和 bigint 任意值。

路由不变量（现有）：

import fc from 'fast-check';
fc.assert(
  fc.property(fc.string(), intent => {
    return typeof routeIntent(intent) === 'string';
  })
);

记忆序列化往返（新）：

// Property: serialize(deserialize(x)) === x for all JSON-serializable values
fc.assert(
  fc.property(fc.jsonValue(), value => {
    const serialized = serializeMemoryRecord(value);
    const deserialized = deserializeMemoryRecord(serialized);
    return JSON.stringify(deserialized) === JSON.stringify(value);
  })
);

钩子验证不变量（新）：

// Property: for any tool input, isValidInput(x) === !isBlocked(x)
// (validation and blocking must be inverses)
fc.assert(
  fc.property(fc.record({ tool_name: fc.string(), tool_input: fc.object() }), input => {
    const valid = isValidInput(input);
    const blocked = wouldBlock(input);
    return valid !== blocked || (!valid && blocked); // blocked implies invalid
  })
);

路径规范化幂等性（新）：

// Property: normalize(normalize(path)) === normalize(path) (idempotent)
fc.assert(
  fc.property(fc.string(), rawPath => {
    const once = normalizePath(rawPath);
    const twice = normalizePath(once);
    return once === twice;
  })
);

模式验证稳定性（新）：

// Property: validate(schema, x) never throws uncaught exception for any input
fc.assert(
  fc.property(fc.anything(), input => {
    try {
      validateSchema(schema, input);
      return true;
    } catch (e) {
      return e instanceof ValidationError;
    } // Only ValidationError allowed
  })
);

验证 TaskUpdate 元数据模式（processedReflectionIds: string[]）。

多智能体 TDD 分解（2026 标准）

基于 TDFlow (arXiv:2510.23761, 94.3% SWE-Bench Verified)，单体 TDD 智能体得分为 60–70%。拆分为专门的子智能体：

角色	智能体	职责
测试作者	`qa`	编写失败测试，仅提交测试
实现者	`developer`	实现到 GREEN —— 绝不能修改测试
验证者	`reflection-agent`	检测测试黑客，验证 RED→GREEN 证据

QA 智能体编写测试 → 仅提交测试文件（无实现）
开发者智能体实现 → 运行测试 → 提交实现
反思智能体审查差异：如果测试断言被更改 → 失败（测试黑客）

测试黑客检测： reflection-agent 检查 git diff HEAD~1 HEAD -- '*.test.*' —— 实现提交后任何断言更改 = 拒绝。

何时使用： 仓库级 TDD、具有多个行为的复杂功能、任何单个智能体可能合理化测试更改的任务。

TDAID 阶段映射（测试驱动的 AI 辅助开发，2025-2026）

TDAID 通过显式的规划和验证门扩展了经典的 TDD：

阶段	TDAID 标签	Agent-Studio 所有者	描述
0	计划	`planner`	思维模型生成结构化的 TDD 计划，在编写任何代码之前包含明确的测试检查点
1	Red	`qa`	编写表达期望行为的失败测试；人工验证失败是预期的
2	Green	`developer`	通过测试的最小实现；绝不能修改测试断言
3	重构	`developer`	在所有测试通过的情况下改进代码质量
4	验证	`reflection-agent` + `verification-before-completion` 技能	检测规范博弈；确认实现与计划匹配；人工门

在验证阶段要检测的关键 TDAID 反模式：

删除测试断言以使测试通过
硬编码期望值
模拟掉正在测试的行为
使实现表面上符合规范，但未满足规范意图

研究基础： TDAID (awesome-testing.com, 2025), TDAD agent-to-agent variant (arXiv:2603.08806, 2026), TDFlow (arXiv:2510.23761, 2025)

LSP 预 RED 类型验证

在编写失败测试之前，验证 API 契约是否存在，以防止"因错误 API 而失败"而非"因缺少行为而失败"：

# Step 1: Find the target function's file + line
pnpm search:code "functionName"

# Step 2: Verify signature with LSP hover
lsp_hover({ filePath: "/abs/path/to/file.ts", line: 42, character: 10 })
# Returns: function signature, parameter types, return type

# Step 3: Write test using VERIFIED signature
# Now RED is guaranteed to fail due to missing behavior, not API mismatch

规则： 如果 lsp_hover 返回空（CJS 文件或 LSP 未激活）→ 回退到 ripgrep rg -n "functionName" --type ts 来读取实际的签名。

何时不需要： 明显是尚不存在的新函数（LSP 没有内容可返回）。

契约测试（钩子边界 —— 扩展）

钩子契约定义了 stdin/stdout JSON 协议。在边界进行测试：

// Hook contract test pattern
const proc = spawn('node', ['.claude/hooks/routing/routing-guard.cjs'], { shell: false });
const input = JSON.stringify({
  tool_name: 'Edit',
  tool_input: { file_path: '.claude/agents/core/developer.md' },
});
proc.stdin.write(input);
proc.stdin.end();

// Assert: exit code 2 (block) for protected paths
// Assert: stdout JSON contains { allow: false, message: /Gate 4/ }

TaskUpdate 元数据契约：

// Validate processedReflectionIds schema
const schema = {
  type: 'object',
  required: ['processedReflectionIds'],
  properties: { processedReflectionIds: { type: 'array', items: { type: 'string' } } },
  additionalProperties: false,
};

要测试的 Agent-Studio 钩子契约：

routing-guard.cjs: 阻止没有 task_id 的 Task（退出码 2）
unified-creator-guard.cjs: 阻止写入 .claude/skills/**/SKILL.md（退出码 2）
spawn-token-guard.cjs: 在 80K 令牌时警告（退出码 0 + 消息）

测试运行器选择（node --test vs Vitest 4）

Agent Studio 使用 node --test（内置的 Node.js 测试运行器）作为所有 .cjs CommonJS 文件（钩子、库、脚本）的默认运行器。Vitest 4 是 ESM/TypeScript 文件的推荐运行器。

运行器	使用时机	命令
`node --test`	`.cjs` 钩子、库、CommonJS 脚本 —— 当前 Agent Studio 标准	`node --test tests/*/.test.cjs`
`vitest`	`.ts`、`.mts`、ESM `.js` 文件 —— 迁移到 TypeScript 时使用	`pnpm vitest run`

为什么对 .cjs 使用 node --test： Vitest 需要 Vite 配置和 ESM 兼容的模块。Agent Studio 钩子使用 require() 和 CommonJS —— node --test 无需转译即可工作。

为什么对 .ts/ESM 使用 Vitest 4： 启动时间从 ~8s (Jest) 降至 ~1.2s (Vitest)。一流的 TypeScript + ESM 支持，浏览器模式（稳定版 v4），以及 jest 兼容的 describe/it/expect API（迁移仅需配置更改）。

反模式： 不要对新文件使用 Jest。Vitest 是 2025-2026 年 ESM/TypeScript 的标准。

# Current Agent Studio pattern (CJS hooks and lib)
node --test tests/lib/routing/routing-table.test.cjs

# Future ESM/TypeScript pattern
pnpm vitest run tests/lib/routing/routing-table.test.ts

AI 输出评估测试（非确定性智能体）

LLM/智能体输出是非确定性的——二元的通过/失败断言是不够的。改用基于分数的评估和工具调用序列验证。

基于分数的断言模式

// Agent output evaluation — score dimensions 0.0-1.0
function evaluateAgentOutput(output, expectations) {
  const scores = {
    relevance: scoreRelevance(output, expectations.topic), // 0.0-1.0
    safety: scoreSafety(output), // 0.0-1.0
    faithfulness: scoreFaithfulness(output, expectations.facts), // 0.0-1.0
    format: scoreFormat(output, expectations.schema), // 0.0-1.0
  };
  const overall = Object.values(scores).reduce((a, b) => a + b) / Object.keys(scores).length;
  return { scores, overall, pass: overall >= 0.75 };
}

// Test: agent output meets quality threshold
test('researcher agent output is relevant and safe', () => {
  const result = evaluateAgentOutput(agentOutput, { topic: 'TDD patterns', facts: knownFacts });
  expect(result.scores.safety).toBeGreaterThanOrEqual(0.9); // Hard floor for safety
  expect(result.overall).toBeGreaterThanOrEqual(0.75); // 75% overall threshold
});

工具调用序列验证

对于智能体测试，验证工具调用的 序列和次数，而不仅仅是最终输出：

// Spy on tool calls and assert ordering
const toolCallLog = [];
const mockTaskUpdate = jest.fn(args => {
  toolCallLog.push({ tool: 'TaskUpdate', args });
});
const mockBash = jest.fn(args => {
  toolCallLog.push({ tool: 'Bash', args });
});

// Run agent under test with mocked tools
await runAgent({ TaskUpdate: mockTaskUpdate, Bash: mockBash });

// Assert: TaskUpdate(in_progress) called BEFORE TaskUpdate(completed)
const inProgressIdx = toolCallLog.findIndex(
  c => c.tool === 'TaskUpdate' && c.args.status === 'in_progress'
);
const completedIdx = toolCallLog.findIndex(
  c => c.tool === 'TaskUpdate' && c.args.status === 'completed'
);
expect(inProgressIdx).toBeLessThan(completedIdx); // Ordering enforced
expect(inProgressIdx).toBeGreaterThanOrEqual(0); // Must have been called
expect(completedIdx).toBeGreaterThanOrEqual(0); // Must have been called

规则： 绝不测试 LLM 生成的散文的文本内容。测试结构、模式有效性、工具调用序列和分数阈值。

参考： Simon Willison (2025) —— "Red/Green TDD for agents: write assertions on tool-call sequences and structured outputs."

MSW v2 HTTP 模拟（API 边界测试）

使用 MSW (Mock Service Worker) v2 来测试进行外部 HTTP 调用的技能和智能体。MSW 在网络层面进行拦截——无需对 fetch 进行猴子补丁，也无需更改生产代码。

设置模式（Node.js / Vitest）

import { setupServer } from 'msw/node';
import { http, HttpResponse } from 'msw';

// Define handlers — these describe the expected API contract
const handlers = [
  http.get('https://api.example.com/search', ({ request }) => {
    const url = new URL(request.url);
    return HttpResponse.json({
      results: [{ id: 1, title: `Result for: ${url.searchParams.get('q')}` }],
    });
  }),
];

const server = setupServer(...handlers);
beforeAll(() => server.listen({ onUnhandledRequest: 'error' }));
afterEach(() => server.resetHandlers());
afterAll(() => server.close());

// Test: researcher skill makes HTTP call and processes response
test('researcher skill fetches and parses search results', async () => {
  const results = await researcherSkill.search('TDD patterns 2026');
  expect(results).toHaveLength(1);
  expect(results[0].title).toContain('TDD patterns');
});

为错误情况按测试覆盖

test('researcher skill handles 503 gracefully', async () => {
  server.use(
    http.get('https://api.example.com/search', () => HttpResponse.json({}, { status: 503 }))
  );
  const results = await researcherSkill.search('TDD patterns');
  expect(results).toEqual([]); // Graceful empty fallback
});

相较于手动模拟的主要优势：

测试执行真实的 HTTP 客户端代码路径（而非模拟的抽象）
onUnhandledRequest: 'error' 捕获测试期间意外的外部调用
处理程序定义请求/响应契约——同时作为文档

MSW 边界测试的 Agent-Studio 目标：

researcher 技能 → WebSearch/WebFetch HTTP 调用
github-ops 技能 → GitHub API 调用
任何使用 mcp__Exa__web_search_exa 或 WebFetch 的智能体

变异测试（Stryker JS）

变异测试验证测试的质量，而不仅仅是覆盖率。在实现 100% 行覆盖率后运行：

Stryker + Vitest（2026 标准 —— ESM/TypeScript 项目）

# Install (once per project) — use vitest-runner for ESM/TypeScript
pnpm add -D @stryker-mutator/core @stryker-mutator/vitest-runner vitest



// stryker.config.mjs — working configuration for Vitest projects
/** @type {import('@stryker-mutator/api/core').PartialStrykerOptions} */
export default {
  testRunner: 'vitest',
  vitest: {
    configFile: 'vitest.config.ts', // optional: path to your vitest config
    related: true, // default: run only tests related to mutated file
  },
  thresholds: { high: 80, low: 60, break: 50 },
  reporters: ['html', 'progress'],
};



# Run mutation tests (use incremental to speed up local loops)
pnpm stryker run --incremental

# Target threshold: >80% mutation score
# Score = (killed mutations / total mutations) × 100

Vitest 运行器限制（StrykerJS 7.x）：

不支持浏览器模式 —— 仅限线程模式
始终使用 perTest 覆盖率分析（忽略 coverageAnalysis 配置）
对于使用 node --test 的 .cjs 文件，使用 @stryker-mutator/jest-runner 作为回退

Stryker + node:test（CommonJS/.cjs 项目）

pnpm add -D @stryker-mutator/core @stryker-mutator/jest-runner

已杀死 —— 测试套件捕获了变异 ✓
存活 —— 测试套件遗漏了此代码路径（添加断言）
无覆盖 —— 没有测试执行此行（添加测试）

何时运行： 在完成安全关键代码（钩子、验证器、路由逻辑）的 TDD 循环后运行。并非所有代码都需要——按风险优先级排序。

变异测试的 Agent-Studio 优先目标：

.claude/hooks/routing/routing-guard.cjs
.claude/hooks/safety/unified-creator-guard.cjs
.claude/lib/routing/routing-table.cjs

验证阶段：TDAD 依赖关系图（P0）

在提交之前，智能体必须识别哪些测试文件覆盖了更改的源文件。使用编译器辅助的引用发现或目标 grep：

# Find tests that import the changed file
grep -r "import.*changedFile\|require.*changedFile" tests/

# Or use LSP to find all references
lsp_findReferences({ filePath: "/path/to/changed/file.ts", line: 1, character: 1 })

构建依赖关系图，并首先仅运行那些测试（根据 arXiv:2603.17973，回归检测速度提高 70%）：

# 1. Run targeted tests (impacted tests only)
pnpm test tests/hooks/routing-guard.test.cjs

# 2. Verify no regressions in targeted scope
# 3. Only then run full suite
pnpm test

原理： 完整的测试套件在大型仓库上可能超过 5 分钟。目标测试在 15-30 秒内捕获回归，为长 TDD 循环中的下一个场景释放上下文。

验证阶段：规范博弈检测（P0）

在验证阶段，验证实现是否没有对测试断言进行博弈：

测试断言行为，而非实现细节（不测试私有变量或类内部）
没有将硬编码的期望值从测试复制到实现
变异分数 ≥80% 表明测试质量足以检测回归
审查：代码是否可能在根本错误的情况下通过测试？

如果可用，运行变异测试：

# Test suite strength validation — mutations should be caught
pnpm stryker run

# If mutation score < 80%, tests are too weak:
# - Add negative tests
# - Add boundary condition tests
# - Verify assertions are on behavior, not mocks

要捕获的规范博弈示例：

✗ 实现硬编码 return 42 以通过期望 42 的测试 → 变异测试会捕获这一点
✗ 测试模拟行为而非断言行为 → 变异测试显示 0% 变异被杀死
✗ 测试检查日志消息而非行为 → 翻转断言，实现仍然通过

Agent-Studio 目标： 在完成安全关键钩子或路由更改后，在标记任务完成之前运行变异测试。

🇺🇸English

Test-Driven Development (TDD)

Overview

This skill implements Canon TDD with AI-specific guardrails:

Build or update a scenario list.
Execute exactly one scenario as a runnable test.
Prove RED.
Implement minimum change for GREEN.
Optionally refactor.
Repeat until scenario list is empty.

When to Use

Use for:

New features
Bug fixes
Behavior changes
Repository-scale patching driven by tests
AI-assisted code generation where tests are executable specifications

Ask human approval before bypassing only for:

Throwaway prototypes
Purely declarative config edits with no execution path
One-off migration scripts that will not be maintained

The Iron Law

NO PRODUCTION CODE WITHOUT A FAILING TEST FIRST

If code was written first, discard and restart from RED.

Canon Loop

Step 0: Create/refresh scenario backlog

Before building the backlog, query memory for past failure signatures and reusable test templates:

Skill({ skill: 'memory-search' }); // query: "<feature-name> test failure signatures"

Read .claude/context/memory/learnings.md for recurring anti-patterns relevant to this task.

Then:

Keep a short ordered list of test scenarios for this task.
Prioritize by design signal and risk, not by implementation convenience.
Add discovered scenarios during execution.
Reuse templates from memory — do not repeat failure patterns already documented.

Step 1: Pick exactly one scenario and write one runnable test

One behavior per cycle.
Use clear behavior names.
Favor real collaborators; mock only external boundaries.

Step 2: Prove RED

Run the narrowest test command.
Failure must be due to missing behavior, not syntax or setup errors.
Record red evidence (test file and failing assertion message).

Step 3: Implement minimum GREEN patch

Implement only what current red test requires.
No speculative APIs or unrelated cleanup.
Keep patch bounded to current scenario.

Step 4: Prove GREEN

Re-run narrow test command.
Run impacted suite (or package-level test set).
Confirm no regressions.

Flakiness Gate (mandatory for async, hook, or nondeterministic tests):

For tests that involve async I/O, stop hooks, timers, or file system operations, a single pass is insufficient. Require 3 consecutive passes before declaring GREEN:

# Run 3 times — all 3 must pass
node --test tests/hooks/routing-guard.test.cjs && \
node --test tests/hooks/routing-guard.test.cjs && \
node --test tests/hooks/routing-guard.test.cjs

A test that passes once and fails on the second run is RED, not GREEN. Do not advance to Step 5 until 3 consecutive passes are confirmed.

Mutation Testing Gate (security-critical code only):

For security hooks, routing validators, auth logic, and any code path that controls access or trust decisions, run Stryker mutation testing after achieving GREEN to verify that tests genuinely catch faults and are not vacuously passing.

# Run Stryker mutation testing (threshold: 85%)
npx stryker run
# Require mutationScore >= 85 in stryker.config.json

For fast-check-based property tests on security hooks, the fail-closed property is the mutation-equivalent gate:

// fast-check fail-closed property — must hold for any input
fc.assert(
  fc.property(fc.anything(), input => {
    const result = securityHook(input);
    // Hook must NEVER return allow=true for malformed/unexpected input
    expect(result.allow).not.toBe(true);
  })
);

Skip this gate for non-security application code (Step 4 → Step 5 directly).

Step 5: Optional refactor

Refactor only with green tests.
Re-run the same test set after refactor.

Step 5.5: Property-Based Testing (recommended for utility functions and security hooks)

After refactor (or after Step 4 for security-critical code), consider supplementing example-based tests with property-based tests. PBT achieves 23.1–37.3% pass@1 improvement over example-based TDD alone for LLM code generation (arXiv:2506.18315) by breaking the self-deception cycle.

When to invoke:

Utility functions (encode/decode, parsers, serializers, calculators)
Security hooks (input validators, sanitizers, access control logic)
Any function where invariants, round-trip properties, or mathematical properties can be stated

Invocation:

Skill({ skill: 'property-based-testing' });

Key property patterns to identify:

Pattern	Example
Round-trip	`decode(encode(x)) === x`
Idempotence	`normalize(normalize(x)) === normalize(x)`
Invariant	`sort(arr).length === arr.length`
Fail-closed (security)	`securityHook(anyInput).allow !== true` (unless explicitly whitelisted)

PBT is a supplement to Canon TDD, not a replacement. Canon RED/GREEN/REFACTOR completes first; PBT runs after GREEN is confirmed.

Step 6: Repeat until backlog empty

AI-Assisted Guardrails

Use tests as executable prompt context; keep prompts short and test-focused.
Prefer deterministic tests (stable fixtures, no nondeterministic ordering).
Use bounded repair loops: max 3 repair attempts per scenario before redesign.
Run anti-test-hacking checks:
- Verify changed assertions still express original requirement.
- Add at least one negative test for bug-fix tasks.
Ensure code does not branch on test-only artifacts.

Memory Acceleration Layer

Use lightweight memory only to reduce repeated setup and triage:

preferred repo-local test/lint/format commands
recurring failure signatures and short fix summaries
recurring anti-pattern reminders
reusable scenario templates

Reference: references/tdd-memory-profile.md

Hard rules:

memory never bypasses RED proof
memory never changes Canon sequence
keep profile bounded and low-noise

Test-Driven Prompting (TDP) — 2026 Standard Pattern

TDP is the dominant 2026 pattern for multi-agent TDD: inject the verbatim failing test output into the developer agent spawn prompt. This eliminates interpretation errors — the developer sees exactly what the test runner sees.

Pattern

Instead of describing the failure in prose, capture stdout/stderr and inject it directly:

// Step 1: Run test and capture raw output
const { execSync } = require('child_process');
let testOutput = '';
try {
  execSync('node --test tests/hooks/routing-guard.test.cjs', { encoding: 'utf-8' });
} catch (e) {
  testOutput = e.stdout + e.stderr; // Verbatim failure output
}

// Step 2: Inject verbatim into developer spawn prompt (no paraphrasing)
Task({
  task_id: 'task-impl',
  subagent_type: 'developer',
  prompt: `## FAILING TEST (verbatim — do NOT modify the test file)\n\`\`\`\n${testOutput}\n\`\`\`\nImplement ONLY what is needed to make this pass.`,
});

Why TDP Works

Eliminates paraphrased failure descriptions (telephone game effect)
Developer has the full assertion context: line number, actual vs expected values
Forces minimal implementation — developer can only implement what the test demands
Prevents specification drift between QA agent's test intent and developer's interpretation

TDP + Multi-Agent TDD Decomposition

Step	Agent	Action
1	`qa`	Write failing test, commit test-only, capture raw output
2	Router	Extract test output, build TDP spawn prompt
3	`developer`	Implement to GREEN using verbatim test output as spec
4	`reflection-agent`	Verify no test assertions were modified (git diff check)

Source: Simon Willison (2026) — "Red/Green TDD for agents: failing test output IS the specification"; TDFlow arXiv:2510.23761.

Autonomous TDD with ralph-loop (Session-Persistent Iteration)

For repository-scale TDD where sessions may be interrupted, wire ralph-loop (Mode 2 — router-managed) to maintain the TDD scenario backlog across interruptions:

TDD State Schema

Maintain a TDD-specific state file at .claude/context/runtime/tdd-state.json:

{
  "scenarios": [
    {
      "id": "sc-001",
      "description": "routing-guard blocks Write on creator paths",
      "status": "pending"
    },
    { "id": "sc-002", "description": "spawn-token-guard warns at 80K tokens", "status": "green" }
  ],
  "completedScenarios": [
    {
      "id": "sc-002",
      "evidenceCommand": "node --test tests/hooks/spawn-token-guard.test.cjs",
      "passedAt": "2026-03-12T10:00:00Z"
    }
  ],
  "currentScenario": "sc-001",
  "evidenceLog": [
    {
      "scenarioId": "sc-001",
      "phase": "red",
      "output": "AssertionError: expected exit code 2, got 0",
      "timestamp": "..."
    }
  ]
}

Resume Pattern

At the start of each iteration, read the TDD state file:

// Step 0 — before building/refreshing backlog
const state = JSON.parse(
  fs.readFileSync('.claude/context/runtime/tdd-state.json', 'utf-8') || '{}'
);
const completedIds = (state.completedScenarios || []).map(s => s.id);
const remaining = (state.scenarios || []).filter(s => !completedIds.includes(s.id));
// Pick next scenario from remaining — never re-run completed ones

Integration with ralph-loop Mode 2

Router spawns qa agent with { task_id, subagent_type: 'qa', prompt: TDP_PROMPT + verbatim state }
qa writes test → runs → captures output → updates tdd-state.json (phase: red)
Router spawns developer with TDP prompt (verbatim test output injected)
developer implements → updates tdd-state.json (phase: green)
Router checks remaining.length === 0 → emit RALPH_AUDIT_COMPLETE_NO_FINDINGS
If remaining > 0 → loop back to step 1 with next scenario

Anti-pattern: Never re-run scenarios already marked green in state — this wastes iterations and may corrupt evidence logs.

Repository-Scale and Class-Level Guidance

For repository-scale work, decompose by failing test cluster and assign one cluster per loop.
For class-level synthesis, derive a method dependency order and implement one method at a time with method-level public tests.
Keep long-context pressure low by limiting each loop to one scenario and one patch objective.

Verification Checklist

Scenario backlog exists and was updated during work
Every production change maps to at least one failing-then-passing test
RED evidence captured (command + failure summary)
GREEN evidence captured (command + pass summary)
No unresolved failing tests in touched scope
Lint/format/test commands completed or explicitly reported as blocked
No detected test-hacking pattern

Pre-Completion Commands (Project-Scoped)

Use the project's actual commands. Typical sequence:

# 1) targeted test
pnpm test <target>
# 2) impacted suite
pnpm test
# 3) lint
pnpm lint
# 4) format check
pnpm format:check

If the repo uses different scripts, replace these with local equivalents and report exactly what ran.

Rationalization Countermeasures

"I will add tests later" -> stop and write current red test.
"This is too small to test" -> write one minimal behavior test.
"I already manually tested" -> manual runs do not replace executable regression tests.
"I spent too long to delete pre-test code" -> sunk cost; restart from RED.

Related Files

references/research-requirements.md
references/tdd-memory-profile.md
testing-anti-patterns.md
rules/tdd.md
templates/implementation-template.md

Research Basis

This skill is aligned with:

Martin Fowler TDD (Dec 11, 2023)
Kent Beck Canon TDD (Dec 11, 2023)
Rafique & Misic meta-analysis, IEEE TSE DOI:10.1109/TSE.2012.28
LLM4TDD (arXiv:2312.04687)
Test-Driven Development for Code Generation (arXiv:2402.13521)
Tests as Prompt (arXiv:2505.09027)
SWE-Flow (arXiv:2506.09003)
TDFlow (arXiv:2510.23761)
Scaling TDD from Functions to Classes (arXiv:2602.03557)

Memory Protocol

Before starting: Read .claude/context/memory/learnings.md

After completing:

New pattern -> .claude/context/memory/learnings.md
Issue found -> .claude/context/memory/issues.md
Decision made -> .claude/context/memory/decisions.md

Assume interruption: if it is not in memory, it did not happen.

Agent-Studio TDD Extensions (2026)

Hook Testing Pattern

Hooks use stdin/stdout JSON protocol:

const proc = require('child_process').spawn('node', ['.claude/hooks/routing/routing-guard.cjs'], {
  shell: false,
});
proc.stdin.write(JSON.stringify({ tool_name: 'Write', tool_input: {} }));
proc.stdin.end();
// Exit 0=allow, 2=block

Memory TDD

Mock MemoryRecord. Test confidence gate (threshold 0.7). Use atomic writes.

Property-Based Testing

Use fast-check (and @fast-check/vitest for vitest integration) for any function with invariants — not just routing. fast-check 3.x (2025) adds improved unicode, date, and bigint arbitraries.

Routing invariant (existing):

import fc from 'fast-check';
fc.assert(
  fc.property(fc.string(), intent => {
    return typeof routeIntent(intent) === 'string';
  })
);

Memory serialization roundtrip (new):

// Property: serialize(deserialize(x)) === x for all JSON-serializable values
fc.assert(
  fc.property(fc.jsonValue(), value => {
    const serialized = serializeMemoryRecord(value);
    const deserialized = deserializeMemoryRecord(serialized);
    return JSON.stringify(deserialized) === JSON.stringify(value);
  })
);

Hook validation invariant (new):

// Property: for any tool input, isValidInput(x) === !isBlocked(x)
// (validation and blocking must be inverses)
fc.assert(
  fc.property(fc.record({ tool_name: fc.string(), tool_input: fc.object() }), input => {
    const valid = isValidInput(input);
    const blocked = wouldBlock(input);
    return valid !== blocked || (!valid && blocked); // blocked implies invalid
  })
);

Path normalization idempotency (new):

// Property: normalize(normalize(path)) === normalize(path) (idempotent)
fc.assert(
  fc.property(fc.string(), rawPath => {
    const once = normalizePath(rawPath);
    const twice = normalizePath(once);
    return once === twice;
  })
);

Schema validation stability (new):

// Property: validate(schema, x) never throws uncaught exception for any input
fc.assert(
  fc.property(fc.anything(), input => {
    try {
      validateSchema(schema, input);
      return true;
    } catch (e) {
      return e instanceof ValidationError;
    } // Only ValidationError allowed
  })
);

Contract Testing

Validate TaskUpdate metadata schemas (processedReflectionIds: string[]).

Multi-Agent TDD Decomposition (2026 Standard)

Based on TDFlow (arXiv:2510.23761, 94.3% SWE-Bench Verified), monolithic TDD agents score 60–70%. Split into specialized sub-agents:

Role	Agent	Responsibility
Test Author	`qa`	Write failing test, commit test-only
Implementer	`developer`	Implement to green — MUST NOT modify tests
Verifier	`reflection-agent`	Detect test-hacking, verify RED→GREEN evidence

Pattern:

QA agent writes test → commits test file alone (no implementation)
Developer agent implements → runs tests → commits implementation
Reflection agent reviews diff: if test assertions changed → FAIL (test-hacking)

Test-hacking detection: reflection-agent checks git diff HEAD~1 HEAD -- '*.test.*' — any assertion changes after implementation commit = REJECT.

When to use: repository-scale TDD, complex features with multiple behaviors, any task where a single agent might rationalize test changes.

TDAID Phase Mapping (Test-Driven AI-Assisted Development, 2025-2026)

TDAID extends classic TDD with explicit Planning and Validation gates:

Phase	TDAID Label	Agent-Studio Owner	Description
0	Plan	`planner`	Thinking-model generates structured TDD plan with explicit test checkpoints before any code is written
1	Red	`qa`	Write failing test expressing desired behavior; human verifies failure is expected
2	Green	`developer`	Minimal implementation to pass test; MUST NOT modify test assertions
3	Refactor	`developer`

Key TDAID anti-patterns to detect in Validate phase:

Deleting test assertions to make tests pass
Hardcoding expected values
Mocking away the behavior being tested
Making implementation superficially compliant without satisfying the specification intent

Research basis: TDAID (awesome-testing.com, 2025), TDAD agent-to-agent variant (arXiv:2603.08806, 2026), TDFlow (arXiv:2510.23761, 2025)

LSP Pre-RED Type Verification

Before writing a failing test, verify the API contract exists to prevent "fails due to wrong API" rather than "fails due to missing behavior":

# Step 1: Find the target function's file + line
pnpm search:code "functionName"

# Step 2: Verify signature with LSP hover
lsp_hover({ filePath: "/abs/path/to/file.ts", line: 42, character: 10 })
# Returns: function signature, parameter types, return type

# Step 3: Write test using VERIFIED signature
# Now RED is guaranteed to fail due to missing behavior, not API mismatch

Rule: If lsp_hover returns empty (CJS file or LSP not active) → fall back to ripgrep rg -n "functionName" --type ts to read the actual signature.

When NOT needed: trivially new functions that don't exist yet (LSP has nothing to return).

Contract Testing (Hook Boundaries — Expanded)

Hook contracts define the stdin/stdout JSON protocol. Test at the boundary:

// Hook contract test pattern
const proc = spawn('node', ['.claude/hooks/routing/routing-guard.cjs'], { shell: false });
const input = JSON.stringify({
  tool_name: 'Edit',
  tool_input: { file_path: '.claude/agents/core/developer.md' },
});
proc.stdin.write(input);
proc.stdin.end();

// Assert: exit code 2 (block) for protected paths
// Assert: stdout JSON contains { allow: false, message: /Gate 4/ }

TaskUpdate metadata contract:

// Validate processedReflectionIds schema
const schema = {
  type: 'object',
  required: ['processedReflectionIds'],
  properties: { processedReflectionIds: { type: 'array', items: { type: 'string' } } },
  additionalProperties: false,
};

Agent-Studio hook contracts to test:

routing-guard.cjs: blocks Task without task_id (exit 2)
unified-creator-guard.cjs: blocks Write to .claude/skills/**/SKILL.md (exit 2)
spawn-token-guard.cjs: warns at 80K tokens (exit 0 + message)

Test Runner Selection (node --test vs Vitest 4)

Agent Studio uses node --test (built-in Node.js test runner) as the default for all .cjs CommonJS files (hooks, lib, scripts). Vitest 4 is the recommended runner for ESM/TypeScript files.

Runner	Use When	Command
`node --test`	`.cjs` hooks, lib, CommonJS scripts — current Agent Studio standard	`node --test tests/*/.test.cjs`
`vitest`	`.ts`, `.mts`, ESM `.js` files — use when migrating to TypeScript	`pnpm vitest run`

Whynode --test for .cjs: Vitest requires Vite configuration and ESM-compatible modules. Agent Studio hooks use require() and CommonJS — node --test works without transpilation.

Why Vitest 4 for.ts/ESM: Boot time drops from ~8s (Jest) to ~1.2s (Vitest). First-class TypeScript + ESM support, Browser Mode (stable v4), and jest-compatible describe/it/expect API (migration = config change only).

Anti-pattern: Do NOT use Jest for new files. Vitest is the 2025-2026 standard for ESM/TypeScript.

# Current Agent Studio pattern (CJS hooks and lib)
node --test tests/lib/routing/routing-table.test.cjs

# Future ESM/TypeScript pattern
pnpm vitest run tests/lib/routing/routing-table.test.ts

AI Output Evaluation Testing (Non-Deterministic Agents)

LLM/agent outputs are non-deterministic — binary pass/fail assertions are insufficient. Use score-based evaluation and tool-call sequence validation instead.

Score-Based Assertion Pattern

// Agent output evaluation — score dimensions 0.0-1.0
function evaluateAgentOutput(output, expectations) {
  const scores = {
    relevance: scoreRelevance(output, expectations.topic), // 0.0-1.0
    safety: scoreSafety(output), // 0.0-1.0
    faithfulness: scoreFaithfulness(output, expectations.facts), // 0.0-1.0
    format: scoreFormat(output, expectations.schema), // 0.0-1.0
  };
  const overall = Object.values(scores).reduce((a, b) => a + b) / Object.keys(scores).length;
  return { scores, overall, pass: overall >= 0.75 };
}

// Test: agent output meets quality threshold
test('researcher agent output is relevant and safe', () => {
  const result = evaluateAgentOutput(agentOutput, { topic: 'TDD patterns', facts: knownFacts });
  expect(result.scores.safety).toBeGreaterThanOrEqual(0.9); // Hard floor for safety
  expect(result.overall).toBeGreaterThanOrEqual(0.75); // 75% overall threshold
});

Tool-Call Sequence Validation

For agent tests, validate the sequence and count of tool calls, not just the final output:

// Spy on tool calls and assert ordering
const toolCallLog = [];
const mockTaskUpdate = jest.fn(args => {
  toolCallLog.push({ tool: 'TaskUpdate', args });
});
const mockBash = jest.fn(args => {
  toolCallLog.push({ tool: 'Bash', args });
});

// Run agent under test with mocked tools
await runAgent({ TaskUpdate: mockTaskUpdate, Bash: mockBash });

// Assert: TaskUpdate(in_progress) called BEFORE TaskUpdate(completed)
const inProgressIdx = toolCallLog.findIndex(
  c => c.tool === 'TaskUpdate' && c.args.status === 'in_progress'
);
const completedIdx = toolCallLog.findIndex(
  c => c.tool === 'TaskUpdate' && c.args.status === 'completed'
);
expect(inProgressIdx).toBeLessThan(completedIdx); // Ordering enforced
expect(inProgressIdx).toBeGreaterThanOrEqual(0); // Must have been called
expect(completedIdx).toBeGreaterThanOrEqual(0); // Must have been called

Rule: Never test the text content of LLM-generated prose. Test structure, schema validity, tool-call sequences, and score thresholds.

Reference: Simon Willison (2025) — "Red/Green TDD for agents: write assertions on tool-call sequences and structured outputs."

MSW v2 HTTP Mocking (API Boundary Testing)

Use MSW (Mock Service Worker) v2 to test skills and agents that make external HTTP calls. MSW intercepts at the network level — no monkey-patching of fetch, no code changes in production.

pnpm add -D msw@2

Setup Pattern (Node.js / Vitest)

import { setupServer } from 'msw/node';
import { http, HttpResponse } from 'msw';

// Define handlers — these describe the expected API contract
const handlers = [
  http.get('https://api.example.com/search', ({ request }) => {
    const url = new URL(request.url);
    return HttpResponse.json({
      results: [{ id: 1, title: `Result for: ${url.searchParams.get('q')}` }],
    });
  }),
];

const server = setupServer(...handlers);
beforeAll(() => server.listen({ onUnhandledRequest: 'error' }));
afterEach(() => server.resetHandlers());
afterAll(() => server.close());

// Test: researcher skill makes HTTP call and processes response
test('researcher skill fetches and parses search results', async () => {
  const results = await researcherSkill.search('TDD patterns 2026');
  expect(results).toHaveLength(1);
  expect(results[0].title).toContain('TDD patterns');
});

Override Per-Test for Error Cases

test('researcher skill handles 503 gracefully', async () => {
  server.use(
    http.get('https://api.example.com/search', () => HttpResponse.json({}, { status: 503 }))
  );
  const results = await researcherSkill.search('TDD patterns');
  expect(results).toEqual([]); // Graceful empty fallback
});

Key benefits over manual mocking:

Tests exercise real HTTP client code paths (not mocked abstractions)
onUnhandledRequest: 'error' catches unintentional external calls during tests
Handlers define request/response contracts — doubles as documentation

Agent-Studio targets for MSW boundary tests:

researcher skill → WebSearch/WebFetch HTTP calls
github-ops skill → GitHub API calls
Any agent using mcp__Exa__web_search_exa or WebFetch

Mutation Testing (Stryker JS)

Mutation testing validates test QUALITY, not just coverage. Run after achieving 100% line coverage:

Stryker + Vitest (2026 Standard — ESM/TypeScript projects)

# Install (once per project) — use vitest-runner for ESM/TypeScript
pnpm add -D @stryker-mutator/core @stryker-mutator/vitest-runner vitest



// stryker.config.mjs — working configuration for Vitest projects
/** @type {import('@stryker-mutator/api/core').PartialStrykerOptions} */
export default {
  testRunner: 'vitest',
  vitest: {
    configFile: 'vitest.config.ts', // optional: path to your vitest config
    related: true, // default: run only tests related to mutated file
  },
  thresholds: { high: 80, low: 60, break: 50 },
  reporters: ['html', 'progress'],
};



# Run mutation tests (use incremental to speed up local loops)
pnpm stryker run --incremental

# Target threshold: >80% mutation score
# Score = (killed mutations / total mutations) × 100

Vitest runner limitations (StrykerJS 7.x):

Browser Mode not supported — threads mode only
Always uses perTest coverage analysis (ignores coverageAnalysis config)
For .cjs files using node --test, use @stryker-mutator/jest-runner as fallback

Stryker + node:test (CommonJS/.cjs projects)

pnpm add -D @stryker-mutator/core @stryker-mutator/jest-runner

Interpret results:

Killed — test suite caught the mutation ✓
Survived — test suite MISSED this code path (add assertion)
No coverage — no test exercises this line at all (add test)

When to run: after completing a TDD cycle for security-critical code (hooks, validators, routing logic). Not required for all code — prioritize by risk.

Agent-Studio priority targets for mutation testing:

.claude/hooks/routing/routing-guard.cjs
.claude/hooks/safety/unified-creator-guard.cjs
.claude/lib/routing/routing-table.cjs

Validation Phase: TDAD Dependency Map (P0)

Before committing, agents MUST identify which test files cover the changed source files. Use compiler-assisted reference discovery or targeted grep:

# Find tests that import the changed file
grep -r "import.*changedFile\|require.*changedFile" tests/

# Or use LSP to find all references
lsp_findReferences({ filePath: "/path/to/changed/file.ts", line: 1, character: 1 })

Build a dependency map and run ONLY those tests first (70% faster regression detection per arXiv:2603.17973):

# 1. Run targeted tests (impacted tests only)
pnpm test tests/hooks/routing-guard.test.cjs

# 2. Verify no regressions in targeted scope
# 3. Only then run full suite
pnpm test

Rationale: Full test suites can exceed 5 minutes on large repos. Targeted testing catches regressions in 15-30 seconds, freeing context for next scenarios in a long TDD loop.

Validation Phase: Spec-Gaming Detection (P0)

In the Validate phase, verify the implementation hasn't gamed test assertions:

Checklist:

Tests assert behavior, not implementation details (no testing private variables or class internals)
No hardcoded expected values were copied from test to implementation
Mutation score ≥80% indicates test quality is sufficient for detecting regressions
Review: could the code pass tests while being fundamentally wrong?

Run mutation testing if available:

# Test suite strength validation — mutations should be caught
pnpm stryker run

# If mutation score < 80%, tests are too weak:
# - Add negative tests
# - Add boundary condition tests
# - Verify assertions are on behavior, not mocks

Spec-gaming examples to catch:

✗ Implementation hardcodes return 42 to pass test expecting 42 → mutation testing catches this
✗ Test mocks behavior instead of asserting it → mutation testing shows 0% mutation killed
✗ Test checks log message instead of behavior → flip the assertion, implementation still passes

Agent-Studio targets: After completing security-critical hook or routing changes, run mutation testing before marking task complete.

Weekly Installs

Repository

oimiragieo/agent-studio

GitHub Stars

First Seen

Jan 27, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

github-copilot61

gemini-cli60

opencode59

codex59

kimi-cli59

amp59

后端测试指南：API端点、业务逻辑与数据库测试最佳实践

11,800 周安装