三阶段代码审查方法：提升代码质量与规范符合性的结构化流程

code-review by jwilger/agent-skills

101 周安装量

2 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/jwilger/agent-skills --skill code-review

质量管理软件工程代码质量

🇨🇳中文介绍

代码审查

价值： 反馈与沟通——结构化的审查能发现作者无法察觉的缺陷，将审查分为不同阶段可以防止某一领域的彻底性挤占其他领域。

目的

教授一种系统性的三阶段代码审查方法，将规范符合性、代码质量和领域完整性作为独立的步骤进行评估。通过确保每个维度都得到专注的关注，防止在合并审查时让问题溜走。

实践

三阶段，按顺序进行

按顺序进行三个阶段的代码审查。不要合并它们。每个阶段都有一个单一焦点。早期阶段的失败会阻止后续阶段——审查不符合规范的代码质量是没有意义的。

阶段 1：规范符合性。 代码是否按要求执行？不多不少。

针对每个验收标准或需求：

找到实现它的代码
找到验证它的测试
确认实现与规范完全匹配

标记每个标准：PASS（通过）、FAIL（缺失/不完整/偏离）或 CONCERN（已实现但可能不正确）。将超出需求构建的任何内容标记为 OVER-BUILT（过度构建）。

如果有任何标准为 FAIL，则停止。在继续之前返回修改实现。

架构符合性检查（在逐项标准循环之后、进入阶段 2 之前运行）：

如果 docs/ARCHITECTURE.md 存在：验证此更改是否符合所有已记录的约束和模式（组件、模式、约束部分）。不符合即为 FAIL——严重性与缺失验收标准相同。
如果 docs/ARCHITECTURE.md 不存在：标记为阶段 2 的 CONCERN："未找到架构文档；无法验证架构符合性。"

在阶段 1 的输出中包含：Architecture Compliance: PASS / FAIL / N/A (no ARCHITECTURE.md)

垂直切片层覆盖

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

结构化审查证据

完成所有三个阶段后，生成一个 REVIEW_RESULT 证据包，包含：各阶段裁决 {stage, verdict (PASS/FAIL), findings [{severity, description, file, line?, required_change?}]}、overall_verdict、required_changes_count、blocking_findings_count。

当上下文元数据中提供了 pipeline-state 时，code-review 技能以流水线模式运行，并将证据存储到 .factory/audit-trail/slices/<slice-id>/review.json。当独立运行时，证据仅为信息性（不存储）。

在工厂模式下，整个团队在流水线推送代码之前进行审查——这是取代构建期间共识的质量检查点。所有阻塞性审查反馈必须在推送前解决。有关工厂模式审查子部分，请参见 references/mob-review.md。

基于文件的审查工件

审查发现必须写入 .reviews/ 文件作为默认的持久化机制。消息仅是补充的协调信号——它们无法在上下文压缩后保留。

命名： <reviewer-name>-<task-slug>.md（例如，kent-beck-user-login.md）
内容： 完整的结构化审查输出（所有三个阶段）
位置： .reviews/ 目录（添加到 .gitignore）
消息显示 "review posted to .reviews/"——实质性反馈仅存在于文件中

这确保了审查发现能够在上下文压缩、代理重启以及缺乏代理间消息传递的框架中保留下来。

非阻塞性反馈升级

出现在 2+ 个不同切片的连续审查中的非阻塞性项目（SUGGESTION 严重性）必须升级为阻塞性（IMPORTANT 严重性）。通过检查 .reviews/ 中先前的审查文件来跟踪重复出现的情况。

这可以防止持续的质量问题被永久性地推迟为"只是一个建议"。

面向用户的行为验证

当 GWT 场景描述了用户可见的行为（UI 元素、显示的消息、视觉变化）时，变更集必须包含产生该可见输出的代码。当场景描述 UI 交互时，仅实现 API 是规范符合性失败——切片不完整。

书面约定优先于观察到的模式。当审查发现与项目约定（CLAUDE.md、AGENTS.md、crate 级文档、架构决策记录）冲突但与代码库中的现有代码匹配时，该发现仍然有效。违反约定的现有代码是技术债务，而不是先例。

绝不基于现有代码降低严重性。 无论有多少文件已经包含相同的错误，违反约定都是阻塞性的。
将新代码标记为阻塞性。 应用与不存在先前代码时相同的严重性。
将现有违规记录为重构候选。 在审查输出中，添加一个单独的注释，列出存在相同反模式的现有文件，以便团队可以安排清理工作。

示例：项目约定规定"对状态机使用类型状态模式"。新代码使用 struct Foo { is_active: bool }，因为三个现有文件也这样做。审查必须阻止新代码并且将这三个现有文件记录为技术债务。

当你的审查发现与实现方法冲突时：

陈述关注点，并提供具体的代码引用
解释风险——可能出什么问题
提出替代方案
如果一轮讨论后仍未达成一致，则升级给用户处理

你的存在是为了发现作者遗漏的问题，而不是为了阻碍进展。

业务价值和用户体验意识

在阶段 1 中，还需考虑：

这个切片是否提供了可见的用户价值？
验收标准是否具体且可测试（不模糊）？
此更改后，用户旅程是否保持连贯？
是否从用户角度处理了边界情况和错误状态？

这些不是阻塞性关注点，但在相关时应予以注明。

独立模式： 咨询性。代理无法在未经审查的情况下机械地阻止合并。
流水线模式： 门控性。审查裁决会阻止流水线进展——FAIL 或 CHANGES REQUIRED 会停止该切片。

阶段 1 FAIL 会阻止后续阶段：[H]
约定违规（约定优于先例规则）：[RP]

有关流水线返工预算冲突，请参见模板目录中的 CONSTRAINT-RESOLUTION.md。

非阻塞性升级： 当相同问题（不仅仅是相同类别）出现在 2+ 个不同切片的连续审查中时，升级规则触发。为避免触发升级而改变同一关注点的措辞是违规行为。判断标准是：人类审查员是否会认为这些是相同的根本问题？
约定优于先例： "书面约定"指 CLAUDE.md、AGENTS.md、ADR 或 crate/package 级文档中记录的约定。仅在现有代码中观察到的模式是先例，不是约定。但如果一个模式是普遍存在的（100% 的文件都遵循它）且没有书面约定与之矛盾，则将其标记为文档化的候选——不要仅仅忽略它。
垂直切片覆盖： 层覆盖检查适用于更改用户可观察行为的情况。"用户可观察"意味着使用应用程序的人会注意到差异。提高性能但不改变行为的基础设施更改不是用户可观察的。新的 API 端点是用户可观察的，即使目前还没有 UI。

根据此技能完成审查后，请验证：

所有三个阶段都已按顺序单独执行
已将 docs/ARCHITECTURE.md 符合性检查作为阶段 1 的明确项目
阶段 1 中，每个验收标准都已映射到代码和测试
阶段 2 中，每个更改的文件都已评估了清晰度和领域类型使用情况
阶段 3 中，已检查领域完整性以寻找编译时强制执行机会
约定违规未因与现有代码匹配而被降级
已生成结构化摘要，每个阶段都有明确的 PASS/FAIL
任何 CHANGES REQUIRED 项目都列出了具体、可操作的修复方案
审查发现已写入 .reviews/ 文件（不仅仅是消息）
在 2+ 次审查中看到的重复非阻塞性项目已升级为阻塞性
面向用户的行为验证已应用于描述 UI 的场景

如果有任何标准未满足，请在最终确定前重新审视相关阶段。

此技能可独立工作。对于增强的工作流程，它与以下技能集成：

domain-modeling： 提供阶段 2 和阶段 3 中引用的原始类型迷恋和解析而非验证原则
tdd： 审查通常遵循 TDD 周期；此技能验证该周期的输出
mutation-testing： 可以作为代码审查之后的额外质量门控

缺少依赖项？使用以下命令安装：

npx skills add jwilger/agent-skills --skill domain-modeling

🇺🇸English

Code Review

Value: Feedback and communication -- structured review catches defects that the author cannot see, and separating review into stages prevents thoroughness in one area from crowding out another.

Purpose

Teaches a systematic three-stage code review that evaluates spec compliance, code quality, and domain integrity as separate passes. Prevents combined reviews from letting issues slip through by ensuring each dimension gets focused attention.

Practices

Three Stages, In Order

Review code in three sequential stages. Do not combine them. Each stage has a single focus. A failure in an earlier stage blocks later stages -- there is no point reviewing code quality on code that does not meet the spec.

Stage 1: Spec Compliance. Does the code do what was asked? Not more, not less.

For each acceptance criterion or requirement:

Find the code that implements it
Find the test that verifies it
Confirm the implementation matches the spec exactly

Mark each criterion: PASS, FAIL (missing/incomplete/divergent), or CONCERN (implemented but potentially incorrect). Flag anything built beyond requirements as OVER-BUILT.

If any criterion is FAIL, stop. Return to implementation before continuing.

Architecture Compliance Check (run after the per-criterion loop, before moving to Stage 2):

If docs/ARCHITECTURE.md exists: verify this change complies with all documented constraints and patterns (Components, Patterns, Constraints sections). Non-compliance is a FAIL — same severity as a missing acceptance criterion.
If docs/ARCHITECTURE.md does not exist: flag as a Stage 2 CONCERN: "No architecture document found; architectural compliance cannot be verified."

Include in Stage 1 output: Architecture Compliance: PASS / FAIL / N/A (no ARCHITECTURE.md)

Vertical Slice Layer Coverage

For tasks that implement a vertical slice (adding user-observable behavior), perform the following checks in order:

Entry-point wiring check (diff-based): Examine whether the changeset includes modifications to the application's entry point or its wiring/routing layer. If the slice claims to add new user-observable behavior but the diff does not touch any wiring or entry-point code, the review fails unless the author explicitly documents why existing wiring already routes to the new behavior.
End-to-end traceability: Verify that a path can be traced from the application's external entry point, through any infrastructure or integration layer, to the new domain logic, and back to observable output. If any segment of this path is missing from the changeset and not already present in the codebase, flag the gap.
Boundary-level test coverage: Confirm that at least one test exercises the new behavior through the application's external boundary (e.g., an HTTP request, a CLI invocation, a message on a queue) rather than calling internal functions directly. Where the application architecture makes automated boundary tests feasible, their absence is a review concern.
Test-level smell check: If every test in the changeset is a unit test of isolated internal functions with no integration or acceptance-level test, flag this as a concern. The slice may be implementing domain logic without proving it is reachable through the running application.

Stage 2: Code Quality. Is the code clear, maintainable, and well-tested?

Review each changed file for:

Clarity: Can you understand what the code does without extra context? Are names descriptive? Is the structure obvious?
Domain types: Are semantic types used where primitives appear? You MUST follow the domain-modeling skill for primitive obsession detection.
Error handling: Are errors handled with typed errors? Are all paths covered?
Test quality: Do tests verify behavior, not implementation? Is coverage adequate for the changed code?
YAGNI: Is there unused code, speculative features, or premature abstraction?

Categorize findings by severity:

CRITICAL: Bug risk, likely to cause defects
IMPORTANT: Maintainability concern, should fix before merge
SUGGESTION: Style or minor improvement, optional

If any CRITICAL issue exists, stop. Return to implementation.

Stage 3: Domain Integrity. Final gate -- does the code respect domain boundaries?

Check for:

Compile-time enforcement opportunities: Are tests checking things the type system could enforce instead?
Domain type consistency: Are semantic types used at all boundaries, or do primitives leak through?
Validation placement: Is validation at construction (parse-don't-validate), not scattered through business logic?
State representation: Can the types represent invalid states? Are bool fields hiding state machines? (See domain-modeling bool-as-state check.)
Convention compliance: Do types and patterns match project conventions? Apply the Convention Over Precedent rule -- existing code that violates a convention is not a defense.

Flag issues but do not block on suggestions, EXCEPT convention violations -- those are blocking per the Convention Over Precedent rule.

Review Output

Produce a structured summary after all three stages:

REVIEW SUMMARY
Stage 1 (Spec Compliance): PASS/FAIL
Stage 2 (Code Quality): PASS/FAIL/PASS with suggestions
Stage 3 (Domain Integrity): PASS/FAIL/PASS with flags

Overall: APPROVED / CHANGES REQUIRED

If CHANGES REQUIRED:
  1. [specific required change]
  2. [specific required change]

Structured Review Evidence

After completing all three stages, produce a REVIEW_RESULT evidence packet containing: per-stage verdicts {stage, verdict (PASS/FAIL), findings [{severity, description, file, line?, required_change?}]}, overall_verdict, required_changes_count, blocking_findings_count.

When pipeline-state is provided in context metadata, the code-review skill operates in pipeline mode and stores the evidence to .factory/audit-trail/slices/<slice-id>/review.json. When running standalone, the evidence is informational only (not stored).

In factory mode , the full team reviews before the pipeline pushes code -- this is the quality checkpoint that replaces consensus-during-build. All blocking review feedback must be addressed before push. See references/mob-review.md for the factory mode review subsection.

File-Based Review Artifacts

Review findings MUST be written to .reviews/ files as the default persistence mechanism. Messages are supplementary coordination signals only — they do not survive context compaction.

Naming: <reviewer-name>-<task-slug>.md (e.g., kent-beck-user-login.md)
Content: Full structured review output (all three stages)
Location: .reviews/ directory (add to .gitignore)
Messages say "review posted to .reviews/" — substantive feedback lives in files only

This ensures review findings survive context compaction, agent restarts, and harnesses that lack inter-agent messaging.

Non-Blocking Feedback Escalation

Non-blocking items (SUGGESTION severity) that appear in 2+ consecutive reviews of different slices MUST escalate to blocking (IMPORTANT severity). Track recurrence by checking previous review files in .reviews/.

This prevents persistent quality issues from being perpetually deferred as "just a suggestion."

User-Facing Behavior Verification

When a GWT scenario describes user-visible behavior (UI elements, displayed messages, visual changes), the changeset MUST include code that produces that visible output. An API-only implementation when the scenario describes UI interaction is a spec compliance failure — the slice is incomplete.

Convention Over Precedent

Written conventions override observed patterns. When a review finding conflicts with a project convention (CLAUDE.md, AGENTS.md, crate-level docs, architectural decision records) but matches existing code in the codebase, the finding is still valid. Existing code that violates a convention is tech debt, not precedent.

Rules:

Never downgrade on the basis of existing code. A convention violation is blocking regardless of how many files already contain the same mistake.
Flag the new code as blocking. Apply the same severity you would if no prior code existed.
Note existing violations as refactoring candidates. In the review output, add a separate note listing files where the same anti-pattern already exists so the team can schedule cleanup.

Example: a project convention says "use the typestate pattern for state machines." The new code uses struct Foo { is_active: bool } because three existing files do the same. The review must block the new code AND note the three existing files as tech debt.

Handling Disagreements

When your review finding conflicts with the implementation approach:

State the concern with specific code references
Explain the risk -- what could go wrong
Propose an alternative
If no agreement after one round, escalate to the user

You exist to catch what the author missed, not to block progress.

Business Value and UX Awareness

During Stage 1, also consider:

Does this slice deliver visible user value?
Are acceptance criteria specific and testable (not vague)?
Does the user journey remain coherent after this change?
Are edge cases and error states handled from the user's perspective?

These are not blocking concerns but should be noted when relevant.

Enforcement Note

Standalone mode : Advisory. The agent cannot mechanically prevent merging without review.
Pipeline mode : Gating. Review verdicts block pipeline progression -- FAIL or CHANGES REQUIRED halts the slice.

Hard constraints:

Stage 1 FAIL blocks later stages: [H]
Convention violation (Convention Over Precedent rule): [RP]

See CONSTRAINT-RESOLUTION.md in the template directory for pipeline rework budget conflicts.

Constraints

Non-blocking escalation : The escalation rule triggers when the SAME issue (not just the same category) appears in 2+ consecutive reviews of DIFFERENT slices. Varying the wording of the same concern to avoid triggering escalation is a violation. The test is: would a human reviewer recognize these as the same underlying issue?
Convention Over Precedent : "Written conventions" means conventions documented in CLAUDE.md, AGENTS.md, ADRs, or crate/package-level documentation. Patterns observed only in existing code are precedent, not convention. But if a pattern is universal (100% of files follow it) and no written convention contradicts it, flag it as a candidate for documentation -- don't just ignore it.
Vertical slice coverage : The layer coverage check applies when the change modifies user-observable behavior. "User-observable" means a human using the application would notice a difference. Infrastructure changes that improve performance but don't change behavior are not user-observable. A new API endpoint IS user-observable even if no UI exists yet.

Verification

After completing a review guided by this skill, verify:

All three stages were performed separately, in order
docs/ARCHITECTURE.md compliance checked as a named Stage 1 item
Every acceptance criterion was mapped to code and tests in Stage 1
Each changed file was assessed for clarity and domain type usage in Stage 2
Domain integrity was checked for compile-time enforcement opportunities in Stage 3
Convention violations were not downgraded due to matching existing code
A structured summary was produced with clear PASS/FAIL per stage
Any CHANGES REQUIRED items list specific, actionable fixes
Review findings written to .reviews/ files (not messages only)
Recurring non-blocking items escalated to blocking when seen in 2+ reviews
User-facing behavior verification applied to UI-describing scenarios

If any criterion is not met, revisit the relevant stage before finalizing.

Dependencies

This skill works standalone. For enhanced workflows, it integrates with:

domain-modeling: Provides the primitive obsession and parse-don't-validate principles referenced in Stage 2 and Stage 3
tdd: Reviews often follow a TDD cycle; this skill validates the output of that cycle
mutation-testing: Can follow code review as an additional quality gate

Missing a dependency? Install with:

npx skills add jwilger/agent-skills --skill domain-modeling

Weekly Installs

Repository

jwilger/agent-skills

GitHub Stars

First Seen

Feb 12, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

codex79

cursor79

opencode77

github-copilot77

amp77

kimi-cli77

AI绩效改进计划PIP技能：提升AI任务执行主动性、交付质量与问题解决能力

993 周安装