基于辩论的AI代码评估工具：结构化多评委辩论，减少偏见达成共识

sadd%3Ajudge-with-debate by neolabhq/context-engineering-kit

297 周安装量

739 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/neolabhq/context-engineering-kit --skill sadd:judge-with-debate

AI/机器学习开发测试

🇨🇳中文介绍

judge-with-debate

主要优势：

结构化评估 - 元评委在评估开始前生成定制的评分标准和细则
多重视角 - 三位独立评委减少个人偏见
基于证据的辩论 - 评委使用解决方案和评估规范中的具体证据来捍卫自己的立场
迭代优化 - 最多3轮辩论推动评分达成准确共识
共享规范 - 元评委运行一次；所有评委在所有轮次中共享相同的评估规范

模式：基于辩论的评估

此命令实现迭代式多评委辩论：

Phase 0: Setup
         mkdir -p .specs/reports
                  |
Phase 0.5: Dispatch Meta-Judge
         Meta-Judge (Opus)
              |
         Evaluation Specification YAML
              |
Phase 1: Independent Analysis (3 judges in parallel)
         +- Judge 1 -> {name}.1.md -+
Solution +- Judge 2 -> {name}.2.md -+-+
         +- Judge 3 -> {name}.3.md -+ |
                                      |
Phase 2: Debate Round (iterative)     |
    Each judge reads others' reports  |
         |                            |
    Argue + Defend + Challenge        |
    (grounded in eval specification)  |
         |                            |
    Revise if convinced --------------+
         |                            |
    Check consensus                   |
         +- Yes -> Final Report       |
         +- No -> Next Round ---------+

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

设置：创建报告目录

开始评估前，确保报告目录存在：

mkdir -p .specs/reports

报告命名约定： .specs/reports/{solution-name}-{YYYY-MM-DD}.[1|2|3].md

{solution-name} - 从解决方案文件名派生（例如，从 src/api/users.ts 得到 users-api）
{YYYY-MM-DD} - 当前日期
[1|2|3] - 评委编号

阶段 0.5：派遣元评委

在独立分析之前，派遣一个元评委代理来生成定制的评估规范。元评委运行一次，并生成所有评委在所有轮次中都将使用的评分标准、检查清单和评分准则。

元评委提示模板：

## Task

Generate an evaluation specification yaml for the following evaluation task. You will produce rubrics, checklists, and scoring criteria that multiple judge agents will use to evaluate the solution through independent analysis and multi-round debate.

CLAUDE_PLUGIN_ROOT=`${CLAUDE_PLUGIN_ROOT}`

## User Prompt
{task description - what the solution was supposed to accomplish}

## Context
{Any relevant context about the solution being evaluated}

## Artifact Type
{code | documentation | configuration | etc.}

## Evaluation Mode
Multi-judge debate with consensus-seeking across rounds

## Instructions
Return only the final evaluation specification YAML in your response.
The specification should support both independent analysis and debate-based refinement.

Use Task tool:
  - description: "Meta-judge: generate evaluation specification for {solution-name}"
  - prompt: {meta-judge prompt}
  - model: opus
  - subagent_type: "sadd:meta-judge"

等待元评委完成，并在进入阶段 1 之前从其输出中提取评估规范 YAML。

阶段 1：独立分析

并行启动3个独立的评委代理（使用 Opus 模型以确保严谨性）：

每位评委接收：
- 待评估解决方案的路径
- 元评委的评估规范 YAML
- 任务描述
每位评委生成独立评估，保存到 .specs/reports/{solution-name}-{date}.[1|2|3].md
报告必须包括：
- 每个标准的评分及证据
- 支持评分的具体引用/示例
- 总体加权分数
- 主要优点和缺点

关键原则： 初始分析的独立性防止群体思维。

初始评委提示模板：

You are Judge {N} evaluating a solution independently against an evaluation specification produced by the meta judge.

CLAUDE_PLUGIN_ROOT=`${CLAUDE_PLUGIN_ROOT}`

## Solution
{path to solution file(s)}

## Task Description
{what the solution was supposed to accomplish}

## Evaluation Specification

```yaml
{meta-judge's evaluation specification YAML}

.specs/reports/{solution-name}-{date}.{N}.md

遵循代理指令中定义的完整评委流程！

彻底阅读解决方案
对于评估规范中的每个标准：
- 查找具体证据（引用确切文本）
- 在定义的评分标准上打分
- 用具体示例证明
计算加权总体分数
将全面的报告写入 {output_file}

在报告开头添加 Done by Judge {N}

**派遣每位评委：**

description: "Judge {N}: independent analysis of {solution-name}"
prompt: {judge prompt with evaluation specification YAML}
model: opus
subagent_type: "sadd:judge"

阶段 2：辩论轮次（迭代）

对于每个辩论轮次（最多 3 轮）：

并行启动3个辩论代理：
1. 每位评委代理接收：
  - 他们自己先前报告的路径（.specs/reports/{solution-name}-{date}.[1|2|3].md）
  - 其他评委报告的路径（.specs/reports/{solution-name}-{date}.[1|2|3].md）
  - 原始解决方案
  - 元评委的评估规范 YAML
2. 每位评委：
  - 识别与其他评委的分歧（在任何标准上评分差距 >1 分）
  - 使用解决方案和评估规范中的证据为自己的评分辩护
  - 挑战他们不同意的其他评委的评分
  - 考虑反驳论点
  - 如果被说服，则修改他们的评估
3. 用新章节更新他们的报告文件：## Debate Round {R}
4. 他们回复后，如果达成一致，则进入阶段 3：共识报告
关键原则： 评委仅通过文件系统进行通信 - 协调器不进行调解，也不读取报告文件本身，否则可能超出你的上下文限制。

辩论评委提示模板：
```
You are Judge {N} in debate round {R}.

CLAUDE_PLUGIN_ROOT=`${CLAUDE_PLUGIN_ROOT}`

## Your Previous Report
{path to .specs/reports/{solution-name}-{date}.{N}.md}

## Other Judges' Reports
Judge 1: .specs/reports/{solution-name}-{date}.1.md
...

## Task Description
{what the solution was supposed to accomplish}

## Solution
{path to solution}

## Evaluation Specification

```yaml
{meta-judge's evaluation specification YAML}
```

.specs/reports/{solution-name}-{date}.{N}.md（追加到现有文件）

遵循代理指令中定义的完整评委流程！

附加辩论指令：

从 {your_previous_report} 阅读你之前的评估
阅读所有其他评委的报告
识别分歧（你的评分与其他评委相差 >1 分的地方）
对于每个主要分歧：
- 清晰地陈述分歧
- 使用解决方案和评估规范中的证据为你的立场辩护
- 用反驳证据挑战其他评委的立场
- 考虑他们的证据是否改变了你的观点
通过追加辩论轮次章节来更新你的报告文件
回复你是否达成一致，以及与哪位评委达成一致。包括重新审视后的分数和标准分数。

将你的论点建立在评估规范标准之上
仅在你认为他们的证据有说服力时进行修改
如果你仍然相信你的原始评分，请为其辩护
引用解决方案中的具体证据

派遣每位辩论评委：

description: "Judge {N}: debate round {R} for {solution-name}"
prompt: {debate judge prompt with evaluation specification YAML}
model: opus
subagent_type: "sadd:judge"

共识检查

每轮辩论后，检查是否达成共识：

达成共识的条件：
- 所有评委的总体分数彼此相差在 0.5 分以内
- 任何两个评委之间没有标准评分差距 >1 分
- 所有评委明确声明他们接受共识
如果 3 轮后仍未达成共识：
- 报告持续存在的分歧
- 提供所有评委报告供人工审查
- 标记自动评估无法达成共识
协调指令：

步骤 1：派遣元评委（阶段 0.5）
1. 启动元评委代理
2. 等待元评委完成
3. 从元评委输出中提取评估规范 YAML
步骤 2：运行独立分析（阶段 1）
1. 并行启动 3 个评委代理（评委 1、2、3），并附带评估规范 YAML
2. 每位评委将他们的独立评估写入 .specs/reports/{solution-name}-{date}.[1|2|3].md
3. 等待所有 3 个代理完成
步骤 3：检查共识

让我们系统地完成这一步，以确保准确检测共识。

阅读所有三份报告并提取：
- 每位评委的总体加权分数
- 每位评委对每个标准的评分
逐步检查共识：
1. 首先，从每份报告中提取所有总体分数并明确列出
2. 计算最高和最低总体分数之间的差值
  - 如果差值 <= 0.5 分 -> 达成总体共识
  - 如果差值 > 0.5 分 -> 尚未达成共识
3. 接下来，对于每个标准，并排列出所有三位评委的评分
4. 对于每个标准，计算最高和最低评分之间的差值
  - 如果任何标准的差值 > 1.0 分 -> 在该标准上未达成共识
5. 最后，验证仅当满足两个条件时才达成共识：
  - 总体分数相差在 0.5 分以内
  - 所有标准分数相差在 1.0 分以内
步骤 4：决策点
- 如果达成共识：转到步骤 6（生成共识报告）
- 如果未达成共识且轮次 < 3：转到步骤 5（运行辩论轮次）
- 如果未达成共识且轮次 = 3：转到步骤 7（报告未达成共识）
步骤 5：运行辩论轮次
1. 增加轮次计数器（轮次 = 轮次 + 1）
2. 并行启动 3 个评委代理，附带相同的评估规范 YAML
3. 每个代理读取：
  - 他们自己先前来自文件系统的报告
  - 其他评委来自文件系统的报告
  - 原始解决方案
4. 每个代理将 "Debate Round {R}" 章节追加到他们自己的报告文件中
5. 等待所有 3 个代理完成
6. 返回步骤 3（检查共识）
步骤 6：回复报告

让我们逐步综合评估结果。
1. 仔细阅读所有最终报告
2. 在生成报告之前，分析以下内容：
  - 共识状态是什么（达成还是未达成）？
  - 所有评委一致同意的关键点是什么？
  - 主要的分歧领域是什么（如果有的话）？
  - 辩论轮次如何改变了评估？
3. 向用户回复一份包含以下内容的报告：
  - 如果存在共识：
    - 共识分数（所有评委的平均值）
    - 共识优点/缺点
    - 达成共识所需的轮次
    - 最终建议，附带清晰的论证
  - 如果不存在共识：
    - 所有评委的最终分数，显示分歧
    - 未达成共识的具体标准
    - 分析为何无法达成共识
    - 标记供人工审查
4. 命令完成
步骤 7：报告未达成共识
- 报告持续存在的分歧
- 提供所有评委报告供人工审查
- 标记自动评估无法达成共识
阶段 3：共识报告

如果达成共识，通过系统地处理每个部分来综合最终报告：
```
# 共识评估报告

让我们通过系统地分析每个组成部分来汇编最终共识。

## 共识分数

首先，让我们整合所有评委的最终分数：

| Criterion | Judge 1 | Judge 2 | Judge 3 | Final |
|-----------|---------|---------|---------|-------|
| {Name}    | {X}/5   | {X}/5   | {X}/5   | {X}/5 |
...

**Consensus Overall Score**: {avg}/5.0

## 共识优点
[审查每位评委识别的优点，并提取所有评委一致同意的共同主题]

## 共识缺点
[审查每位评委识别的缺点，并提取所有评委一致同意的共同主题]

## 辩论摘要
让我们追溯共识是如何达成的：
- 达成共识的轮次：{N}
- 初始分歧：{列出具体标准和分数差距}
- 如何解决：{对于每个分歧，解释是哪些证据或论点导致了解决}

## 最终建议
基于共识分数和识别出的关键优点/缺点：
{通过/失败/需要修订，并附带与证据相关的清晰论证}
```

报告目录：.specs/reports/（如果不存在则创建）
初始报告：.specs/reports/{solution-name}-{date}.1.md、.specs/reports/{solution-name}-{date}.2.md、.specs/reports/{solution-name}-{date}.3.md
辩论更新：每轮在每个报告文件中追加的章节
最终综合：回复给用户（共识或分歧摘要）

元评委 + 评委验证

切勿跳过元评委 - 定制的评估标准能产生更好的判断和更有根据的辩论
元评委运行一次 - 所有 3 位评委在所有辩论轮次中使用相同的规范
包含 CLAUDE_PLUGIN_ROOT - 元评委和评委都需要解析后的插件根路径
元评委 YAML - 仅将 YAML 传递给评委，不要修改它
辩论依据 - 评委在捍卫立场时应引用评估规范标准

评委创建新报告而不是追加 - 丢失辩论历史
协调器在评委之间传递报告 - 违反文件系统通信原则
初始评估薄弱 - 输入垃圾，输出垃圾
辩论轮次过多 - 3 轮后收益递减
辩论中的谄媚 - 评委在没有真实证据的情况下轻易同意
修改元评委 YAML - 规范必须原封不动地传递给所有评委
在轮次之间重新运行元评委 - 规范生成一次并共享

评委追加到他们自己的报告文件
评委直接从文件系统读取其他报告
基于证据的强有力的初始评估
最多 3 轮辩论
要求提供改变立场的证据
将辩论论点建立在评估规范标准之上
在所有轮次中使用相同的评估规范

/judge-with-debate Implement REST API for user management --solution "src/api/users.ts"

阶段 0.5 - 元评委（假设日期为 2025-01-15）：

元评委生成评估规范 YAML，包含标准：
- 正确性 (30%)、设计 (25%)、安全性 (20%)、性能 (15%)、文档 (10%)
- 每个标准的评分标准、检查清单和评分定义

阶段 1 - 独立分析（3 位评委接收规范）：

.specs/reports/users-api-2025-01-15.1.md - 评委 1 评分：正确性 4/5，安全性 3/5
.specs/reports/users-api-2025-01-15.2.md - 评委 2 评分：正确性 4/5，安全性 5/5
.specs/reports/users-api-2025-01-15.3.md - 评委 3 评分：正确性 5/5，安全性 4/5

检测到分歧： 安全性评分范围从 3 到 5

阶段 2 - 辩论第 1 轮（评委引用评估规范）：

评委 1 为 3/5 辩护："缺少速率限制，输入验证不完整，不符合规范检查清单第 4 项"
评委 2 挑战："中间件中存在速率限制（第 45 行），满足规范评分标准"
评委 1 修改为 4/5："漏掉了中间件，但根据规范，输入验证仍然薄弱"
评委 3 为 4/5 辩护："根据规范中的定义，输入验证满足要求"

辩论第 1 轮输出：

所有评委现在对安全性的评分为 4-5/5（相差 1 分以内）
关于输入验证的分歧仍然存在

辩论第 2 轮（相同的评估规范）：

评委根据规范标准检查具体的验证代码
评委 2 修改为 4/5："重新检查后，根据规范检查清单，电子邮件验证正则表达式薄弱"
共识：安全性 = 4/5

Correctness: 4.3/5
Design: 4.5/5
Security: 4.0/5 (2 debate rounds to consensus)
Performance: 4.7/5
Documentation: 4.0/5

Overall: 4.3/5 - PASS

🇺🇸English

judge-with-debate

Key benefits:

Structured evaluation - Meta-judge produces tailored rubrics and criteria before judging begins
Multiple perspectives - Three independent judges reduce individual bias
Evidence-based debate - Judges defend positions with specific evidence from the solution and evaluation specification
Iterative refinement - Up to 3 debate rounds drive convergence on accurate scores
Shared specification - Meta-judge runs once; all judges across all rounds share the same evaluation specification

Pattern: Debate-Based Evaluation

This command implements iterative multi-judge debate:

Phase 0: Setup
         mkdir -p .specs/reports
                  |
Phase 0.5: Dispatch Meta-Judge
         Meta-Judge (Opus)
              |
         Evaluation Specification YAML
              |
Phase 1: Independent Analysis (3 judges in parallel)
         +- Judge 1 -> {name}.1.md -+
Solution +- Judge 2 -> {name}.2.md -+-+
         +- Judge 3 -> {name}.3.md -+ |
                                      |
Phase 2: Debate Round (iterative)     |
    Each judge reads others' reports  |
         |                            |
    Argue + Defend + Challenge        |
    (grounded in eval specification)  |
         |                            |
    Revise if convinced --------------+
         |                            |
    Check consensus                   |
         +- Yes -> Final Report       |
         +- No -> Next Round ---------+

Process

Setup: Create Reports Directory

Before starting evaluation, ensure the reports directory exists:

mkdir -p .specs/reports

Report naming convention: .specs/reports/{solution-name}-{YYYY-MM-DD}.[1|2|3].md

Where:

{solution-name} - Derived from solution filename (e.g., users-api from src/api/users.ts)
{YYYY-MM-DD} - Current date
[1|2|3] - Judge number

Phase 0.5: Dispatch Meta-Judge

Before independent analysis, dispatch a meta-judge agent to generate a tailored evaluation specification. The meta-judge runs ONCE and produces rubrics, checklists, and scoring criteria that ALL judges will use across ALL rounds.

Meta-judge prompt template:

## Task

Generate an evaluation specification yaml for the following evaluation task. You will produce rubrics, checklists, and scoring criteria that multiple judge agents will use to evaluate the solution through independent analysis and multi-round debate.

CLAUDE_PLUGIN_ROOT=`${CLAUDE_PLUGIN_ROOT}`

## User Prompt
{task description - what the solution was supposed to accomplish}

## Context
{Any relevant context about the solution being evaluated}

## Artifact Type
{code | documentation | configuration | etc.}

## Evaluation Mode
Multi-judge debate with consensus-seeking across rounds

## Instructions
Return only the final evaluation specification YAML in your response.
The specification should support both independent analysis and debate-based refinement.

Dispatch:

Use Task tool:
  - description: "Meta-judge: generate evaluation specification for {solution-name}"
  - prompt: {meta-judge prompt}
  - model: opus
  - subagent_type: "sadd:meta-judge"

Wait for the meta-judge to complete and extract the evaluation specification YAML from its output before proceeding to Phase 1.

Phase 1: Independent Analysis

Launch 3 independent judge agents in parallel (Opus for rigor):

Each judge receives:
- Path to solution(s) being evaluated
- The meta-judge's evaluation specification YAML
- Task description
Each produces independent assessment saved to .specs/reports/{solution-name}-{date}.[1|2|3].md
Reports must include:
- Per-criterion scores with evidence
- Specific quotes/examples supporting ratings
- Overall weighted score
- Key strengths and weaknesses

Key principle: Independence in initial analysis prevents groupthink.

Prompt template for initial judges:

You are Judge {N} evaluating a solution independently against an evaluation specification produced by the meta judge.

CLAUDE_PLUGIN_ROOT=`${CLAUDE_PLUGIN_ROOT}`

## Solution
{path to solution file(s)}

## Task Description
{what the solution was supposed to accomplish}

## Evaluation Specification

```yaml
{meta-judge's evaluation specification YAML}

Output File

.specs/reports/{solution-name}-{date}.{N}.md

Instructions

Follow your full judge process as defined in your agent instructions!

Additional instructions:

Read the solution thoroughly
For each criterion from the evaluation specification:
- Find specific evidence (quote exact text)
- Score on the defined scale
- Justify with concrete examples
Calculate weighted overall score
Write comprehensive report to {output_file}

Add to report beginning Done by Judge {N}

**Dispatch each judge:**

Use Task tool:

description: "Judge {N}: independent analysis of {solution-name}"
prompt: {judge prompt with evaluation specification YAML}
model: opus
subagent_type: "sadd:judge"

Phase 2: Debate Rounds (Iterative)

For each debate round (max 3 rounds):

Launch 3 debate agents in parallel:
1. Each judge agent receives:
  - Path to their own previous report (.specs/reports/{solution-name}-{date}.[1|2|3].md)
  - Paths to other judges' reports (.specs/reports/{solution-name}-{date}.[1|2|3].md)
  - The original solution
  - The meta-judge's evaluation specification YAML
2. Each judge:
  - Identifies disagreements with other judges (>1 point score gap on any criterion)
  - Defends their own ratings with evidence from the solution and evaluation specification
  - Challenges other judges' ratings they disagree with
  - Considers counter-arguments
  - Revises their assessment if convinced
3. Updates their report file with new section: ## Debate Round {R}
4. After they reply, if they reached agreement move to Phase 3: Consensus Report

Output File

.specs/reports/{solution-name}-{date}.{N}.md (append to existing file)

Instructions

Follow your full judge process as defined in your agent instructions!

Additional debate instructions:

Read your previous assessment from {your_previous_report}
Read all other judges' reports
Identify disagreements (where your scores differ by >1 point)
For each major disagreement:
- State the disagreement clearly
- Defend your position with evidence from the solution and evaluation specification
- Challenge the other judge's position with counter-evidence
- Consider whether their evidence changes your view
Update your report file by APPENDING debate round section
Reply whether you reached agreement, and with which judge. Include revisited scores and criteria scores.

CRITICAL:

Ground your arguments in the evaluation specification criteria
Only revise if you find their evidence compelling
Defend your original scores if you still believe them
Quote specific evidence from the solution

Dispatch each debate judge:

Use Task tool:

description: "Judge {N}: debate round {R} for {solution-name}"
prompt: {debate judge prompt with evaluation specification YAML}
model: opus
subagent_type: "sadd:judge"

Consensus Check

After each debate round, check for consensus:

Consensus achieved if:
- All judges' overall scores within 0.5 points of each other
- No criterion has >1 point disagreement across any two judges
- All judges explicitly state they accept the consensus
If no consensus after 3 rounds:
- Report persistent disagreements
- Provide all judge reports for human review
- Flag that automated evaluation couldn't reach consensus
Orchestration Instructions:

Step 1: Dispatch Meta-Judge (Phase 0.5)
1. Launch meta-judge agent
2. Wait for meta-judge to complete
3. Extract the evaluation specification YAML from meta-judge output
Step 2: Run Independent Analysis (Phase 1)
1. Launch 3 judge agents in parallel (Judge 1, 2, 3) with the evaluation specification YAML
2. Each writes their independent assessment to .specs/reports/{solution-name}-{date}.[1|2|3].md

Reports directory : .specs/reports/ (created if not exists)
Initial reports : .specs/reports/{solution-name}-{date}.1.md, .specs/reports/{solution-name}-{date}.2.md, .specs/reports/{solution-name}-{date}.3.md
Debate updates : Appended sections in each report file per round
Final synthesis : Replied to user (consensus or disagreement summary)

Best Practices

Meta-Judge + Judge Verification

Never skip meta-judge - Tailored evaluation criteria produce better judgments and more grounded debates
Meta-judge runs once - Same specification for all 3 judges across all debate rounds
Include CLAUDE_PLUGIN_ROOT - Both meta-judge and judges need the resolved plugin root path
Meta-judge YAML - Pass only the YAML to judges, do not modify it
Debate grounding - Judges should reference evaluation specification criteria when defending positions

Common Pitfalls

Judges create new reports instead of appending - Loses debate history
Orchestrator passes reports between judges - Violates filesystem communication principle
Weak initial assessments - Garbage in, garbage out
Too many debate rounds - Diminishing returns after 3 rounds
Sycophancy in debate - Judges agree too easily without real evidence
Modifying meta-judge YAML - Specification must be passed verbatim to all judges
Re-running meta-judge between rounds - Specification is generated once and shared

Do This

Judges append to their own report file
Judges read other reports from filesystem directly
Strong evidence-based initial assessments
Maximum 3 debate rounds
Require evidence for changing positions
Ground debate arguments in the evaluation specification criteria
Use same evaluation specification across all rounds

Example Usage

Evaluating an API Implementation

/judge-with-debate Implement REST API for user management --solution "src/api/users.ts"

Phase 0.5 - Meta-Judge (assuming date 2025-01-15):

Meta-judge generates evaluation specification YAML with criteria:
- Correctness (30%), Design (25%), Security (20%), Performance (15%), Documentation (10%)
- Rubrics, checklists, and scoring definitions for each criterion

Phase 1 - Independent Analysis (3 judges receive specification):

.specs/reports/users-api-2025-01-15.1.md - Judge 1 scores correctness 4/5, security 3/5
.specs/reports/users-api-2025-01-15.2.md - Judge 2 scores correctness 4/5, security 5/5
.specs/reports/users-api-2025-01-15.3.md - Judge 3 scores correctness 5/5, security 4/5

Disagreement detected: Security scores range from 3-5

Phase 2 - Debate Round 1 (judges reference evaluation specification):

Judge 1 defends 3/5: "Missing rate limiting, input validation incomplete per specification checklist item 4"
Judge 2 challenges: "Rate limiting exists in middleware (line 45), satisfies specification rubric"
Judge 1 revises to 4/5: "Missed middleware, but input validation still weak per specification"
Judge 3 defends 4/5: "Input validation adequate for requirements as defined in specification"

Debate Round 1 outputs:

All judges now 4-5/5 on security (within 1 point)
Disagreement on input validation remains

Debate Round 2 (same evaluation specification):

Judges examine specific validation code against specification criteria
Judge 2 revises to 4/5: "Upon re-examination, email validation regex is weak per specification checklist"
Consensus: Security = 4/5

Final consensus:

Correctness: 4.3/5
Design: 4.5/5
Security: 4.0/5 (2 debate rounds to consensus)
Performance: 4.7/5
Documentation: 4.0/5

Overall: 4.3/5 - PASS

Weekly Installs

220

Repository

neolabhq/contex…ring-kit

GitHub Stars

699

First Seen

Feb 19, 2026

Installed on

opencode215

github-copilot214

codex214

gemini-cli213

kimi-cli211

cursor211

React 组合模式指南：Vercel 组件架构最佳实践，提升代码可维护性

109,600 周安装

Key principle: Judges communicate only through filesystem - orchestrator doesn't mediate and don't read reports files itself, it can overflow your context.

Prompt template for debate judges:

You are Judge {N} in debate round {R}.

CLAUDE_PLUGIN_ROOT=`${CLAUDE_PLUGIN_ROOT}`

## Your Previous Report
{path to .specs/reports/{solution-name}-{date}.{N}.md}

## Other Judges' Reports
Judge 1: .specs/reports/{solution-name}-{date}.1.md
...

## Task Description
{what the solution was supposed to accomplish}

## Solution
{path to solution}

## Evaluation Specification

```yaml
{meta-judge's evaluation specification YAML}

Wait for all 3 agents to complete

Step 3: Check for Consensus

Let's work through this systematically to ensure accurate consensus detection.

Read all three reports and extract:

Each judge's overall weighted score
Each judge's score for every criterion

Check consensus step by step:

First, extract all overall scores from each report and list them explicitly
Calculate the difference between the highest and lowest overall scores
- If difference <= 0.5 points -> overall consensus achieved
- If difference > 0.5 points -> no consensus yet
Next, for each criterion, list all three judges' scores side by side
For each criterion, calculate the difference between highest and lowest scores
- If any criterion has difference > 1.0 point -> no consensus on that criterion
Finally, verify consensus is achieved only if BOTH conditions are met:
- Overall scores within 0.5 points
- All criterion scores within 1.0 point

Step 4: Decision Point

If consensus achieved: Go to Step 6 (Generate Consensus Report)
If no consensus AND round < 3: Go to Step 5 (Run Debate Round)
If no consensus AND round = 3: Go to Step 7 (Report No Consensus)

Step 5: Run Debate Round

Increment round counter (round = round + 1)
Launch 3 judge agents in parallel with the same evaluation specification YAML
Each agent reads:
- Their own previous report from filesystem
- Other judges' reports from filesystem
- Original solution
Each agent appends "Debate Round {R}" section to their own report file
Wait for all 3 agents to complete
Go back to Step 3 (Check for Consensus)

Step 6: Reply with Report

Let's synthesize the evaluation results step by step.

Read all final reports carefully
Before generating the report, analyze the following:
- What is the consensus status (achieved or not)?
- What were the key points of agreement across all judges?
- What were the main areas of disagreement, if any?
- How did the debate rounds change the evaluations?
Reply to user with a report that contains:
- If there is consensus:
  - Consensus scores (average of all judges)
  - Consensus strengths/weaknesses
  - Number of rounds to reach consensus
  - Final recommendation with clear justification
- If there is no consensus:
  - All judges' final scores showing disagreements
  - Specific criteria where consensus wasn't reached
  - Analysis of why consensus couldn't be reached
  - Flag for human review
Command complete

Step 7: Report No Consensus

Report persistent disagreements
Provide all judge reports for human review
Flag that automated evaluation couldn't reach consensus

Phase 3: Consensus Report

If consensus achieved, synthesize the final report by working through each section methodically:

# Consensus Evaluation Report

Let's compile the final consensus by analyzing each component systematically.

## Consensus Scores

First, let's consolidate all judges' final scores:

| Criterion | Judge 1 | Judge 2 | Judge 3 | Final |
|-----------|---------|---------|---------|-------|
| {Name}    | {X}/5   | {X}/5   | {X}/5   | {X}/5 |
...

**Consensus Overall Score**: {avg}/5.0

## Consensus Strengths
[Review each judge's identified strengths and extract the common themes that all judges agreed upon]

## Consensus Weaknesses
[Review each judge's identified weaknesses and extract the common themes that all judges agreed upon]

## Debate Summary
Let's trace how consensus was reached:
- Rounds to consensus: {N}
- Initial disagreements: {list with specific criteria and score gaps}
- How resolved: {for each disagreement, explain what evidence or argument led to resolution}

## Final Recommendation
Based on the consensus scores and the key strengths/weaknesses identified:
{Pass/Fail/Needs Revision with clear justification tied to the evidence}

基于辩论的AI代码评估工具：结构化多评委辩论，减少偏见达成共识

🇨🇳中文介绍

judge-with-debate

模式：基于辩论的评估

相关 Skills

流程

设置：创建报告目录

阶段 0.5：派遣元评委

阶段 1：独立分析

输出文件

指令

阶段 2：辩论轮次（迭代）

输出文件

指令

共识检查

阶段 3：共识报告

最佳实践

元评委 + 评委验证

常见陷阱

应该这样做

使用示例

评估 API 实现

🇺🇸English

judge-with-debate

Pattern: Debate-Based Evaluation

Process

Setup: Create Reports Directory

Phase 0.5: Dispatch Meta-Judge

Phase 1: Independent Analysis

Output File

Instructions

Phase 2: Debate Rounds (Iterative)

Output File

Instructions

Consensus Check

Best Practices

Meta-Judge + Judge Verification

Common Pitfalls

Do This

Example Usage

Evaluating an API Implementation

最新 Skills

Phase 3: Consensus Report