sadd%3Ajudge-with-debate by neolabhq/context-engineering-kit
npx skills add https://github.com/neolabhq/context-engineering-kit --skill sadd:judge-with-debate主要优势:
此命令实现迭代式多评委辩论:
Phase 0: Setup
mkdir -p .specs/reports
|
Phase 0.5: Dispatch Meta-Judge
Meta-Judge (Opus)
|
Evaluation Specification YAML
|
Phase 1: Independent Analysis (3 judges in parallel)
+- Judge 1 -> {name}.1.md -+
Solution +- Judge 2 -> {name}.2.md -+-+
+- Judge 3 -> {name}.3.md -+ |
|
Phase 2: Debate Round (iterative) |
Each judge reads others' reports |
| |
Argue + Defend + Challenge |
(grounded in eval specification) |
| |
Revise if convinced --------------+
| |
Check consensus |
+- Yes -> Final Report |
+- No -> Next Round ---------+
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
开始评估前,确保报告目录存在:
mkdir -p .specs/reports
报告命名约定: .specs/reports/{solution-name}-{YYYY-MM-DD}.[1|2|3].md
其中:
{solution-name} - 从解决方案文件名派生(例如,从 src/api/users.ts 得到 users-api){YYYY-MM-DD} - 当前日期[1|2|3] - 评委编号在独立分析之前,派遣一个元评委代理来生成定制的评估规范。元评委运行一次,并生成所有评委在所有轮次中都将使用的评分标准、检查清单和评分准则。
元评委提示模板:
## Task
Generate an evaluation specification yaml for the following evaluation task. You will produce rubrics, checklists, and scoring criteria that multiple judge agents will use to evaluate the solution through independent analysis and multi-round debate.
CLAUDE_PLUGIN_ROOT=`${CLAUDE_PLUGIN_ROOT}`
## User Prompt
{task description - what the solution was supposed to accomplish}
## Context
{Any relevant context about the solution being evaluated}
## Artifact Type
{code | documentation | configuration | etc.}
## Evaluation Mode
Multi-judge debate with consensus-seeking across rounds
## Instructions
Return only the final evaluation specification YAML in your response.
The specification should support both independent analysis and debate-based refinement.
派遣:
Use Task tool:
- description: "Meta-judge: generate evaluation specification for {solution-name}"
- prompt: {meta-judge prompt}
- model: opus
- subagent_type: "sadd:meta-judge"
等待元评委完成,并在进入阶段 1 之前从其输出中提取评估规范 YAML。
并行启动3个独立的评委代理(使用 Opus 模型以确保严谨性):
.specs/reports/{solution-name}-{date}.[1|2|3].md关键原则: 初始分析的独立性防止群体思维。
初始评委提示模板:
You are Judge {N} evaluating a solution independently against an evaluation specification produced by the meta judge.
CLAUDE_PLUGIN_ROOT=`${CLAUDE_PLUGIN_ROOT}`
## Solution
{path to solution file(s)}
## Task Description
{what the solution was supposed to accomplish}
## Evaluation Specification
```yaml
{meta-judge's evaluation specification YAML}
.specs/reports/{solution-name}-{date}.{N}.md
遵循代理指令中定义的完整评委流程!
附加指令:
在报告开头添加 Done by Judge {N}
**派遣每位评委:**
Use Task tool:
description: "Judge {N}: independent analysis of {solution-name}"
prompt: {judge prompt with evaluation specification YAML}
model: opus
subagent_type: "sadd:judge"
对于每个辩论轮次(最多 3 轮):
并行启动3个辩论代理:
.specs/reports/{solution-name}-{date}.[1|2|3].md).specs/reports/{solution-name}-{date}.[1|2|3].md)## Debate Round {R}关键原则: 评委仅通过文件系统进行通信 - 协调器不进行调解,也不读取报告文件本身,否则可能超出你的上下文限制。
辩论评委提示模板:
You are Judge {N} in debate round {R}.
CLAUDE_PLUGIN_ROOT=`${CLAUDE_PLUGIN_ROOT}`
## Your Previous Report
{path to .specs/reports/{solution-name}-{date}.{N}.md}
## Other Judges' Reports
Judge 1: .specs/reports/{solution-name}-{date}.1.md
...
## Task Description
{what the solution was supposed to accomplish}
## Solution
{path to solution}
## Evaluation Specification
```yaml
{meta-judge's evaluation specification YAML}
.specs/reports/{solution-name}-{date}.{N}.md(追加到现有文件)
遵循代理指令中定义的完整评委流程!
附加辩论指令:
关键:
将你的论点建立在评估规范标准之上
仅在你认为他们的证据有说服力时进行修改
如果你仍然相信你的原始评分,请为其辩护
引用解决方案中的具体证据
派遣每位辩论评委:
Use Task tool:
description: "Judge {N}: debate round {R} for {solution-name}"
prompt: {debate judge prompt with evaluation specification YAML}
model: opus
subagent_type: "sadd:judge"
每轮辩论后,检查是否达成共识:
达成共识的条件:
如果 3 轮后仍未达成共识:
协调指令:
步骤 1:派遣元评委(阶段 0.5)
步骤 2:运行独立分析(阶段 1)
.specs/reports/{solution-name}-{date}.[1|2|3].md步骤 3:检查共识
让我们系统地完成这一步,以确保准确检测共识。
阅读所有三份报告并提取:
逐步检查共识:
步骤 4:决策点
步骤 5:运行辩论轮次
步骤 6:回复报告
让我们逐步综合评估结果。
步骤 7:报告未达成共识
如果达成共识,通过系统地处理每个部分来综合最终报告:
# 共识评估报告
让我们通过系统地分析每个组成部分来汇编最终共识。
## 共识分数
首先,让我们整合所有评委的最终分数:
| Criterion | Judge 1 | Judge 2 | Judge 3 | Final |
|-----------|---------|---------|---------|-------|
| {Name} | {X}/5 | {X}/5 | {X}/5 | {X}/5 |
...
**Consensus Overall Score**: {avg}/5.0
## 共识优点
[审查每位评委识别的优点,并提取所有评委一致同意的共同主题]
## 共识缺点
[审查每位评委识别的缺点,并提取所有评委一致同意的共同主题]
## 辩论摘要
让我们追溯共识是如何达成的:
- 达成共识的轮次:{N}
- 初始分歧:{列出具体标准和分数差距}
- 如何解决:{对于每个分歧,解释是哪些证据或论点导致了解决}
## 最终建议
基于共识分数和识别出的关键优点/缺点:
{通过/失败/需要修订,并附带与证据相关的清晰论证}
.specs/reports/(如果不存在则创建).specs/reports/{solution-name}-{date}.1.md、.specs/reports/{solution-name}-{date}.2.md、.specs/reports/{solution-name}-{date}.3.md/judge-with-debate Implement REST API for user management --solution "src/api/users.ts"
阶段 0.5 - 元评委(假设日期为 2025-01-15):
阶段 1 - 独立分析(3 位评委接收规范):
.specs/reports/users-api-2025-01-15.1.md - 评委 1 评分:正确性 4/5,安全性 3/5.specs/reports/users-api-2025-01-15.2.md - 评委 2 评分:正确性 4/5,安全性 5/5.specs/reports/users-api-2025-01-15.3.md - 评委 3 评分:正确性 5/5,安全性 4/5检测到分歧: 安全性评分范围从 3 到 5
阶段 2 - 辩论第 1 轮(评委引用评估规范):
辩论第 1 轮输出:
辩论第 2 轮(相同的评估规范):
最终共识:
Correctness: 4.3/5
Design: 4.5/5
Security: 4.0/5 (2 debate rounds to consensus)
Performance: 4.7/5
Documentation: 4.0/5
Overall: 4.3/5 - PASS
每周安装量
220
代码仓库
GitHub 星标数
699
首次出现
2026年2月19日
安装于
opencode215
github-copilot214
codex214
gemini-cli213
kimi-cli211
cursor211
Key benefits:
This command implements iterative multi-judge debate:
Phase 0: Setup
mkdir -p .specs/reports
|
Phase 0.5: Dispatch Meta-Judge
Meta-Judge (Opus)
|
Evaluation Specification YAML
|
Phase 1: Independent Analysis (3 judges in parallel)
+- Judge 1 -> {name}.1.md -+
Solution +- Judge 2 -> {name}.2.md -+-+
+- Judge 3 -> {name}.3.md -+ |
|
Phase 2: Debate Round (iterative) |
Each judge reads others' reports |
| |
Argue + Defend + Challenge |
(grounded in eval specification) |
| |
Revise if convinced --------------+
| |
Check consensus |
+- Yes -> Final Report |
+- No -> Next Round ---------+
Before starting evaluation, ensure the reports directory exists:
mkdir -p .specs/reports
Report naming convention: .specs/reports/{solution-name}-{YYYY-MM-DD}.[1|2|3].md
Where:
{solution-name} - Derived from solution filename (e.g., users-api from src/api/users.ts){YYYY-MM-DD} - Current date[1|2|3] - Judge numberBefore independent analysis, dispatch a meta-judge agent to generate a tailored evaluation specification. The meta-judge runs ONCE and produces rubrics, checklists, and scoring criteria that ALL judges will use across ALL rounds.
Meta-judge prompt template:
## Task
Generate an evaluation specification yaml for the following evaluation task. You will produce rubrics, checklists, and scoring criteria that multiple judge agents will use to evaluate the solution through independent analysis and multi-round debate.
CLAUDE_PLUGIN_ROOT=`${CLAUDE_PLUGIN_ROOT}`
## User Prompt
{task description - what the solution was supposed to accomplish}
## Context
{Any relevant context about the solution being evaluated}
## Artifact Type
{code | documentation | configuration | etc.}
## Evaluation Mode
Multi-judge debate with consensus-seeking across rounds
## Instructions
Return only the final evaluation specification YAML in your response.
The specification should support both independent analysis and debate-based refinement.
Dispatch:
Use Task tool:
- description: "Meta-judge: generate evaluation specification for {solution-name}"
- prompt: {meta-judge prompt}
- model: opus
- subagent_type: "sadd:meta-judge"
Wait for the meta-judge to complete and extract the evaluation specification YAML from its output before proceeding to Phase 1.
Launch 3 independent judge agents in parallel (Opus for rigor):
.specs/reports/{solution-name}-{date}.[1|2|3].mdKey principle: Independence in initial analysis prevents groupthink.
Prompt template for initial judges:
You are Judge {N} evaluating a solution independently against an evaluation specification produced by the meta judge.
CLAUDE_PLUGIN_ROOT=`${CLAUDE_PLUGIN_ROOT}`
## Solution
{path to solution file(s)}
## Task Description
{what the solution was supposed to accomplish}
## Evaluation Specification
```yaml
{meta-judge's evaluation specification YAML}
.specs/reports/{solution-name}-{date}.{N}.md
Follow your full judge process as defined in your agent instructions!
Additional instructions:
Add to report beginning Done by Judge {N}
**Dispatch each judge:**
Use Task tool:
description: "Judge {N}: independent analysis of {solution-name}"
prompt: {judge prompt with evaluation specification YAML}
model: opus
subagent_type: "sadd:judge"
For each debate round (max 3 rounds):
Launch 3 debate agents in parallel:
.specs/reports/{solution-name}-{date}.[1|2|3].md).specs/reports/{solution-name}-{date}.[1|2|3].md)## Debate Round {R}.specs/reports/{solution-name}-{date}.{N}.md (append to existing file)
Follow your full judge process as defined in your agent instructions!
Additional debate instructions:
CRITICAL:
Ground your arguments in the evaluation specification criteria
Only revise if you find their evidence compelling
Defend your original scores if you still believe them
Quote specific evidence from the solution
Dispatch each debate judge:
Use Task tool:
description: "Judge {N}: debate round {R} for {solution-name}"
prompt: {debate judge prompt with evaluation specification YAML}
model: opus
subagent_type: "sadd:judge"
After each debate round, check for consensus:
Consensus achieved if:
If no consensus after 3 rounds:
Orchestration Instructions:
Step 1: Dispatch Meta-Judge (Phase 0.5)
Step 2: Run Independent Analysis (Phase 1)
.specs/reports/{solution-name}-{date}.[1|2|3].md.specs/reports/ (created if not exists).specs/reports/{solution-name}-{date}.1.md, .specs/reports/{solution-name}-{date}.2.md, .specs/reports/{solution-name}-{date}.3.md/judge-with-debate Implement REST API for user management --solution "src/api/users.ts"
Phase 0.5 - Meta-Judge (assuming date 2025-01-15):
Phase 1 - Independent Analysis (3 judges receive specification):
.specs/reports/users-api-2025-01-15.1.md - Judge 1 scores correctness 4/5, security 3/5.specs/reports/users-api-2025-01-15.2.md - Judge 2 scores correctness 4/5, security 5/5.specs/reports/users-api-2025-01-15.3.md - Judge 3 scores correctness 5/5, security 4/5Disagreement detected: Security scores range from 3-5
Phase 2 - Debate Round 1 (judges reference evaluation specification):
Debate Round 1 outputs:
Debate Round 2 (same evaluation specification):
Final consensus:
Correctness: 4.3/5
Design: 4.5/5
Security: 4.0/5 (2 debate rounds to consensus)
Performance: 4.7/5
Documentation: 4.0/5
Overall: 4.3/5 - PASS
Weekly Installs
220
Repository
GitHub Stars
699
First Seen
Feb 19, 2026
Installed on
opencode215
github-copilot214
codex214
gemini-cli213
kimi-cli211
cursor211
React 组合模式指南:Vercel 组件架构最佳实践,提升代码可维护性
109,600 周安装
Key principle: Judges communicate only through filesystem - orchestrator doesn't mediate and don't read reports files itself, it can overflow your context.
Prompt template for debate judges:
You are Judge {N} in debate round {R}.
CLAUDE_PLUGIN_ROOT=`${CLAUDE_PLUGIN_ROOT}`
## Your Previous Report
{path to .specs/reports/{solution-name}-{date}.{N}.md}
## Other Judges' Reports
Judge 1: .specs/reports/{solution-name}-{date}.1.md
...
## Task Description
{what the solution was supposed to accomplish}
## Solution
{path to solution}
## Evaluation Specification
```yaml
{meta-judge's evaluation specification YAML}
Step 3: Check for Consensus
Let's work through this systematically to ensure accurate consensus detection.
Read all three reports and extract:
Check consensus step by step:
Step 4: Decision Point
Step 5: Run Debate Round
Step 6: Reply with Report
Let's synthesize the evaluation results step by step.
Step 7: Report No Consensus
If consensus achieved, synthesize the final report by working through each section methodically:
# Consensus Evaluation Report
Let's compile the final consensus by analyzing each component systematically.
## Consensus Scores
First, let's consolidate all judges' final scores:
| Criterion | Judge 1 | Judge 2 | Judge 3 | Final |
|-----------|---------|---------|---------|-------|
| {Name} | {X}/5 | {X}/5 | {X}/5 | {X}/5 |
...
**Consensus Overall Score**: {avg}/5.0
## Consensus Strengths
[Review each judge's identified strengths and extract the common themes that all judges agreed upon]
## Consensus Weaknesses
[Review each judge's identified weaknesses and extract the common themes that all judges agreed upon]
## Debate Summary
Let's trace how consensus was reached:
- Rounds to consensus: {N}
- Initial disagreements: {list with specific criteria and score gaps}
- How resolved: {for each disagreement, explain what evidence or argument led to resolution}
## Final Recommendation
Based on the consensus scores and the key strengths/weaknesses identified:
{Pass/Fail/Needs Revision with clear justification tied to the evidence}