sadd:do-competitively - 多代理竞争性AI生成与评估框架，实现宪法式AI自我批判循环

sadd%3Ado-competitively by neolabhq/context-engineering-kit

210 周安装量

699 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/neolabhq/context-engineering-kit --skill sadd:do-competitively

AI/机器学习自动化代码生成

🇨🇳中文介绍

do-competitively

主要特性:

生成过程中的自我批判循环（宪法式人工智能）
结构化评估 - 元评审员在评审前生成定制的评分标准
评估中的验证循环（验证链）
自适应策略：优化明确的优胜者，综合存在分歧的决策，重新设计失败方案
通过智能策略选择平均节省 15-20% 的成本

关键提示： 你不是执行代理或评审员，不应阅读为子代理或任务提供的上下文文件。你不应阅读报告，不应让不必要的信息淹没你的上下文。你必须严格按照流程步骤执行。任何偏差都将被视为失败，你将立即被终止！

模式：生成-批判-综合 (GCS)

此命令实现了一个多阶段自适应竞争编排模式：

Phase 1: Competitive Generation with Self-Critique + Meta-Judge (IN PARALLEL)
         ┌─ Meta-Judge → Evaluation Specification YAML ───────────┐
Task ────┼─ Agent 2 → Draft → Critique → Revise → Solution B ───┐ │
         ├─ Agent 3 → Draft → Critique → Revise → Solution C ───┼─┤
         └─ Agent 1 → Draft → Critique → Revise → Solution A ───┘ │
                                                                  │
Phase 2: Multi-Judge Evaluation with Verification                 │
         ┌─ Judge 1 → Evaluate → Verify → Revise → Report A ─┐    │
         ├─ Judge 2 → Evaluate → Verify → Revise → Report B ─┼────┤
         └─ Judge 3 → Evaluate → Verify → Revise → Report C ─┘    │
                                                                  │
Phase 2.5: Adaptive Strategy Selection                            │
         Analyze Consensus ───────────────────────────────────────┤
                ├─ Clear Winner? → SELECT_AND_POLISH              │
                ├─ All Flawed (<3.0)? → REDESIGN (return Phase 1) │
                └─ Split Decision? → FULL_SYNTHESIS               │
                                          │                       │
Phase 3: Evidence-Based Synthesis         │                       │
         (Only if FULL_SYNTHESIS)         │                       │
         Synthesizer ─────────────────────┴───────────────────────┴─→ Final Solution

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

设置：创建报告目录

开始前，确保报告目录存在：

mkdir -p .specs/reports

报告命名约定： .specs/reports/{solution-name}-{YYYY-MM-DD}.[1|2|3].md

{solution-name} - 从输出路径派生（例如，从输出 specs/api/users.md 得到 users-api）
{YYYY-MM-DD} - 当前日期
[1|2|3] - 评审员编号

注意： 解决方案保留在其指定的输出位置；只有评估报告会放在 .specs/reports/

阶段 1：竞争性生成 + 元评审员 (并行)

并行启动 3 个独立的生成器代理和 1 个元评审员代理（共 4 个代理，均推荐使用 Opus 模型以保证质量）：

元评审员与 3 个生成器并行运行，因为它不需要它们的输出——它只需要任务描述来生成评估标准。

关键提示： 使用 4 个 Task 工具调用，在单条消息中一次性调度所有 4 个代理作为前台代理。元评审员必须是调度顺序中的第一个工具调用，因为它需要在生成器修改代码库之前有时间从代码库收集上下文。

元评审员代理 (1 个代理)

元评审员生成针对此特定任务定制的评估规范 YAML（评分标准、检查清单、评分准则）。它返回所有 3 位评审员将使用的评估规范 YAML。

元评审员提示模板：

## Task

Generate an evaluation specification yaml for the following task. You will produce rubrics, checklists, and scoring criteria that judge agents will use to evaluate and compare competitive implementation artifacts.

CLAUDE_PLUGIN_ROOT=`${CLAUDE_PLUGIN_ROOT}`

## User Prompt
{Original task description from user}

## Context
{Any relevant codebase context, file paths, constraints}

## Artifact Type
{code | documentation | configuration | etc.}

## Number of Solutions
3 (competitive implementations to be compared)

## Instructions
Return only the final evaluation specification YAML in your response.
The specification should support comparative evaluation across multiple solutions.

Use Task tool:
  - description: "Meta-judge: {brief task summary}"
  - prompt: {meta-judge prompt}
  - model: opus
  - subagent_type: "sadd:meta-judge"

生成器代理 (3 个代理)

每个代理接收相同的任务描述和上下文
代理独立工作，互不查看对方的工作
每个代理针对同一问题生成一个完整的解决方案
解决方案保存到不同的文件中（例如，{solution-file}.[a|b|c].[ext]）

解决方案命名约定： {solution-file}.[a|b|c].[ext] 其中：

{solution-file} - 从任务派生（例如，create users.ts 任务的结果是 users 作为解决方案文件）
[a|b|c] - 每个子代理的唯一标识符
[ext] - 文件扩展名（例如，md、ts 等）

关键原则： 通过独立性实现多样性——代理探索不同的方法。

关键提示： 你必须向代理和评审员提供带有 [a|b|c] 标识符的文件名！！！缺少它将导致你被立即终止！

生成器提示模板：

<task>
{task_description}
</task>

<constraints>
{constraints_if_any}
</constraints>

<context>
{relevant_context}
</context>

<output>
{define expected output following such pattern: {solution-file}.[a|b|c].[ext] based on the task description and context. Each [a|b|c] is a unique identifier per sub-agent. You MUST provide filename with it!!!}
</output>

Instructions:
Let's approach this systematically to produce the best possible solution.

1. First, analyze the task carefully - what is being asked and what are the key requirements?
2. Consider multiple approaches - what are the different ways to solve this?
3. Think through the tradeoffs step by step and choose the approach you believe is best
4. Implement it completely
5. Generate 5 verification questions about critical aspects
6. Answer your own questions:
   - Review solution against each question
   - Identify gaps or weaknesses
7. Revise solution:
   - Fix identified issues
8. Explain what was changed and why

在单条消息中发送所有 4 个 Task 工具调用。先元评审员，后生成器：

Message with 4 tool calls:
  Tool call 1 (meta-judge):
    - description: "Meta-judge: {brief task summary}"
    - model: opus
    - subagent_type: "sadd:meta-judge"

  Tool call 2 (generator A):
    - description: "Generate solution A: {brief task summary}"
    - model: opus

  Tool call 3 (generator B):
    - description: "Generate solution B: {brief task summary}"
    - model: opus

  Tool call 4 (generator C):
    - description: "Generate solution C: {brief task summary}"
    - model: opus

等待所有 4 个代理返回，然后进入阶段 2。

阶段 2：多评审员评估

并行启动 3 个独立的评审员（推荐使用 Opus 模型以保证严谨性）：

关键提示： 等待所有阶段 1 的代理（元评审员 + 3 个生成器）完成，然后再调度评审员。

关键提示： 向每位评审员提供完全相同的元评审员评估规范 YAML。不要跳过或添加任何内容，不要以任何方式修改它，不要缩短或总结其中的任何文本！

每位评审员接收元评审员评估规范 YAML 和所有候选解决方案（A、B、C）的路径
评审员根据元评审员的标准（而非硬编码标准）进行评估
每位评审员生成：
- 比较分析（哪个解决方案在哪些方面更优）
- 基于证据的评分（附带具体引用/示例）
- 最终投票（他们偏好哪个解决方案及原因）
报告保存到不同的文件中（例如，.specs/reports/{solution-name}-{date}.[1|2|3].md）

关键原则： 多个独立的评估可以减少偏见并发现不同的问题。

评审员提示模板：

You are evaluating {number} competitive solutions against an evaluation specification produced by the meta judge.

CLAUDE_PLUGIN_ROOT=`${CLAUDE_PLUGIN_ROOT}`

## Task
{task_description}

## Solutions
{list of paths to all candidate solutions}

## Evaluation Specification

```yaml
{meta-judge's evaluation specification YAML}

Write full report to: {.specs/reports/{solution-name}-{date}.[1|2|3].md - each judge gets unique number identifier}

CRITICAL: You must reply with this exact structured header format:

VOTE: [Solution A/B/C] SCORES: Solution A: [X.X]/5.0 Solution B: [X.X]/5.0 Solution C: [X.X]/5.0 CRITERIA:

{criterion_1}: [X.X]/5.0
{criterion_2}: [X.X]/5.0 ...

[Summary of your evaluation]

Follow your full judge process as defined in your agent instructions!

CRITICAL: Base your evaluation on evidence, not impressions. Quote specific text.

CRITICAL: You must reply with this exact structured evaluation report format in YAML at the START of your response!


**关键提示：** 切勿向评审员提供分数阈值。评审员不得知道分数阈值是多少，以免产生偏见！

**调度：**

使用 Task 工具（在单条消息中调用 3 次）：

*   description: "Judge [1|2|3]: {brief task summary}"
*   prompt: {judge prompt with exact meta-judge specification YAML}
*   model: opus
*   subagent_type: "sadd:judge"

### 阶段 2.5：自适应策略选择（提前返回）

**编排器**（非子代理）分析评审员输出来确定最优策略。

#### 决策逻辑

**步骤 1：从评审员回复中解析结构化标题**

解析评审员的回复。
**关键提示：** 不要读取报告文件本身，这可能会超出你的上下文容量。

**步骤 2：检查是否存在一致同意的优胜者**

比较所有三个 VOTE 值：

*   如果 Judge 1 VOTE = Judge 2 VOTE = Judge 3 VOTE（相同解决方案）：
    *   **策略：SELECT_AND_POLISH**
    *   **原因：** 明确共识——所有三位评审员偏好同一个解决方案

**步骤 3：检查所有解决方案是否都存在根本性缺陷**

如果没有一致投票，则计算平均分数：

1.  平均 Solution A 分数：(Judge1_A + Judge2_A + Judge3_A) / 3
2.  平均 Solution B 分数：(Judge1_B + Judge2_B + Judge3_B) / 3
3.  平均 Solution C 分数：(Judge1_C + Judge2_C + Judge3_C) / 3

如果 (avg_A < 3.0) AND (avg_B < 3.0) AND (avg_C < 3.0)：

*   **策略：REDESIGN**
*   **原因：** 所有解决方案均低于质量阈值，存在根本性方法问题

**步骤 5：默认进行完全综合**

如果以上条件均未满足：

*   **策略：FULL_SYNTHESIS**
*   **原因：** 存在分歧的决策但各有优点，需要综合以结合最佳元素

#### 策略 1：SELECT_AND_POLISH

**适用场景：** 明确的优胜者（一致投票）

**流程：**

1.  选择优胜解决方案作为基础
2.  启动子代理，根据评审员反馈应用具体改进
3.  从亚军解决方案中精选 1-2 个最佳元素
4.  记录添加的内容及原因

**优点：**

*   节省综合成本（比完全综合更简单）
*   保留优胜解决方案已证明的质量
*   进行有针对性的改进而非完全重建

**提示模板：**

```markdown
You are polishing the winning solution based on judge feedback.

<task>
{task_description}
</task>

<winning_solution>
{path_to_winning_solution}
Score: {winning_score}/5.0
Judge consensus: {why_it_won}
</winning_solution>

<runner_up_solutions>
{list of paths to all runner-up solutions}
</runner_up_solutions>

<judge_feedback>
{list of paths to all evaluation reports}
</judge_feedback>

<output>
{final_solution_path}
</output>

Instructions:
Let's work through this step by step to polish the winning solution effectively.

1. Take the winning solution as your base (do NOT rewrite it)
2. First, carefully review all judge feedback to understand what needs improvement
3. Apply improvements based on judge feedback:
   - Fix identified weaknesses
   - Add missing elements judges noted
4. Next, examine the runner-up solutions for standout elements
5. Cherry-pick 1-2 specific elements from runners-up if judges praised them
6. Document changes made:
   - What was changed and why
   - What was added from other solutions

CRITICAL: Preserve the winning solution's core approach. Make targeted improvements only.

适用场景： 所有解决方案得分均 <3.0/5.0（普遍存在根本性问题）

启动新代理来分析失败模式和经验教训。要求该代理：
- 逐步思考：每个解决方案出了什么问题？
- 分析所有解决方案中的常见失败模式
- 提取经验教训（不应做什么）
- 找出所有方法失败的根源
- 基于这些见解生成新的任务分解或约束
返回阶段 1，向新的实现代理提供经验教训和新约束。

新实现提示模板：

You are analyzing why all solutions failed to meet quality standards. And implement new solution based on it.

<task>
{task_description}
</task>

<constraints>
{constraints_if_any}
</constraints>

<context>
{relevant_context}
</context>

<failed_solutions>
{list of paths to all candidate solutions}
</failed_solutions>

<evaluation_reports>
{list of paths to all evaluation reports with low scores}
</evaluation_reports>

Instructions:
Let's break this down systematically to understand what went wrong and how to design new solution based on it.

1. First, analyze the task carefully - what is being asked and what are the key requirements?
2. Read through each solution and its evaluation report
3. For each solution, think step by step about:
   - What was the core approach?
   - What specific issues did judges identify?
   - Why did this approach fail to meet the quality threshold?
4. Identify common failure patterns across all solutions:
   - Are there shared misconceptions?
   - Are there missing requirements that all solutions overlooked?
   - Are there fundamental constraints that weren't considered?
5. Extract lessons learned:
   - What approaches should be avoided?
   - What constraints must be addressed?
6. Generate improved guidance for the next iteration:
   - New constraints to add
   - Specific approaches to try - what are the different ways to solve this?
   - Key requirements to emphasize
7. Think through the tradeoffs step by step and choose the approach you believe is best
8. Implement it completely
9. Generate 5 verification questions about critical aspects
10. Answer your own questions:
    - Review solution against each question
    - Identify gaps or weaknesses
11. Revise solution:
    - Fix identified issues
12. Explain what was changed and why

策略 3：FULL_SYNTHESIS (默认)

适用场景： 没有明确的优胜者 AND 解决方案具有优点（得分 >=3.0）

流程： 进入阶段 3（基于证据的综合）

阶段 3：基于证据的综合

仅在阶段 2.5 选择策略 3 (FULL_SYNTHESIS) 时执行

启动1 个综合代理（推荐使用 Opus 模型以保证质量）：

代理接收：
- 所有候选解决方案（A、B、C）
- 所有评估报告（1、2、3）
代理分析：
- 每位评审员赞扬了哪些元素（对优点的共识）
- 每位评审员指出了哪些问题（对缺点的共识）
- 解决方案在方法上有何不同
代理通过以下方式生成最终解决方案：
- 当一个解决方案明显胜出时，复制其优秀部分
- 当混合方法更好时，结合不同方法
- 修复所有评审员发现的问题
- 记录决策（从何处获取了什么以及原因）

关键原则： 基于证据的综合利用了集体智慧。

综合器提示模板：

You are synthesizing the best solution from competitive implementations and evaluations.

<task>
{task_description}
</task>

<solutions>
{list of paths to all candidate solutions}
</solutions>

<evaluation_reports>
{list of paths to all evaluation reports}
</evaluation_reports>

<output>
{define expected output following such pattern: solution.md based on the task description and context. Result should be a complete solution to the task.}
</output>

Instructions:
Let's think through this synthesis step by step to create the best possible combined solution.

1. First, read all solutions and evaluation reports carefully
2. Map out the consensus:
   - What strengths did multiple judges praise in each solution?
   - What weaknesses did multiple judges criticize in each solution?
3. For each major component or section, think through:
   - Which solution handles this best and why?
   - Could a hybrid approach work better?
4. Create the best possible solution by:
   - Copying text directly when one solution is clearly superior
   - Combining approaches when a hybrid would be better
   - Fixing all identified issues
   - Preserving the best elements from each
5. Explain your synthesis decisions:
   - What you took from each solution
   - Why you made those choices
   - How you addressed identified weaknesses

CRITICAL: Do not create something entirely new. Synthesize the best from what exists.

输出（所有策略）

候选解决方案： {solution-file}.[a|b|c].[ext]（位于指定的输出位置）
评估报告： .specs/reports/{solution-name}-{date}.[1|2|3].md
最终解决方案： {output_path}

SELECT_AND_POLISH：基于优胜解决方案的优化方案
REDESIGN：不要停止，返回阶段 1，最终应通过 SELECT_AND_POLISH 或 FULL_SYNTHESIS 策略完成
FULL_SYNTHESIS：综合了所有方案最佳部分的解决方案

命令执行完成后，按以下结构回复用户：

## Execution Summary

Original Task: {task_description}

Strategy Used: {strategy} ({reason})

### Results

| Phase                   | Agents | Models   | Status      |
|-------------------------|--------|----------|-------------|
| Phase 1: Competitive Generation + Meta-Judge | 4 (3 generators + 1 meta-judge) | opus x 4 | [Complete / Failed] |
| Phase 2: Multi-Judge Evaluation | 3 | opus x 3 | [Complete / Failed] |
| Phase 2.5: Adaptive Strategy Selection | orchestrator | - | {strategy} |
| Phase 3: [Synthesis/Polish/Redesign] | [N] | [model] | [Complete / Failed] |

Files Created

Final Solution:
- {output_path} - Synthesized production-ready command

Candidate Solutions:
- {solution-file}.[a|b|c].[ext] (Score: [X.X]/5.0)

Evaluation Reports:
- .specs/reports/{solution-file}-{date}.[1|2|3].md (Vote: [Solution A/B/C])

Synthesis Decisions

| Element              | Source           | Rationale   |
|----------------------|------------------|-------------|
| [element]            | Solution [B/A/C] | [rationale] |

元评审员 + 评审员验证

切勿跳过元评审员 - 定制的评估标准比通用标准能产生更好的判断
元评审员只运行一次 - 所有 3 位评审员使用相同的规范
包含 CLAUDE_PLUGIN_ROOT - 元评审员和评审员都需要解析后的插件根路径
元评审员 YAML - 只将元评审员 YAML 传递给评审员，不要在其中添加任何额外的文本或注释！

用于琐碎任务 - 开销不合理
模糊的任务描述 - 导致解决方案不可比较
上下文不足 - 代理无法产生高质量的工作
在存在明确优胜者时强制综合 - 浪费成本并可能降低质量
综合存在根本性缺陷的解决方案 - 重新设计比优化垃圾更好
跳过元评审员 - 硬编码的标准不如定制的标准有效
在传递给评审员之前修改元评审员 YAML - 评审员必须接收完全相同的规范

定义明确的任务和清晰的约束
提供丰富的上下文以做出明智的决策
信任自适应策略选择
优化明确的优胜者，综合存在分歧的决策，重新设计失败方案
为了速度，将元评审员与生成器并行调度

示例 1：API 设计（明确优胜者 - SELECT_AND_POLISH）

/do-competitively "Design REST API for user management (CRUD + auth)" \
  --output "specs/api/users.md" \
  --criteria "RESTfulness,security,scalability,developer-experience"

阶段 1 输出（4 个并行代理）：

元评审员：包含 5 个标准维度、比较性评分标准的评估规范 YAML
specs/api/users.a.md - 基于资源的设计，包含嵌套路由
specs/api/users.b.md - 基于操作的设计，包含 RPC 风格端点
specs/api/users.c.md - 最小化设计，缺少身份验证考虑

阶段 2 输出（假设日期 2025-01-15，3 位评审员使用元评审员规范）：

.specs/reports/users-api-2025-01-15.1.md:
```
VOTE: Solution A
SCORES: A=4.5/5.0, B=3.2/5.0, C=2.8/5.0
```
"最符合 REST 风格，安全性好"
.specs/reports/users-api-2025-01-15.2.md:
```
VOTE: Solution A
SCORES: A=4.3/5.0, B=3.5/5.0, C=2.6/5.0
```
"资源设计清晰，可扩展"
.specs/reports/users-api-2025-01-15.3.md:
```
VOTE: Solution A
SCORES: A=4.6/5.0, B=3.0/5.0, C=2.9/5.0
```
"最佳实践，结构清晰"

阶段 2.5 决策（编排器解析标题）：

一致投票：A, A, A
平均分数：A=4.5, B=3.2, C=2.8
策略：SELECT_AND_POLISH
原因：一致同意的优胜者，分数差距 >1.0

阶段 3 输出：

specs/api/users.md - 解决方案 A 经过优化，包含：
- 添加了速率限制文档（来自 B）
- 简化了嵌套路由（根据评审员反馈）
- 总成本：8 个代理（4 个阶段 1 + 3 个评审员 + 1 个优化）

示例 2：算法选择（存在分歧 - FULL_SYNTHESIS）

/do-competitively "Design caching strategy for high-traffic API" \
  --output "specs/caching.md" \
  --criteria "performance,memory-efficiency,simplicity,reliability"

阶段 1 输出（4 个并行代理）：

元评审员：包含 4 个标准维度、比较性评分标准的评估规范 YAML
specs/caching.a.md - 使用 LRU 淘汰策略的 Redis
specs/caching.b.md - 多层缓存（内存 + Redis）
specs/caching.c.md - CDN + 应用缓存

阶段 2 输出（假设日期 2025-01-15，3 位评审员使用元评审员规范）：

.specs/reports/caching-2025-01-15.1.md:
```
VOTE: Solution B
SCORES: A=3.8/5.0, B=4.2/5.0, C=3.9/5.0
```
"性能最佳，复杂"
.specs/reports/caching-2025-01-15.2.md:
```
VOTE: Solution A
SCORES: A=4.0/5.0, B=3.9/5.0, C=3.7/5.0
```
"简单，可靠，经过验证"
.specs/reports/caching-2025-01-15.3.md:
```
VOTE: Solution C
SCORES: A=3.6/5.0, B=4.0/5.0, C=4.1/5.0
```
"全球覆盖，成本效益高"

阶段 2.5 决策（编排器解析标题）：

分歧投票：B, A, C（无共识）
平均分数：A=3.8, B=4.0, C=3.9
分数差距：4.0 - 3.9 = 0.1 (<1.0 阈值)
策略：FULL_SYNTHESIS
原因：存在分歧的决策，所有解决方案均 >=3.0，没有明确的优胜者

阶段 3 输出：

specs/caching.md - 混合方法：
- 多层架构（来自 B）
- 简单的 LRU 策略（来自 A）
- 用于静态内容的 CDN（来自 C）
- 总成本：8 个代理（4 个阶段 1 + 3 个评审员 + 1 个综合）

示例 3：身份验证设计（全部存在缺陷 - REDESIGN）

/do-competitively "Design authentication system with social login" \
  --output "specs/auth.md" \
  --criteria "security,user-experience,maintainability"

阶段 1 输出（4 个并行代理）：

元评审员：包含 3 个标准维度、比较性评分标准的评估规范 YAML
specs/auth.a.md - 自定义 OAuth2 实现
specs/auth.b.md - 基于会话，包含社交登录提供商
specs/auth.c.md - JWT，仅包含密码身份验证

阶段 2 输出（假设日期 2025-01-15，3 位评审员使用元评审员规范）：

.specs/reports/auth-2025-01-15.1.md:
```
VOTE: Solution A
SCORES: A=2.5/5.0, B=2.2/5.0, C=2.3/5.0
```
"安全风险，重复造轮子"
.specs/reports/auth-2025-01-15.2.md:
```
VOTE: Solution B
SCORES: A=2.4/5.0, B=2.8/5.0, C=2.1/5.0
```
"会话无法扩展，缺少需求"
.specs/reports/auth-2025-01-15.3.md:
```
VOTE: Solution C
SCORES: A=2.6/5.0, B=2.5/5.0, C=2.3/5.0
```
"无社交登录，安全问题"

阶段 2.5 决策（编排器解析标题）：

分歧投票：A, B, C（无共识）
平均分数：A=2.5, B=2.5, C=2.2（全部 <3.0）
策略：REDESIGN
原因：所有解决方案均低于 3.0 阈值，存在根本性问题
不要停止，返回阶段 1，最终应通过 SELECT_AND_POLISH 或 FULL_SYNTHESIS 策略完成

🇺🇸English

do-competitively

Key features:

Self-critique loops in generation (Constitutional AI)
Structured evaluation - Meta-judge produces tailored rubrics before judging
Verification loops in evaluation (Chain-of-Verification)
Adaptive strategy: polish clear winners, synthesize split decisions, redesign failures
Average 15-20% cost savings through intelligent strategy selection

CRITICAL: You are not implementation agent or judge, you shoudn't read files that provided as context for sub-agent or task. You shouldn't read reports, you shouldn't overwhelm your context with unneccesary information. You MUST follow process step by step. Any diviations will be considered as failure and you will be killed!

Pattern: Generate-Critique-Synthesize (GCS)

This command implements a multi-phase adaptive competitive orchestration pattern:

Phase 1: Competitive Generation with Self-Critique + Meta-Judge (IN PARALLEL)
         ┌─ Meta-Judge → Evaluation Specification YAML ───────────┐
Task ────┼─ Agent 2 → Draft → Critique → Revise → Solution B ───┐ │ 
         ├─ Agent 3 → Draft → Critique → Revise → Solution C ───┼─┤ 
         └─ Agent 1 → Draft → Critique → Revise → Solution A ───┘ │
                                                                  │
Phase 2: Multi-Judge Evaluation with Verification                 │
         ┌─ Judge 1 → Evaluate → Verify → Revise → Report A ─┐    │
         ├─ Judge 2 → Evaluate → Verify → Revise → Report B ─┼────┤
         └─ Judge 3 → Evaluate → Verify → Revise → Report C ─┘    │
                                                                  │
Phase 2.5: Adaptive Strategy Selection                            │
         Analyze Consensus ───────────────────────────────────────┤
                ├─ Clear Winner? → SELECT_AND_POLISH              │
                ├─ All Flawed (<3.0)? → REDESIGN (return Phase 1) │
                └─ Split Decision? → FULL_SYNTHESIS               │
                                          │                       │
Phase 3: Evidence-Based Synthesis         │                       │
         (Only if FULL_SYNTHESIS)         │                       │
         Synthesizer ─────────────────────┴───────────────────────┴─→ Final Solution

Process

Setup: Create Reports Directory

Before starting, ensure the reports directory exists:

mkdir -p .specs/reports

Report naming convention: .specs/reports/{solution-name}-{YYYY-MM-DD}.[1|2|3].md

Where:

{solution-name} - Derived from output path (e.g., users-api from output specs/api/users.md)
{YYYY-MM-DD} - Current date
[1|2|3] - Judge number

Note: Solutions remain in their specified output locations; only evaluation reports go to .specs/reports/

Phase 1: Competitive Generation + Meta-Judge (IN PARALLEL)

Launch 3 independent generator agents AND 1 meta-judge agent in parallel (4 agents total, all recommended: Opus for quality):

The meta-judge runs in parallel with the 3 generators because it does not need their output — it only needs the task description to generate evaluation criteria.

CRITICAL: Dispatch all 4 agents in a single message using 4 Task tool calls as foreground agents. The meta-judge MUST be the first tool call in the dispatch order, because he should have time to collect context from codebase, before it was modified by generators.

Meta-Judge Agent (1 agent)

The meta-judge generates an evaluation specification YAML (rubrics, checklists, scoring criteria) tailored to this specific task. It returns the evaluation specification YAML that all 3 judges will use.

Prompt template for meta-judge:

## Task

Generate an evaluation specification yaml for the following task. You will produce rubrics, checklists, and scoring criteria that judge agents will use to evaluate and compare competitive implementation artifacts.

CLAUDE_PLUGIN_ROOT=`${CLAUDE_PLUGIN_ROOT}`

## User Prompt
{Original task description from user}

## Context
{Any relevant codebase context, file paths, constraints}

## Artifact Type
{code | documentation | configuration | etc.}

## Number of Solutions
3 (competitive implementations to be compared)

## Instructions
Return only the final evaluation specification YAML in your response.
The specification should support comparative evaluation across multiple solutions.

Dispatch:

Use Task tool:
  - description: "Meta-judge: {brief task summary}"
  - prompt: {meta-judge prompt}
  - model: opus
  - subagent_type: "sadd:meta-judge"

Generator Agents (3 agents)

Each agent receives identical task description and context
Agents work independently without seeing each other's work
Each produces a complete solution to the same problem
Solutions are saved to distinct files (e.g., {solution-file}.[a|b|c].[ext])

Solution naming convention: {solution-file}.[a|b|c].[ext] Where:

{solution-file} - Derived from task (e.g., create users.ts result in users as solution file)
[a|b|c] - Unique identifier per sub-agent
[ext] - File extension (e.g., md, ts and etc.)

Key principle: Diversity through independence - agents explore different approaches.

CRITICAL: You MUST provide filename with [a|b|c] identifier to agents and judges!!! Missing it, will result in your TERMINATION imidiatly!

Prompt template for generators:

<task>
{task_description}
</task>

<constraints>
{constraints_if_any}
</constraints>

<context>
{relevant_context}
</context>

<output>
{define expected output following such pattern: {solution-file}.[a|b|c].[ext] based on the task description and context. Each [a|b|c] is a unique identifier per sub-agent. You MUST provide filename with it!!!}
</output>

Instructions:
Let's approach this systematically to produce the best possible solution.

1. First, analyze the task carefully - what is being asked and what are the key requirements?
2. Consider multiple approaches - what are the different ways to solve this?
3. Think through the tradeoffs step by step and choose the approach you believe is best
4. Implement it completely
5. Generate 5 verification questions about critical aspects
6. Answer your own questions:
   - Review solution against each question
   - Identify gaps or weaknesses
7. Revise solution:
   - Fix identified issues
8. Explain what was changed and why

Parallel Dispatch Example

Send ALL 4 Task tool calls in a single message. Meta-judge first, then generators:

Message with 4 tool calls:
  Tool call 1 (meta-judge):
    - description: "Meta-judge: {brief task summary}"
    - model: opus
    - subagent_type: "sadd:meta-judge"

  Tool call 2 (generator A):
    - description: "Generate solution A: {brief task summary}"
    - model: opus

  Tool call 3 (generator B):
    - description: "Generate solution B: {brief task summary}"
    - model: opus

  Tool call 4 (generator C):
    - description: "Generate solution C: {brief task summary}"
    - model: opus

Wait for ALL 4 to return before proceeding to Phase 2.

Phase 2: Multi-Judge Evaluation

Launch 3 independent judges in parallel (recommended: Opus for rigor):

CRITICAL: Wait for ALL Phase 1 agents (meta-judge + 3 generators) to complete before dispatching judges.

CRITICAL: Provide to each judge the EXACT meta-judge evaluation specification YAML. Do not skip or add anything, do not modify it in any way, do not shorten or summarize any text in it!

Each judge receives the meta-judge evaluation specification YAML and paths to ALL candidate solutions (A, B, C)
Judges evaluate against the meta-judge's criteria (not hardcoded criteria)
Each judge produces:
- Comparative analysis (which solution excels where)
- Evidence-based ratings (with specific quotes/examples)
- Final vote (which solution they prefer and why)
Reports saved to distinct files (e.g., .specs/reports/{solution-name}-{date}.[1|2|3].md)

Key principle: Multiple independent evaluations reduce bias and catch different issues.

Prompt template for judges:

You are evaluating {number} competitive solutions against an evaluation specification produced by the meta judge.

CLAUDE_PLUGIN_ROOT=`${CLAUDE_PLUGIN_ROOT}`

## Task
{task_description}

## Solutions
{list of paths to all candidate solutions}

## Evaluation Specification

```yaml
{meta-judge's evaluation specification YAML}

Output

Write full report to: {.specs/reports/{solution-name}-{date}.[1|2|3].md - each judge gets unique number identifier}

CRITICAL: You must reply with this exact structured header format:

VOTE: [Solution A/B/C] SCORES: Solution A: [X.X]/5.0 Solution B: [X.X]/5.0 Solution C: [X.X]/5.0 CRITERIA:

{criterion_1}: [X.X]/5.0
{criterion_2}: [X.X]/5.0 ...

[Summary of your evaluation]

Instructions

Follow your full judge process as defined in your agent instructions!

CRITICAL: Base your evaluation on evidence, not impressions. Quote specific text.

Output

CRITICAL: You must reply with this exact structured evaluation report format in YAML at the START of your response!

CRITICAL: NEVER provide score threshold to judges. Judge MUST not know what threshold for score is, in order to not be biased!!!

**Dispatch:**

Use Task tool (3 calls in single message):

description: "Judge [1|2|3]: {brief task summary}"
prompt: {judge prompt with exact meta-judge specification YAML}
model: opus
subagent_type: "sadd:judge"

Phase 2.5: Adaptive Strategy Selection (Early Return)

The orchestrator (not a subagent) analyzes judge outputs to determine the optimal strategy.

Decision Logic

Step 1: Parse structured headers from judge reply

Parse the judges reply. CRITICAL: Do not read reports files itself, it can overflow your context.

Step 2: Check for unanimous winner

Compare all three VOTE values:
- If Judge 1 VOTE = Judge 2 VOTE = Judge 3 VOTE (same solution):
  - Strategy: SELECT_AND_POLISH
  - Reason: Clear consensus - all three judges prefer same solution
Step 3: Check if all solutions are fundamentally flawed

If no unanimous vote, calculate average scores:
1. Average Solution A scores: (Judge1_A + Judge2_A + Judge3_A) / 3
2. Average Solution B scores: (Judge1_B + Judge2_B + Judge3_B) / 3
3. Average Solution C scores: (Judge1_C + Judge2_C + Judge3_C) / 3
If (avg_A < 3.0) AND (avg_B < 3.0) AND (avg_C < 3.0):
- Strategy: REDESIGN
- Reason: All solutions below quality threshold, fundamental approach issues

Strategy 2: REDESIGN

When: All solutions scored <3.0/5.0 (fundamental issues across the board)

Process:

Launch new agent to analyze the failure modes and lessons learned. Ask the agent to:
- Think through step by step: what went wrong with each solution?
- Analyze common failure modes across all solutions
- Extract lessons learned (what NOT to do)
- Identify the root causes of why all approaches failed
- Generate new task decomposition or constraints based on these insights
Return to Phase 1 , provide to new implementation agents the lessons learned and new constraints.

Prompt template for new implementation:

You are analyzing why all solutions failed to meet quality standards. And implement new solution based on it.

<task>
{task_description}
</task>

<constraints>
{constraints_if_any}
</constraints>

<context>
{relevant_context}
</context>

<failed_solutions>
{list of paths to all candidate solutions}
</failed_solutions>

<evaluation_reports>
{list of paths to all evaluation reports with low scores}
</evaluation_reports>

Instructions:
Let's break this down systematically to understand what went wrong and how to design new solution based on it.

1. First, analyze the task carefully - what is being asked and what are the key requirements?
2. Read through each solution and its evaluation report
3. For each solution, think step by step about:
   - What was the core approach?
   - What specific issues did judges identify?
   - Why did this approach fail to meet the quality threshold?
4. Identify common failure patterns across all solutions:
   - Are there shared misconceptions?
   - Are there missing requirements that all solutions overlooked?
   - Are there fundamental constraints that weren't considered?
5. Extract lessons learned:
   - What approaches should be avoided?
   - What constraints must be addressed?
6. Generate improved guidance for the next iteration:
   - New constraints to add
   - Specific approaches to try - what are the different ways to solve this?
   - Key requirements to emphasize
7. Think through the tradeoffs step by step and choose the approach you believe is best
8. Implement it completely
9. Generate 5 verification questions about critical aspects
10. Answer your own questions:
   - Review solution against each question
   - Identify gaps or weaknesses
11. Revise solution:
   - Fix identified issues
12. Explain what was changed and why

Strategy 3: FULL_SYNTHESIS (Default)

When: No clear winner AND solutions have merit (scores >=3.0)

Process: Proceed to Phase 3 (Evidence-Based Synthesis)

Phase 3: Evidence-Based Synthesis

Only executed when Strategy 3 (FULL_SYNTHESIS) selected in Phase 2.5

Launch 1 synthesis agent (recommended: Opus for quality):

Agent receives:
- All candidate solutions (A, B, C)
- All evaluation reports (1, 2, 3)
Agent analyzes:
- Which elements each judge praised (consensus on strengths)
- Which issues each judge identified (consensus on weaknesses)
- Where solutions differed in approach
Agent produces final solution by:
- Copying superior sections when one solution clearly wins
- Combining approaches when hybrid is better
- Fixing identified issues that all judges caught
- Documenting decisions (what was taken from where and why)

Key principle: Evidence-based synthesis leverages collective intelligence.

Prompt template for synthesizer:

You are synthesizing the best solution from competitive implementations and evaluations.

<task>
{task_description}
</task>

<solutions>
{list of paths to all candidate solutions}
</solutions>

<evaluation_reports>
{list of paths to all evaluation reports}
</evaluation_reports>

<output>
{define expected output following such pattern: solution.md based on the task description and context. Result should be a complete solution to the task.}
</output>

Instructions:
Let's think through this synthesis step by step to create the best possible combined solution.

1. First, read all solutions and evaluation reports carefully
2. Map out the consensus:
   - What strengths did multiple judges praise in each solution?
   - What weaknesses did multiple judges criticize in each solution?
3. For each major component or section, think through:
   - Which solution handles this best and why?
   - Could a hybrid approach work better?
4. Create the best possible solution by:
   - Copying text directly when one solution is clearly superior
   - Combining approaches when a hybrid would be better
   - Fixing all identified issues
   - Preserving the best elements from each
5. Explain your synthesis decisions:
   - What you took from each solution
   - Why you made those choices
   - How you addressed identified weaknesses

CRITICAL: Do not create something entirely new. Synthesize the best from what exists.

Outputs (All Strategies)

Candidate solutions: {solution-file}.[a|b|c].[ext] (in specified output location)
Evaluation reports: .specs/reports/{solution-name}-{date}.[1|2|3].md
Resulting solution: {output_path}

Strategy-Specific Outputs

SELECT_AND_POLISH: Polished solution based on winning solution
REDESIGN: Do not stop, return to phase 1 and eventiualy should result in finish at SELECT_AND_POLISH or FULL_SYNTHESIS strategies
FULL_SYNTHESIS: Synthesized solution combined best from all

Orcestrator Reply

Once command execution is complete, reply to user with following structure:

## Execution Summary

Original Task: {task_description}

Strategy Used: {strategy} ({reason})

### Results

| Phase                   | Agents | Models   | Status      |
|-------------------------|--------|----------|-------------|
| Phase 1: Competitive Generation + Meta-Judge | 4 (3 generators + 1 meta-judge) | opus x 4 | [Complete / Failed] |
| Phase 2: Multi-Judge Evaluation | 3 | opus x 3 | [Complete / Failed] |
| Phase 2.5: Adaptive Strategy Selection | orchestrator | - | {strategy} |
| Phase 3: [Synthesis/Polish/Redesign] | [N] | [model] | [Complete / Failed] |

Files Created

Final Solution:
- {output_path} - Synthesized production-ready command

Candidate Solutions:
- {solution-file}.[a|b|c].[ext] (Score: [X.X]/5.0)

Evaluation Reports:
- .specs/reports/{solution-file}-{date}.[1|2|3].md (Vote: [Solution A/B/C])

Synthesis Decisions

| Element              | Source           | Rationale   |
|----------------------|------------------|-------------|
| [element]            | Solution [B/A/C] | [rationale] |

Best Practices

Meta-Judge + Judge Verification

Never skip meta-judge - Tailored evaluation criteria produce better judgments than generic ones
Meta-judge runs once - Same specification for all 3 judges
Include CLAUDE_PLUGIN_ROOT - Both meta-judge and judges need the resolved plugin root path
Meta-judge YAML - Pass only the meta-judge YAML to judges, do not add any additional text or comments to it!

Common Pitfalls

Using for trivial tasks - Overhead not justified
Vague task descriptions - Leads to incomparable solutions
Insufficient context - Agents can't produce quality work
Forcing synthesis when clear winner exists - Wastes cost and risks degrading quality
Synthesizing fundamentally flawed solutions - Better to redesign than polish garbage
Skipping meta-judge - Hardcoded criteria are less effective than tailored ones
Modifying meta-judge YAML before passing to judges - Judges must receive exact specification

Do:

Well-defined task with clear constraints
Rich context for informed decisions
Trust adaptive strategy selection
Polish clear winners, synthesize split decisions, redesign failures
Dispatch meta-judge in parallel with generators for speed

Examples

Example 1: API Design (Clear Winner - SELECT_AND_POLISH)

/do-competitively "Design REST API for user management (CRUD + auth)" \
  --output "specs/api/users.md" \
  --criteria "RESTfulness,security,scalability,developer-experience"

Phase 1 outputs (4 parallel agents):

Meta-judge: evaluation specification YAML with 5 criteria dimensions, comparative rubrics
specs/api/users.a.md - Resource-based design with nested routes
specs/api/users.b.md - Action-based design with RPC-style endpoints
specs/api/users.c.md - Minimal design, missing auth consideration

Phase 2 outputs (assuming date 2025-01-15, 3 judges using meta-judge specification):

.specs/reports/users-api-2025-01-15.1.md:
```
VOTE: Solution A
```
SCORES: A=4.5/5.0, B=3.2/5.0, C=2.8/5.0

"Most RESTful, good security"

.specs/reports/users-api-2025-01-15.2.md:
```
VOTE: Solution A
```
SCORES: A=4.3/5.0, B=3.5/5.0, C=2.6/5.0

"Clean resource design, scalable"

.specs/reports/users-api-2025-01-15.3.md:
```
VOTE: Solution A
```
SCORES: A=4.6/5.0, B=3.0/5.0, C=2.9/5.0

"Best practices, clear structure"

Phase 2.5 decision (orchestrator parses headers):

Unanimous vote: A, A, A
Average scores: A=4.5, B=3.2, C=2.8
Strategy: SELECT_AND_POLISH
Reason: Unanimous winner with >1.0 point gap

Phase 3 output:

specs/api/users.md - Solution A polished with:
- Added rate limiting documentation (from B)
- Simplified nested routes (judge feedback)
- Total cost: 8 agents (4 Phase 1 + 3 judges + 1 polish)

Example 2: Algorithm Selection (Split Decision - FULL_SYNTHESIS)

/do-competitively "Design caching strategy for high-traffic API" \
  --output "specs/caching.md" \
  --criteria "performance,memory-efficiency,simplicity,reliability"

Phase 1 outputs (4 parallel agents):

Meta-judge: evaluation specification YAML with 4 criteria dimensions, comparative rubrics
specs/caching.a.md - Redis with LRU eviction
specs/caching.b.md - Multi-tier cache (memory + Redis)
specs/caching.c.md - CDN + application cache

Phase 2 outputs (assuming date 2025-01-15, 3 judges using meta-judge specification):

.specs/reports/caching-2025-01-15.1.md:
```
VOTE: Solution B
```
SCORES: A=3.8/5.0, B=4.2/5.0, C=3.9/5.0

"Best performance, complex"

.specs/reports/caching-2025-01-15.2.md:
```
VOTE: Solution A
```
SCORES: A=4.0/5.0, B=3.9/5.0, C=3.7/5.0

"Simple, reliable, proven"

.specs/reports/caching-2025-01-15.3.md:
```
VOTE: Solution C
```
SCORES: A=3.6/5.0, B=4.0/5.0, C=4.1/5.0

"Global reach, cost-effective"

Phase 2.5 decision (orchestrator parses headers):

Split votes: B, A, C (no consensus)
Average scores: A=3.8, B=4.0, C=3.9
Score gap: 4.0 - 3.9 = 0.1 (<1.0 threshold)
Strategy: FULL_SYNTHESIS
Reason: Split decision, all solutions >=3.0, no clear winner

Phase 3 output:

specs/caching.md - Hybrid approach:
- Multi-tier architecture (from B)
- Simple LRU policy (from A)
- CDN for static content (from C)
- Total cost: 8 agents (4 Phase 1 + 3 judges + 1 synthesis)

Example 3: Authentication Design (All Flawed - REDESIGN)

/do-competitively "Design authentication system with social login" \
  --output "specs/auth.md" \
  --criteria "security,user-experience,maintainability"

Phase 1 outputs (4 parallel agents):

Meta-judge: evaluation specification YAML with 3 criteria dimensions, comparative rubrics
specs/auth.a.md - Custom OAuth2 implementation
specs/auth.b.md - Session-based with social providers
specs/auth.c.md - JWT with password-only auth

Phase 2 outputs (assuming date 2025-01-15, 3 judges using meta-judge specification):

.specs/reports/auth-2025-01-15.1.md:
```
VOTE: Solution A
```
SCORES: A=2.5/5.0, B=2.2/5.0, C=2.3/5.0

"Security risks, reinventing wheel"

.specs/reports/auth-2025-01-15.2.md:
```
VOTE: Solution B
```
SCORES: A=2.4/5.0, B=2.8/5.0, C=2.1/5.0

"Sessions don't scale, missing requirements"

.specs/reports/auth-2025-01-15.3.md:
```
VOTE: Solution C
```
SCORES: A=2.6/5.0, B=2.5/5.0, C=2.3/5.0

"No social login, security concerns"

Phase 2.5 decision (orchestrator parses headers):

Split votes: A, B, C (no consensus)
Average scores: A=2.5, B=2.5, C=2.2 (ALL <3.0)
Strategy: REDESIGN
Reason: All solutions below 3.0 threshold, fundamental issues
Do not stop, return to phase 1 and eventiualy should result in finish at SELECT_AND_POLISH or FULL_SYNTHESIS strategies

Weekly Installs

210

Repository

neolabhq/contex…ring-kit

GitHub Stars

699

First Seen

Feb 19, 2026

Installed on

opencode205

github-copilot204

codex204

gemini-cli203

kimi-cli201

cursor201

AI Elements：基于shadcn/ui的AI原生应用组件库，快速构建对话界面

62,200 周安装

Step 5: Default to full synthesis

If none of the above conditions met:

Strategy: FULL_SYNTHESIS
Reason: Split decision with merit, synthesis needed to combine best elements

Strategy 1: SELECT_AND_POLISH

When: Clear winner (unanimous votes)

Select the winning solution as the base
Launch subagent to apply specific improvements from judge feedback
Cherry-pick 1-2 best elements from runner-up solutions
Document what was added and why

Saves synthesis cost (simpler than full synthesis)
Preserves proven quality of winning solution
Focused improvements rather than full reconstruction

Prompt template:

You are polishing the winning solution based on judge feedback.

<task>
{task_description}
</task>

<winning_solution>
{path_to_winning_solution}
Score: {winning_score}/5.0
Judge consensus: {why_it_won}
</winning_solution>

<runner_up_solutions>
{list of paths to all runner-up solutions}
</runner_up_solutions>

<judge_feedback>
{list of paths to all evaluation reports}
</judge_feedback>

<output>
{final_solution_path}
</output>

Instructions:
Let's work through this step by step to polish the winning solution effectively.

1. Take the winning solution as your base (do NOT rewrite it)
2. First, carefully review all judge feedback to understand what needs improvement
3. Apply improvements based on judge feedback:
   - Fix identified weaknesses
   - Add missing elements judges noted
4. Next, examine the runner-up solutions for standout elements
5. Cherry-pick 1-2 specific elements from runners-up if judges praised them
6. Document changes made:
   - What was changed and why
   - What was added from other solutions

CRITICAL: Preserve the winning solution's core approach. Make targeted improvements only.

sadd:do-competitively - 多代理竞争性AI生成与评估框架，实现宪法式AI自我批判循环

🇨🇳中文介绍

do-competitively

模式：生成-批判-综合 (GCS)

相关 Skills

流程

设置：创建报告目录

阶段 1：竞争性生成 + 元评审员 (并行)

元评审员代理 (1 个代理)

生成器代理 (3 个代理)

并行调度示例

阶段 2：多评审员评估

Output

Instructions

Output

策略 2：REDESIGN

策略 3：FULL_SYNTHESIS (默认)

阶段 3：基于证据的综合

输出（所有策略）

策略特定输出

编排器回复

最佳实践

元评审员 + 评审员验证

常见陷阱

示例

示例 1：API 设计（明确优胜者 - SELECT_AND_POLISH）

示例 2：算法选择（存在分歧 - FULL_SYNTHESIS）

示例 3：身份验证设计（全部存在缺陷 - REDESIGN）

🇺🇸English

do-competitively

Pattern: Generate-Critique-Synthesize (GCS)

Process

Setup: Create Reports Directory

Phase 1: Competitive Generation + Meta-Judge (IN PARALLEL)

Meta-Judge Agent (1 agent)

Generator Agents (3 agents)

Parallel Dispatch Example

Phase 2: Multi-Judge Evaluation

Output

Instructions

Output

Phase 2.5: Adaptive Strategy Selection (Early Return)

Decision Logic

Strategy 2: REDESIGN

Strategy 3: FULL_SYNTHESIS (Default)

Phase 3: Evidence-Based Synthesis

Outputs (All Strategies)

Strategy-Specific Outputs

Orcestrator Reply

Best Practices

Meta-Judge + Judge Verification

Common Pitfalls

Examples

Example 1: API Design (Clear Winner - SELECT_AND_POLISH)

Example 2: Algorithm Selection (Split Decision - FULL_SYNTHESIS)

Example 3: Authentication Design (All Flawed - REDESIGN)

最新 Skills

Strategy 1: SELECT_AND_POLISH