Tree of Thoughts (ToT) 思维之树：多智能体系统推理与自适应策略选择框架

sadd%3Atree-of-thoughts by neolabhq/context-engineering-kit

224 周安装量

699 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/neolabhq/context-engineering-kit --skill sadd:tree-of-thoughts

AI/机器学习方法论科研工具

🇨🇳中文介绍

tree-of-thoughts

主要优势：

系统性探索 - 多个智能体探索解决方案空间的不同区域
结构化评估 - 元评委在评判前生成定制的评分标准和准则
独立验证 - 评委机械地应用元评委的规范，减少偏见
自适应策略 - 明确胜出的方案得到优化，分歧的决策得到综合，失败的方案得到重新设计

模式：思维之树 (ToT)

此命令实现了一个包含元评委评估和自适应策略选择的八阶段系统性推理模式：

Phase 1: Exploration (Propose Approaches)
         ┌─ Agent A → Proposals A1, A2 (with probabilities) ─┐
Task ───┼─ Agent B → Proposals B1, B2 (with probabilities) ─┼─┐
         └─ Agent C → Proposals C1, C2 (with probabilities) ─┘ │
                                                                │
Phase 1.5: Pruning Meta-Judge (runs in parallel with Phase 1) │
         Meta-Judge → Pruning Evaluation Specification YAML ───┤
                                                                │
Phase 2: Pruning (Vote for Best 3)                             │
         ┌─ Judge 1 → Votes + Rationale ─┐                     │
         ├─ Judge 2 → Votes + Rationale ─┼─────────────────────┤
         └─ Judge 3 → Votes + Rationale ─┘                     │
                 │                                              │
                 ├─→ Select Top 3 Proposals                     │
                 │                                              │
Phase 3: Expansion (Develop Full Solutions)                    │
         ┌─ Agent A → Solution A (from proposal X) ─┐          │
         ├─ Agent B → Solution B (from proposal Y) ─┼──────────┤
         └─ Agent C → Solution C (from proposal Z) ─┘          │
                                                                │
Phase 3.5: Evaluation Meta-Judge (runs in parallel w/ Phase 3)│
         Meta-Judge → Evaluation Specification YAML ───────────┤
                                                                │
Phase 4: Evaluation (Judge Full Solutions)                     │
         ┌─ Judge 1 → Report 1 ─┐                              │
         ├─ Judge 2 → Report 2 ─┼──────────────────────────────┤
         └─ Judge 3 → Report 3 ─┘                              │
                                                                │
Phase 4.5: Adaptive Strategy Selection                         │
         Analyze Consensus ────────────────────────────────────┤
                ├─ Clear Winner? → SELECT_AND_POLISH           │
                ├─ All Flawed (<3.0)? → REDESIGN (Phase 3)     │
                └─ Split Decision? → FULL_SYNTHESIS            │
                                         │                      │
Phase 5: Synthesis (Only if FULL_SYNTHESIS)                    │
         Synthesizer ────────────────────┴──────────────────────┴─→ Final Solution

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

设置：创建目录结构

开始前，确保目录结构存在：

mkdir -p .specs/research .specs/reports

提案：.specs/research/{solution-name}-{YYYY-MM-DD}.proposals.[a|b|c].md
筛选：.specs/research/{solution-name}-{YYYY-MM-DD}.pruning.[1|2|3].md
选择：.specs/research/{solution-name}-{YYYY-MM-DD}.selection.md
评估：.specs/reports/{solution-name}-{YYYY-MM-DD}.[1|2|3].md

{solution-name} - 从输出路径派生（例如，从输出 specs/api/users.md 得到 users-api）
{YYYY-MM-DD} - 当前日期

注意： 解决方案保留在其指定的输出位置；只有研究和评估文件会放入 .specs/

阶段 1：探索（提出方法）

并行启动 3 个独立的智能体（建议：Sonnet 以获得速度）：

每个智能体接收相同的任务描述和上下文
每个智能体生成 6 种高层次方法（非完整实现）
对于每种方法，智能体提供：
- 方法描述（2-3 段）
- 关键设计决策和权衡
- 概率估计（0.0-1.0）
- 估计复杂度（低/中/高）
- 潜在风险和失败模式
提案保存到 .specs/research/{solution-name}-{date}.proposals.[a|b|c].md

关键原则： 通过对所有可能方法的完整分布进行概率抽样，实现系统性探索。

探索者提示模板：

<task>
{task_description}
</task>

<constraints>
{constraints_if_any}
</constraints>

<context>
{relevant_context}
</context>

<output>
{.specs/research/{solution-name}-{date}.proposals.[a|b|c].md - each agent gets unique letter identifier}
</output>

Instructions:

Let's approach this systematically by first understanding what we're solving, then exploring the solution space.

**Step 1: Decompose the problem**
Before generating approaches, break down the task:
- What is the core problem being solved?
- What are the key constraints and requirements?
- What subproblems must any solution address?
- What are the evaluation criteria for success?

**Step 2: Map the solution space**
Identify the major dimensions along which solutions can vary:
- Architecture patterns (e.g., monolithic vs distributed)
- Implementation strategies (e.g., eager vs lazy)
- Trade-off axes (e.g., performance vs simplicity)

**Step 3: Generate 6 distinct high-level approaches**

**Sampling guidance:**
Please sample approaches at random from the [full distribution / tails of the distribution]
- For first 3 approaches aim for high probability, over 0.80
- For last 3 approaches aim for diversity - explore different regions of the solution space, such that the probability of each response is less than 0.10

For each approach, provide:
   - Name and one-sentence summary
   - Detailed description (2-3 paragraphs)
   - Key design decisions and rationale
   - Trade-offs (what you gain vs what you sacrifice)
   - Probability (0.0-1.0)
   - Complexity estimate (low/medium/high)
   - Potential risks and failure modes

**Step 4: Verify diversity**
Before finalizing, check:
- Are approaches genuinely different, not minor variations?
- Do they span different regions of the solution space?
- Have you covered both conventional and unconventional options?


CRITICAL:
- Do NOT implement full solutions yet - only high-level approaches
- Ensure approaches are genuinely different, not minor variations

阶段 1.5：派遣筛选元评委

关键： 将筛选元评委与阶段 1 的探索智能体并行启动。元评委不需要探索输出来生成筛选标准——它只需要原始任务描述。

筛选元评委生成一个评估规范（评分标准、检查表、评分准则），专门用于评估高层次提案以进行筛选。

筛选元评委提示模板：

## Task

Generate an evaluation specification yaml for pruning high-level solution proposals. You will produce rubrics, checklists, and scoring criteria that judge agents will use to select the top 3 proposals for full development.

CLAUDE_PLUGIN_ROOT=`${CLAUDE_PLUGIN_ROOT}`

## User Prompt
{Original task description from user}

## Context
{Any relevant codebase context, file paths, constraints}

## Artifact Type
proposals (high-level approaches with probability estimates, not full implementations)

## Evaluation Focus
Feasibility, alignment with requirements, potential for high-quality result, risk manageability

## Instructions
Return only the final evaluation specification YAML in your response.
The specification should support comparative evaluation and ranking of proposals.

Use Task tool:
  - description: "Pruning Meta-judge: {brief task summary}"
  - prompt: {pruning meta-judge prompt}
  - model: opus
  - subagent_type: "sadd:meta-judge"

阶段 2：筛选（投票选出前 3 名候选方案）

等待阶段 1 的探索智能体和阶段 1.5 的筛选元评委都完成后，再继续。

并行启动 3 个独立的评委（建议：Opus 以获得严谨性）：

每个评委接收所有提案文件（来自 .specs/research/）和筛选元评委评估规范 YAML
评委根据元评委生成的筛选标准评估每个提案
每个评委生成：
- 每个提案的分数（附证据）
- 投票选出要扩展的前 3 个提案
- 选择的理由
投票保存到 .specs/research/{solution-name}-{date}.pruning.[1|2|3].md

关键原则： 使用元评委生成的标准进行独立评估，确保一致、定制的评估，而无需硬编码权重。

关键：向每个评委提供完全相同的筛选元评委评估规范 YAML。不要跳过、添加、修改、缩短或总结其中的任何文本！

筛选评委提示模板：

You are evaluating {N} proposed approaches against an evaluation specification produced by the meta judge, to select the top 3 for full development.

CLAUDE_PLUGIN_ROOT=`${CLAUDE_PLUGIN_ROOT}`

## Task
{task_description}

## Proposals
{list of paths to all proposal files}
Read all proposals carefully before evaluating.

## Evaluation Specification

```yaml
{pruning meta-judge's evaluation specification YAML}

{.specs/research/{solution-name}-{date}.pruning.[1|2|3].md}

Follow your full judge process as defined in your agent instructions!

CRITICAL: You must reply with this exact structured evaluation report format in YAML at the START of your response!

description: "Pruning Judge {1|2|3}: {brief task summary}"
prompt: {pruning judge prompt with exact meta-judge specification YAML}
model: opus

subagent_type: "sadd:judge"

Phase 2b: Select Top 3 Proposals

After judges complete voting:

Aggregate votes using ranked choice:
- 1st choice = 3 points
- 2nd choice = 2 points
- 3rd choice = 1 point
Select top 3 proposals by total points
Handle ties by comparing average scores across criteria
Document selection in .specs/research/{solution-name}-{date}.selection.md:
- Vote tallies
- Selected proposals
- Consensus rationale

Phase 3: Expansion (Develop Full Solutions)

Launch 3 independent agents in parallel (recommended: Opus for quality):

Each agent receives:
- One selected proposal to expand
- Original task description and context
- Judge feedback from pruning phase (concerns, questions)
Agent produces complete solution implementing the proposal:
- Full implementation details
- Addresses concerns raised by judges
- Documents key decisions made during expansion
Solutions saved to solution.a.md, solution.b.md, solution.c.md

Key principle: Focused development of validated approaches with awareness of evaluation feedback.

Prompt template for expansion agents:

You are developing a full solution based on a selected proposal.

<task>
{task_description}
</task>

<selected_proposal>
{write selected proposal EXACTLY as it is. Including all details provided by the agent}
Read this carefully - it is your starting point.
</selected_proposal>

<judge_feedback>
{concerns and questions from judges about this proposal}
Address these in your implementation.
</judge_feedback>

<output>
solution.[*].md where [*] is your unique identifier (a, b, or c)
</output>

Instructions:

Let's work through this systematically to ensure we build a complete, high-quality solution.

**Step 1: Understand the proposal deeply**
Before implementing, analyze:
- What is the core insight or approach of this proposal?
- What are the key design decisions already made?
- What gaps need to be filled for a complete solution?

**Step 2: Address judge feedback**
For each concern raised by judges:
- What specific change or addition addresses this concern?
- How does this change integrate with the proposal's approach?

**Step 3: Decompose into implementation subproblems**
Break the solution into logical parts:
- What are the main components or sections?
- What must be defined first for other parts to build upon?
- What are the dependencies between parts?

**Step 4: Implement each subproblem**
For each component, work through:
- Core functionality and behavior
- Edge cases and error handling
- Integration points with other components

**Step 5: Self-verification**
Generate 3-5 verification questions about critical aspects, then answer them:
- Review solution against each question
- Identify gaps or weaknesses
- Fix identified issues

**Step 6: Document changes**
Explain what was changed from the original proposal and why.

<example>
**Example of good expansion thinking:**

Proposal: "Use event-driven architecture with message queue"

Step 1 Analysis:
- Core insight: Decouple components via async messaging
- Key decisions: Events as primary communication, eventual consistency
- Gaps: Need to define event schemas, queue technology, error handling

Step 2 - Addressing judge concern "What about message ordering?":
- Add partition keys for ordered processing within entity scope
- Document ordering guarantees and limitations

Step 3 - Subproblems:
1. Event schema definitions (foundational - others depend on this)
2. Producer interfaces (depends on schemas)
3. Consumer handlers (depends on schemas)
4. Error handling and dead letter queues (depends on both)
5. Integration patterns (builds on all above)
</example>

CRITICAL:
- Stay faithful to the selected proposal's core approach
- Do not switch to a different approach midway
- Address judge feedback explicitly
- Produce a complete, implementable solution

阶段 3.5：派遣评估元评委

关键： 将评估元评委与阶段 3 的扩展智能体并行启动。元评委不需要扩展输出来生成评估标准——它只需要原始任务描述。

评估元评委生成一个评估规范（评分标准、检查表、评分准则），专门用于评估完整的解决方案实现。

评估元评委提示模板：

## Task

Generate an evaluation specification yaml for evaluating full solution implementations. You will produce rubrics, checklists, and scoring criteria that judge agents will use to evaluate and compare competitive implementations.

CLAUDE_PLUGIN_ROOT=`${CLAUDE_PLUGIN_ROOT}`

## User Prompt
{Original task description from user}

## Context
{Any relevant codebase context, file paths, constraints}

## Artifact Type
{code | documentation | configuration | etc.}

## Number of Solutions
3 (full implementations developed from selected proposals)

## Instructions
Return only the final evaluation specification YAML in your response.
The specification should support comparative evaluation across multiple solutions.

Use Task tool:
  - description: "Evaluation Meta-judge: {brief task summary}"
  - prompt: {evaluation meta-judge prompt}
  - model: opus
  - subagent_type: "sadd:meta-judge"

阶段 4：评估（评判完整解决方案）

等待阶段 3 的扩展智能体和阶段 3.5 的评估元评委都完成后，再继续。

并行启动 3 个独立的评委（建议：Opus 以获得严谨性）：

每个评委接收所有解决方案文件（solution.a.md, solution.b.md, solution.c.md）和评估元评委规范 YAML
评委根据元评委生成的评估标准进行评估
每个评委生成：
- 比较分析（哪个解决方案在哪些方面更出色）
- 基于证据的评分（附具体引用/示例）
- 最终投票（他们偏好哪个解决方案及原因）
报告保存到 .specs/reports/{solution-name}-{date}.[1|2|3].md

关键原则： 多个独立的评估，使用元评委生成的规范和明确的证据，减少偏见并捕捉不同的质量方面。

关键：向每个评委提供完全相同的评估元评委评估规范 YAML。不要跳过、添加、修改、缩短或总结其中的任何文本！

关键：永远不要向评委提供分数阈值。评委绝对不能知道分数阈值是多少，以避免产生偏见！！！

评估评委提示模板：

You are evaluating {number} full solutions against an evaluation specification produced by the meta judge.

CLAUDE_PLUGIN_ROOT=`${CLAUDE_PLUGIN_ROOT}`

## Task
{task_description}

## Solutions
{list of paths to all solution files}
Read all solutions carefully before evaluating.

## Evaluation Specification

```yaml
{evaluation meta-judge's evaluation specification YAML}

Write full report to: .specs/reports/{solution-name}-{date}.[1|2|3].md

CRITICAL: You must reply with this exact structured header format:

VOTE: [Solution A/B/C] SCORES: Solution A: [X.X]/5.0 Solution B: [X.X]/5.0 Solution C: [X.X]/5.0 CRITERIA:

{criterion_1}: [X.X]/5.0
{criterion_2}: [X.X]/5.0 ...

[Summary of your evaluation]

Follow your full judge process as defined in your agent instructions!

CRITICAL: You must reply with this exact structured evaluation report format in YAML at the START of your response!

description: "Evaluation Judge {1|2|3}: {brief task summary}"
prompt: {evaluation judge prompt with exact meta-judge specification YAML}
model: opus

subagent_type: "sadd:judge"

Phase 4.5: Adaptive Strategy Selection (Early Return)

The orchestrator (not a subagent) analyzes judge outputs to determine the optimal strategy.

Decision Logic

Step 1: Parse structured headers from judge reply

Parse the judges reply. CRITICAL: Do not read report files themselves, as they can overflow your context.

Step 2: Check for unanimous winner

Compare all three VOTE values:

If Judge 1 VOTE = Judge 2 VOTE = Judge 3 VOTE (same solution):
- Strategy: SELECT_AND_POLISH
- Reason: Clear consensus - all three judges prefer same solution

Step 3: Check if all solutions are fundamentally flawed

If no unanimous vote, calculate average scores:

Average Solution A scores: (Judge1_A + Judge2_A + Judge3_A) / 3
Average Solution B scores: (Judge1_B + Judge2_B + Judge3_B) / 3
Average Solution C scores: (Judge1_C + Judge2_C + Judge3_C) / 3

If (avg_A < 3.0) AND (avg_B < 3.0) AND (avg_C < 3.0):

Strategy: REDESIGN
Reason: All solutions below quality threshold, fundamental approach issues

Step 4: Default to full synthesis

If none of the above conditions met:

Strategy: FULL_SYNTHESIS
Reason: Split decision with merit, synthesis needed to combine best elements

Strategy 1: SELECT_AND_POLISH

When: Clear winner (unanimous votes)

Process:

Select the winning solution as the base
Launch subagent to apply specific improvements from judge feedback
Cherry-pick 1-2 best elements from runner-up solutions
Document what was added and why

Benefits:

Saves synthesis cost (simpler than full synthesis)
Preserves proven quality of winning solution
Focused improvements rather than full reconstruction

Prompt template:

You are polishing the winning solution based on judge feedback.

<task>
{task_description}
</task>

<winning_solution>
{path_to_winning_solution}
Score: {winning_score}/5.0
Judge consensus: {why_it_won}
</winning_solution>

<runner_up_solutions>
{list of paths to all runner-up solutions}
</runner_up_solutions>

<judge_feedback>
{list of paths to all evaluation reports}
</judge_feedback>

<output>
{final_solution_path}
</output>

Instructions:

Let's approach this polishing task methodically to improve without disrupting what works.

**Step 1: Understand why this solution won**
Analyze the winning solution:
- What are its core strengths that judges praised?
- What makes its approach superior to alternatives?
- Which parts should remain untouched?

**Step 2: Catalog improvement opportunities**
From judge feedback, identify:
- Specific weaknesses mentioned (list each one)
- Missing elements judges noted
- Areas where runner-ups were praised

**Step 3: Prioritize changes by impact**
For each improvement opportunity:
- High impact: Directly addresses judge criticism
- Medium impact: Adds praised element from runner-up
- Low impact: Nice-to-have refinement

Focus on high-impact changes first.

**Step 4: Apply improvements surgically**
For each change:
- Locate the specific section to modify
- Make the minimal change needed to address the issue
- Verify the change integrates cleanly with surrounding content

**Step 5: Cherry-pick from runners-up**
Review runner-up solutions for:
- 1-2 specific elements that judges praised
- Elements that complement (not conflict with) the winning approach
- Only incorporate if clearly superior to winning solution's version

**Step 6: Document all changes**
Record:
- What was changed and why (with reference to judge feedback)
- What was added from other solutions (cite source)
- What was intentionally left unchanged

CRITICAL: Preserve the winning solution's core approach. Make targeted improvements only.

策略 2：重新设计

时机： 所有解决方案得分 <3.0/5.0（普遍存在根本性问题）

启动新的智能体来分析失败模式和经验教训
返回到阶段 3（扩展），向新的实现智能体提供经验教训和新的约束

注意： 如果重新设计失败两次，则向用户上报以寻求指导。

新实现的提示模板：

You are analyzing why all solutions failed to meet quality standards, to inform a redesign. And implement new solution based on it.


<task>
{task_description}
</task>

<constraints>
{constraints_if_any}
</constraints>

<context>
{relevant_context}
</context>

<failed_solutions>
{list of paths to all solution files}
Average scores: A={avg_a}/5.0, B={avg_b}/5.0, C={avg_c}/5.0
</failed_solutions>

<evaluation_reports>
{list of paths to all evaluation reports}
All solutions scored below 3.0/5.0 threshold.
</evaluation_reports>

<output>
.specs/research/{solution-name}-{date}.redesign-analysis.md
</output>

Instructions:
Let's break this down systematically to understand what went wrong and how to design new solution based on it.

1. First, analyze the task carefully - what is being asked and what are the key requirements?
2. Read through each solution and its evaluation report
3. For each solution, think step by step about:
   - What was the core approach?
   - What specific issues did judges identify?
   - Why did this approach fail to meet the quality threshold?
4. Identify common failure patterns across all solutions:
   - Are there shared misconceptions?
   - Are there missing requirements that all solutions overlooked?
   - Are there fundamental constraints that weren't considered?
5. Extract lessons learned:
   - What approaches should be avoided?
   - What constraints must be addressed?
6. Generate improved guidance for the next iteration:
   - New constraints to add
   - Specific approaches to try - what are the different ways to solve this?
   - Key requirements to emphasize
7. Think through the tradeoffs step by step and choose the approach you believe is best
8. Implement it completely
9. Generate 5 verification questions about critical aspects
10. Answer your own questions:
   - Review solution against each question
   - Identify gaps or weaknesses
11. Revise solution:
   - Fix identified issues
12. Explain what was changed and why

策略 3：完全综合（默认）

时机： 没有明确的胜出者，且解决方案有可取之处（得分 >=3.0）

流程： 进入阶段 5（基于证据的综合）

阶段 5：综合（基于证据的组合）

仅在阶段 4.5 选择了策略 3 (FULL_SYNTHESIS) 时执行

启动1 个综合智能体（建议：Opus 以获得质量）：

智能体接收：
- 所有解决方案（来自指定的输出位置）
- 所有评估报告（来自 .specs/reports/）
- 筛选阶段的选择理由（来自 .specs/research/）
智能体分析：
- 共识优势（多个评委称赞了什么）
- 共识弱点（多个评委批评了什么）
- 互补元素（解决方案采取了不同方法的地方）
智能体通过以下方式生成最终解决方案：
- 复制优秀部分（当一个解决方案明显胜出时）
- 组合方法（当混合方案更好时）
- 修复评委发现的问题
- 记录决策（从何处获取了什么以及原因）

关键原则： 基于证据的综合利用了探索和评估过程中产生的集体智慧。

综合器提示模板：

You are synthesizing the best solution from explored, pruned, and evaluated implementations.

<task>
{task_description}
</task>

<solutions>
{list of paths to all solution files}
</solutions>

<evaluation_reports>
{list of paths to all evaluation reports}
</evaluation_reports>

<selection_rationale>
{path to selection.md explaining why these proposals were chosen}
</selection_rationale>

<output>
{output_path} - The final synthesized solution
</output>

Instructions:

Let's approach this synthesis systematically by first analyzing, then decomposing, then building.

**Step 1: Build the evidence base**
Before synthesizing, gather evidence from judge reports:
- What did multiple judges praise? (consensus strengths)
- What did multiple judges criticize? (consensus weaknesses)
- Where did judges disagree? (areas needing careful analysis)

**Step 2: Decompose into synthesis subproblems**
Break the solution into logical sections or components. For each component:
- Which solution handles this best? (cite evidence)
- Are there complementary elements from multiple solutions?
- What issues were identified that need fixing?

**Step 3: Solve each subproblem**
For each component/section, determine the synthesis strategy:

*Strategy A - Clear winner:* If one solution is clearly superior for this component:
- Copy that section directly
- Document: "Taken from Solution X because [judge evidence]"

*Strategy B - Complementary combination:* If solutions have complementary strengths:
- Identify what each contributes
- Combine carefully, ensuring consistency
- Document: "Combined X from Solution A with Y from Solution B because [rationale]"

*Strategy C - All flawed:* If all solutions have issues in this area:
- Start with the best version
- Apply fixes based on judge criticism
- Document: "Based on Solution X, modified to address [specific issues]"

**Step 4: Integrate and verify consistency**
After synthesizing all components:
- Check that combined elements work together
- Resolve any contradictions between borrowed sections
- Ensure consistent terminology and style

**Step 5: Document synthesis decisions**
Create a synthesis log:
- What you took from each solution (with specific citations)
- Why you made those choices (reference judge feedback)
- How you addressed identified weaknesses
- Any novel combinations or improvements

<example>
**Example synthesis decision for an API design:**

Component: Authentication flow
- Solution A: JWT with refresh tokens (praised for security by 2/3 judges)
- Solution B: Session-based (praised for simplicity by 1 judge, criticized for scalability)
- Solution C: OAuth2 only (criticized as over-engineered for use case)

Decision: Take Solution A's authentication flow directly.
Evidence: Judges 1 and 3 both noted "JWT approach provides good balance of security and statelessness"
Modification: None needed - this section was rated highest across judges.
</example>

**Step 6: Revise your solution**
- Generate 5 verification questions about critical aspects
- Answer your own questions:
   - Review solution against each question
   - Identify gaps or weaknesses
- Revise solution:
   - Fix identified issues
- Explain what was changed and why


CRITICAL:
- Do not create something entirely new - synthesize the best from what exists
- Cite your sources (which solution, which section)
- Explain every major decision
- Address all consensus weaknesses identified by judges

输出（所有策略）

研究目录： .specs/research/（如果不存在则创建）
- 提案：.specs/research/{solution-name}-{date}.proposals.[a|b|c].md - 带概率的高层次方法
- 筛选：.specs/research/{solution-name}-{date}.pruning.[1|2|3].md - 评委评估和投票
- 选择：.specs/research/{solution-name}-{date}.selection.md - 投票统计和选定的提案
扩展输出：
- solution.a.md, solution.b.md, solution.c.md - 完整实现（位于指定的输出位置）
报告目录： .specs/reports/（如果不存在则创建）
- 评估：.specs/reports/{solution-name}-{date}.[1|2|3].md - 最终评委报告
最终解决方案： {output_path}

SELECT_AND_POLISH：基于胜出方案的优化方案，包含针对性改进
REDESIGN：不要停止；带着经验教训返回到阶段 3；最终以 SELECT_AND_POLISH 或 FULL_SYNTHESIS 结束
FULL_SYNTHESIS：综合了所有解决方案最佳元素的综合方案

元评委 + 评委验证

两个元评委 - 分别用于筛选（提案）和评估（完整解决方案）的规范
元评委与实现并行运行 - 不阻塞流程；筛选元评委与阶段 1 并行运行，评估元评委与阶段 3 并行运行
包含 CLAUDE_PLUGIN_ROOT - 元评委和评委都需要解析后的插件根路径
元评委 YAML - 只将 YAML 传递给评委，不要修改它

探索不足 - 智能体提出相似的方法
忽略评委反馈 - 扩展阶段忽略了筛选阶段的担忧
提案模糊 - 没有实现细节无法正确评估
过度探索 - 提案太多，评估变得昂贵
在存在明确胜出者时强制综合 - 浪费成本并可能降低质量
综合存在根本性缺陷的解决方案 - 重新设计比优化垃圾更好

鼓励多样化探索 - 提示探索解决方案空间的不同区域
反馈向前传递 - 扩展智能体解决筛选阶段的担忧
合适的细节水平 - 提案有足够的细节进行评估
积极筛选 - 只扩展最有希望的 3 种方法
信任自适应策略选择 - 优化明确胜出者，综合分歧决策，重新设计失败方案

/tree-of-thoughts "Design REST API for user management (CRUD + auth)" \
  --output "specs/api/users.md" \
  --criteria "RESTfulness,security,scalability,developer-experience"

阶段 1 输出（假设日期 2025-01-15）：

.specs/research/users-api-2025-01-15.proposals.a.md - 来自智能体 A 的 6 种方法
.specs/research/users-api-2025-01-15.proposals.b.md - 来自智能体 B 的 6 种方法
.specs/research/users-api-2025-01-15.proposals.c.md - 来自智能体 C 的 6 种方法

阶段 1.5 输出（与阶段 1 并行运行）：

筛选元评委 (Opus, sadd:meta-judge) 生成筛选评估规范 YAML

阶段 2 输出（3 个评委使用筛选元评委规范）：

.specs/research/users-api-2025-01-15.pruning.1.md - 前 3 名：基于资源的 REST、纯 REST、单体式
.specs/research/users-api-2025-01-15.pruning.2.md - 前 3 名：纯 REST、混合（服务）、基于资源的 REST
.specs/research/users-api-2025-01-15.pruning.3.md - 前 3 名：基于资源的 REST、REST+GraphQL 混合、纯 REST
.specs/research/users-api-2025-01-15.selection.md - 选定：基于资源的 REST（8 分）、纯 REST（7 分）、单体式（4 分）

阶段 3 输出：

specs/api/users.a.md - 带嵌套路由的完整基于资源的设计
specs/api/users.b.md - 带简单端点的扁平 REST 设计
specs/api/users.c.md - 内部面向服务的单体式 API

阶段 3.5 输出（与阶段 3 并行运行）：

评估元评委 (Opus, sadd:meta-judge) 生成评估规范 YAML

阶段 4 输出（3 个评委使用评估元评委规范）：

.specs/reports/users-api-2025-01-15.1.md：
```
VOTE: Solution A
```
SCORES: A=4.2/5.0, B=3.8/5.0, C

🇺🇸English

tree-of-thoughts

Key benefits:

Systematic exploration - Multiple agents explore different regions of the solution space
Structured evaluation - Meta-judges produce tailored rubrics and criteria before judging
Independent verification - Judges apply meta-judge specifications mechanically, reducing bias
Adaptive strategy - Clear winners get polished, split decisions get synthesized, failures get redesigned

Pattern: Tree of Thoughts (ToT)

This command implements an eight-phase systematic reasoning pattern with meta-judge evaluation and adaptive strategy selection:

Phase 1: Exploration (Propose Approaches)
         ┌─ Agent A → Proposals A1, A2 (with probabilities) ─┐
Task ───┼─ Agent B → Proposals B1, B2 (with probabilities) ─┼─┐
         └─ Agent C → Proposals C1, C2 (with probabilities) ─┘ │
                                                                │
Phase 1.5: Pruning Meta-Judge (runs in parallel with Phase 1) │
         Meta-Judge → Pruning Evaluation Specification YAML ───┤
                                                                │
Phase 2: Pruning (Vote for Best 3)                             │
         ┌─ Judge 1 → Votes + Rationale ─┐                     │
         ├─ Judge 2 → Votes + Rationale ─┼─────────────────────┤
         └─ Judge 3 → Votes + Rationale ─┘                     │
                 │                                              │
                 ├─→ Select Top 3 Proposals                     │
                 │                                              │
Phase 3: Expansion (Develop Full Solutions)                    │
         ┌─ Agent A → Solution A (from proposal X) ─┐          │
         ├─ Agent B → Solution B (from proposal Y) ─┼──────────┤
         └─ Agent C → Solution C (from proposal Z) ─┘          │
                                                                │
Phase 3.5: Evaluation Meta-Judge (runs in parallel w/ Phase 3)│
         Meta-Judge → Evaluation Specification YAML ───────────┤
                                                                │
Phase 4: Evaluation (Judge Full Solutions)                     │
         ┌─ Judge 1 → Report 1 ─┐                              │
         ├─ Judge 2 → Report 2 ─┼──────────────────────────────┤
         └─ Judge 3 → Report 3 ─┘                              │
                                                                │
Phase 4.5: Adaptive Strategy Selection                         │
         Analyze Consensus ────────────────────────────────────┤
                ├─ Clear Winner? → SELECT_AND_POLISH           │
                ├─ All Flawed (<3.0)? → REDESIGN (Phase 3)     │
                └─ Split Decision? → FULL_SYNTHESIS            │
                                         │                      │
Phase 5: Synthesis (Only if FULL_SYNTHESIS)                    │
         Synthesizer ────────────────────┴──────────────────────┴─→ Final Solution

Process

Setup: Create Directory Structure

Before starting, ensure the directory structure exists:

mkdir -p .specs/research .specs/reports

Naming conventions:

Proposals: .specs/research/{solution-name}-{YYYY-MM-DD}.proposals.[a|b|c].md
Pruning: .specs/research/{solution-name}-{YYYY-MM-DD}.pruning.[1|2|3].md
Selection: .specs/research/{solution-name}-{YYYY-MM-DD}.selection.md
Evaluation: .specs/reports/{solution-name}-{YYYY-MM-DD}.[1|2|3].md

Where:

{solution-name} - Derived from output path (e.g., users-api from output specs/api/users.md)
{YYYY-MM-DD} - Current date

Note: Solutions remain in their specified output locations; only research and evaluation files go to .specs/

Phase 1: Exploration (Propose Approaches)

Launch 3 independent agents in parallel (recommended: Sonnet for speed):

Each agent receives identical task description and context
Each agent generates 6 high-level approaches (not full implementations)
For each approach, agent provides:
- Approach description (2-3 paragraphs)
- Key design decisions and trade-offs
- Probability estimate (0.0-1.0)
- Estimated complexity (low/medium/high)
- Potential risks and failure modes
Proposals saved to .specs/research/{solution-name}-{date}.proposals.[a|b|c].md

Key principle: Systematic exploration through probabilistic sampling from the full distribution of possible approaches.

Prompt template for explorers:

<task>
{task_description}
</task>

<constraints>
{constraints_if_any}
</constraints>

<context>
{relevant_context}
</context>

<output>
{.specs/research/{solution-name}-{date}.proposals.[a|b|c].md - each agent gets unique letter identifier}
</output>

Instructions:

Let's approach this systematically by first understanding what we're solving, then exploring the solution space.

**Step 1: Decompose the problem**
Before generating approaches, break down the task:
- What is the core problem being solved?
- What are the key constraints and requirements?
- What subproblems must any solution address?
- What are the evaluation criteria for success?

**Step 2: Map the solution space**
Identify the major dimensions along which solutions can vary:
- Architecture patterns (e.g., monolithic vs distributed)
- Implementation strategies (e.g., eager vs lazy)
- Trade-off axes (e.g., performance vs simplicity)

**Step 3: Generate 6 distinct high-level approaches**

**Sampling guidance:**
Please sample approaches at random from the [full distribution / tails of the distribution]
- For first 3 approaches aim for high probability, over 0.80
- For last 3 approaches aim for diversity - explore different regions of the solution space, such that the probability of each response is less than 0.10

For each approach, provide:
   - Name and one-sentence summary
   - Detailed description (2-3 paragraphs)
   - Key design decisions and rationale
   - Trade-offs (what you gain vs what you sacrifice)
   - Probability (0.0-1.0)
   - Complexity estimate (low/medium/high)
   - Potential risks and failure modes

**Step 4: Verify diversity**
Before finalizing, check:
- Are approaches genuinely different, not minor variations?
- Do they span different regions of the solution space?
- Have you covered both conventional and unconventional options?


CRITICAL:
- Do NOT implement full solutions yet - only high-level approaches
- Ensure approaches are genuinely different, not minor variations

Phase 1.5: Dispatch Pruning Meta-Judge

CRITICAL : Launch the pruning meta-judge in parallel with Phase 1 exploration agents. The meta-judge does not need exploration output to generate pruning criteria — it only needs the original task description.

The pruning meta-judge generates an evaluation specification (rubrics, checklist, scoring criteria) tailored to evaluating high-level proposals for pruning.

Prompt template for pruning meta-judge:

## Task

Generate an evaluation specification yaml for pruning high-level solution proposals. You will produce rubrics, checklists, and scoring criteria that judge agents will use to select the top 3 proposals for full development.

CLAUDE_PLUGIN_ROOT=`${CLAUDE_PLUGIN_ROOT}`

## User Prompt
{Original task description from user}

## Context
{Any relevant codebase context, file paths, constraints}

## Artifact Type
proposals (high-level approaches with probability estimates, not full implementations)

## Evaluation Focus
Feasibility, alignment with requirements, potential for high-quality result, risk manageability

## Instructions
Return only the final evaluation specification YAML in your response.
The specification should support comparative evaluation and ranking of proposals.

Dispatch:

Use Task tool:
  - description: "Pruning Meta-judge: {brief task summary}"
  - prompt: {pruning meta-judge prompt}
  - model: opus
  - subagent_type: "sadd:meta-judge"

Phase 2: Pruning (Vote for Top 3 Candidates)

Wait for BOTH Phase 1 exploration agents AND Phase 1.5 pruning meta-judge to complete before proceeding.

Launch 3 independent judges in parallel (recommended: Opus for rigor):

Each judge receives ALL proposal files (from .specs/research/) and the pruning meta-judge evaluation specification YAML
Judges evaluate each proposal against the meta-judge-generated pruning criteria
Each judge produces:
- Scores for each proposal (with evidence)
- Vote for top 3 proposals to expand
- Rationale for selections
Votes saved to .specs/research/{solution-name}-{date}.pruning.[1|2|3].md

Key principle: Independent evaluation with meta-judge-generated criteria ensures consistent, tailored assessment without hardcoded weights.

CRITICAL: Provide to each judge the EXACT pruning meta-judge's evaluation specification YAML. Do not skip, add, modify, shorten, or summarize any text in it!

Prompt template for pruning judges:

You are evaluating {N} proposed approaches against an evaluation specification produced by the meta judge, to select the top 3 for full development.

CLAUDE_PLUGIN_ROOT=`${CLAUDE_PLUGIN_ROOT}`

## Task
{task_description}

## Proposals
{list of paths to all proposal files}
Read all proposals carefully before evaluating.

## Evaluation Specification

```yaml
{pruning meta-judge's evaluation specification YAML}

Output

{.specs/research/{solution-name}-{date}.pruning.[1|2|3].md}

Instructions

Follow your full judge process as defined in your agent instructions!

CRITICAL: You must reply with this exact structured evaluation report format in YAML at the START of your response!

**Dispatch:**

Use Task tool:

description: "Pruning Judge {1|2|3}: {brief task summary}"
prompt: {pruning judge prompt with exact meta-judge specification YAML}
model: opus
subagent_type: "sadd:judge"

Phase 2b: Select Top 3 Proposals

After judges complete voting:
1. Aggregate votes using ranked choice:
  - 1st choice = 3 points
  - 2nd choice = 2 points
  - 3rd choice = 1 point
2. Select top 3 proposals by total points
3. Handle ties by comparing average scores across criteria
4. Document selection in .specs/research/{solution-name}-{date}.selection.md:
  - Vote tallies
  - Selected proposals
  - Consensus rationale
Phase 3: Expansion (Develop Full Solutions)

Launch 3 independent agents in parallel (recommended: Opus for quality):
1. Each agent receives:
  - One selected proposal to expand
  - Original task description and context
  - Judge feedback from pruning phase (concerns, questions)

Phase 3.5: Dispatch Evaluation Meta-Judge

CRITICAL : Launch the evaluation meta-judge in parallel with Phase 3 expansion agents. The meta-judge does not need expansion output to generate evaluation criteria — it only needs the original task description.

The evaluation meta-judge generates an evaluation specification (rubrics, checklist, scoring criteria) tailored to evaluating full solution implementations.

Prompt template for evaluation meta-judge:

## Task

Generate an evaluation specification yaml for evaluating full solution implementations. You will produce rubrics, checklists, and scoring criteria that judge agents will use to evaluate and compare competitive implementations.

CLAUDE_PLUGIN_ROOT=`${CLAUDE_PLUGIN_ROOT}`

## User Prompt
{Original task description from user}

## Context
{Any relevant codebase context, file paths, constraints}

## Artifact Type
{code | documentation | configuration | etc.}

## Number of Solutions
3 (full implementations developed from selected proposals)

## Instructions
Return only the final evaluation specification YAML in your response.
The specification should support comparative evaluation across multiple solutions.

Dispatch:

Use Task tool:
  - description: "Evaluation Meta-judge: {brief task summary}"
  - prompt: {evaluation meta-judge prompt}
  - model: opus
  - subagent_type: "sadd:meta-judge"

Phase 4: Evaluation (Judge Full Solutions)

Wait for BOTH Phase 3 expansion agents AND Phase 3.5 evaluation meta-judge to complete before proceeding.

Launch 3 independent judges in parallel (recommended: Opus for rigor):

Each judge receives ALL solution files (solution.a.md, solution.b.md, solution.c.md) and the evaluation meta-judge specification YAML
Judges evaluate against the meta-judge-generated evaluation criteria
Each judge produces:
- Comparative analysis (which solution excels where)
- Evidence-based ratings (with specific quotes/examples)
- Final vote (which solution they prefer and why)
Reports saved to .specs/reports/{solution-name}-{date}.[1|2|3].md

Key principle: Multiple independent evaluations with meta-judge-generated specifications and explicit evidence reduce bias and catch different quality aspects.

CRITICAL: Provide to each judge the EXACT evaluation meta-judge's evaluation specification YAML. Do not skip, add, modify, shorten, or summarize any text in it!

CRITICAL: NEVER provide score threshold to judges. Judge MUST not know what threshold for score is, in order to not be biased!!!

Prompt template for evaluation judges:

You are evaluating {number} full solutions against an evaluation specification produced by the meta judge.

CLAUDE_PLUGIN_ROOT=`${CLAUDE_PLUGIN_ROOT}`

## Task
{task_description}

## Solutions
{list of paths to all solution files}
Read all solutions carefully before evaluating.

## Evaluation Specification

```yaml
{evaluation meta-judge's evaluation specification YAML}

Output

Write full report to: .specs/reports/{solution-name}-{date}.[1|2|3].md

CRITICAL: You must reply with this exact structured header format:

VOTE: [Solution A/B/C] SCORES: Solution A: [X.X]/5.0 Solution B: [X.X]/5.0 Solution C: [X.X]/5.0 CRITERIA:

{criterion_1}: [X.X]/5.0
{criterion_2}: [X.X]/5.0 ...

[Summary of your evaluation]

Instructions

Follow your full judge process as defined in your agent instructions!

CRITICAL: You must reply with this exact structured evaluation report format in YAML at the START of your response!

**Dispatch:**

Use Task tool:

description: "Evaluation Judge {1|2|3}: {brief task summary}"
prompt: {evaluation judge prompt with exact meta-judge specification YAML}
model: opus
subagent_type: "sadd:judge"

Phase 4.5: Adaptive Strategy Selection (Early Return)

The orchestrator (not a subagent) analyzes judge outputs to determine the optimal strategy.

Decision Logic

Step 1: Parse structured headers from judge reply

Parse the judges reply. CRITICAL: Do not read report files themselves, as they can overflow your context.

Step 2: Check for unanimous winner

Compare all three VOTE values:
- If Judge 1 VOTE = Judge 2 VOTE = Judge 3 VOTE (same solution):
  - Strategy: SELECT_AND_POLISH
  - Reason: Clear consensus - all three judges prefer same solution
Step 3: Check if all solutions are fundamentally flawed

If no unanimous vote, calculate average scores:
1. Average Solution A scores: (Judge1_A + Judge2_A + Judge3_A) / 3
2. Average Solution B scores: (Judge1_B + Judge2_B + Judge3_B) / 3
3. Average Solution C scores: (Judge1_C + Judge2_C + Judge3_C) / 3
If (avg_A < 3.0) AND (avg_B < 3.0) AND (avg_C < 3.0):
- Strategy: REDESIGN
- Reason: All solutions below quality threshold, fundamental approach issues

Strategy 2: REDESIGN

When: All solutions scored <3.0/5.0 (fundamental issues across the board)

Process:

Launch new agent to analyze the failure modes and lessons learned
Return to Phase 3 (Expansion), provide to new implementation agents the lessons learned and new constraints

Note: If redesign fails twice, escalate to user for guidance.

Prompt template for new implementation:

You are analyzing why all solutions failed to meet quality standards, to inform a redesign. And implement new solution based on it.


<task>
{task_description}
</task>

<constraints>
{constraints_if_any}
</constraints>

<context>
{relevant_context}
</context>

<failed_solutions>
{list of paths to all solution files}
Average scores: A={avg_a}/5.0, B={avg_b}/5.0, C={avg_c}/5.0
</failed_solutions>

<evaluation_reports>
{list of paths to all evaluation reports}
All solutions scored below 3.0/5.0 threshold.
</evaluation_reports>

<output>
.specs/research/{solution-name}-{date}.redesign-analysis.md
</output>

Instructions:
Let's break this down systematically to understand what went wrong and how to design new solution based on it.

1. First, analyze the task carefully - what is being asked and what are the key requirements?
2. Read through each solution and its evaluation report
3. For each solution, think step by step about:
   - What was the core approach?
   - What specific issues did judges identify?
   - Why did this approach fail to meet the quality threshold?
4. Identify common failure patterns across all solutions:
   - Are there shared misconceptions?
   - Are there missing requirements that all solutions overlooked?
   - Are there fundamental constraints that weren't considered?
5. Extract lessons learned:
   - What approaches should be avoided?
   - What constraints must be addressed?
6. Generate improved guidance for the next iteration:
   - New constraints to add
   - Specific approaches to try - what are the different ways to solve this?
   - Key requirements to emphasize
7. Think through the tradeoffs step by step and choose the approach you believe is best
8. Implement it completely
9. Generate 5 verification questions about critical aspects
10. Answer your own questions:
   - Review solution against each question
   - Identify gaps or weaknesses
11. Revise solution:
   - Fix identified issues
12. Explain what was changed and why

Strategy 3: FULL_SYNTHESIS (Default)

When: No clear winner AND solutions have merit (scores >=3.0)

Process: Proceed to Phase 5 (Evidence-Based Synthesis)

Phase 5: Synthesis (Evidence-Based Combination)

Only executed when Strategy 3 (FULL_SYNTHESIS) selected in Phase 4.5

Launch 1 synthesis agent (recommended: Opus for quality):

Agent receives:
- All solutions (from specified output location)
- All evaluation reports (from .specs/reports/)
- Selection rationale from pruning phase (from .specs/research/)
Agent analyzes:
- Consensus strengths (what multiple judges praised)
- Consensus weaknesses (what multiple judges criticized)
- Complementary elements where solutions took different approaches
Agent produces final solution by:
- Copying superior sections when one solution clearly wins
- Combining approaches when hybrid is better
- Fixing identified issues that judges caught
- Documenting decisions (what was taken from where and why)

Key principle: Evidence-based synthesis leverages collective intelligence from exploration and evaluation.

Prompt template for synthesizer:

You are synthesizing the best solution from explored, pruned, and evaluated implementations.

<task>
{task_description}
</task>

<solutions>
{list of paths to all solution files}
</solutions>

<evaluation_reports>
{list of paths to all evaluation reports}
</evaluation_reports>

<selection_rationale>
{path to selection.md explaining why these proposals were chosen}
</selection_rationale>

<output>
{output_path} - The final synthesized solution
</output>

Instructions:

Let's approach this synthesis systematically by first analyzing, then decomposing, then building.

**Step 1: Build the evidence base**
Before synthesizing, gather evidence from judge reports:
- What did multiple judges praise? (consensus strengths)
- What did multiple judges criticize? (consensus weaknesses)
- Where did judges disagree? (areas needing careful analysis)

**Step 2: Decompose into synthesis subproblems**
Break the solution into logical sections or components. For each component:
- Which solution handles this best? (cite evidence)
- Are there complementary elements from multiple solutions?
- What issues were identified that need fixing?

**Step 3: Solve each subproblem**
For each component/section, determine the synthesis strategy:

*Strategy A - Clear winner:* If one solution is clearly superior for this component:
- Copy that section directly
- Document: "Taken from Solution X because [judge evidence]"

*Strategy B - Complementary combination:* If solutions have complementary strengths:
- Identify what each contributes
- Combine carefully, ensuring consistency
- Document: "Combined X from Solution A with Y from Solution B because [rationale]"

*Strategy C - All flawed:* If all solutions have issues in this area:
- Start with the best version
- Apply fixes based on judge criticism
- Document: "Based on Solution X, modified to address [specific issues]"

**Step 4: Integrate and verify consistency**
After synthesizing all components:
- Check that combined elements work together
- Resolve any contradictions between borrowed sections
- Ensure consistent terminology and style

**Step 5: Document synthesis decisions**
Create a synthesis log:
- What you took from each solution (with specific citations)
- Why you made those choices (reference judge feedback)
- How you addressed identified weaknesses
- Any novel combinations or improvements

<example>
**Example synthesis decision for an API design:**

Component: Authentication flow
- Solution A: JWT with refresh tokens (praised for security by 2/3 judges)
- Solution B: Session-based (praised for simplicity by 1 judge, criticized for scalability)
- Solution C: OAuth2 only (criticized as over-engineered for use case)

Decision: Take Solution A's authentication flow directly.
Evidence: Judges 1 and 3 both noted "JWT approach provides good balance of security and statelessness"
Modification: None needed - this section was rated highest across judges.
</example>

**Step 6: Revise your solution**
- Generate 5 verification questions about critical aspects
- Answer your own questions:
   - Review solution against each question
   - Identify gaps or weaknesses
- Revise solution:
   - Fix identified issues
- Explain what was changed and why


CRITICAL:
- Do not create something entirely new - synthesize the best from what exists
- Cite your sources (which solution, which section)
- Explain every major decision
- Address all consensus weaknesses identified by judges

Outputs (All Strategies)

Research directory: .specs/research/ (created if not exists)
- Proposals: .specs/research/{solution-name}-{date}.proposals.[a|b|c].md - High-level approaches with probabilities
- Pruning: .specs/research/{solution-name}-{date}.pruning.[1|2|3].md - Judge evaluations and votes
- Selection: .specs/research/{solution-name}-{date}.selection.md - Vote tallies and selected proposals
Expansion outputs:
- solution.a.md, solution.b.md, solution.c.md - Full implementations (in specified output location)

Strategy-Specific Outputs

SELECT_AND_POLISH : Polished solution based on winning solution, with targeted improvements
REDESIGN : Do not stop; return to Phase 3 with lessons learned; eventually finishes at SELECT_AND_POLISH or FULL_SYNTHESIS
FULL_SYNTHESIS : Synthesized solution combining best elements from all solutions

Best Practices

Meta-Judge + Judge Verification

Two meta-judges - Separate specs for pruning (proposals) and evaluation (full solutions)
Meta-judges run in parallel with implementation - Don't block the pipeline; pruning meta-judge runs with Phase 1, evaluation meta-judge runs with Phase 3
Include CLAUDE_PLUGIN_ROOT - Both meta-judges and judges need the resolved plugin root path
Meta-judge YAML - Pass only the YAML to judges, do not modify it

Common Pitfalls

Insufficient exploration - Agents propose similar approaches
Ignoring judge feedback - Expansion ignores concerns from pruning
Vague proposals - Can't properly evaluate without implementation details
Over-exploration - Too many proposals, evaluation becomes expensive
Forcing synthesis when clear winner exists - Wastes cost and risks degrading quality
Synthesizing fundamentally flawed solutions - Better to redesign than polish garbage

Recommendations

Encourage diverse exploration - Prompt for different regions of solution space
Feed feedback forward - Expansion agents address pruning concerns
Right level of detail - Proposals have enough detail to evaluate
Prune aggressively - Only expand most promising 3 approaches
Trust adaptive strategy selection - Polish clear winners, synthesize split decisions, redesign failures

Example: API Design

/tree-of-thoughts "Design REST API for user management (CRUD + auth)" \
  --output "specs/api/users.md" \
  --criteria "RESTfulness,security,scalability,developer-experience"

Phase 1 outputs (assuming date 2025-01-15):

.specs/research/users-api-2025-01-15.proposals.a.md - 6 approaches from Agent A
.specs/research/users-api-2025-01-15.proposals.b.md - 6 approaches from Agent B
.specs/research/users-api-2025-01-15.proposals.c.md - 6 approaches from Agent C

Phase 1.5 output (runs in parallel with Phase 1):

Pruning Meta-judge (Opus, sadd:meta-judge) generates pruning evaluation specification YAML

Phase 2 outputs (3 judges with pruning meta-judge spec):

.specs/research/users-api-2025-01-15.pruning.1.md - Top 3: Resource-based REST, Pure REST, Monolithic
.specs/research/users-api-2025-01-15.pruning.2.md - Top 3: Pure REST, Hybrid (services), Resource-based REST
.specs/research/users-api-2025-01-15.pruning.3.md - Top 3: Resource-based REST, REST+GraphQL hybrid, Pure REST
.specs/research/users-api-2025-01-15.selection.md - Selected: Resource-based REST (8 pts), Pure REST (7 pts), Monolithic (4 pts)

Phase 3 outputs:

specs/api/users.a.md - Full resource-based design with nested routes
specs/api/users.b.md - Flat REST design with simple endpoints
specs/api/users.c.md - Monolithic API with service-oriented internals

Phase 3.5 output (runs in parallel with Phase 3):

Evaluation Meta-judge (Opus, sadd:meta-judge) generates evaluation specification YAML

Phase 4 outputs (3 judges with evaluation meta-judge spec):

.specs/reports/users-api-2025-01-15.1.md:
```
VOTE: Solution A
```
SCORES: A=4.2/5.0, B=3.8/5.0, C=3.4/5.0

"Prefers A for RESTfulness, criticizes C complexity"

.specs/reports/users-api-2025-01-15.2.md:
```
VOTE: Solution B
```
SCORES: A=3.9/5.0, B=4.1/5.0, C=3.5/5.0

"Prefers B for simplicity, criticizes A deep nesting"

.specs/reports/users-api-2025-01-15.3.md:
```
VOTE: Solution A
```
SCORES: A=4.3/5.0, B=3.6/5.0, C=3.2/5.0

"Prefers A for discoverability, criticizes B lack of structure"

Phase 4.5 decision (orchestrator parses headers):

Split votes: A, B, A (no unanimous winner)
Average scores: A=4.1, B=3.8, C=3.4 (all >=3.0)
Strategy: FULL_SYNTHESIS
Reason: Split decision with merit, synthesis needed

Phase 5 output (synthesis):

specs/api/users.md - Resource-based structure (from A), max 2-level nesting (from B), internal services (from C)

Weekly Installs

224

Repository

neolabhq/contex…ring-kit

GitHub Stars

699

First Seen

Feb 19, 2026

Installed on

opencode218

codex217

github-copilot216

gemini-cli215

kimi-cli213

amp213

超能力技能使用指南：AI助手技能调用优先级与工作流程详解

41,800 周安装

Agent produces complete solution implementing the proposal:

Full implementation details
Addresses concerns raised by judges
Documents key decisions made during expansion

Solutions saved to solution.a.md, solution.b.md, solution.c.md

Key principle: Focused development of validated approaches with awareness of evaluation feedback.

Prompt template for expansion agents:

You are developing a full solution based on a selected proposal.

<task>
{task_description}
</task>

<selected_proposal>
{write selected proposal EXACTLY as it is. Including all details provided by the agent}
Read this carefully - it is your starting point.
</selected_proposal>

<judge_feedback>
{concerns and questions from judges about this proposal}
Address these in your implementation.
</judge_feedback>

<output>
solution.[*].md where [*] is your unique identifier (a, b, or c)
</output>

Instructions:

Let's work through this systematically to ensure we build a complete, high-quality solution.

**Step 1: Understand the proposal deeply**
Before implementing, analyze:
- What is the core insight or approach of this proposal?
- What are the key design decisions already made?
- What gaps need to be filled for a complete solution?

**Step 2: Address judge feedback**
For each concern raised by judges:
- What specific change or addition addresses this concern?
- How does this change integrate with the proposal's approach?

**Step 3: Decompose into implementation subproblems**
Break the solution into logical parts:
- What are the main components or sections?
- What must be defined first for other parts to build upon?
- What are the dependencies between parts?

**Step 4: Implement each subproblem**
For each component, work through:
- Core functionality and behavior
- Edge cases and error handling
- Integration points with other components

**Step 5: Self-verification**
Generate 3-5 verification questions about critical aspects, then answer them:
- Review solution against each question
- Identify gaps or weaknesses
- Fix identified issues

**Step 6: Document changes**
Explain what was changed from the original proposal and why.

<example>
**Example of good expansion thinking:**

Proposal: "Use event-driven architecture with message queue"

Step 1 Analysis:
- Core insight: Decouple components via async messaging
- Key decisions: Events as primary communication, eventual consistency
- Gaps: Need to define event schemas, queue technology, error handling

Step 2 - Addressing judge concern "What about message ordering?":
- Add partition keys for ordered processing within entity scope
- Document ordering guarantees and limitations

Step 3 - Subproblems:
1. Event schema definitions (foundational - others depend on this)
2. Producer interfaces (depends on schemas)
3. Consumer handlers (depends on schemas)
4. Error handling and dead letter queues (depends on both)
5. Integration patterns (builds on all above)
</example>

CRITICAL:
- Stay faithful to the selected proposal's core approach
- Do not switch to a different approach midway
- Address judge feedback explicitly
- Produce a complete, implementable solution

Step 4: Default to full synthesis

If none of the above conditions met:

Strategy: FULL_SYNTHESIS
Reason: Split decision with merit, synthesis needed to combine best elements

Strategy 1: SELECT_AND_POLISH

When: Clear winner (unanimous votes)

Select the winning solution as the base
Launch subagent to apply specific improvements from judge feedback
Cherry-pick 1-2 best elements from runner-up solutions
Document what was added and why

Saves synthesis cost (simpler than full synthesis)
Preserves proven quality of winning solution
Focused improvements rather than full reconstruction

Prompt template:

You are polishing the winning solution based on judge feedback.

<task>
{task_description}
</task>

<winning_solution>
{path_to_winning_solution}
Score: {winning_score}/5.0
Judge consensus: {why_it_won}
</winning_solution>

<runner_up_solutions>
{list of paths to all runner-up solutions}
</runner_up_solutions>

<judge_feedback>
{list of paths to all evaluation reports}
</judge_feedback>

<output>
{final_solution_path}
</output>

Instructions:

Let's approach this polishing task methodically to improve without disrupting what works.

**Step 1: Understand why this solution won**
Analyze the winning solution:
- What are its core strengths that judges praised?
- What makes its approach superior to alternatives?
- Which parts should remain untouched?

**Step 2: Catalog improvement opportunities**
From judge feedback, identify:
- Specific weaknesses mentioned (list each one)
- Missing elements judges noted
- Areas where runner-ups were praised

**Step 3: Prioritize changes by impact**
For each improvement opportunity:
- High impact: Directly addresses judge criticism
- Medium impact: Adds praised element from runner-up
- Low impact: Nice-to-have refinement

Focus on high-impact changes first.

**Step 4: Apply improvements surgically**
For each change:
- Locate the specific section to modify
- Make the minimal change needed to address the issue
- Verify the change integrates cleanly with surrounding content

**Step 5: Cherry-pick from runners-up**
Review runner-up solutions for:
- 1-2 specific elements that judges praised
- Elements that complement (not conflict with) the winning approach
- Only incorporate if clearly superior to winning solution's version

**Step 6: Document all changes**
Record:
- What was changed and why (with reference to judge feedback)
- What was added from other solutions (cite source)
- What was intentionally left unchanged

CRITICAL: Preserve the winning solution's core approach. Make targeted improvements only.

Reports directory: .specs/reports/ (created if not exists)

Evaluation: .specs/reports/{solution-name}-{date}.[1|2|3].md - Final judge reports

Resulting solution: {output_path}

Tree of Thoughts (ToT) 思维之树：多智能体系统推理与自适应策略选择框架

🇨🇳中文介绍

tree-of-thoughts

模式：思维之树 (ToT)

相关 Skills

流程

设置：创建目录结构

阶段 1：探索（提出方法）

阶段 1.5：派遣筛选元评委

阶段 2：筛选（投票选出前 3 名候选方案）

Output

Instructions

Phase 2b: Select Top 3 Proposals

Phase 3: Expansion (Develop Full Solutions)

阶段 3.5：派遣评估元评委

阶段 4：评估（评判完整解决方案）

Output

Instructions

Phase 4.5: Adaptive Strategy Selection (Early Return)

Decision Logic

Strategy 1: SELECT_AND_POLISH

策略 2：重新设计

策略 3：完全综合（默认）

阶段 5：综合（基于证据的组合）

输出（所有策略）

策略特定输出

最佳实践

元评委 + 评委验证

常见陷阱

建议

示例：API 设计

🇺🇸English

tree-of-thoughts

Pattern: Tree of Thoughts (ToT)

Process

Setup: Create Directory Structure

Phase 1: Exploration (Propose Approaches)

Phase 1.5: Dispatch Pruning Meta-Judge

Phase 2: Pruning (Vote for Top 3 Candidates)

Output

Instructions

Phase 2b: Select Top 3 Proposals

Phase 3: Expansion (Develop Full Solutions)

Phase 3.5: Dispatch Evaluation Meta-Judge

Phase 4: Evaluation (Judge Full Solutions)

Output

Instructions

Phase 4.5: Adaptive Strategy Selection (Early Return)

Decision Logic

Strategy 2: REDESIGN

Strategy 3: FULL_SYNTHESIS (Default)

Phase 5: Synthesis (Evidence-Based Combination)

Outputs (All Strategies)

Strategy-Specific Outputs

Best Practices

Meta-Judge + Judge Verification

Common Pitfalls

Recommendations

Example: API Design

最新 Skills

Strategy 1: SELECT_AND_POLISH