sadd%3Atree-of-thoughts by neolabhq/context-engineering-kit
npx skills add https://github.com/neolabhq/context-engineering-kit --skill sadd:tree-of-thoughts主要优势:
此命令实现了一个包含元评委评估和自适应策略选择的八阶段系统性推理模式:
Phase 1: Exploration (Propose Approaches)
┌─ Agent A → Proposals A1, A2 (with probabilities) ─┐
Task ───┼─ Agent B → Proposals B1, B2 (with probabilities) ─┼─┐
└─ Agent C → Proposals C1, C2 (with probabilities) ─┘ │
│
Phase 1.5: Pruning Meta-Judge (runs in parallel with Phase 1) │
Meta-Judge → Pruning Evaluation Specification YAML ───┤
│
Phase 2: Pruning (Vote for Best 3) │
┌─ Judge 1 → Votes + Rationale ─┐ │
├─ Judge 2 → Votes + Rationale ─┼─────────────────────┤
└─ Judge 3 → Votes + Rationale ─┘ │
│ │
├─→ Select Top 3 Proposals │
│ │
Phase 3: Expansion (Develop Full Solutions) │
┌─ Agent A → Solution A (from proposal X) ─┐ │
├─ Agent B → Solution B (from proposal Y) ─┼──────────┤
└─ Agent C → Solution C (from proposal Z) ─┘ │
│
Phase 3.5: Evaluation Meta-Judge (runs in parallel w/ Phase 3)│
Meta-Judge → Evaluation Specification YAML ───────────┤
│
Phase 4: Evaluation (Judge Full Solutions) │
┌─ Judge 1 → Report 1 ─┐ │
├─ Judge 2 → Report 2 ─┼──────────────────────────────┤
└─ Judge 3 → Report 3 ─┘ │
│
Phase 4.5: Adaptive Strategy Selection │
Analyze Consensus ────────────────────────────────────┤
├─ Clear Winner? → SELECT_AND_POLISH │
├─ All Flawed (<3.0)? → REDESIGN (Phase 3) │
└─ Split Decision? → FULL_SYNTHESIS │
│ │
Phase 5: Synthesis (Only if FULL_SYNTHESIS) │
Synthesizer ────────────────────┴──────────────────────┴─→ Final Solution
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
开始前,确保目录结构存在:
mkdir -p .specs/research .specs/reports
命名约定:
.specs/research/{solution-name}-{YYYY-MM-DD}.proposals.[a|b|c].md.specs/research/{solution-name}-{YYYY-MM-DD}.pruning.[1|2|3].md.specs/research/{solution-name}-{YYYY-MM-DD}.selection.md.specs/reports/{solution-name}-{YYYY-MM-DD}.[1|2|3].md其中:
{solution-name} - 从输出路径派生(例如,从输出 specs/api/users.md 得到 users-api){YYYY-MM-DD} - 当前日期注意: 解决方案保留在其指定的输出位置;只有研究和评估文件会放入 .specs/
并行启动 3 个独立的智能体(建议:Sonnet 以获得速度):
.specs/research/{solution-name}-{date}.proposals.[a|b|c].md关键原则: 通过对所有可能方法的完整分布进行概率抽样,实现系统性探索。
探索者提示模板:
<task>
{task_description}
</task>
<constraints>
{constraints_if_any}
</constraints>
<context>
{relevant_context}
</context>
<output>
{.specs/research/{solution-name}-{date}.proposals.[a|b|c].md - each agent gets unique letter identifier}
</output>
Instructions:
Let's approach this systematically by first understanding what we're solving, then exploring the solution space.
**Step 1: Decompose the problem**
Before generating approaches, break down the task:
- What is the core problem being solved?
- What are the key constraints and requirements?
- What subproblems must any solution address?
- What are the evaluation criteria for success?
**Step 2: Map the solution space**
Identify the major dimensions along which solutions can vary:
- Architecture patterns (e.g., monolithic vs distributed)
- Implementation strategies (e.g., eager vs lazy)
- Trade-off axes (e.g., performance vs simplicity)
**Step 3: Generate 6 distinct high-level approaches**
**Sampling guidance:**
Please sample approaches at random from the [full distribution / tails of the distribution]
- For first 3 approaches aim for high probability, over 0.80
- For last 3 approaches aim for diversity - explore different regions of the solution space, such that the probability of each response is less than 0.10
For each approach, provide:
- Name and one-sentence summary
- Detailed description (2-3 paragraphs)
- Key design decisions and rationale
- Trade-offs (what you gain vs what you sacrifice)
- Probability (0.0-1.0)
- Complexity estimate (low/medium/high)
- Potential risks and failure modes
**Step 4: Verify diversity**
Before finalizing, check:
- Are approaches genuinely different, not minor variations?
- Do they span different regions of the solution space?
- Have you covered both conventional and unconventional options?
CRITICAL:
- Do NOT implement full solutions yet - only high-level approaches
- Ensure approaches are genuinely different, not minor variations
关键: 将筛选元评委与阶段 1 的探索智能体并行启动。元评委不需要探索输出来生成筛选标准——它只需要原始任务描述。
筛选元评委生成一个评估规范(评分标准、检查表、评分准则),专门用于评估高层次提案以进行筛选。
筛选元评委提示模板:
## Task
Generate an evaluation specification yaml for pruning high-level solution proposals. You will produce rubrics, checklists, and scoring criteria that judge agents will use to select the top 3 proposals for full development.
CLAUDE_PLUGIN_ROOT=`${CLAUDE_PLUGIN_ROOT}`
## User Prompt
{Original task description from user}
## Context
{Any relevant codebase context, file paths, constraints}
## Artifact Type
proposals (high-level approaches with probability estimates, not full implementations)
## Evaluation Focus
Feasibility, alignment with requirements, potential for high-quality result, risk manageability
## Instructions
Return only the final evaluation specification YAML in your response.
The specification should support comparative evaluation and ranking of proposals.
派遣:
Use Task tool:
- description: "Pruning Meta-judge: {brief task summary}"
- prompt: {pruning meta-judge prompt}
- model: opus
- subagent_type: "sadd:meta-judge"
等待阶段 1 的探索智能体和阶段 1.5 的筛选元评委都完成后,再继续。
并行启动 3 个独立的评委(建议:Opus 以获得严谨性):
.specs/research/)和筛选元评委评估规范 YAML.specs/research/{solution-name}-{date}.pruning.[1|2|3].md关键原则: 使用元评委生成的标准进行独立评估,确保一致、定制的评估,而无需硬编码权重。
关键:向每个评委提供完全相同的筛选元评委评估规范 YAML。不要跳过、添加、修改、缩短或总结其中的任何文本!
筛选评委提示模板:
You are evaluating {N} proposed approaches against an evaluation specification produced by the meta judge, to select the top 3 for full development.
CLAUDE_PLUGIN_ROOT=`${CLAUDE_PLUGIN_ROOT}`
## Task
{task_description}
## Proposals
{list of paths to all proposal files}
Read all proposals carefully before evaluating.
## Evaluation Specification
```yaml
{pruning meta-judge's evaluation specification YAML}
{.specs/research/{solution-name}-{date}.pruning.[1|2|3].md}
Follow your full judge process as defined in your agent instructions!
CRITICAL: You must reply with this exact structured evaluation report format in YAML at the START of your response!
**Dispatch:**
Use Task tool:
description: "Pruning Judge {1|2|3}: {brief task summary}"
prompt: {pruning judge prompt with exact meta-judge specification YAML}
model: opus
subagent_type: "sadd:judge"
After judges complete voting:
.specs/research/{solution-name}-{date}.selection.md:
Launch 3 independent agents in parallel (recommended: Opus for quality):
solution.a.md, solution.b.md, solution.c.mdKey principle: Focused development of validated approaches with awareness of evaluation feedback.
Prompt template for expansion agents:
You are developing a full solution based on a selected proposal.
<task>
{task_description}
</task>
<selected_proposal>
{write selected proposal EXACTLY as it is. Including all details provided by the agent}
Read this carefully - it is your starting point.
</selected_proposal>
<judge_feedback>
{concerns and questions from judges about this proposal}
Address these in your implementation.
</judge_feedback>
<output>
solution.[*].md where [*] is your unique identifier (a, b, or c)
</output>
Instructions:
Let's work through this systematically to ensure we build a complete, high-quality solution.
**Step 1: Understand the proposal deeply**
Before implementing, analyze:
- What is the core insight or approach of this proposal?
- What are the key design decisions already made?
- What gaps need to be filled for a complete solution?
**Step 2: Address judge feedback**
For each concern raised by judges:
- What specific change or addition addresses this concern?
- How does this change integrate with the proposal's approach?
**Step 3: Decompose into implementation subproblems**
Break the solution into logical parts:
- What are the main components or sections?
- What must be defined first for other parts to build upon?
- What are the dependencies between parts?
**Step 4: Implement each subproblem**
For each component, work through:
- Core functionality and behavior
- Edge cases and error handling
- Integration points with other components
**Step 5: Self-verification**
Generate 3-5 verification questions about critical aspects, then answer them:
- Review solution against each question
- Identify gaps or weaknesses
- Fix identified issues
**Step 6: Document changes**
Explain what was changed from the original proposal and why.
<example>
**Example of good expansion thinking:**
Proposal: "Use event-driven architecture with message queue"
Step 1 Analysis:
- Core insight: Decouple components via async messaging
- Key decisions: Events as primary communication, eventual consistency
- Gaps: Need to define event schemas, queue technology, error handling
Step 2 - Addressing judge concern "What about message ordering?":
- Add partition keys for ordered processing within entity scope
- Document ordering guarantees and limitations
Step 3 - Subproblems:
1. Event schema definitions (foundational - others depend on this)
2. Producer interfaces (depends on schemas)
3. Consumer handlers (depends on schemas)
4. Error handling and dead letter queues (depends on both)
5. Integration patterns (builds on all above)
</example>
CRITICAL:
- Stay faithful to the selected proposal's core approach
- Do not switch to a different approach midway
- Address judge feedback explicitly
- Produce a complete, implementable solution
关键: 将评估元评委与阶段 3 的扩展智能体并行启动。元评委不需要扩展输出来生成评估标准——它只需要原始任务描述。
评估元评委生成一个评估规范(评分标准、检查表、评分准则),专门用于评估完整的解决方案实现。
评估元评委提示模板:
## Task
Generate an evaluation specification yaml for evaluating full solution implementations. You will produce rubrics, checklists, and scoring criteria that judge agents will use to evaluate and compare competitive implementations.
CLAUDE_PLUGIN_ROOT=`${CLAUDE_PLUGIN_ROOT}`
## User Prompt
{Original task description from user}
## Context
{Any relevant codebase context, file paths, constraints}
## Artifact Type
{code | documentation | configuration | etc.}
## Number of Solutions
3 (full implementations developed from selected proposals)
## Instructions
Return only the final evaluation specification YAML in your response.
The specification should support comparative evaluation across multiple solutions.
派遣:
Use Task tool:
- description: "Evaluation Meta-judge: {brief task summary}"
- prompt: {evaluation meta-judge prompt}
- model: opus
- subagent_type: "sadd:meta-judge"
等待阶段 3 的扩展智能体和阶段 3.5 的评估元评委都完成后,再继续。
并行启动 3 个独立的评委(建议:Opus 以获得严谨性):
.specs/reports/{solution-name}-{date}.[1|2|3].md关键原则: 多个独立的评估,使用元评委生成的规范和明确的证据,减少偏见并捕捉不同的质量方面。
关键:向每个评委提供完全相同的评估元评委评估规范 YAML。不要跳过、添加、修改、缩短或总结其中的任何文本!
关键:永远不要向评委提供分数阈值。评委绝对不能知道分数阈值是多少,以避免产生偏见!!!
评估评委提示模板:
You are evaluating {number} full solutions against an evaluation specification produced by the meta judge.
CLAUDE_PLUGIN_ROOT=`${CLAUDE_PLUGIN_ROOT}`
## Task
{task_description}
## Solutions
{list of paths to all solution files}
Read all solutions carefully before evaluating.
## Evaluation Specification
```yaml
{evaluation meta-judge's evaluation specification YAML}
Write full report to: .specs/reports/{solution-name}-{date}.[1|2|3].md
CRITICAL: You must reply with this exact structured header format:
VOTE: [Solution A/B/C] SCORES: Solution A: [X.X]/5.0 Solution B: [X.X]/5.0 Solution C: [X.X]/5.0 CRITERIA:
[Summary of your evaluation]
Follow your full judge process as defined in your agent instructions!
CRITICAL: You must reply with this exact structured evaluation report format in YAML at the START of your response!
**Dispatch:**
Use Task tool:
description: "Evaluation Judge {1|2|3}: {brief task summary}"
prompt: {evaluation judge prompt with exact meta-judge specification YAML}
model: opus
subagent_type: "sadd:judge"
The orchestrator (not a subagent) analyzes judge outputs to determine the optimal strategy.
Step 1: Parse structured headers from judge reply
Parse the judges reply. CRITICAL: Do not read report files themselves, as they can overflow your context.
Step 2: Check for unanimous winner
Compare all three VOTE values:
Step 3: Check if all solutions are fundamentally flawed
If no unanimous vote, calculate average scores:
If (avg_A < 3.0) AND (avg_B < 3.0) AND (avg_C < 3.0):
Step 4: Default to full synthesis
If none of the above conditions met:
When: Clear winner (unanimous votes)
Process:
Benefits:
Prompt template:
You are polishing the winning solution based on judge feedback.
<task>
{task_description}
</task>
<winning_solution>
{path_to_winning_solution}
Score: {winning_score}/5.0
Judge consensus: {why_it_won}
</winning_solution>
<runner_up_solutions>
{list of paths to all runner-up solutions}
</runner_up_solutions>
<judge_feedback>
{list of paths to all evaluation reports}
</judge_feedback>
<output>
{final_solution_path}
</output>
Instructions:
Let's approach this polishing task methodically to improve without disrupting what works.
**Step 1: Understand why this solution won**
Analyze the winning solution:
- What are its core strengths that judges praised?
- What makes its approach superior to alternatives?
- Which parts should remain untouched?
**Step 2: Catalog improvement opportunities**
From judge feedback, identify:
- Specific weaknesses mentioned (list each one)
- Missing elements judges noted
- Areas where runner-ups were praised
**Step 3: Prioritize changes by impact**
For each improvement opportunity:
- High impact: Directly addresses judge criticism
- Medium impact: Adds praised element from runner-up
- Low impact: Nice-to-have refinement
Focus on high-impact changes first.
**Step 4: Apply improvements surgically**
For each change:
- Locate the specific section to modify
- Make the minimal change needed to address the issue
- Verify the change integrates cleanly with surrounding content
**Step 5: Cherry-pick from runners-up**
Review runner-up solutions for:
- 1-2 specific elements that judges praised
- Elements that complement (not conflict with) the winning approach
- Only incorporate if clearly superior to winning solution's version
**Step 6: Document all changes**
Record:
- What was changed and why (with reference to judge feedback)
- What was added from other solutions (cite source)
- What was intentionally left unchanged
CRITICAL: Preserve the winning solution's core approach. Make targeted improvements only.
时机: 所有解决方案得分 <3.0/5.0(普遍存在根本性问题)
流程:
注意: 如果重新设计失败两次,则向用户上报以寻求指导。
新实现的提示模板:
You are analyzing why all solutions failed to meet quality standards, to inform a redesign. And implement new solution based on it.
<task>
{task_description}
</task>
<constraints>
{constraints_if_any}
</constraints>
<context>
{relevant_context}
</context>
<failed_solutions>
{list of paths to all solution files}
Average scores: A={avg_a}/5.0, B={avg_b}/5.0, C={avg_c}/5.0
</failed_solutions>
<evaluation_reports>
{list of paths to all evaluation reports}
All solutions scored below 3.0/5.0 threshold.
</evaluation_reports>
<output>
.specs/research/{solution-name}-{date}.redesign-analysis.md
</output>
Instructions:
Let's break this down systematically to understand what went wrong and how to design new solution based on it.
1. First, analyze the task carefully - what is being asked and what are the key requirements?
2. Read through each solution and its evaluation report
3. For each solution, think step by step about:
- What was the core approach?
- What specific issues did judges identify?
- Why did this approach fail to meet the quality threshold?
4. Identify common failure patterns across all solutions:
- Are there shared misconceptions?
- Are there missing requirements that all solutions overlooked?
- Are there fundamental constraints that weren't considered?
5. Extract lessons learned:
- What approaches should be avoided?
- What constraints must be addressed?
6. Generate improved guidance for the next iteration:
- New constraints to add
- Specific approaches to try - what are the different ways to solve this?
- Key requirements to emphasize
7. Think through the tradeoffs step by step and choose the approach you believe is best
8. Implement it completely
9. Generate 5 verification questions about critical aspects
10. Answer your own questions:
- Review solution against each question
- Identify gaps or weaknesses
11. Revise solution:
- Fix identified issues
12. Explain what was changed and why
时机: 没有明确的胜出者,且解决方案有可取之处(得分 >=3.0)
流程: 进入阶段 5(基于证据的综合)
仅在阶段 4.5 选择了策略 3 (FULL_SYNTHESIS) 时执行
启动1 个综合智能体(建议:Opus 以获得质量):
.specs/reports/).specs/research/)关键原则: 基于证据的综合利用了探索和评估过程中产生的集体智慧。
综合器提示模板:
You are synthesizing the best solution from explored, pruned, and evaluated implementations.
<task>
{task_description}
</task>
<solutions>
{list of paths to all solution files}
</solutions>
<evaluation_reports>
{list of paths to all evaluation reports}
</evaluation_reports>
<selection_rationale>
{path to selection.md explaining why these proposals were chosen}
</selection_rationale>
<output>
{output_path} - The final synthesized solution
</output>
Instructions:
Let's approach this synthesis systematically by first analyzing, then decomposing, then building.
**Step 1: Build the evidence base**
Before synthesizing, gather evidence from judge reports:
- What did multiple judges praise? (consensus strengths)
- What did multiple judges criticize? (consensus weaknesses)
- Where did judges disagree? (areas needing careful analysis)
**Step 2: Decompose into synthesis subproblems**
Break the solution into logical sections or components. For each component:
- Which solution handles this best? (cite evidence)
- Are there complementary elements from multiple solutions?
- What issues were identified that need fixing?
**Step 3: Solve each subproblem**
For each component/section, determine the synthesis strategy:
*Strategy A - Clear winner:* If one solution is clearly superior for this component:
- Copy that section directly
- Document: "Taken from Solution X because [judge evidence]"
*Strategy B - Complementary combination:* If solutions have complementary strengths:
- Identify what each contributes
- Combine carefully, ensuring consistency
- Document: "Combined X from Solution A with Y from Solution B because [rationale]"
*Strategy C - All flawed:* If all solutions have issues in this area:
- Start with the best version
- Apply fixes based on judge criticism
- Document: "Based on Solution X, modified to address [specific issues]"
**Step 4: Integrate and verify consistency**
After synthesizing all components:
- Check that combined elements work together
- Resolve any contradictions between borrowed sections
- Ensure consistent terminology and style
**Step 5: Document synthesis decisions**
Create a synthesis log:
- What you took from each solution (with specific citations)
- Why you made those choices (reference judge feedback)
- How you addressed identified weaknesses
- Any novel combinations or improvements
<example>
**Example synthesis decision for an API design:**
Component: Authentication flow
- Solution A: JWT with refresh tokens (praised for security by 2/3 judges)
- Solution B: Session-based (praised for simplicity by 1 judge, criticized for scalability)
- Solution C: OAuth2 only (criticized as over-engineered for use case)
Decision: Take Solution A's authentication flow directly.
Evidence: Judges 1 and 3 both noted "JWT approach provides good balance of security and statelessness"
Modification: None needed - this section was rated highest across judges.
</example>
**Step 6: Revise your solution**
- Generate 5 verification questions about critical aspects
- Answer your own questions:
- Review solution against each question
- Identify gaps or weaknesses
- Revise solution:
- Fix identified issues
- Explain what was changed and why
CRITICAL:
- Do not create something entirely new - synthesize the best from what exists
- Cite your sources (which solution, which section)
- Explain every major decision
- Address all consensus weaknesses identified by judges
.specs/research/(如果不存在则创建)
.specs/research/{solution-name}-{date}.proposals.[a|b|c].md - 带概率的高层次方法.specs/research/{solution-name}-{date}.pruning.[1|2|3].md - 评委评估和投票.specs/research/{solution-name}-{date}.selection.md - 投票统计和选定的提案solution.a.md, solution.b.md, solution.c.md - 完整实现(位于指定的输出位置).specs/reports/(如果不存在则创建)
.specs/reports/{solution-name}-{date}.[1|2|3].md - 最终评委报告{output_path}/tree-of-thoughts "Design REST API for user management (CRUD + auth)" \
--output "specs/api/users.md" \
--criteria "RESTfulness,security,scalability,developer-experience"
阶段 1 输出(假设日期 2025-01-15):
.specs/research/users-api-2025-01-15.proposals.a.md - 来自智能体 A 的 6 种方法.specs/research/users-api-2025-01-15.proposals.b.md - 来自智能体 B 的 6 种方法.specs/research/users-api-2025-01-15.proposals.c.md - 来自智能体 C 的 6 种方法阶段 1.5 输出(与阶段 1 并行运行):
sadd:meta-judge) 生成筛选评估规范 YAML阶段 2 输出(3 个评委使用筛选元评委规范):
.specs/research/users-api-2025-01-15.pruning.1.md - 前 3 名:基于资源的 REST、纯 REST、单体式.specs/research/users-api-2025-01-15.pruning.2.md - 前 3 名:纯 REST、混合(服务)、基于资源的 REST.specs/research/users-api-2025-01-15.pruning.3.md - 前 3 名:基于资源的 REST、REST+GraphQL 混合、纯 REST.specs/research/users-api-2025-01-15.selection.md - 选定:基于资源的 REST(8 分)、纯 REST(7 分)、单体式(4 分)阶段 3 输出:
specs/api/users.a.md - 带嵌套路由的完整基于资源的设计specs/api/users.b.md - 带简单端点的扁平 REST 设计specs/api/users.c.md - 内部面向服务的单体式 API阶段 3.5 输出(与阶段 3 并行运行):
sadd:meta-judge) 生成评估规范 YAML阶段 4 输出(3 个评委使用评估元评委规范):
.specs/reports/users-api-2025-01-15.1.md:
VOTE: Solution A
SCORES: A=4.2/5.0, B=3.8/5.0, C
Key benefits:
This command implements an eight-phase systematic reasoning pattern with meta-judge evaluation and adaptive strategy selection:
Phase 1: Exploration (Propose Approaches)
┌─ Agent A → Proposals A1, A2 (with probabilities) ─┐
Task ───┼─ Agent B → Proposals B1, B2 (with probabilities) ─┼─┐
└─ Agent C → Proposals C1, C2 (with probabilities) ─┘ │
│
Phase 1.5: Pruning Meta-Judge (runs in parallel with Phase 1) │
Meta-Judge → Pruning Evaluation Specification YAML ───┤
│
Phase 2: Pruning (Vote for Best 3) │
┌─ Judge 1 → Votes + Rationale ─┐ │
├─ Judge 2 → Votes + Rationale ─┼─────────────────────┤
└─ Judge 3 → Votes + Rationale ─┘ │
│ │
├─→ Select Top 3 Proposals │
│ │
Phase 3: Expansion (Develop Full Solutions) │
┌─ Agent A → Solution A (from proposal X) ─┐ │
├─ Agent B → Solution B (from proposal Y) ─┼──────────┤
└─ Agent C → Solution C (from proposal Z) ─┘ │
│
Phase 3.5: Evaluation Meta-Judge (runs in parallel w/ Phase 3)│
Meta-Judge → Evaluation Specification YAML ───────────┤
│
Phase 4: Evaluation (Judge Full Solutions) │
┌─ Judge 1 → Report 1 ─┐ │
├─ Judge 2 → Report 2 ─┼──────────────────────────────┤
└─ Judge 3 → Report 3 ─┘ │
│
Phase 4.5: Adaptive Strategy Selection │
Analyze Consensus ────────────────────────────────────┤
├─ Clear Winner? → SELECT_AND_POLISH │
├─ All Flawed (<3.0)? → REDESIGN (Phase 3) │
└─ Split Decision? → FULL_SYNTHESIS │
│ │
Phase 5: Synthesis (Only if FULL_SYNTHESIS) │
Synthesizer ────────────────────┴──────────────────────┴─→ Final Solution
Before starting, ensure the directory structure exists:
mkdir -p .specs/research .specs/reports
Naming conventions:
.specs/research/{solution-name}-{YYYY-MM-DD}.proposals.[a|b|c].md.specs/research/{solution-name}-{YYYY-MM-DD}.pruning.[1|2|3].md.specs/research/{solution-name}-{YYYY-MM-DD}.selection.md.specs/reports/{solution-name}-{YYYY-MM-DD}.[1|2|3].mdWhere:
{solution-name} - Derived from output path (e.g., users-api from output specs/api/users.md){YYYY-MM-DD} - Current dateNote: Solutions remain in their specified output locations; only research and evaluation files go to .specs/
Launch 3 independent agents in parallel (recommended: Sonnet for speed):
.specs/research/{solution-name}-{date}.proposals.[a|b|c].mdKey principle: Systematic exploration through probabilistic sampling from the full distribution of possible approaches.
Prompt template for explorers:
<task>
{task_description}
</task>
<constraints>
{constraints_if_any}
</constraints>
<context>
{relevant_context}
</context>
<output>
{.specs/research/{solution-name}-{date}.proposals.[a|b|c].md - each agent gets unique letter identifier}
</output>
Instructions:
Let's approach this systematically by first understanding what we're solving, then exploring the solution space.
**Step 1: Decompose the problem**
Before generating approaches, break down the task:
- What is the core problem being solved?
- What are the key constraints and requirements?
- What subproblems must any solution address?
- What are the evaluation criteria for success?
**Step 2: Map the solution space**
Identify the major dimensions along which solutions can vary:
- Architecture patterns (e.g., monolithic vs distributed)
- Implementation strategies (e.g., eager vs lazy)
- Trade-off axes (e.g., performance vs simplicity)
**Step 3: Generate 6 distinct high-level approaches**
**Sampling guidance:**
Please sample approaches at random from the [full distribution / tails of the distribution]
- For first 3 approaches aim for high probability, over 0.80
- For last 3 approaches aim for diversity - explore different regions of the solution space, such that the probability of each response is less than 0.10
For each approach, provide:
- Name and one-sentence summary
- Detailed description (2-3 paragraphs)
- Key design decisions and rationale
- Trade-offs (what you gain vs what you sacrifice)
- Probability (0.0-1.0)
- Complexity estimate (low/medium/high)
- Potential risks and failure modes
**Step 4: Verify diversity**
Before finalizing, check:
- Are approaches genuinely different, not minor variations?
- Do they span different regions of the solution space?
- Have you covered both conventional and unconventional options?
CRITICAL:
- Do NOT implement full solutions yet - only high-level approaches
- Ensure approaches are genuinely different, not minor variations
CRITICAL : Launch the pruning meta-judge in parallel with Phase 1 exploration agents. The meta-judge does not need exploration output to generate pruning criteria — it only needs the original task description.
The pruning meta-judge generates an evaluation specification (rubrics, checklist, scoring criteria) tailored to evaluating high-level proposals for pruning.
Prompt template for pruning meta-judge:
## Task
Generate an evaluation specification yaml for pruning high-level solution proposals. You will produce rubrics, checklists, and scoring criteria that judge agents will use to select the top 3 proposals for full development.
CLAUDE_PLUGIN_ROOT=`${CLAUDE_PLUGIN_ROOT}`
## User Prompt
{Original task description from user}
## Context
{Any relevant codebase context, file paths, constraints}
## Artifact Type
proposals (high-level approaches with probability estimates, not full implementations)
## Evaluation Focus
Feasibility, alignment with requirements, potential for high-quality result, risk manageability
## Instructions
Return only the final evaluation specification YAML in your response.
The specification should support comparative evaluation and ranking of proposals.
Dispatch:
Use Task tool:
- description: "Pruning Meta-judge: {brief task summary}"
- prompt: {pruning meta-judge prompt}
- model: opus
- subagent_type: "sadd:meta-judge"
Wait for BOTH Phase 1 exploration agents AND Phase 1.5 pruning meta-judge to complete before proceeding.
Launch 3 independent judges in parallel (recommended: Opus for rigor):
.specs/research/) and the pruning meta-judge evaluation specification YAML.specs/research/{solution-name}-{date}.pruning.[1|2|3].mdKey principle: Independent evaluation with meta-judge-generated criteria ensures consistent, tailored assessment without hardcoded weights.
CRITICAL: Provide to each judge the EXACT pruning meta-judge's evaluation specification YAML. Do not skip, add, modify, shorten, or summarize any text in it!
Prompt template for pruning judges:
You are evaluating {N} proposed approaches against an evaluation specification produced by the meta judge, to select the top 3 for full development.
CLAUDE_PLUGIN_ROOT=`${CLAUDE_PLUGIN_ROOT}`
## Task
{task_description}
## Proposals
{list of paths to all proposal files}
Read all proposals carefully before evaluating.
## Evaluation Specification
```yaml
{pruning meta-judge's evaluation specification YAML}
{.specs/research/{solution-name}-{date}.pruning.[1|2|3].md}
Follow your full judge process as defined in your agent instructions!
CRITICAL: You must reply with this exact structured evaluation report format in YAML at the START of your response!
**Dispatch:**
Use Task tool:
description: "Pruning Judge {1|2|3}: {brief task summary}"
prompt: {pruning judge prompt with exact meta-judge specification YAML}
model: opus
subagent_type: "sadd:judge"
After judges complete voting:
.specs/research/{solution-name}-{date}.selection.md:
Launch 3 independent agents in parallel (recommended: Opus for quality):
CRITICAL : Launch the evaluation meta-judge in parallel with Phase 3 expansion agents. The meta-judge does not need expansion output to generate evaluation criteria — it only needs the original task description.
The evaluation meta-judge generates an evaluation specification (rubrics, checklist, scoring criteria) tailored to evaluating full solution implementations.
Prompt template for evaluation meta-judge:
## Task
Generate an evaluation specification yaml for evaluating full solution implementations. You will produce rubrics, checklists, and scoring criteria that judge agents will use to evaluate and compare competitive implementations.
CLAUDE_PLUGIN_ROOT=`${CLAUDE_PLUGIN_ROOT}`
## User Prompt
{Original task description from user}
## Context
{Any relevant codebase context, file paths, constraints}
## Artifact Type
{code | documentation | configuration | etc.}
## Number of Solutions
3 (full implementations developed from selected proposals)
## Instructions
Return only the final evaluation specification YAML in your response.
The specification should support comparative evaluation across multiple solutions.
Dispatch:
Use Task tool:
- description: "Evaluation Meta-judge: {brief task summary}"
- prompt: {evaluation meta-judge prompt}
- model: opus
- subagent_type: "sadd:meta-judge"
Wait for BOTH Phase 3 expansion agents AND Phase 3.5 evaluation meta-judge to complete before proceeding.
Launch 3 independent judges in parallel (recommended: Opus for rigor):
.specs/reports/{solution-name}-{date}.[1|2|3].mdKey principle: Multiple independent evaluations with meta-judge-generated specifications and explicit evidence reduce bias and catch different quality aspects.
CRITICAL: Provide to each judge the EXACT evaluation meta-judge's evaluation specification YAML. Do not skip, add, modify, shorten, or summarize any text in it!
CRITICAL: NEVER provide score threshold to judges. Judge MUST not know what threshold for score is, in order to not be biased!!!
Prompt template for evaluation judges:
You are evaluating {number} full solutions against an evaluation specification produced by the meta judge.
CLAUDE_PLUGIN_ROOT=`${CLAUDE_PLUGIN_ROOT}`
## Task
{task_description}
## Solutions
{list of paths to all solution files}
Read all solutions carefully before evaluating.
## Evaluation Specification
```yaml
{evaluation meta-judge's evaluation specification YAML}
Write full report to: .specs/reports/{solution-name}-{date}.[1|2|3].md
CRITICAL: You must reply with this exact structured header format:
VOTE: [Solution A/B/C] SCORES: Solution A: [X.X]/5.0 Solution B: [X.X]/5.0 Solution C: [X.X]/5.0 CRITERIA:
[Summary of your evaluation]
Follow your full judge process as defined in your agent instructions!
CRITICAL: You must reply with this exact structured evaluation report format in YAML at the START of your response!
**Dispatch:**
Use Task tool:
description: "Evaluation Judge {1|2|3}: {brief task summary}"
prompt: {evaluation judge prompt with exact meta-judge specification YAML}
model: opus
subagent_type: "sadd:judge"
The orchestrator (not a subagent) analyzes judge outputs to determine the optimal strategy.
Step 1: Parse structured headers from judge reply
Parse the judges reply. CRITICAL: Do not read report files themselves, as they can overflow your context.
Step 2: Check for unanimous winner
Compare all three VOTE values:
Step 3: Check if all solutions are fundamentally flawed
If no unanimous vote, calculate average scores:
If (avg_A < 3.0) AND (avg_B < 3.0) AND (avg_C < 3.0):
When: All solutions scored <3.0/5.0 (fundamental issues across the board)
Process:
Note: If redesign fails twice, escalate to user for guidance.
Prompt template for new implementation:
You are analyzing why all solutions failed to meet quality standards, to inform a redesign. And implement new solution based on it.
<task>
{task_description}
</task>
<constraints>
{constraints_if_any}
</constraints>
<context>
{relevant_context}
</context>
<failed_solutions>
{list of paths to all solution files}
Average scores: A={avg_a}/5.0, B={avg_b}/5.0, C={avg_c}/5.0
</failed_solutions>
<evaluation_reports>
{list of paths to all evaluation reports}
All solutions scored below 3.0/5.0 threshold.
</evaluation_reports>
<output>
.specs/research/{solution-name}-{date}.redesign-analysis.md
</output>
Instructions:
Let's break this down systematically to understand what went wrong and how to design new solution based on it.
1. First, analyze the task carefully - what is being asked and what are the key requirements?
2. Read through each solution and its evaluation report
3. For each solution, think step by step about:
- What was the core approach?
- What specific issues did judges identify?
- Why did this approach fail to meet the quality threshold?
4. Identify common failure patterns across all solutions:
- Are there shared misconceptions?
- Are there missing requirements that all solutions overlooked?
- Are there fundamental constraints that weren't considered?
5. Extract lessons learned:
- What approaches should be avoided?
- What constraints must be addressed?
6. Generate improved guidance for the next iteration:
- New constraints to add
- Specific approaches to try - what are the different ways to solve this?
- Key requirements to emphasize
7. Think through the tradeoffs step by step and choose the approach you believe is best
8. Implement it completely
9. Generate 5 verification questions about critical aspects
10. Answer your own questions:
- Review solution against each question
- Identify gaps or weaknesses
11. Revise solution:
- Fix identified issues
12. Explain what was changed and why
When: No clear winner AND solutions have merit (scores >=3.0)
Process: Proceed to Phase 5 (Evidence-Based Synthesis)
Only executed when Strategy 3 (FULL_SYNTHESIS) selected in Phase 4.5
Launch 1 synthesis agent (recommended: Opus for quality):
.specs/reports/).specs/research/)Key principle: Evidence-based synthesis leverages collective intelligence from exploration and evaluation.
Prompt template for synthesizer:
You are synthesizing the best solution from explored, pruned, and evaluated implementations.
<task>
{task_description}
</task>
<solutions>
{list of paths to all solution files}
</solutions>
<evaluation_reports>
{list of paths to all evaluation reports}
</evaluation_reports>
<selection_rationale>
{path to selection.md explaining why these proposals were chosen}
</selection_rationale>
<output>
{output_path} - The final synthesized solution
</output>
Instructions:
Let's approach this synthesis systematically by first analyzing, then decomposing, then building.
**Step 1: Build the evidence base**
Before synthesizing, gather evidence from judge reports:
- What did multiple judges praise? (consensus strengths)
- What did multiple judges criticize? (consensus weaknesses)
- Where did judges disagree? (areas needing careful analysis)
**Step 2: Decompose into synthesis subproblems**
Break the solution into logical sections or components. For each component:
- Which solution handles this best? (cite evidence)
- Are there complementary elements from multiple solutions?
- What issues were identified that need fixing?
**Step 3: Solve each subproblem**
For each component/section, determine the synthesis strategy:
*Strategy A - Clear winner:* If one solution is clearly superior for this component:
- Copy that section directly
- Document: "Taken from Solution X because [judge evidence]"
*Strategy B - Complementary combination:* If solutions have complementary strengths:
- Identify what each contributes
- Combine carefully, ensuring consistency
- Document: "Combined X from Solution A with Y from Solution B because [rationale]"
*Strategy C - All flawed:* If all solutions have issues in this area:
- Start with the best version
- Apply fixes based on judge criticism
- Document: "Based on Solution X, modified to address [specific issues]"
**Step 4: Integrate and verify consistency**
After synthesizing all components:
- Check that combined elements work together
- Resolve any contradictions between borrowed sections
- Ensure consistent terminology and style
**Step 5: Document synthesis decisions**
Create a synthesis log:
- What you took from each solution (with specific citations)
- Why you made those choices (reference judge feedback)
- How you addressed identified weaknesses
- Any novel combinations or improvements
<example>
**Example synthesis decision for an API design:**
Component: Authentication flow
- Solution A: JWT with refresh tokens (praised for security by 2/3 judges)
- Solution B: Session-based (praised for simplicity by 1 judge, criticized for scalability)
- Solution C: OAuth2 only (criticized as over-engineered for use case)
Decision: Take Solution A's authentication flow directly.
Evidence: Judges 1 and 3 both noted "JWT approach provides good balance of security and statelessness"
Modification: None needed - this section was rated highest across judges.
</example>
**Step 6: Revise your solution**
- Generate 5 verification questions about critical aspects
- Answer your own questions:
- Review solution against each question
- Identify gaps or weaknesses
- Revise solution:
- Fix identified issues
- Explain what was changed and why
CRITICAL:
- Do not create something entirely new - synthesize the best from what exists
- Cite your sources (which solution, which section)
- Explain every major decision
- Address all consensus weaknesses identified by judges
Research directory: .specs/research/ (created if not exists)
.specs/research/{solution-name}-{date}.proposals.[a|b|c].md - High-level approaches with probabilities.specs/research/{solution-name}-{date}.pruning.[1|2|3].md - Judge evaluations and votes.specs/research/{solution-name}-{date}.selection.md - Vote tallies and selected proposalsExpansion outputs:
solution.a.md, solution.b.md, solution.c.md - Full implementations (in specified output location)/tree-of-thoughts "Design REST API for user management (CRUD + auth)" \
--output "specs/api/users.md" \
--criteria "RESTfulness,security,scalability,developer-experience"
Phase 1 outputs (assuming date 2025-01-15):
.specs/research/users-api-2025-01-15.proposals.a.md - 6 approaches from Agent A.specs/research/users-api-2025-01-15.proposals.b.md - 6 approaches from Agent B.specs/research/users-api-2025-01-15.proposals.c.md - 6 approaches from Agent CPhase 1.5 output (runs in parallel with Phase 1):
sadd:meta-judge) generates pruning evaluation specification YAMLPhase 2 outputs (3 judges with pruning meta-judge spec):
.specs/research/users-api-2025-01-15.pruning.1.md - Top 3: Resource-based REST, Pure REST, Monolithic.specs/research/users-api-2025-01-15.pruning.2.md - Top 3: Pure REST, Hybrid (services), Resource-based REST.specs/research/users-api-2025-01-15.pruning.3.md - Top 3: Resource-based REST, REST+GraphQL hybrid, Pure REST.specs/research/users-api-2025-01-15.selection.md - Selected: Resource-based REST (8 pts), Pure REST (7 pts), Monolithic (4 pts)Phase 3 outputs:
specs/api/users.a.md - Full resource-based design with nested routesspecs/api/users.b.md - Flat REST design with simple endpointsspecs/api/users.c.md - Monolithic API with service-oriented internalsPhase 3.5 output (runs in parallel with Phase 3):
sadd:meta-judge) generates evaluation specification YAMLPhase 4 outputs (3 judges with evaluation meta-judge spec):
.specs/reports/users-api-2025-01-15.1.md:
VOTE: Solution A
SCORES: A=4.2/5.0, B=3.8/5.0, C=3.4/5.0
"Prefers A for RESTfulness, criticizes C complexity"
.specs/reports/users-api-2025-01-15.2.md:
VOTE: Solution B
SCORES: A=3.9/5.0, B=4.1/5.0, C=3.5/5.0
"Prefers B for simplicity, criticizes A deep nesting"
.specs/reports/users-api-2025-01-15.3.md:
VOTE: Solution A
SCORES: A=4.3/5.0, B=3.6/5.0, C=3.2/5.0
"Prefers A for discoverability, criticizes B lack of structure"
Phase 4.5 decision (orchestrator parses headers):
Phase 5 output (synthesis):
specs/api/users.md - Resource-based structure (from A), max 2-level nesting (from B), internal services (from C)Weekly Installs
224
Repository
GitHub Stars
699
First Seen
Feb 19, 2026
Installed on
opencode218
codex217
github-copilot216
gemini-cli215
kimi-cli213
amp213
超能力技能使用指南:AI助手技能调用优先级与工作流程详解
41,800 周安装
solution.a.md, solution.b.md, solution.c.mdKey principle: Focused development of validated approaches with awareness of evaluation feedback.
Prompt template for expansion agents:
You are developing a full solution based on a selected proposal.
<task>
{task_description}
</task>
<selected_proposal>
{write selected proposal EXACTLY as it is. Including all details provided by the agent}
Read this carefully - it is your starting point.
</selected_proposal>
<judge_feedback>
{concerns and questions from judges about this proposal}
Address these in your implementation.
</judge_feedback>
<output>
solution.[*].md where [*] is your unique identifier (a, b, or c)
</output>
Instructions:
Let's work through this systematically to ensure we build a complete, high-quality solution.
**Step 1: Understand the proposal deeply**
Before implementing, analyze:
- What is the core insight or approach of this proposal?
- What are the key design decisions already made?
- What gaps need to be filled for a complete solution?
**Step 2: Address judge feedback**
For each concern raised by judges:
- What specific change or addition addresses this concern?
- How does this change integrate with the proposal's approach?
**Step 3: Decompose into implementation subproblems**
Break the solution into logical parts:
- What are the main components or sections?
- What must be defined first for other parts to build upon?
- What are the dependencies between parts?
**Step 4: Implement each subproblem**
For each component, work through:
- Core functionality and behavior
- Edge cases and error handling
- Integration points with other components
**Step 5: Self-verification**
Generate 3-5 verification questions about critical aspects, then answer them:
- Review solution against each question
- Identify gaps or weaknesses
- Fix identified issues
**Step 6: Document changes**
Explain what was changed from the original proposal and why.
<example>
**Example of good expansion thinking:**
Proposal: "Use event-driven architecture with message queue"
Step 1 Analysis:
- Core insight: Decouple components via async messaging
- Key decisions: Events as primary communication, eventual consistency
- Gaps: Need to define event schemas, queue technology, error handling
Step 2 - Addressing judge concern "What about message ordering?":
- Add partition keys for ordered processing within entity scope
- Document ordering guarantees and limitations
Step 3 - Subproblems:
1. Event schema definitions (foundational - others depend on this)
2. Producer interfaces (depends on schemas)
3. Consumer handlers (depends on schemas)
4. Error handling and dead letter queues (depends on both)
5. Integration patterns (builds on all above)
</example>
CRITICAL:
- Stay faithful to the selected proposal's core approach
- Do not switch to a different approach midway
- Address judge feedback explicitly
- Produce a complete, implementable solution
Step 4: Default to full synthesis
If none of the above conditions met:
When: Clear winner (unanimous votes)
Process:
Benefits:
Prompt template:
You are polishing the winning solution based on judge feedback.
<task>
{task_description}
</task>
<winning_solution>
{path_to_winning_solution}
Score: {winning_score}/5.0
Judge consensus: {why_it_won}
</winning_solution>
<runner_up_solutions>
{list of paths to all runner-up solutions}
</runner_up_solutions>
<judge_feedback>
{list of paths to all evaluation reports}
</judge_feedback>
<output>
{final_solution_path}
</output>
Instructions:
Let's approach this polishing task methodically to improve without disrupting what works.
**Step 1: Understand why this solution won**
Analyze the winning solution:
- What are its core strengths that judges praised?
- What makes its approach superior to alternatives?
- Which parts should remain untouched?
**Step 2: Catalog improvement opportunities**
From judge feedback, identify:
- Specific weaknesses mentioned (list each one)
- Missing elements judges noted
- Areas where runner-ups were praised
**Step 3: Prioritize changes by impact**
For each improvement opportunity:
- High impact: Directly addresses judge criticism
- Medium impact: Adds praised element from runner-up
- Low impact: Nice-to-have refinement
Focus on high-impact changes first.
**Step 4: Apply improvements surgically**
For each change:
- Locate the specific section to modify
- Make the minimal change needed to address the issue
- Verify the change integrates cleanly with surrounding content
**Step 5: Cherry-pick from runners-up**
Review runner-up solutions for:
- 1-2 specific elements that judges praised
- Elements that complement (not conflict with) the winning approach
- Only incorporate if clearly superior to winning solution's version
**Step 6: Document all changes**
Record:
- What was changed and why (with reference to judge feedback)
- What was added from other solutions (cite source)
- What was intentionally left unchanged
CRITICAL: Preserve the winning solution's core approach. Make targeted improvements only.
Reports directory: .specs/reports/ (created if not exists)
.specs/reports/{solution-name}-{date}.[1|2|3].md - Final judge reportsResulting solution: {output_path}