AI工作质量评估工具Judge：自动化代码文档分析评估与质量检查

sadd%3Ajudge by neolabhq/context-engineering-kit

228 周安装量

699 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/neolabhq/context-engineering-kit --skill sadd:judge

AI/机器学习自动化代码质量

🇨🇳中文介绍

Judge 命令

您的工作流程

阶段 1：上下文提取

在启动评估流程之前，先确定需要评估的内容：

确定要评估的工作： * 检查对话历史记录中已完成的工作 * 如果提供了参数：使用参数来聚焦于特定方面 * 如果不明确：询问用户“我应该评估什么工作？（代码变更、分析、文档等）”
提取评估上下文： * 促使该项工作的原始任务或请求 * 实际产生的输出/结果 * 创建或修改的文件（附简要说明） * 提到的任何约束、要求或验收标准 * 工件类型（代码、文档、配置等）
向用户提供范围：

评估范围：
- 原始请求：[摘要]
- 产出的工作：[描述]
- 涉及的文件：[列表]
- 工件类型：[代码 | 文档 | 配置 | 等]
- 评估重点：[来自参数或“总体质量”]
正在启动元评估器以生成评估标准...

重要提示：仅将提取的上下文传递给子代理——而非整个对话。这可以防止上下文污染，并实现重点评估。

阶段 2：派遣元评估器

启动一个元评估器代理，以生成针对正在评估的特定工作量身定制的评估规范。元评估器将返回一个包含评分细则、检查清单和评分标准的评估规范 YAML。

元评估器提示：

## 任务

为以下评估任务生成一个评估规范 yaml。您将生成评分细则、检查清单和评分标准，供评估员代理用于评估工作。

CLAUDE_PLUGIN_ROOT=`${CLAUDE_PLUGIN_ROOT}`

## 用户提示
{促使该项工作的原始任务或请求}

## 上下文
{关于被评估工作的任何相关上下文}
{来自参数的评估重点，或“总体质量评估”}

## 工件类型
{代码 | 文档 | 配置 | 等}

## 指令
在您的回复中仅返回最终的评估规范 YAML。

派遣：

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

阶段 3：派遣评估员代理

元评估器完成后，提取其评估规范 YAML，并将工作上下文和规范一起派遣给评估员代理。

关键：向评估员提供完全相同的元评估器评估规范 YAML。不要跳过、添加、修改、缩短或总结其中的任何文本！

评估员代理提示：

您是一位专家评估员，根据元评估器生成的评估规范来评估工作质量。

CLAUDE_PLUGIN_ROOT=`${CLAUDE_PLUGIN_ROOT}`

## 被评估的工作

[原始任务]
{粘贴原始请求/任务}
[/原始任务]

[工作输出]
{创建/修改内容的摘要}
[/工作输出]

[涉及的文件]
{文件列表及简要说明}
[/涉及的文件]

## 评估规范

```yaml
{元评估器的评估规范 YAML}

按照您在代理指令中定义的完整评估流程执行！

关键：您必须在回复的开头使用以下确切的 YAML 结构化评估报告格式！

关键：切勿以任何格式向评估员提供分数阈值。评估员绝不能知道分数阈值是多少，以避免产生偏见！！！

**派遣：**

使用任务工具：

description: "评估员：评估{简要工作摘要}"
prompt: {包含确切元评估器规范 YAML 的评估员提示}
model: opus

subagent_type: "sadd:judge"

阶段 4：处理并呈现结果

收到评估员的评估后：

验证评估：

检查所有标准是否都有有效范围内的分数（1-5）
验证每个分数是否有支持性的理由和证据
确认加权总分计算正确
检查理由和分数之间是否存在矛盾
验证是否已完成自我验证并记录了调整

如果验证失败：

记录具体问题
必要时请求澄清或重新评估

向用户呈现结果：

显示完整的评估报告
突出显示裁决结果和关键发现
提供后续选项：
- 解决特定的改进点
- 请求澄清任何判断
- 按原样继续工作

分数解读

分数范围	裁决	解读	建议
4.50 - 5.00	优秀	质量卓越，超出预期	可直接使用
4.00 - 4.49	良好	质量扎实，符合专业标准	可选的微小改进
3.50 - 3.99	可接受	足够但仍有改进空间	建议改进
3.00 - 3.49	需要改进	低于标准，需要完善	使用前需解决问题
1.00 - 2.99	不足	未达到基本要求	需要大量返工

重要准则

元评估器优先：在评估之前始终先生成评估规范——切勿跳过元评估器阶段
包含 CLAUDE_PLUGIN_ROOT：元评估器和评估员都需要解析后的插件根路径
元评估器 YAML：仅将元评估器的 YAML 传递给评估员，不要修改它
上下文隔离：仅将相关上下文传递给子代理——而非整个对话
理由优先：始终要求在给出分数之前提供证据和推理
基于证据：每个分数都必须引用具体证据（文件路径、行号、引用内容）
偏见缓解：明确警告长度偏见、冗长偏见和权威偏见
保持客观：基于证据和评分细则定义进行评估，而非个人偏好
具体明确：引用确切位置，而非模糊的观察
具有建设性：将批评视为带有影响背景的改进机会
考虑上下文：考虑已声明的约束、复杂性和要求
报告置信度：当证据模糊或标准不明确时，降低置信度
单一评估员：此命令使用一个专注的评估员以实现上下文隔离

备注

这是一个仅报告命令——它评估但不修改工作
元评估器生成针对特定工件类型和评估重点量身定制的标准
评估员使用全新的上下文进行无偏见的评估
分数根据专业开发标准进行校准
低分表示改进机会，而非失败
使用评估结果来指导后续步骤和迭代
低置信度的评估可能需要人工审查

🇺🇸English

Judge Command

Your Workflow

Phase 1: Context Extraction

Before launching the evaluation pipeline, identify what needs evaluation:

Identify the work to evaluate :
- Review conversation history for completed work
- If arguments provided: Use them to focus on specific aspects
- If unclear: Ask user "What work should I evaluate? (code changes, analysis, documentation, etc.)"
Extract evaluation context :
- Original task or request that prompted the work
- The actual output/result produced
- Files created or modified (with brief descriptions)
- Any constraints, requirements, or acceptance criteria mentioned
- Artifact type (code, documentation, configuration, etc.)

Provide scope for user :

Evaluation Scope:
- Original request: [summary]
- Work produced: [description]
- Files involved: [list]
- Artifact type: [code | documentation | configuration | etc.]
- Evaluation focus: [from arguments or "general quality"]

Launching meta-judge to generate evaluation criteria...

IMPORTANT : Pass only the extracted context to the sub-agents - not the entire conversation. This prevents context pollution and enables focused assessment.

Phase 2: Dispatch Meta-Judge

Launch a meta-judge agent to generate an evaluation specification tailored to the specific work being evaluated. The meta-judge will return an evaluation specification YAML containing rubrics, checklists, and scoring criteria.

Meta-Judge Prompt:

## Task

Generate an evaluation specification yaml for the following evaluation task. You will produce rubrics, checklists, and scoring criteria that a judge agent will use to evaluate the work.

CLAUDE_PLUGIN_ROOT=`${CLAUDE_PLUGIN_ROOT}`

## User Prompt
{Original task or request that prompted the work}

## Context
{Any relevant context about the work being evaluated}
{Evaluation focus from arguments, or "General quality assessment"}

## Artifact Type
{code | documentation | configuration | etc.}

## Instructions
Return only the final evaluation specification YAML in your response.

Dispatch:

Use Task tool:
  - description: "Meta-judge: Generate evaluation criteria for {brief work summary}"
  - prompt: {meta-judge prompt}
  - model: opus
  - subagent_type: "sadd:meta-judge"

Wait for the meta-judge to complete before proceeding to Phase 3.

Phase 3: Dispatch Judge Agent

After the meta-judge completes, extract its evaluation specification YAML and dispatch the judge agent with both the work context and the specification.

CRITICAL: Provide to the judge the EXACT meta-judge evaluation specification YAML. Do not skip, add, modify, shorten, or summarize any text in it!

Judge Agent Prompt:

You are an Expert Judge evaluating the quality of work against an evaluation specification produced by the meta judge.

CLAUDE_PLUGIN_ROOT=`${CLAUDE_PLUGIN_ROOT}`

## Work Under Evaluation

[ORIGINAL TASK]
{paste the original request/task}
[/ORIGINAL TASK]

[WORK OUTPUT]
{summary of what was created/modified}
[/WORK OUTPUT]

[FILES INVOLVED]
{list of files with brief descriptions}
[/FILES INVOLVED]

## Evaluation Specification

```yaml
{meta-judge's evaluation specification YAML}

Instructions

Follow your full judge process as defined in your agent instructions!

CRITICAL: You must reply with this exact structured evaluation report format in YAML at the START of your response!

CRITICAL: NEVER provide score threshold to judges in any format. Judge MUST not know what threshold for score is, in order to not be biased!!!

**Dispatch:**

Use Task tool:

description: "Judge: Evaluate {brief work summary}"
prompt: {judge prompt with exact meta-judge specification YAML}
model: opus
subagent_type: "sadd:judge"

Phase 4: Process and Present Results

After receiving the judge's evaluation:
1. Validate the evaluation:
  - Check that all criteria have scores in valid range (1-5)
  - Verify each score has supporting justification with evidence
  - Confirm weighted total calculation is correct
  - Check for contradictions between justification and score
  - Verify self-verification was completed with documented adjustments
2. If validation fails:
  - Note the specific issue
  - Request clarification or re-evaluation if needed
3. Present results to user:
  - Display the full evaluation report
  - Highlight the verdict and key findings
  - Offer follow-up options:
    - Address specific improvements
    - Request clarification on any judgment
    - Proceed with the work as-is
Scoring Interpretation

Score Range

Weekly Installs

228

Repository

neolabhq/contex…ring-kit

GitHub Stars

699

First Seen

Feb 19, 2026

Installed on

opencode223

codex222

github-copilot221

gemini-cli220

kimi-cli218

cursor218

AI Elements：基于shadcn/ui的AI原生应用组件库，快速构建对话界面

58,500 周安装

Important Guidelines

Meta-judge first: Always generate evaluation specification before judging - never skip the meta-judge phase
Include CLAUDE_PLUGIN_ROOT: Both meta-judge and judge need the resolved plugin root path
Meta-judge YAML: Pass only the meta-judge YAML to the judge, do not modify it
Context Isolation: Pass only relevant context to sub-agents - not the entire conversation
Justification First: Always require evidence and reasoning BEFORE the score
Evidence-Based: Every score must cite specific evidence (file paths, line numbers, quotes)
Bias Mitigation: Explicitly warn against length bias, verbosity bias, and authority bias
Be Objective: Base assessments on evidence and rubric definitions, not preferences
Be Specific: Cite exact locations, not vague observations
Be Constructive: Frame criticism as opportunities for improvement with impact context
Consider Context: Account for stated constraints, complexity, and requirements
Report Confidence: Lower confidence when evidence is ambiguous or criteria unclear
Single Judge: This command uses one focused judge for context isolation

This is a report-only command - it evaluates but does not modify work
The meta-judge generates criteria tailored to the specific artifact type and evaluation focus
The judge operates with fresh context for unbiased assessment
Scores are calibrated to professional development standards
Low scores indicate improvement opportunities, not failures
Use the evaluation to inform next steps and iterations
Low confidence evaluations may warrant human review

4.50 - 5.00	EXCELLENT	Exceptional quality, exceeds expectations	Ready as-is
4.00 - 4.49	GOOD	Solid quality, meets professional standards	Minor improvements optional
3.50 - 3.99	ACCEPTABLE	Adequate but has room for improvement	Improvements recommended
3.00 - 3.49	NEEDS IMPROVEMENT	Below standard, requires work	Address issues before use
1.00 - 2.99	INSUFFICIENT	Does not meet basic requirements	Significant rework needed