write-judge-prompt by hamelsmu/evals-skills
npx skills add https://github.com/hamelsmu/evals-skills --skill write-judge-prompt设计一个针对特定失败模式的二元通过/不通过 LLM 作为裁判的评估器。每个裁判只检查一件事。
每个裁判提示词必须包含以下四个组成部分:
说明裁判评估的内容。每个裁判只针对一个失败模式。
您是一名评估员,负责评估房地产助理的电子邮件是否使用了适合客户角色的语气。
而不是:"评估电子邮件是否良好"或"将电子邮件质量从 1-5 评分"。
结果严格为二元:通过或不通过。没有李克特量表,没有字母等级,没有部分分数。明确定义什么构成通过和不通过。这些定义来自您的错误分析失败模式描述。
## 定义
通过:电子邮件符合客户角色的预期沟通风格:
- 豪华买家:正式语言,强调独家功能,高端市场定位,无随意俚语
- 首次购房者:热情鼓励的语气,教育性解释,避免行话,耐心且支持性强
- 投资者:数据驱动语言,关注投资回报率,市场分析,简洁专业
不通过:电子邮件的语气与客户角色不匹配。示例:
- 对豪华买家使用随意俚语("嘿,看看这个房子!")
- 对首次购房者使用大量金融术语
- 对投资者使用过于情绪化的语言
包含来自您人工标注数据的已标注通过和不通过示例。
## 示例
### 示例 1: 通过
客户角色:豪华买家
电子邮件:"尊敬的哈灵顿先生,我很高兴向您介绍太平洋高地大道 1200 号的一处独家房源。这处尊贵的房产具有以下特点..."
评价:电子邮件以正式称呼开头,并使用与豪华定位一致的语言——"独家房源"、"尊贵的房产"。没有随意俚语或非正式措辞。整个邮件的语气都符合豪华买家角色。
结果:通过
### 示例 2: 不通过
客户角色:豪华买家
电子邮件:"嘿!刚找到这个你可能喜欢的超棒地方。它有游泳池什么的,社区超酷..."
评价:问候语"嘿!"是非正式的。"超棒地方"、"有游泳池什么的"、"超酷"等短语是随意俚语,不适合豪华买家。这封邮件读起来像短信,而不是针对高端客户的专业沟通。
结果:不通过
### 示例 3: 通过(临界情况)
客户角色:首次购房者
电子邮件:"嗨,莎拉,我找到一处房产,可能非常适合您的第一套房子。社区附近有不错的学校,月供可能接近您目前支付的租金..."
评价:问候语热情但不随意。邮件用易于理解的术语解释房产——将抵押贷款与租金比较,提到学校——这是教育性的,没有居高临下的感觉。它避免了"摊销"或"贷款价值比"等行话。虽然不深入技术细节,但这符合首次购房者预期的支持性语气。
结果:通过
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
选择示例的规则:
使用您的 LLM 提供商的模式强制功能(例如 OpenAI 的 response_format,Anthropic 的工具定义)或像 Instructor 或 Outlines 这样的库来强制执行结构化输出。如果提供商不支持模式强制,请在提示词中指定 JSON 模式。
输出必须在给出裁决之前包含评价。将评价放在首位,迫使裁判在做出决定之前阐明其评估。
{
"critique": "字符串——根据标准对输出的详细评估",
"result": "通过 或 不通过"
}
评价必须详细,不能简略。好的评价会具体说明哪些是正确的或错误的,并引用输出中的具体证据。您少量示例中的评价为裁判将产生的详细程度设定了标准。
仅传递裁判做出准确决策所需的内容:
| 失败模式 | 裁判需要的内容 |
|---|---|
| 语气不匹配 | 客户角色 + 生成的电子邮件 |
| 答案忠实度 | 检索到的上下文 + 生成的答案 |
| SQL 正确性 | 用户查询 + 生成的 SQL + 数据库模式 |
| 指令遵循 | 系统提示规则 + 生成的响应 |
| 工具调用合理性 | 对话历史 + 工具调用 + 工具结果 |
对于长文档,仅传递相关片段,而不是整个文档。
从可用的最强大模型开始。用于主要任务的同一模型也可以作为裁判使用(裁判执行一个不同的、更狭窄的任务)。在确认对齐后,再为成本进行优化。
每周安装量
136
代码仓库
GitHub 星标数
955
首次出现
2026 年 3 月 3 日
安全审计
安装于
codex133
gemini-cli132
kimi-cli132
github-copilot132
cursor132
opencode132
Design a binary Pass/Fail LLM-as-Judge evaluator for one specific failure mode. Each judge checks exactly one thing.
Every judge prompt requires exactly four components:
State what the judge evaluates. One failure mode per judge.
You are an evaluator assessing whether a real estate assistant's email
uses the appropriate tone for the client's persona.
Not: "Evaluate whether the email is good" or "Rate the email quality from 1-5."
Outcomes are strictly binary: Pass or Fail. No Likert scales, no letter grades, no partial credit. Define exactly what constitutes Pass and Fail. These definitions come from your error analysis failure mode descriptions.
## Definitions
PASS: The email matches the expected communication style for the client persona:
- Luxury Buyers: formal language, emphasis on exclusive features, premium
market positioning, no casual slang
- First-Time Homebuyers: warm and encouraging tone, educational explanations,
avoids jargon, patient and supportive
- Investors: data-driven language, ROI-focused, market analytics, concise
and professional
FAIL: The email uses a tone mismatched to the client persona. Examples:
- Using casual slang ("hey, check out this pad!") for a luxury buyer
- Using heavy financial jargon for a first-time homebuyer
- Using overly emotional language for an investor
Include labeled Pass and Fail examples from your human-labeled data.
## Examples
### Example 1: PASS
Client Persona: Luxury Buyer
Email: "Dear Mr. Harrington, I am pleased to present an exclusive listing
at 1200 Pacific Heights Drive. This distinguished property features..."
Critique: The email opens with a formal salutation and uses language
consistent with luxury positioning — "exclusive listing," "distinguished
property." No casual slang or informal phrasing. The tone matches the
luxury buyer persona throughout.
Result: Pass
### Example 2: FAIL
Client Persona: Luxury Buyer
Email: "Hey! Just found this awesome place you might like. It's got a
pool and stuff, super cool neighborhood..."
Critique: The greeting "Hey!" is informal. Phrases like "awesome place,"
"got a pool and stuff," and "super cool" are casual slang inappropriate
for a luxury buyer. The email reads like a text message, not a
professional communication for a high-end client.
Result: Fail
### Example 3: PASS (borderline)
Client Persona: First-Time Homebuyer
Email: "Hi Sarah, I found a property that might be a great fit for your
first home. The neighborhood has good schools nearby, and the monthly
payment would be similar to what you're currently paying in rent..."
Critique: The greeting is warm but not overly casual. The email explains
the property in relatable terms — comparing mortgage to rent, mentioning
schools — which is educational without being condescending. It avoids
jargon like "amortization" or "LTV ratio." While not deeply technical,
this matches the supportive tone expected for a first-time buyer.
Result: Pass
Rules for selecting examples:
Enforce structured output using your LLM provider's schema enforcement (e.g., response_format in OpenAI, tool definitions in Anthropic) or a library like Instructor or Outlines. If the provider doesn't support schema enforcement, specify the JSON schema in the prompt.
The output must include a critique before the verdict. Placing the critique first forces the judge to articulate its assessment before committing to a decision.
{
"critique": "string — detailed assessment of the output against the criterion",
"result": "Pass or Fail"
}
Critiques must be detailed, not terse. A good critique explains what specifically was correct or incorrect and references concrete evidence from the output. The critiques in your few-shot examples set the bar for the level of detail the judge will produce.
Feed only what the judge needs for an accurate decision:
| Failure Mode | What the Judge Needs |
|---|---|
| Tone mismatch | Client persona + generated email |
| Answer faithfulness | Retrieved context + generated answer |
| SQL correctness | User query + generated SQL + schema |
| Instruction following | System prompt rules + generated response |
| Tool call justification | Conversation history + tool call + tool result |
For long documents, feed only the relevant snippet, not the entire document.
Start with the most capable model available. The same model used for the main task works as judge (the judge performs a different, narrower task). Optimize for cost later once alignment is confirmed.
Weekly Installs
136
Repository
GitHub Stars
955
First Seen
Mar 3, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykPass
Installed on
codex133
gemini-cli132
kimi-cli132
github-copilot132
cursor132
opencode132
AI 代码实施计划编写技能 | 自动化开发任务分解与 TDD 流程规划工具
50,900 周安装