prompt-engineer-toolkit by borghei/claude-skills
npx skills add https://github.com/borghei/claude-skills --skill prompt-engineer-toolkit层级: 强大 类别: 工程 标签: 提示工程,思维链,少样本,评估,测试,提示版本控制
Prompt Engineer Toolkit 为生产级提示提供了完整的生命周期管理:行之有效的设计模式、能捕获回归问题的测试框架、跟踪变更的版本控制系统,以及用可衡量的质量指标取代主观"看起来不错"的评估标准。这无关乎巧妙技巧——而是将提示视为生产代码,并施以同等严格的管理。
每个生产级提示都具有分层结构。顺序很重要。
┌──────────────────────────────────────┐
│ 第一层:身份与角色 │ 模型是谁
│ "你是一位资深代码审查员..." │
├──────────────────────────────────────┤
│ 第二层:能力与约束 │ 它能做什么和不能做什么
│ "你可以读取文件、运行测试..." │
├──────────────────────────────────────┤
│ 第三层:输出格式 │ 如何构建响应
│ "始终以 JSON 格式响应..." │
├──────────────────────────────────────┤
│ 第四层:质量标准 │ 良好输出的样子
│ "包含边缘情况,引用来源" │
├──────────────────────────────────────┤
│ 第五层:反模式 │ 需要避免什么
│ "切勿捏造引用..." │
├──────────────────────────────────────┤
│ 第六层:示例 │ 通过演示进行校准
│ "这是一个示例..." │
└──────────────────────────────────────┘
| 层级 |
|---|
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
| 原则 |
|---|
| 常见错误 |
|---|
| 身份 | 明确指定专业水平 | "你是一个 AI 助手"(过于笼统) |
| 能力 | 明确列举,而非暗示 | 假设模型知道可用工具 |
| 输出格式 | 展示确切的模式 | 用文字描述格式而非模式 |
| 质量标准 | 尽可能量化 | "要彻底"(无法量化) |
| 反模式 | 说明实际的失败模式 | "不要出错"(无用) |
| 示例 | 展示边缘情况,而不仅仅是理想路径 | 只展示简单示例 |
请逐步思考:
1. 首先,确定[需要分析的内容]
2. 然后,评估[特定标准]
3. 最后,综合[得出结论]
展示每个步骤的推理过程。
使用场景: 复杂推理、数学、多步骤逻辑 不适用场景: 简单分类、格式化任务、创意写作
使用以下推理过程:
<scratchpad>
- 列出相关事实
- 识别适用规则
- 逐步推理逻辑
- 检查边缘情况
</scratchpad>
然后在草稿标签外提供最终答案。
优势: 模型可以进行混乱的推理,但输出是干净的。
用三种不同的方法解决此问题,然后比较你的答案。
如果三者一致,那就是你的答案。
如果不一致,请确定哪种方法最可靠并解释原因。
使用场景: 高风险决策,其中正确性比速度更重要。成本: 3 倍令牌使用量。选择性使用。
| 标准 | 良好示例 | 不良示例 |
|---|---|---|
| 代表性 | 涵盖典型的输入模式 | 只有边缘情况 |
| 多样性 | 不同的输入类型/长度 | 全部结构相同 |
| 覆盖边缘 | 包含棘手情况 | 只有理想路径 |
| 输出校准 | 显示所需的详细程度 | 过于冗长或简洁 |
| 有序性 | 简单 → 复杂的递进顺序 | 随机顺序 |
以下是预期输入和输出的示例:
示例 1(简单情况):
输入:[简单输入]
输出:[带注释的简单输出]
示例 2(典型情况):
输入:[典型输入]
输出:[带注释的典型输出]
示例 3(边缘情况):
输入:[棘手输入]
输出:[带注释的正确处理方式]
现在处理这个:
输入:{user_input}
输出:
对于拥有数千个示例的生产系统:
1. 嵌入所有示例
2. 嵌入当前输入
3. 通过嵌入相似度找到 K 个最接近的示例
4. 将这 K 个示例作为样本包含进来
5. 典型 K 值:3-5(超过 5 个后收益递减)
使用与此确切模式匹配的 JSON 对象进行响应:
{
"analysis": {
"summary": "string - 一句话总结",
"severity": "string - 其中之一:critical, high, medium, low",
"findings": [
{
"issue": "string - 问题描述",
"location": "string - 文件:行号",
"fix": "string - 建议的修复方法",
"confidence": "number - 0.0 到 1.0"
}
],
"overall_score": "number - 0 到 100"
}
}
规则:
- findings 数组必须至少有一个条目
- confidence 必须反映实际确定性,而非乐观估计
- overall_score: 90-100(优秀),70-89(良好),50-69(需要改进),<50(差)
使用以下确切分区构建你的响应:
## 评估
[1-2 句话的结论]
## 证据
[支持评估的具体观察结果]
## 风险
[可能出错的地方,附可能性估计]
## 建议
[具体的、可操作的后续步骤,并指定负责人]
试图包揽一切的复杂提示会失败。请分解它们。
| 不良(单体式) | 良好(分解式) |
|---|---|
| "审查此代码的错误、风格、性能、安全性,并提出改进建议" | 提示 1:"识别错误" / 提示 2:"检查风格" / 提示 3:"查找性能问题" / 提示 4:"安全审计" / 提示 5:"综合发现" |
提示 1(提取): 输入 → 结构化数据
提示 2(分析): 结构化数据 → 发现
提示 3(综合): 发现 → 建议
提示 4(格式化): 建议 → 面向用户的输出
每个提示都可以独立测试。提示 2 的失败不需要重新运行提示 1。
| 任务类型 | 温度 | 原理 |
|---|---|---|
| 代码生成 | 0.0-0.2 | 正确性 > 创造性 |
| 分类 | 0.0 | 期望确定性 |
| 分析/推理 | 0.2-0.5 | 在表述上具有一定灵活性 |
| 创意写作 | 0.7-1.0 | 表达的多样性 |
| 头脑风暴 | 0.8-1.2 | 最大程度的多样性 |
对于每个发现,请评估你的置信度:
置信度级别:
- 已验证:我可以指出所提供上下文中的具体证据
- 很可能:基于可用信息的强有力推断
- 不确定:合理的猜测,但证据有限
- 推测性:有可能,但我是在推测
切勿将推测性发现陈述为已验证。
每个生产级提示都需要一个测试套件。
{
"test_id": "classify-urgent-001",
"input": "服务器宕机,客户无法访问产品",
"expected": {
"contains": ["critical", "immediate"],
"not_contains": ["low priority", "can wait"],
"format_regex": "^\\{.*\\}$",
"max_tokens": 500,
"required_fields": ["severity", "category"]
},
"tags": ["classification", "urgency", "happy-path"]
}
| 类别 | 套件占比 | 目的 |
|---|---|---|
| 理想路径 | 40% | 确认基本功能正常 |
| 边缘情况 | 30% | 边界条件,异常输入 |
| 对抗性 | 15% | 旨在破坏提示的输入 |
| 回归 | 15% | 先前失败的案例 |
| 维度 | 衡量标准 | 权重 |
|---|---|---|
| 遵循度 | 包含必需元素,符合模式 | 30% |
| 准确性 | 正确的分类/分析/答案 | 30% |
| 安全性 | 无禁止内容,无幻觉 | 20% |
| 格式 | 符合预期结构,长度限制 | 10% |
| 相关性 | 响应针对实际输入 | 10% |
score = (adherence * 0.30) + (accuracy * 0.30) + (safety * 0.20) + (format * 0.10) + (relevance * 0.10)
通过阈值:0.80
警告阈值:0.70
失败阈值:< 0.70
1. 在任何提示更改之前:
- 针对当前提示运行完整测试套件(基线)
- 记录每个测试用例的分数
2. 提示更改之后:
- 针对新提示运行相同的测试套件(候选)
- 比较每个测试用例的分数
3. 验收标准:
- 平均分数:候选 >= 基线
- 任何单个测试用例的分数下降不超过 10%
- 零安全违规(任何安全失败 = 拒绝)
- 如果满足标准:提升候选版本
- 如果不满足标准:迭代提示或拒绝
prompts/
├── support-classifier/
│ ├── v1.txt # 原始版本
│ ├── v2.txt # 添加了边缘情况处理
│ ├── v3.txt # 当前生产版本
│ ├── changelog.md # 变更日志及理由
│ └── tests/
│ ├── suite.json # 测试用例
│ └── baselines/
│ ├── v1-results.json
│ ├── v2-results.json
│ └── v3-results.json
├── code-reviewer/
│ ├── v1.txt
│ └── ...
## v3 (2026-03-09)
**作者:** borghei
**变更:** 添加了对多语言输入的显式处理
**原因:** v2 版本对非英语代码注释默认使用英语分析
**测试结果:** 平均分数 0.87(v2 为 0.82)。无回归。
**回滚计划:** 回退到 v2.txt
## v2 (2026-02-15)
**作者:** borghei
**变更:** 添加了带有 JSON 模式的结构化输出格式
**原因:** 下游解析器需要一致的格式
**测试结果:** 平均分数 0.82(v1 为 0.79)。格式符合率 100%(v1 为 73%)。
部署新版本前,务必进行差异分析:
提示差异分析的关键问题:
1. 是否有任何约束被移除?(风险:安全性回归)
2. 是否有任何示例被更改?(风险:校准偏移)
3. 输出格式是否被更改?(风险:下游解析器中断)
4. 是否有任何反模式被移除?(风险:已知的失败模式重现)
5. 新提示是否更长?(风险:上下文预算影响)
| 失败模式 | 症状 | 修复方法 |
|---|---|---|
| 指令覆盖 | 模型忽略约束 | 将约束提前,添加"CRITICAL:"前缀 |
| 格式漂移 | 输出结构在不同调用间变化 | 添加 JSON 模式,降低温度 |
| 谄媚性 | 模型同意错误的前提 | 添加"质疑假设"指令 |
| 冗长膨胀 | 输出过长,淹没了答案 | 添加字数/令牌限制,"保持简洁" |
| 幻觉 | 捏造事实、引用或代码 | 添加"仅引用所提供的上下文" |
| 锚定效应 | 第一个示例主导输出风格 | 多样化示例,添加"每个输入都是独立的" |
| 中间迷失 | 中间指令被忽略 | 将关键指令放在开头和结尾 |
1. 精确定义任务(输入类型、输出类型、质量标准)
2. 使用 6 层架构编写系统提示
3. 创建 10+ 个测试用例(40% 理想路径,30% 边缘,15% 对抗性,15% 回归)
4. 运行测试套件,评分结果
5. 迭代直到达到通过阈值(0.80+)
6. 版本化为 v1,记录基线分数
7. 部署并进行监控
1. 确定哪些测试用例失败
2. 对失败进行分类(格式?准确性?安全性?相关性?)
3. 检查:模型是否发生了变化?(API 版本,模型更新)
4. 检查:输入分布是否发生了变化?(新的边缘情况)
5. 检查:提示是否被修改?(与最后已知良好的版本进行差异比较)
6. 修复根本原因(而非症状)
7. 部署修复前运行完整的回归测试套件
1. 在当前模型上运行完整测试套件(基线)
2. 在新模型上运行相同的套件(不更改提示)
3. 比较:如果分数相当,则完成
4. 如果分数下降:识别哪些维度性能下降
5. 针对新模型的行为模式调整提示
6. 重新运行套件,直到分数达到或超过基线
7. 在变更日志中记录模型特定的调整
| 技能 | 集成 |
|---|---|
| self-improving-agent | 性能下降的提示是回归信号;测试它们 |
| agent-designer | 代理系统提示是需要测试的最高风险提示 |
| context-engine | 上下文检索质量直接影响提示的有效性 |
| ab-test-setup | 在生产环境中以统计严谨性进行提示变体的 A/B 测试 |
references/prompt-patterns-catalog.md - 包含示例的完整提示技术目录references/evaluation-rubric-templates.md - 按任务类型分类的可重用评估标准模板references/model-specific-behaviors.md - 跨模型家族的已知行为差异每周安装数
1
代码仓库
GitHub 星标数
27
首次出现
今天
安全审计
安装于
zencoder1
amp1
cline1
openclaw1
opencode1
cursor1
Tier: POWERFUL Category: Engineering Tags: prompt engineering, chain-of-thought, few-shot, evaluation, testing, prompt versioning
Prompt Engineer Toolkit provides the complete lifecycle for production prompts: design patterns that work, testing frameworks that catch regressions, versioning systems that track changes, and evaluation rubrics that replace subjective "looks good" with measurable quality. This is not about clever tricks -- it is about treating prompts as production code with the same rigor.
Every production prompt has a layered structure. Order matters.
┌──────────────────────────────────────┐
│ Layer 1: Identity & Role │ Who the model is
│ "You are a senior code reviewer..." │
├──────────────────────────────────────┤
│ Layer 2: Capabilities & Constraints │ What it can and cannot do
│ "You can read files, run tests..." │
├──────────────────────────────────────┤
│ Layer 3: Output Format │ How to structure responses
│ "Always respond with JSON..." │
├──────────────────────────────────────┤
│ Layer 4: Quality Standards │ What good output looks like
│ "Include edge cases, cite sources" │
├──────────────────────────────────────┤
│ Layer 5: Anti-Patterns │ What to avoid
│ "Never fabricate citations..." │
├──────────────────────────────────────┤
│ Layer 6: Examples │ Calibration via demonstration
│ "Here is an example..." │
└──────────────────────────────────────┘
| Layer | Principle | Common Mistake |
|---|---|---|
| Identity | Be specific about expertise level | "You are an AI assistant" (too generic) |
| Capabilities | Enumerate, don't imply | Assuming model knows available tools |
| Output Format | Show exact schema | Describing format in prose instead of schema |
| Quality Standards | Quantify when possible | "Be thorough" (unquantifiable) |
| Anti-Patterns | State the actual failure mode | "Don't be wrong" (useless) |
| Examples | Show edge cases, not just happy path | Only showing trivial examples |
Think through this step by step:
1. First, identify [what needs to be analyzed]
2. Then, evaluate [specific criteria]
3. Finally, synthesize [the conclusion]
Show your reasoning for each step.
When to use: Complex reasoning, math, multi-step logic When NOT to use: Simple classification, formatting tasks, creative writing
Use the following reasoning process:
<scratchpad>
- List relevant facts
- Identify applicable rules
- Work through the logic
- Check for edge cases
</scratchpad>
Then provide your final answer outside the scratchpad tags.
Advantage: Model can reason messy, output is clean.
Solve this problem three different ways, then compare your answers.
If all three agree, that's your answer.
If they disagree, identify which approach is most reliable and explain why.
When to use: High-stakes decisions where correctness matters more than speed. Cost: 3x token usage. Use selectively.
| Criterion | Good Example | Bad Example |
|---|---|---|
| Representative | Covers typical input pattern | Only edge cases |
| Diverse | Different input types/lengths | All same structure |
| Edge-covering | Includes tricky cases | Only happy path |
| Output-calibrating | Shows desired detail level | Overly verbose or terse |
| Ordered | Simple → complex progression | Random order |
Here are examples of the expected input and output:
Example 1 (simple case):
Input: [simple input]
Output: [simple output with annotation]
Example 2 (typical case):
Input: [typical input]
Output: [typical output with annotation]
Example 3 (edge case):
Input: [tricky input]
Output: [correct handling with annotation]
Now process this:
Input: {user_input}
Output:
For production systems with thousands of examples:
1. Embed all examples
2. Embed the current input
3. Find K nearest examples by embedding similarity
4. Include those K examples as shots
5. Typical K: 3-5 (diminishing returns after 5)
Respond with a JSON object matching this exact schema:
{
"analysis": {
"summary": "string - one sentence summary",
"severity": "string - one of: critical, high, medium, low",
"findings": [
{
"issue": "string - description of the issue",
"location": "string - file:line",
"fix": "string - recommended fix",
"confidence": "number - 0.0 to 1.0"
}
],
"overall_score": "number - 0 to 100"
}
}
Rules:
- findings array must have at least one entry
- confidence must reflect actual certainty, not optimism
- overall_score: 90-100 (excellent), 70-89 (good), 50-69 (needs work), <50 (poor)
Structure your response with these exact sections:
## Assessment
[1-2 sentence bottom line]
## Evidence
[Specific observations supporting the assessment]
## Risks
[What could go wrong, with likelihood estimates]
## Recommendation
[Specific actionable next steps with owners]
Complex prompts that try to do everything fail. Decompose them.
| Bad (monolithic) | Good (decomposed) |
|---|---|
| "Review this code for bugs, style, performance, security, and suggest improvements" | Prompt 1: "Identify bugs" / Prompt 2: "Check style" / Prompt 3: "Find performance issues" / Prompt 4: "Security audit" / Prompt 5: "Synthesize findings" |
Prompt 1 (Extract): Input → structured data
Prompt 2 (Analyze): Structured data → findings
Prompt 3 (Synthesize): Findings → recommendation
Prompt 4 (Format): Recommendation → user-facing output
Each prompt is testable independently. A failure in Prompt 2 doesn't require re-running Prompt 1.
| Task Type | Temperature | Rationale |
|---|---|---|
| Code generation | 0.0-0.2 | Correctness > creativity |
| Classification | 0.0 | Deterministic expected |
| Analysis/reasoning | 0.2-0.5 | Some flexibility in framing |
| Creative writing | 0.7-1.0 | Diversity of expression |
| Brainstorming | 0.8-1.2 | Maximum variety |
For each finding, rate your confidence:
Confidence levels:
- VERIFIED: I can point to specific evidence in the provided context
- LIKELY: Strong inference from available information
- UNCERTAIN: Reasonable guess, but limited evidence
- SPECULATIVE: Possible but I'm reaching
Never state SPECULATIVE findings as VERIFIED.
Every production prompt needs a test suite.
{
"test_id": "classify-urgent-001",
"input": "Server is down, customers can't access the product",
"expected": {
"contains": ["critical", "immediate"],
"not_contains": ["low priority", "can wait"],
"format_regex": "^\\{.*\\}$",
"max_tokens": 500,
"required_fields": ["severity", "category"]
},
"tags": ["classification", "urgency", "happy-path"]
}
| Category | % of Suite | Purpose |
|---|---|---|
| Happy path | 40% | Confirm basic functionality works |
| Edge cases | 30% | Boundary conditions, unusual inputs |
| Adversarial | 15% | Inputs designed to break the prompt |
| Regression | 15% | Cases that previously failed |
| Dimension | Measurement | Weight |
|---|---|---|
| Adherence | Contains required elements, matches schema | 30% |
| Accuracy | Correct classification/analysis/answer | 30% |
| Safety | No forbidden content, no hallucinations | 20% |
| Format | Matches expected structure, length bounds | 10% |
| Relevance | Response addresses the actual input | 10% |
score = (adherence * 0.30) + (accuracy * 0.30) + (safety * 0.20) + (format * 0.10) + (relevance * 0.10)
Pass threshold: 0.80
Warning threshold: 0.70
Fail threshold: < 0.70
1. Before any prompt change:
- Run full test suite against current prompt (baseline)
- Record scores per test case
2. After prompt change:
- Run same test suite against new prompt (candidate)
- Compare scores per test case
3. Acceptance criteria:
- Average score: candidate >= baseline
- No individual test case drops by more than 10%
- Zero safety violations (any safety failure = reject)
- If criteria met: promote candidate
- If criteria not met: iterate on prompt or reject
prompts/
├── support-classifier/
│ ├── v1.txt # Original version
│ ├── v2.txt # Added edge case handling
│ ├── v3.txt # Current production
│ ├── changelog.md # Change log with rationale
│ └── tests/
│ ├── suite.json # Test cases
│ └── baselines/
│ ├── v1-results.json
│ ├── v2-results.json
│ └── v3-results.json
├── code-reviewer/
│ ├── v1.txt
│ └── ...
## v3 (2026-03-09)
**Author:** borghei
**Change:** Added explicit handling for multi-language inputs
**Reason:** v2 defaulted to English analysis for non-English code comments
**Test results:** Average score 0.87 (v2 was 0.82). No regressions.
**Rollback plan:** Revert to v2.txt
## v2 (2026-02-15)
**Author:** borghei
**Change:** Added structured output format with JSON schema
**Reason:** Downstream parser needed consistent format
**Test results:** Average score 0.82 (v1 was 0.79). Format compliance 100% (v1 was 73%).
Before deploying a new version, always diff:
Key questions for prompt diffs:
1. Were any constraints removed? (Risk: safety regression)
2. Were any examples changed? (Risk: calibration shift)
3. Was the output format changed? (Risk: downstream parser breaks)
4. Were any anti-patterns removed? (Risk: known failure modes return)
5. Is the new prompt longer? (Risk: context budget impact)
| Failure Mode | Symptom | Fix |
|---|---|---|
| Instruction override | Model ignores constraints | Move constraints earlier, add "CRITICAL:" prefix |
| Format drift | Output structure varies between calls | Add JSON schema, reduce temperature |
| Sycophancy | Model agrees with wrong premise | Add "Challenge assumptions" instruction |
| Verbosity bloat | Output too long, buries the answer | Add word/token limits, "be concise" |
| Hallucination | Fabricated facts, citations, or code | Add "Only reference provided context" |
| Anchoring | First example dominates output style | Diversify examples, add "each input is independent" |
| Lost in the middle | Middle instructions get ignored | Front-load and back-load critical instructions |
1. Define the task precisely (input type, output type, quality criteria)
2. Write the system prompt using the 6-layer architecture
3. Create 10+ test cases (40% happy, 30% edge, 15% adversarial, 15% regression)
4. Run test suite, score results
5. Iterate until passing threshold (0.80+)
6. Version as v1, record baseline scores
7. Deploy with monitoring
1. Identify which test cases are failing
2. Categorize failures (format? accuracy? safety? relevance?)
3. Check: did the model change? (API version, model update)
4. Check: did the input distribution change? (new edge cases)
5. Check: was the prompt modified? (diff against last known good)
6. Fix the root cause (not the symptom)
7. Run full regression suite before deploying fix
1. Run full test suite on current model (baseline)
2. Run same suite on new model (no prompt changes)
3. Compare: if scores are equivalent, done
4. If scores drop: identify which dimensions degraded
5. Adjust prompt for new model's behavior patterns
6. Re-run suite until scores meet or exceed baseline
7. Document model-specific adjustments in changelog
| Skill | Integration |
|---|---|
| self-improving-agent | Prompts that degrade are a regression signal; test them |
| agent-designer | Agent system prompts are the highest-stakes prompts to test |
| context-engine | Context retrieval quality directly affects prompt effectiveness |
| ab-test-setup | A/B test prompt variants in production with statistical rigor |
references/prompt-patterns-catalog.md - Complete catalog of prompting techniques with examplesreferences/evaluation-rubric-templates.md - Reusable evaluation rubrics by task typereferences/model-specific-behaviors.md - Known behavior differences across model familiesWeekly Installs
1
Repository
GitHub Stars
27
First Seen
Today
Security Audits
Gen Agent Trust HubPassSocketPassSnykPass
Installed on
zencoder1
amp1
cline1
openclaw1
opencode1
cursor1
AI Elements:基于shadcn/ui的AI原生应用组件库,快速构建对话界面
60,400 周安装