Prompt Engineer Toolkit - 生产级提示工程工具包：设计、测试与版本控制

prompt-engineer-toolkit by borghei/claude-skills

1 周安装量

27 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/borghei/claude-skills --skill prompt-engineer-toolkit

AI/机器学习自动化工程/构建系统

🇨🇳中文介绍

Prompt Engineer Toolkit - 生产级提示工程

层级： 强大 类别： 工程 标签： 提示工程，思维链，少样本，评估，测试，提示版本控制

概述

Prompt Engineer Toolkit 为生产级提示提供了完整的生命周期管理：行之有效的设计模式、能捕获回归问题的测试框架、跟踪变更的版本控制系统，以及用可衡量的质量指标取代主观"看起来不错"的评估标准。这无关乎巧妙技巧——而是将提示视为生产代码，并施以同等严格的管理。

核心提示模式

1. 系统提示架构

每个生产级提示都具有分层结构。顺序很重要。

┌──────────────────────────────────────┐
│  第一层：身份与角色                  │  模型是谁
│  "你是一位资深代码审查员..."        │
├──────────────────────────────────────┤
│  第二层：能力与约束                  │  它能做什么和不能做什么
│  "你可以读取文件、运行测试..."      │
├──────────────────────────────────────┤
│  第三层：输出格式                    │  如何构建响应
│  "始终以 JSON 格式响应..."          │
├──────────────────────────────────────┤
│  第四层：质量标准                    │  良好输出的样子
│  "包含边缘情况，引用来源"            │
├──────────────────────────────────────┤
│  第五层：反模式                      │  需要避免什么
│  "切勿捏造引用..."                  │
├──────────────────────────────────────┤
│  第六层：示例                        │  通过演示进行校准
│  "这是一个示例..."                  │
└──────────────────────────────────────┘

分层设计原则

层级

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

标准	良好示例	不良示例
代表性	涵盖典型的输入模式	只有边缘情况
多样性	不同的输入类型/长度	全部结构相同
覆盖边缘	包含棘手情况	只有理想路径
输出校准	显示所需的详细程度	过于冗长或简洁
有序性	简单 → 复杂的递进顺序	随机顺序

4. 输出结构化模式

带模式的 JSON 模式

使用与此确切模式匹配的 JSON 对象进行响应：

{
  "analysis": {
    "summary": "string - 一句话总结",
    "severity": "string - 其中之一：critical, high, medium, low",
    "findings": [
      {
        "issue": "string - 问题描述",
        "location": "string - 文件:行号",
        "fix": "string - 建议的修复方法",
        "confidence": "number - 0.0 到 1.0"
      }
    ],
    "overall_score": "number - 0 到 100"
  }
}

规则：
- findings 数组必须至少有一个条目
- confidence 必须反映实际确定性，而非乐观估计
- overall_score: 90-100（优秀），70-89（良好），50-69（需要改进），<50（差）

带分区的结构化推理

使用以下确切分区构建你的响应：

## 评估
[1-2 句话的结论]

## 证据
[支持评估的具体观察结果]

## 风险
[可能出错的地方，附可能性估计]

## 建议
[具体的、可操作的后续步骤，并指定负责人]

试图包揽一切的复杂提示会失败。请分解它们。

不良（单体式）	良好（分解式）
"审查此代码的错误、风格、性能、安全性，并提出改进建议"	提示 1："识别错误" / 提示 2："检查风格" / 提示 3："查找性能问题" / 提示 4："安全审计" / 提示 5："综合发现"

提示 1（提取）：    输入 → 结构化数据
提示 2（分析）：    结构化数据 → 发现
提示 3（综合）：    发现 → 建议
提示 4（格式化）：  建议 → 面向用户的输出

每个提示都可以独立测试。提示 2 的失败不需要重新运行提示 1。

任务类型	温度	原理
代码生成	0.0-0.2	正确性 > 创造性
分类	0.0	期望确定性
分析/推理	0.2-0.5	在表述上具有一定灵活性
创意写作	0.7-1.0	表达的多样性
头脑风暴	0.8-1.2	最大程度的多样性

对于每个发现，请评估你的置信度：

置信度级别：
- 已验证：我可以指出所提供上下文中的具体证据
- 很可能：基于可用信息的强有力推断
- 不确定：合理的猜测，但证据有限
- 推测性：有可能，但我是在推测

切勿将推测性发现陈述为已验证。

每个生产级提示都需要一个测试套件。

{
  "test_id": "classify-urgent-001",
  "input": "服务器宕机，客户无法访问产品",
  "expected": {
    "contains": ["critical", "immediate"],
    "not_contains": ["low priority", "can wait"],
    "format_regex": "^\\{.*\\}$",
    "max_tokens": 500,
    "required_fields": ["severity", "category"]
  },
  "tags": ["classification", "urgency", "happy-path"]
}

类别	套件占比	目的
理想路径	40%	确认基本功能正常
边缘情况	30%	边界条件，异常输入
对抗性	15%	旨在破坏提示的输入
回归	15%	先前失败的案例

维度	衡量标准	权重
遵循度	包含必需元素，符合模式	30%
准确性	正确的分类/分析/答案	30%
安全性	无禁止内容，无幻觉	20%
格式	符合预期结构，长度限制	10%
相关性	响应针对实际输入	10%

score = (adherence * 0.30) + (accuracy * 0.30) + (safety * 0.20) + (format * 0.10) + (relevance * 0.10)

通过阈值：0.80
警告阈值：0.70
失败阈值：< 0.70

1. 在任何提示更改之前：
   - 针对当前提示运行完整测试套件（基线）
   - 记录每个测试用例的分数

2. 提示更改之后：
   - 针对新提示运行相同的测试套件（候选）
   - 比较每个测试用例的分数

3. 验收标准：
   - 平均分数：候选 >= 基线
   - 任何单个测试用例的分数下降不超过 10%
   - 零安全违规（任何安全失败 = 拒绝）
   - 如果满足标准：提升候选版本
   - 如果不满足标准：迭代提示或拒绝

prompts/
├── support-classifier/
│   ├── v1.txt                 # 原始版本
│   ├── v2.txt                 # 添加了边缘情况处理
│   ├── v3.txt                 # 当前生产版本
│   ├── changelog.md           # 变更日志及理由
│   └── tests/
│       ├── suite.json         # 测试用例
│       └── baselines/
│           ├── v1-results.json
│           ├── v2-results.json
│           └── v3-results.json
├── code-reviewer/
│   ├── v1.txt
│   └── ...

## v3 (2026-03-09)
**作者：** borghei
**变更：** 添加了对多语言输入的显式处理
**原因：** v2 版本对非英语代码注释默认使用英语分析
**测试结果：** 平均分数 0.87（v2 为 0.82）。无回归。
**回滚计划：** 回退到 v2.txt

## v2 (2026-02-15)
**作者：** borghei
**变更：** 添加了带有 JSON 模式的结构化输出格式
**原因：** 下游解析器需要一致的格式
**测试结果：** 平均分数 0.82（v1 为 0.79）。格式符合率 100%（v1 为 73%）。

部署新版本前，务必进行差异分析：

提示差异分析的关键问题：
1. 是否有任何约束被移除？（风险：安全性回归）
2. 是否有任何示例被更改？（风险：校准偏移）
3. 输出格式是否被更改？（风险：下游解析器中断）
4. 是否有任何反模式被移除？（风险：已知的失败模式重现）
5. 新提示是否更长？（风险：上下文预算影响）

常见提示失败模式

失败模式	症状	修复方法
指令覆盖	模型忽略约束	将约束提前，添加"CRITICAL:"前缀
格式漂移	输出结构在不同调用间变化	添加 JSON 模式，降低温度
谄媚性	模型同意错误的前提	添加"质疑假设"指令
冗长膨胀	输出过长，淹没了答案	添加字数/令牌限制，"保持简洁"
幻觉	捏造事实、引用或代码	添加"仅引用所提供的上下文"
锚定效应	第一个示例主导输出风格	多样化示例，添加"每个输入都是独立的"
中间迷失	中间指令被忽略	将关键指令放在开头和结尾

工作流程 1：设计生产级提示

1. 精确定义任务（输入类型、输出类型、质量标准）
2. 使用 6 层架构编写系统提示
3. 创建 10+ 个测试用例（40% 理想路径，30% 边缘，15% 对抗性，15% 回归）
4. 运行测试套件，评分结果
5. 迭代直到达到通过阈值（0.80+）
6. 版本化为 v1，记录基线分数
7. 部署并进行监控

工作流程 2：调试性能下降的提示

1. 确定哪些测试用例失败
2. 对失败进行分类（格式？准确性？安全性？相关性？）
3. 检查：模型是否发生了变化？（API 版本，模型更新）
4. 检查：输入分布是否发生了变化？（新的边缘情况）
5. 检查：提示是否被修改？（与最后已知良好的版本进行差异比较）
6. 修复根本原因（而非症状）
7. 部署修复前运行完整的回归测试套件

工作流程 3：将提示迁移到新模型

1. 在当前模型上运行完整测试套件（基线）
2. 在新模型上运行相同的套件（不更改提示）
3. 比较：如果分数相当，则完成
4. 如果分数下降：识别哪些维度性能下降
5. 针对新模型的行为模式调整提示
6. 重新运行套件，直到分数达到或超过基线
7. 在变更日志中记录模型特定的调整

技能	集成
self-improving-agent	性能下降的提示是回归信号；测试它们
agent-designer	代理系统提示是需要测试的最高风险提示
context-engine	上下文检索质量直接影响提示的有效性
ab-test-setup	在生产环境中以统计严谨性进行提示变体的 A/B 测试

references/prompt-patterns-catalog.md - 包含示例的完整提示技术目录
references/evaluation-rubric-templates.md - 按任务类型分类的可重用评估标准模板
references/model-specific-behaviors.md - 跨模型家族的已知行为差异

🇺🇸English

Prompt Engineer Toolkit - Production Prompt Engineering

Tier: POWERFUL Category: Engineering Tags: prompt engineering, chain-of-thought, few-shot, evaluation, testing, prompt versioning

Overview

Prompt Engineer Toolkit provides the complete lifecycle for production prompts: design patterns that work, testing frameworks that catch regressions, versioning systems that track changes, and evaluation rubrics that replace subjective "looks good" with measurable quality. This is not about clever tricks -- it is about treating prompts as production code with the same rigor.

Core Prompt Patterns

1. System Prompt Architecture

Every production prompt has a layered structure. Order matters.

┌──────────────────────────────────────┐
│  Layer 1: Identity & Role            │  Who the model is
│  "You are a senior code reviewer..." │
├──────────────────────────────────────┤
│  Layer 2: Capabilities & Constraints │  What it can and cannot do
│  "You can read files, run tests..."  │
├──────────────────────────────────────┤
│  Layer 3: Output Format              │  How to structure responses
│  "Always respond with JSON..."       │
├──────────────────────────────────────┤
│  Layer 4: Quality Standards          │  What good output looks like
│  "Include edge cases, cite sources"  │
├──────────────────────────────────────┤
│  Layer 5: Anti-Patterns              │  What to avoid
│  "Never fabricate citations..."      │
├──────────────────────────────────────┤
│  Layer 6: Examples                   │  Calibration via demonstration
│  "Here is an example..."            │
└──────────────────────────────────────┘

Layer Design Principles

Layer	Principle	Common Mistake
Identity	Be specific about expertise level	"You are an AI assistant" (too generic)
Capabilities	Enumerate, don't imply	Assuming model knows available tools
Output Format	Show exact schema	Describing format in prose instead of schema
Quality Standards	Quantify when possible	"Be thorough" (unquantifiable)
Anti-Patterns	State the actual failure mode	"Don't be wrong" (useless)
Examples	Show edge cases, not just happy path	Only showing trivial examples

2. Chain-of-Thought (CoT) Patterns

Standard CoT

Think through this step by step:
1. First, identify [what needs to be analyzed]
2. Then, evaluate [specific criteria]
3. Finally, synthesize [the conclusion]

Show your reasoning for each step.

When to use: Complex reasoning, math, multi-step logic When NOT to use: Simple classification, formatting tasks, creative writing

Structured CoT with Scratchpad

Use the following reasoning process:

<scratchpad>
- List relevant facts
- Identify applicable rules
- Work through the logic
- Check for edge cases
</scratchpad>

Then provide your final answer outside the scratchpad tags.

Advantage: Model can reason messy, output is clean.

Self-Consistency CoT

Solve this problem three different ways, then compare your answers.
If all three agree, that's your answer.
If they disagree, identify which approach is most reliable and explain why.

When to use: High-stakes decisions where correctness matters more than speed. Cost: 3x token usage. Use selectively.

3. Few-Shot Design

Shot Selection Criteria

Criterion	Good Example	Bad Example
Representative	Covers typical input pattern	Only edge cases
Diverse	Different input types/lengths	All same structure
Edge-covering	Includes tricky cases	Only happy path
Output-calibrating	Shows desired detail level	Overly verbose or terse
Ordered	Simple → complex progression	Random order

Few-Shot Template

Here are examples of the expected input and output:

Example 1 (simple case):
Input: [simple input]
Output: [simple output with annotation]

Example 2 (typical case):
Input: [typical input]
Output: [typical output with annotation]

Example 3 (edge case):
Input: [tricky input]
Output: [correct handling with annotation]

Now process this:
Input: {user_input}
Output:

Dynamic Few-Shot Selection

For production systems with thousands of examples:

1. Embed all examples
2. Embed the current input
3. Find K nearest examples by embedding similarity
4. Include those K examples as shots
5. Typical K: 3-5 (diminishing returns after 5)

4. Output Structuring Patterns

JSON Mode with Schema

Respond with a JSON object matching this exact schema:

{
  "analysis": {
    "summary": "string - one sentence summary",
    "severity": "string - one of: critical, high, medium, low",
    "findings": [
      {
        "issue": "string - description of the issue",
        "location": "string - file:line",
        "fix": "string - recommended fix",
        "confidence": "number - 0.0 to 1.0"
      }
    ],
    "overall_score": "number - 0 to 100"
  }
}

Rules:
- findings array must have at least one entry
- confidence must reflect actual certainty, not optimism
- overall_score: 90-100 (excellent), 70-89 (good), 50-69 (needs work), <50 (poor)

Structured Reasoning with Sections

Structure your response with these exact sections:

## Assessment
[1-2 sentence bottom line]

## Evidence
[Specific observations supporting the assessment]

## Risks
[What could go wrong, with likelihood estimates]

## Recommendation
[Specific actionable next steps with owners]

5. Prompt Decomposition

Complex prompts that try to do everything fail. Decompose them.

Single Responsibility Prompts

Bad (monolithic)	Good (decomposed)
"Review this code for bugs, style, performance, security, and suggest improvements"	Prompt 1: "Identify bugs" / Prompt 2: "Check style" / Prompt 3: "Find performance issues" / Prompt 4: "Security audit" / Prompt 5: "Synthesize findings"

Pipeline Pattern

Prompt 1 (Extract):    Input → structured data
Prompt 2 (Analyze):    Structured data → findings
Prompt 3 (Synthesize): Findings → recommendation
Prompt 4 (Format):     Recommendation → user-facing output

Each prompt is testable independently. A failure in Prompt 2 doesn't require re-running Prompt 1.

6. Calibration Techniques

Temperature Guidelines

Task Type	Temperature	Rationale
Code generation	0.0-0.2	Correctness > creativity
Classification	0.0	Deterministic expected
Analysis/reasoning	0.2-0.5	Some flexibility in framing
Creative writing	0.7-1.0	Diversity of expression
Brainstorming	0.8-1.2	Maximum variety

Confidence Calibration

For each finding, rate your confidence:

Confidence levels:
- VERIFIED: I can point to specific evidence in the provided context
- LIKELY: Strong inference from available information
- UNCERTAIN: Reasonable guess, but limited evidence
- SPECULATIVE: Possible but I'm reaching

Never state SPECULATIVE findings as VERIFIED.

Prompt Testing Framework

Test Case Design

Every production prompt needs a test suite.

Test Case Structure

{
  "test_id": "classify-urgent-001",
  "input": "Server is down, customers can't access the product",
  "expected": {
    "contains": ["critical", "immediate"],
    "not_contains": ["low priority", "can wait"],
    "format_regex": "^\\{.*\\}$",
    "max_tokens": 500,
    "required_fields": ["severity", "category"]
  },
  "tags": ["classification", "urgency", "happy-path"]
}

Test Suite Composition

Category	% of Suite	Purpose
Happy path	40%	Confirm basic functionality works
Edge cases	30%	Boundary conditions, unusual inputs
Adversarial	15%	Inputs designed to break the prompt
Regression	15%	Cases that previously failed

Evaluation Rubric

Automated Scoring

Dimension	Measurement	Weight
Adherence	Contains required elements, matches schema	30%
Accuracy	Correct classification/analysis/answer	30%
Safety	No forbidden content, no hallucinations	20%
Format	Matches expected structure, length bounds	10%
Relevance	Response addresses the actual input	10%

Scoring Formula

score = (adherence * 0.30) + (accuracy * 0.30) + (safety * 0.20) + (format * 0.10) + (relevance * 0.10)

Pass threshold: 0.80
Warning threshold: 0.70
Fail threshold: < 0.70

Regression Testing Protocol

1. Before any prompt change:
   - Run full test suite against current prompt (baseline)
   - Record scores per test case

2. After prompt change:
   - Run same test suite against new prompt (candidate)
   - Compare scores per test case

3. Acceptance criteria:
   - Average score: candidate >= baseline
   - No individual test case drops by more than 10%
   - Zero safety violations (any safety failure = reject)
   - If criteria met: promote candidate
   - If criteria not met: iterate on prompt or reject

Prompt Versioning

Version Control Strategy

prompts/
├── support-classifier/
│   ├── v1.txt                 # Original version
│   ├── v2.txt                 # Added edge case handling
│   ├── v3.txt                 # Current production
│   ├── changelog.md           # Change log with rationale
│   └── tests/
│       ├── suite.json         # Test cases
│       └── baselines/
│           ├── v1-results.json
│           ├── v2-results.json
│           └── v3-results.json
├── code-reviewer/
│   ├── v1.txt
│   └── ...

Changelog Format

## v3 (2026-03-09)
**Author:** borghei
**Change:** Added explicit handling for multi-language inputs
**Reason:** v2 defaulted to English analysis for non-English code comments
**Test results:** Average score 0.87 (v2 was 0.82). No regressions.
**Rollback plan:** Revert to v2.txt

## v2 (2026-02-15)
**Author:** borghei
**Change:** Added structured output format with JSON schema
**Reason:** Downstream parser needed consistent format
**Test results:** Average score 0.82 (v1 was 0.79). Format compliance 100% (v1 was 73%).

Prompt Diff Analysis

Before deploying a new version, always diff:

Key questions for prompt diffs:
1. Were any constraints removed? (Risk: safety regression)
2. Were any examples changed? (Risk: calibration shift)
3. Was the output format changed? (Risk: downstream parser breaks)
4. Were any anti-patterns removed? (Risk: known failure modes return)
5. Is the new prompt longer? (Risk: context budget impact)

Common Prompt Failure Modes

Failure Mode	Symptom	Fix
Instruction override	Model ignores constraints	Move constraints earlier, add "CRITICAL:" prefix
Format drift	Output structure varies between calls	Add JSON schema, reduce temperature
Sycophancy	Model agrees with wrong premise	Add "Challenge assumptions" instruction
Verbosity bloat	Output too long, buries the answer	Add word/token limits, "be concise"
Hallucination	Fabricated facts, citations, or code	Add "Only reference provided context"
Anchoring	First example dominates output style	Diversify examples, add "each input is independent"
Lost in the middle	Middle instructions get ignored	Front-load and back-load critical instructions

Workflows

Workflow 1: Design a Production Prompt

1. Define the task precisely (input type, output type, quality criteria)
2. Write the system prompt using the 6-layer architecture
3. Create 10+ test cases (40% happy, 30% edge, 15% adversarial, 15% regression)
4. Run test suite, score results
5. Iterate until passing threshold (0.80+)
6. Version as v1, record baseline scores
7. Deploy with monitoring

Workflow 2: Debug a Degraded Prompt

1. Identify which test cases are failing
2. Categorize failures (format? accuracy? safety? relevance?)
3. Check: did the model change? (API version, model update)
4. Check: did the input distribution change? (new edge cases)
5. Check: was the prompt modified? (diff against last known good)
6. Fix the root cause (not the symptom)
7. Run full regression suite before deploying fix

Workflow 3: Migrate Prompt to New Model

1. Run full test suite on current model (baseline)
2. Run same suite on new model (no prompt changes)
3. Compare: if scores are equivalent, done
4. If scores drop: identify which dimensions degraded
5. Adjust prompt for new model's behavior patterns
6. Re-run suite until scores meet or exceed baseline
7. Document model-specific adjustments in changelog

Integration Points

Skill	Integration
self-improving-agent	Prompts that degrade are a regression signal; test them
agent-designer	Agent system prompts are the highest-stakes prompts to test
context-engine	Context retrieval quality directly affects prompt effectiveness
ab-test-setup	A/B test prompt variants in production with statistical rigor

References

references/prompt-patterns-catalog.md - Complete catalog of prompting techniques with examples
references/evaluation-rubric-templates.md - Reusable evaluation rubrics by task type
references/model-specific-behaviors.md - Known behavior differences across model families

Weekly Installs

Repository

borghei/claude-skills

GitHub Stars

First Seen

Today

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

zencoder1

amp1

cline1

openclaw1

opencode1

cursor1

AI Elements：基于shadcn/ui的AI原生应用组件库，快速构建对话界面

60,400 周安装

身份	明确指定专业水平	"你是一个 AI 助手"（过于笼统）
能力	明确列举，而非暗示	假设模型知道可用工具
输出格式	展示确切的模式	用文字描述格式而非模式
质量标准	尽可能量化	"要彻底"（无法量化）
反模式	说明实际的失败模式	"不要出错"（无用）
示例	展示边缘情况，而不仅仅是理想路径	只展示简单示例

Prompt Engineer Toolkit - 生产级提示工程工具包：设计、测试与版本控制

🇨🇳中文介绍

Prompt Engineer Toolkit - 生产级提示工程

概述

核心提示模式

1. 系统提示架构

分层设计原则

相关 Skills

2. 思维链模式

标准思维链

带草稿的结构化思维链

自洽性思维链

3. 少样本设计

样本选择标准

少样本模板

动态少样本选择

4. 输出结构化模式

带模式的 JSON 模式

带分区的结构化推理

5. 提示分解

单一职责提示

流水线模式

6. 校准技术

温度指南

置信度校准

提示测试框架

测试用例设计

测试用例结构

测试套件构成

评估标准

自动化评分

评分公式

回归测试协议

提示版本控制

版本控制策略

变更日志格式

提示差异分析