promptfoo-evaluation by daymade/claude-code-skills
npx skills add https://github.com/daymade/claude-code-skills --skill promptfoo-evaluation本技能提供使用 Promptfoo 配置和运行 LLM 评估的指导。Promptfoo 是一个用于测试和比较 LLM 输出的开源 CLI 工具。
# 初始化一个新的评估项目
npx promptfoo@latest init
# 运行评估
npx promptfoo@latest eval
# 在浏览器中查看结果
npx promptfoo@latest view
典型的 Promptfoo 项目结构:
project/
├── promptfooconfig.yaml # 主配置文件
├── prompts/
│ ├── system.md # 系统提示词
│ └── chat.json # 聊天格式提示词
├── tests/
│ └── cases.yaml # 测试用例
└── scripts/
└── metrics.py # 自定义 Python 断言
# yaml-language-server: $schema=https://promptfoo.dev/config-schema.json
description: "我的 LLM 评估"
# 要测试的提示词
prompts:
- file://prompts/system.md
- file://prompts/chat.json
# 要比较的模型
providers:
- id: anthropic:messages:claude-sonnet-4-6
label: Claude-Sonnet-4.6
- id: openai:gpt-4.1
label: GPT-4.1
# 测试用例
tests: file://tests/cases.yaml
# 并发控制 (必须放在 commandLineOptions 下,而不是顶层)
commandLineOptions:
maxConcurrency: 2
# 所有测试的默认断言
defaultTest:
assert:
- type: python
value: file://scripts/metrics.py:custom_assert
- type: llm-rubric
value: |
以 0-1 的尺度评估响应质量。
threshold: 0.7
# 输出路径
outputPath: results/eval-results.json
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
You are a helpful assistant.
Task: {{task}}
Context: {{context}}
[
{"role": "system", "content": "{{system_prompt}}"},
{"role": "user", "content": "{{user_input}}"}
]
直接在提示词中嵌入示例,或使用包含助手消息的聊天格式:
[
{"role": "system", "content": "{{system_prompt}}"},
{"role": "user", "content": "Example input: {{example_input}}"},
{"role": "assistant", "content": "{{example_output}}"},
{"role": "user", "content": "Now process: {{actual_input}}"}
]
- description: "测试用例 1"
vars:
system_prompt: file://prompts/system.md
user_input: "Hello world"
# 从文件加载内容
context: file://data/context.txt
assert:
- type: contains
value: "expected text"
- type: python
value: file://scripts/metrics.py:custom_check
threshold: 0.8
创建一个 Python 文件用于自定义断言(例如 scripts/metrics.py):
def get_assert(output: str, context: dict) -> dict:
"""默认断言函数。"""
vars_dict = context.get('vars', {})
# 访问测试变量
expected = vars_dict.get('expected', '')
# 返回结果
return {
"pass": expected in output,
"score": 0.8,
"reason": "Contains expected content",
"named_scores": {"relevance": 0.9}
}
def custom_check(output: str, context: dict) -> dict:
"""自定义命名断言。"""
word_count = len(output.split())
passed = 100 <= word_count <= 500
return {
"pass": passed,
"score": min(1.0, word_count / 300),
"reason": f"Word count: {word_count}"
}
关键点:
get_assertfile://path.py:function_name 指定函数bool、float(分数)或包含 pass/score/reason 的 dictcontext['vars'] 访问变量assert:
- type: llm-rubric
value: |
根据以下标准评估响应:
1. 信息的准确性
2. 解释的清晰度
3. 完整性
评分 0.0-1.0,0.7+ 为通过。
threshold: 0.7
provider: openai:gpt-4.1 # 可选:覆盖评分模型
当使用中继/代理 API 时,每个 llm-rubric 断言都需要自己的 provider 配置,并包含 apiBaseUrl。否则,评分器会回退到默认的 Anthropic/OpenAI 端点并收到 401 错误:
assert:
- type: llm-rubric
value: |
以 0-1 的尺度评估质量。
threshold: 0.7
provider:
id: anthropic:messages:claude-sonnet-4-6
config:
apiBaseUrl: https://your-relay.example.com/api
最佳实践:
threshold 设置最低通过分数llm-rubric 必须有自己的 provider 并包含 apiBaseUrl — 主提供者的 apiBaseUrl 不会被继承| 类型 | 用途 | 示例 |
|---|---|---|
contains | 检查子字符串 | value: "hello" |
icontains | 不区分大小写 | value: "HELLO" |
equals | 精确匹配 | value: "42" |
regex | 模式匹配 | value: "\\d{4}" |
python | 自定义逻辑 | value: file://script.py |
llm-rubric | LLM 评分 | value: "Is professional" |
latency | 响应时间 | threshold: 1000 |
所有 file:// 路径都是相对于 promptfooconfig.yaml 文件的位置解析的(而不是包含引用的 YAML 文件)。当 tests: 引用单独的 YAML 文件时,这是一个常见的陷阱 — 该测试文件内部的 file:// 路径仍然从配置根目录解析。
# 将文件内容加载为变量
vars:
content: file://data/input.txt
# 从文件加载提示词
prompts:
- file://prompts/main.md
# 从文件加载测试用例
tests: file://tests/cases.yaml
# 加载 Python 断言
assert:
- type: python
value: file://scripts/check.py:validate
# 基本运行
npx promptfoo@latest eval
# 使用特定配置
npx promptfoo@latest eval --config path/to/config.yaml
# 输出到文件
npx promptfoo@latest eval --output results.json
# 过滤测试
npx promptfoo@latest eval --filter-metadata category=math
# 查看结果
npx promptfoo@latest view
当使用 API 中继或代理而不是直接的 Anthropic/OpenAI 端点时:
providers:
- id: anthropic:messages:claude-sonnet-4-6
label: Claude-Sonnet-4.6
config:
max_tokens: 4096
apiBaseUrl: https://your-relay.example.com/api # Promptfoo 会自动追加 /v1/messages
# 关键:maxConcurrency 必须放在 commandLineOptions 下(而不是顶层)
commandLineOptions:
maxConcurrency: 1 # 遵守中继速率限制
关键规则:
apiBaseUrl 放在 providers[].config 中 — Promptfoo 会自动追加 /v1/messagesmaxConcurrency 必须放在 commandLineOptions: 下 — 放在顶层会被静默忽略maxConcurrency: 1 以避免并发请求限制(生成和评分共享同一个池)ANTHROPIC_API_KEY 环境变量传递找不到 Python:
export PROMPTFOO_PYTHON=python3
大输出被截断: 超过 30000 个字符的输出会被截断。在断言中使用 head_limit。
文件未找到错误: 所有 file:// 路径都是相对于 promptfooconfig.yaml 的位置解析的。
maxConcurrency 被忽略(显示"最多同时 N 个"): maxConcurrency 必须放在 commandLineOptions: 下,而不是 YAML 顶层。这是一个常见错误。
LLM 作为评判者在使用中继 API 时返回 401: 每个 llm-rubric 断言必须有自己的 provider 并包含 apiBaseUrl。主提供者配置不会被评分器断言继承。
模型输出中的 HTML 标签夸大了指标: 模型可能在结构化内容中输出 <br>、<b> 等。在测量前,在 Python 断言中去除 HTML:
import re
clean_text = re.sub(r'<[^>]+>', '', raw_text)
使用 echo 提供者 来预览渲染后的提示词,而无需进行 API 调用:
# promptfooconfig-preview.yaml
providers:
- echo # 将提示词作为输出返回,不进行 API 调用
tests:
- vars:
input: "test content"
使用场景:
在进行昂贵的 API 调用之前预览提示词渲染
验证少样本示例是否正确加载
调试变量替换问题
验证提示词结构
npx promptfoo@latest eval --config promptfooconfig-preview.yaml
成本: 免费 - 不消耗 API 令牌。
用于具有完整示例的复杂少样本学习:
[
{"role": "system", "content": "{{system_prompt}}"},
// 少样本示例 1
{"role": "user", "content": "Task: {{example_input_1}}"},
{"role": "assistant", "content": "{{example_output_1}}"},
// 少样本示例 2 (可选)
{"role": "user", "content": "Task: {{example_input_2}}"},
{"role": "assistant", "content": "{{example_output_2}}"},
// 实际测试
{"role": "user", "content": "Task: {{actual_input}}"}
]
测试用例配置:
tests:
- vars:
system_prompt: file://prompts/system.md
# 少样本示例
example_input_1: file://data/examples/input1.txt
example_output_1: file://data/examples/output1.txt
example_input_2: file://data/examples/input2.txt
example_output_2: file://data/examples/output2.txt
# 实际测试
actual_input: file://data/test1.txt
最佳实践:
用于中文/长文本内容评估(10k+ 字符):
配置:
providers:
- id: anthropic:messages:claude-sonnet-4-6
config:
max_tokens: 8192 # 为长输出增加
defaultTest:
assert:
- type: python
value: file://scripts/metrics.py:check_length
用于文本指标的 Python 断言:
import re
def strip_tags(text: str) -> str:
"""去除 HTML 标签以获取纯文本。"""
return re.sub(r'<[^>]+>', '', text)
def check_length(output: str, context: dict) -> dict:
"""检查输出长度约束。"""
raw_input = context['vars'].get('raw_input', '')
input_len = len(strip_tags(raw_input))
output_len = len(strip_tags(output))
reduction_ratio = 1 - (output_len / input_len) if input_len > 0 else 0
return {
"pass": 0.7 <= reduction_ratio <= 0.9,
"score": reduction_ratio,
"reason": f"Reduction: {reduction_ratio:.1%} (target: 70-90%)",
"named_scores": {
"input_length": input_len,
"output_length": output_len,
"reduction_ratio": reduction_ratio
}
}
项目: 从长转录本中筛选中文短视频内容
结构:
tiaogaoren/
├── promptfooconfig.yaml # 生产配置
├── promptfooconfig-preview.yaml # 预览配置 (echo 提供者)
├── prompts/
│ ├── tiaogaoren-prompt.json # 带少样本的聊天格式
│ └── v4/system-v4.md # 系统提示词
├── tests/cases.yaml # 3 个测试样本
├── scripts/metrics.py # 自定义指标 (缩减比例等)
├── data/ # 5 个样本 (2 个少样本,3 个评估)
└── results/
参见: ./tiaogaoren/(示例项目根目录)以获取完整实现。
有关详细的 API 参考和高级模式,请参阅 references/promptfoo_api.md。
每周安装数
156
代码仓库
GitHub 星标数
708
首次出现
2026年1月21日
安全审计
安装于
claude-code134
opencode132
codex128
gemini-cli124
cursor119
github-copilot119
This skill provides guidance for configuring and running LLM evaluations using Promptfoo, an open-source CLI tool for testing and comparing LLM outputs.
# Initialize a new evaluation project
npx promptfoo@latest init
# Run evaluation
npx promptfoo@latest eval
# View results in browser
npx promptfoo@latest view
A typical Promptfoo project structure:
project/
├── promptfooconfig.yaml # Main configuration
├── prompts/
│ ├── system.md # System prompt
│ └── chat.json # Chat format prompt
├── tests/
│ └── cases.yaml # Test cases
└── scripts/
└── metrics.py # Custom Python assertions
# yaml-language-server: $schema=https://promptfoo.dev/config-schema.json
description: "My LLM Evaluation"
# Prompts to test
prompts:
- file://prompts/system.md
- file://prompts/chat.json
# Models to compare
providers:
- id: anthropic:messages:claude-sonnet-4-6
label: Claude-Sonnet-4.6
- id: openai:gpt-4.1
label: GPT-4.1
# Test cases
tests: file://tests/cases.yaml
# Concurrency control (MUST be under commandLineOptions, NOT top-level)
commandLineOptions:
maxConcurrency: 2
# Default assertions for all tests
defaultTest:
assert:
- type: python
value: file://scripts/metrics.py:custom_assert
- type: llm-rubric
value: |
Evaluate the response quality on a 0-1 scale.
threshold: 0.7
# Output path
outputPath: results/eval-results.json
You are a helpful assistant.
Task: {{task}}
Context: {{context}}
[
{"role": "system", "content": "{{system_prompt}}"},
{"role": "user", "content": "{{user_input}}"}
]
Embed examples directly in prompt or use chat format with assistant messages:
[
{"role": "system", "content": "{{system_prompt}}"},
{"role": "user", "content": "Example input: {{example_input}}"},
{"role": "assistant", "content": "{{example_output}}"},
{"role": "user", "content": "Now process: {{actual_input}}"}
]
- description: "Test case 1"
vars:
system_prompt: file://prompts/system.md
user_input: "Hello world"
# Load content from files
context: file://data/context.txt
assert:
- type: contains
value: "expected text"
- type: python
value: file://scripts/metrics.py:custom_check
threshold: 0.8
Create a Python file for custom assertions (e.g., scripts/metrics.py):
def get_assert(output: str, context: dict) -> dict:
"""Default assertion function."""
vars_dict = context.get('vars', {})
# Access test variables
expected = vars_dict.get('expected', '')
# Return result
return {
"pass": expected in output,
"score": 0.8,
"reason": "Contains expected content",
"named_scores": {"relevance": 0.9}
}
def custom_check(output: str, context: dict) -> dict:
"""Custom named assertion."""
word_count = len(output.split())
passed = 100 <= word_count <= 500
return {
"pass": passed,
"score": min(1.0, word_count / 300),
"reason": f"Word count: {word_count}"
}
Key points:
get_assertfile://path.py:function_namebool, float (score), or dict with pass/score/reasoncontext['vars']assert:
- type: llm-rubric
value: |
Evaluate the response based on:
1. Accuracy of information
2. Clarity of explanation
3. Completeness
Score 0.0-1.0 where 0.7+ is passing.
threshold: 0.7
provider: openai:gpt-4.1 # Optional: override grader model
When using a relay/proxy API , each llm-rubric assertion needs its own provider config with apiBaseUrl. Otherwise the grader falls back to the default Anthropic/OpenAI endpoint and gets 401 errors:
assert:
- type: llm-rubric
value: |
Evaluate quality on a 0-1 scale.
threshold: 0.7
provider:
id: anthropic:messages:claude-sonnet-4-6
config:
apiBaseUrl: https://your-relay.example.com/api
Best practices:
threshold to set minimum passing scorellm-rubric must have its own provider with apiBaseUrl — the main provider's apiBaseUrl is NOT inherited| Type | Usage | Example |
|---|---|---|
contains | Check substring | value: "hello" |
icontains | Case-insensitive | value: "HELLO" |
equals | Exact match | value: "42" |
regex |
All file:// paths are resolved relative to promptfooconfig.yaml location (NOT the YAML file containing the reference). This is a common gotcha when tests: references a separate YAML file — the file:// paths inside that test file still resolve from the config root.
# Load file content as variable
vars:
content: file://data/input.txt
# Load prompt from file
prompts:
- file://prompts/main.md
# Load test cases from file
tests: file://tests/cases.yaml
# Load Python assertion
assert:
- type: python
value: file://scripts/check.py:validate
# Basic run
npx promptfoo@latest eval
# With specific config
npx promptfoo@latest eval --config path/to/config.yaml
# Output to file
npx promptfoo@latest eval --output results.json
# Filter tests
npx promptfoo@latest eval --filter-metadata category=math
# View results
npx promptfoo@latest view
When using an API relay or proxy instead of direct Anthropic/OpenAI endpoints:
providers:
- id: anthropic:messages:claude-sonnet-4-6
label: Claude-Sonnet-4.6
config:
max_tokens: 4096
apiBaseUrl: https://your-relay.example.com/api # Promptfoo appends /v1/messages
# CRITICAL: maxConcurrency MUST be under commandLineOptions (NOT top-level)
commandLineOptions:
maxConcurrency: 1 # Respect relay rate limits
Key rules:
apiBaseUrl goes in providers[].config — Promptfoo appends /v1/messages automaticallymaxConcurrency must be under commandLineOptions: — placing it at top level is silently ignoredmaxConcurrency: 1 to avoid concurrent request limits (generation + grading share the same pool)ANTHROPIC_API_KEY env varPython not found:
export PROMPTFOO_PYTHON=python3
Large outputs truncated: Outputs over 30000 characters are truncated. Use head_limit in assertions.
File not found errors: All file:// paths resolve relative to promptfooconfig.yaml location.
maxConcurrency ignored (shows "up to N at a time"): maxConcurrency must be under commandLineOptions:, not at the YAML top level. This is a common mistake.
LLM-as-judge returns 401 with relay API: Each llm-rubric assertion must have its own provider with apiBaseUrl. The main provider config is not inherited by grader assertions.
HTML tags in model output inflating metrics: Models may output <br>, <b>, etc. in structured content. Strip HTML in Python assertions before measuring:
import re
clean_text = re.sub(r'<[^>]+>', '', raw_text)
Use the echo provider to preview rendered prompts without making API calls:
# promptfooconfig-preview.yaml
providers:
- echo # Returns prompt as output, no API calls
tests:
- vars:
input: "test content"
Use cases:
Preview prompt rendering before expensive API calls
Verify Few-shot examples are loaded correctly
Debug variable substitution issues
Validate prompt structure
npx promptfoo@latest eval --config promptfooconfig-preview.yaml
Cost: Free - no API tokens consumed.
For complex few-shot learning with full examples:
[
{"role": "system", "content": "{{system_prompt}}"},
// Few-shot Example 1
{"role": "user", "content": "Task: {{example_input_1}}"},
{"role": "assistant", "content": "{{example_output_1}}"},
// Few-shot Example 2 (optional)
{"role": "user", "content": "Task: {{example_input_2}}"},
{"role": "assistant", "content": "{{example_output_2}}"},
// Actual test
{"role": "user", "content": "Task: {{actual_input}}"}
]
Test case configuration:
tests:
- vars:
system_prompt: file://prompts/system.md
# Few-shot examples
example_input_1: file://data/examples/input1.txt
example_output_1: file://data/examples/output1.txt
example_input_2: file://data/examples/input2.txt
example_output_2: file://data/examples/output2.txt
# Actual test
actual_input: file://data/test1.txt
Best practices:
For Chinese/long-form content evaluations (10k+ characters):
Configuration:
providers:
- id: anthropic:messages:claude-sonnet-4-6
config:
max_tokens: 8192 # Increase for long outputs
defaultTest:
assert:
- type: python
value: file://scripts/metrics.py:check_length
Python assertion for text metrics:
import re
def strip_tags(text: str) -> str:
"""Remove HTML tags for pure text."""
return re.sub(r'<[^>]+>', '', text)
def check_length(output: str, context: dict) -> dict:
"""Check output length constraints."""
raw_input = context['vars'].get('raw_input', '')
input_len = len(strip_tags(raw_input))
output_len = len(strip_tags(output))
reduction_ratio = 1 - (output_len / input_len) if input_len > 0 else 0
return {
"pass": 0.7 <= reduction_ratio <= 0.9,
"score": reduction_ratio,
"reason": f"Reduction: {reduction_ratio:.1%} (target: 70-90%)",
"named_scores": {
"input_length": input_len,
"output_length": output_len,
"reduction_ratio": reduction_ratio
}
}
Project: Chinese short-video content curation from long transcripts
Structure:
tiaogaoren/
├── promptfooconfig.yaml # Production config
├── promptfooconfig-preview.yaml # Preview config (echo provider)
├── prompts/
│ ├── tiaogaoren-prompt.json # Chat format with few-shot
│ └── v4/system-v4.md # System prompt
├── tests/cases.yaml # 3 test samples
├── scripts/metrics.py # Custom metrics (reduction ratio, etc.)
├── data/ # 5 samples (2 few-shot, 3 eval)
└── results/
See: ./tiaogaoren/ (example project root) for full implementation.
For detailed API reference and advanced patterns, see references/promptfoo_api.md.
Weekly Installs
156
Repository
GitHub Stars
708
First Seen
Jan 21, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykPass
Installed on
claude-code134
opencode132
codex128
gemini-cli124
cursor119
github-copilot119
AI 代码实施计划编写技能 | 自动化开发任务分解与 TDD 流程规划工具
47,700 周安装
iOS StoreKit 2 应用内购买完整指南:实现、测试与最佳实践
154 周安装
Angular HTTP 数据获取教程:基于信号的 httpResource() 与 resource() 使用指南
1 周安装
Angular路由配置指南:v20+懒加载、函数式守卫与信号参数详解
1 周安装
iOS数据持久化与数据库路由指南:SwiftData、Core Data、CloudKit、迁移与审计
154 周安装
Angular CLI 工具集完整指南:v20+ 项目创建、代码生成与构建优化
1 周安装
Codex Wrapped 报告生成技能 - 获取过去30天和7天的Codex使用数据洞察
1 周安装
| Pattern match |
value: "\\d{4}" |
python | Custom logic | value: file://script.py |
llm-rubric | LLM grading | value: "Is professional" |
latency | Response time | threshold: 1000 |