Promptfoo评估指南：LLM测试与比较工具配置教程，提升AI模型输出质量

promptfoo-evaluation by daymade/claude-code-skills

156 周安装量

708 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/daymade/claude-code-skills --skill promptfoo-evaluation

AI/机器学习测试提示工程

🇨🇳中文介绍

Promptfoo 评估

概述

本技能提供使用 Promptfoo 配置和运行 LLM 评估的指导。Promptfoo 是一个用于测试和比较 LLM 输出的开源 CLI 工具。

快速开始

# 初始化一个新的评估项目
npx promptfoo@latest init

# 运行评估
npx promptfoo@latest eval

# 在浏览器中查看结果
npx promptfoo@latest view

配置结构

典型的 Promptfoo 项目结构：

project/
├── promptfooconfig.yaml    # 主配置文件
├── prompts/
│   ├── system.md           # 系统提示词
│   └── chat.json           # 聊天格式提示词
├── tests/
│   └── cases.yaml          # 测试用例
└── scripts/
    └── metrics.py          # 自定义 Python 断言

核心配置 (promptfooconfig.yaml)

# yaml-language-server: $schema=https://promptfoo.dev/config-schema.json
description: "我的 LLM 评估"

# 要测试的提示词
prompts:
  - file://prompts/system.md
  - file://prompts/chat.json

# 要比较的模型
providers:
  - id: anthropic:messages:claude-sonnet-4-6
    label: Claude-Sonnet-4.6
  - id: openai:gpt-4.1
    label: GPT-4.1

# 测试用例
tests: file://tests/cases.yaml

# 并发控制 (必须放在 commandLineOptions 下，而不是顶层)
commandLineOptions:
  maxConcurrency: 2

# 所有测试的默认断言
defaultTest:
  assert:
    - type: python
      value: file://scripts/metrics.py:custom_assert
    - type: llm-rubric
      value: |
        以 0-1 的尺度评估响应质量。
      threshold: 0.7

# 输出路径
outputPath: results/eval-results.json

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

文本提示词 (system.md)

You are a helpful assistant.

Task: {{task}}
Context: {{context}}

聊天格式 (chat.json)

[
  {"role": "system", "content": "{{system_prompt}}"},
  {"role": "user", "content": "{{user_input}}"}
]

直接在提示词中嵌入示例，或使用包含助手消息的聊天格式：

[
  {"role": "system", "content": "{{system_prompt}}"},
  {"role": "user", "content": "Example input: {{example_input}}"},
  {"role": "assistant", "content": "{{example_output}}"},
  {"role": "user", "content": "Now process: {{actual_input}}"}
]

测试用例 (tests/cases.yaml)

- description: "测试用例 1"
  vars:
    system_prompt: file://prompts/system.md
    user_input: "Hello world"
    # 从文件加载内容
    context: file://data/context.txt
  assert:
    - type: contains
      value: "expected text"
    - type: python
      value: file://scripts/metrics.py:custom_check
      threshold: 0.8

Python 自定义断言

创建一个 Python 文件用于自定义断言（例如 scripts/metrics.py）：

def get_assert(output: str, context: dict) -> dict:
    """默认断言函数。"""
    vars_dict = context.get('vars', {})

    # 访问测试变量
    expected = vars_dict.get('expected', '')

    # 返回结果
    return {
        "pass": expected in output,
        "score": 0.8,
        "reason": "Contains expected content",
        "named_scores": {"relevance": 0.9}
    }

def custom_check(output: str, context: dict) -> dict:
    """自定义命名断言。"""
    word_count = len(output.split())
    passed = 100 <= word_count <= 500

    return {
        "pass": passed,
        "score": min(1.0, word_count / 300),
        "reason": f"Word count: {word_count}"
    }

默认函数名是 get_assert
使用 file://path.py:function_name 指定函数
返回 bool、float（分数）或包含 pass/score/reason 的 dict
通过 context['vars'] 访问变量

LLM 作为评判者 (llm-rubric)

assert:
  - type: llm-rubric
    value: |
      根据以下标准评估响应：
      1. 信息的准确性
      2. 解释的清晰度
      3. 完整性

      评分 0.0-1.0，0.7+ 为通过。
    threshold: 0.7
    provider: openai:gpt-4.1  # 可选：覆盖评分模型

当使用中继/代理 API 时，每个 llm-rubric 断言都需要自己的 provider 配置，并包含 apiBaseUrl。否则，评分器会回退到默认的 Anthropic/OpenAI 端点并收到 401 错误：

assert:
  - type: llm-rubric
    value: |
      以 0-1 的尺度评估质量。
    threshold: 0.7
    provider:
      id: anthropic:messages:claude-sonnet-4-6
      config:
        apiBaseUrl: https://your-relay.example.com/api

提供清晰的评分标准
使用 threshold 设置最低通过分数
默认评分器使用可用的 API 密钥（OpenAI → Anthropic → Google）
当使用中继/代理时：每个 llm-rubric 必须有自己的 provider 并包含 apiBaseUrl — 主提供者的 apiBaseUrl 不会被继承

类型	用途	示例
`contains`	检查子字符串	`value: "hello"`
`icontains`	不区分大小写	`value: "HELLO"`
`equals`	精确匹配	`value: "42"`
`regex`	模式匹配	`value: "\\d{4}"`
`python`	自定义逻辑	`value: file://script.py`
`llm-rubric`	LLM 评分	`value: "Is professional"`
`latency`	响应时间	`threshold: 1000`

所有 file:// 路径都是相对于 promptfooconfig.yaml 文件的位置解析的（而不是包含引用的 YAML 文件）。当 tests: 引用单独的 YAML 文件时，这是一个常见的陷阱 — 该测试文件内部的 file:// 路径仍然从配置根目录解析。

# 将文件内容加载为变量
vars:
  content: file://data/input.txt

# 从文件加载提示词
prompts:
  - file://prompts/main.md

# 从文件加载测试用例
tests: file://tests/cases.yaml

# 加载 Python 断言
assert:
  - type: python
    value: file://scripts/check.py:validate

# 基本运行
npx promptfoo@latest eval

# 使用特定配置
npx promptfoo@latest eval --config path/to/config.yaml

# 输出到文件
npx promptfoo@latest eval --output results.json

# 过滤测试
npx promptfoo@latest eval --filter-metadata category=math

# 查看结果
npx promptfoo@latest view

中继 / 代理 API 配置

当使用 API 中继或代理而不是直接的 Anthropic/OpenAI 端点时：

providers:
  - id: anthropic:messages:claude-sonnet-4-6
    label: Claude-Sonnet-4.6
    config:
      max_tokens: 4096
      apiBaseUrl: https://your-relay.example.com/api  # Promptfoo 会自动追加 /v1/messages

# 关键：maxConcurrency 必须放在 commandLineOptions 下（而不是顶层）
commandLineOptions:
  maxConcurrency: 1  # 遵守中继速率限制

apiBaseUrl 放在 providers[].config 中 — Promptfoo 会自动追加 /v1/messages
maxConcurrency 必须放在 commandLineOptions: 下 — 放在顶层会被静默忽略
当将中继与 LLM 作为评判者一起使用时，设置 maxConcurrency: 1 以避免并发请求限制（生成和评分共享同一个池）
将中继令牌作为 ANTHROPIC_API_KEY 环境变量传递

找不到 Python：

export PROMPTFOO_PYTHON=python3

大输出被截断： 超过 30000 个字符的输出会被截断。在断言中使用 head_limit。

文件未找到错误： 所有 file:// 路径都是相对于 promptfooconfig.yaml 的位置解析的。

maxConcurrency 被忽略（显示"最多同时 N 个"）： maxConcurrency 必须放在 commandLineOptions: 下，而不是 YAML 顶层。这是一个常见错误。

LLM 作为评判者在使用中继 API 时返回 401： 每个 llm-rubric 断言必须有自己的 provider 并包含 apiBaseUrl。主提供者配置不会被评分器断言继承。

模型输出中的 HTML 标签夸大了指标： 模型可能在结构化内容中输出 <br>、<b> 等。在测量前，在 Python 断言中去除 HTML：

import re
clean_text = re.sub(r'<[^>]+>', '', raw_text)

Echo 提供者（预览模式）

使用 echo 提供者 来预览渲染后的提示词，而无需进行 API 调用：

# promptfooconfig-preview.yaml
providers:
  - echo  # 将提示词作为输出返回，不进行 API 调用

tests:
  - vars:
      input: "test content"

在进行昂贵的 API 调用之前预览提示词渲染
验证少样本示例是否正确加载
调试变量替换问题
验证提示词结构

运行预览模式

npx promptfoo@latest eval --config promptfooconfig-preview.yaml

成本： 免费 - 不消耗 API 令牌。

高级少样本实现

用于具有完整示例的复杂少样本学习：

[
  {"role": "system", "content": "{{system_prompt}}"},

  // 少样本示例 1
  {"role": "user", "content": "Task: {{example_input_1}}"},
  {"role": "assistant", "content": "{{example_output_1}}"},

  // 少样本示例 2 (可选)
  {"role": "user", "content": "Task: {{example_input_2}}"},
  {"role": "assistant", "content": "{{example_output_2}}"},

  // 实际测试
  {"role": "user", "content": "Task: {{actual_input}}"}
]

测试用例配置：

tests:
  - vars:
      system_prompt: file://prompts/system.md
      # 少样本示例
      example_input_1: file://data/examples/input1.txt
      example_output_1: file://data/examples/output1.txt
      example_input_2: file://data/examples/input2.txt
      example_output_2: file://data/examples/output2.txt
      # 实际测试
      actual_input: file://data/test1.txt

使用 1-3 个少样本示例（过多可能会稀释效果）
确保示例完全匹配任务格式
从文件加载示例以提高可维护性
首先使用 echo 提供者验证结构

用于中文/长文本内容评估（10k+ 字符）：

providers:
  - id: anthropic:messages:claude-sonnet-4-6
    config:
      max_tokens: 8192  # 为长输出增加

defaultTest:
  assert:
    - type: python
      value: file://scripts/metrics.py:check_length

用于文本指标的 Python 断言：

import re

def strip_tags(text: str) -> str:
    """去除 HTML 标签以获取纯文本。"""
    return re.sub(r'<[^>]+>', '', text)

def check_length(output: str, context: dict) -> dict:
    """检查输出长度约束。"""
    raw_input = context['vars'].get('raw_input', '')

    input_len = len(strip_tags(raw_input))
    output_len = len(strip_tags(output))

    reduction_ratio = 1 - (output_len / input_len) if input_len > 0 else 0

    return {
        "pass": 0.7 <= reduction_ratio <= 0.9,
        "score": reduction_ratio,
        "reason": f"Reduction: {reduction_ratio:.1%} (target: 70-90%)",
        "named_scores": {
            "input_length": input_len,
            "output_length": output_len,
            "reduction_ratio": reduction_ratio
        }
    }

项目： 从长转录本中筛选中文短视频内容

tiaogaoren/
├── promptfooconfig.yaml          # 生产配置
├── promptfooconfig-preview.yaml  # 预览配置 (echo 提供者)
├── prompts/
│   ├── tiaogaoren-prompt.json   # 带少样本的聊天格式
│   └── v4/system-v4.md          # 系统提示词
├── tests/cases.yaml              # 3 个测试样本
├── scripts/metrics.py            # 自定义指标 (缩减比例等)
├── data/                         # 5 个样本 (2 个少样本，3 个评估)
└── results/

参见： ./tiaogaoren/（示例项目根目录）以获取完整实现。

有关详细的 API 参考和高级模式，请参阅 references/promptfoo_api.md。

🇺🇸English

Promptfoo Evaluation

Overview

This skill provides guidance for configuring and running LLM evaluations using Promptfoo, an open-source CLI tool for testing and comparing LLM outputs.

Quick Start

# Initialize a new evaluation project
npx promptfoo@latest init

# Run evaluation
npx promptfoo@latest eval

# View results in browser
npx promptfoo@latest view

Configuration Structure

A typical Promptfoo project structure:

project/
├── promptfooconfig.yaml    # Main configuration
├── prompts/
│   ├── system.md           # System prompt
│   └── chat.json           # Chat format prompt
├── tests/
│   └── cases.yaml          # Test cases
└── scripts/
    └── metrics.py          # Custom Python assertions

Core Configuration (promptfooconfig.yaml)

# yaml-language-server: $schema=https://promptfoo.dev/config-schema.json
description: "My LLM Evaluation"

# Prompts to test
prompts:
  - file://prompts/system.md
  - file://prompts/chat.json

# Models to compare
providers:
  - id: anthropic:messages:claude-sonnet-4-6
    label: Claude-Sonnet-4.6
  - id: openai:gpt-4.1
    label: GPT-4.1

# Test cases
tests: file://tests/cases.yaml

# Concurrency control (MUST be under commandLineOptions, NOT top-level)
commandLineOptions:
  maxConcurrency: 2

# Default assertions for all tests
defaultTest:
  assert:
    - type: python
      value: file://scripts/metrics.py:custom_assert
    - type: llm-rubric
      value: |
        Evaluate the response quality on a 0-1 scale.
      threshold: 0.7

# Output path
outputPath: results/eval-results.json

Prompt Formats

Text Prompt (system.md)

You are a helpful assistant.

Task: {{task}}
Context: {{context}}

Chat Format (chat.json)

[
  {"role": "system", "content": "{{system_prompt}}"},
  {"role": "user", "content": "{{user_input}}"}
]

Few-Shot Pattern

Embed examples directly in prompt or use chat format with assistant messages:

[
  {"role": "system", "content": "{{system_prompt}}"},
  {"role": "user", "content": "Example input: {{example_input}}"},
  {"role": "assistant", "content": "{{example_output}}"},
  {"role": "user", "content": "Now process: {{actual_input}}"}
]

Test Cases (tests/cases.yaml)

- description: "Test case 1"
  vars:
    system_prompt: file://prompts/system.md
    user_input: "Hello world"
    # Load content from files
    context: file://data/context.txt
  assert:
    - type: contains
      value: "expected text"
    - type: python
      value: file://scripts/metrics.py:custom_check
      threshold: 0.8

Python Custom Assertions

Create a Python file for custom assertions (e.g., scripts/metrics.py):

def get_assert(output: str, context: dict) -> dict:
    """Default assertion function."""
    vars_dict = context.get('vars', {})

    # Access test variables
    expected = vars_dict.get('expected', '')

    # Return result
    return {
        "pass": expected in output,
        "score": 0.8,
        "reason": "Contains expected content",
        "named_scores": {"relevance": 0.9}
    }

def custom_check(output: str, context: dict) -> dict:
    """Custom named assertion."""
    word_count = len(output.split())
    passed = 100 <= word_count <= 500

    return {
        "pass": passed,
        "score": min(1.0, word_count / 300),
        "reason": f"Word count: {word_count}"
    }

Key points:

Default function name is get_assert
Specify function with file://path.py:function_name
Return bool, float (score), or dict with pass/score/reason
Access variables via context['vars']

LLM-as-Judge (llm-rubric)

assert:
  - type: llm-rubric
    value: |
      Evaluate the response based on:
      1. Accuracy of information
      2. Clarity of explanation
      3. Completeness

      Score 0.0-1.0 where 0.7+ is passing.
    threshold: 0.7
    provider: openai:gpt-4.1  # Optional: override grader model

When using a relay/proxy API , each llm-rubric assertion needs its own provider config with apiBaseUrl. Otherwise the grader falls back to the default Anthropic/OpenAI endpoint and gets 401 errors:

assert:
  - type: llm-rubric
    value: |
      Evaluate quality on a 0-1 scale.
    threshold: 0.7
    provider:
      id: anthropic:messages:claude-sonnet-4-6
      config:
        apiBaseUrl: https://your-relay.example.com/api

Best practices:

Provide clear scoring criteria
Use threshold to set minimum passing score
Default grader uses available API keys (OpenAI → Anthropic → Google)
When using relay/proxy : every llm-rubric must have its own provider with apiBaseUrl — the main provider's apiBaseUrl is NOT inherited

Common Assertion Types

Type	Usage	Example
`contains`	Check substring	`value: "hello"`
`icontains`	Case-insensitive	`value: "HELLO"`
`equals`	Exact match	`value: "42"`
`regex`

File References

All file:// paths are resolved relative to promptfooconfig.yaml location (NOT the YAML file containing the reference). This is a common gotcha when tests: references a separate YAML file — the file:// paths inside that test file still resolve from the config root.

# Load file content as variable
vars:
  content: file://data/input.txt

# Load prompt from file
prompts:
  - file://prompts/main.md

# Load test cases from file
tests: file://tests/cases.yaml

# Load Python assertion
assert:
  - type: python
    value: file://scripts/check.py:validate

Running Evaluations

# Basic run
npx promptfoo@latest eval

# With specific config
npx promptfoo@latest eval --config path/to/config.yaml

# Output to file
npx promptfoo@latest eval --output results.json

# Filter tests
npx promptfoo@latest eval --filter-metadata category=math

# View results
npx promptfoo@latest view

Relay / Proxy API Configuration

When using an API relay or proxy instead of direct Anthropic/OpenAI endpoints:

providers:
  - id: anthropic:messages:claude-sonnet-4-6
    label: Claude-Sonnet-4.6
    config:
      max_tokens: 4096
      apiBaseUrl: https://your-relay.example.com/api  # Promptfoo appends /v1/messages

# CRITICAL: maxConcurrency MUST be under commandLineOptions (NOT top-level)
commandLineOptions:
  maxConcurrency: 1  # Respect relay rate limits

Key rules:

apiBaseUrl goes in providers[].config — Promptfoo appends /v1/messages automatically
maxConcurrency must be under commandLineOptions: — placing it at top level is silently ignored
When using relay with LLM-as-judge, set maxConcurrency: 1 to avoid concurrent request limits (generation + grading share the same pool)
Pass relay token as ANTHROPIC_API_KEY env var

Troubleshooting

Python not found:

export PROMPTFOO_PYTHON=python3

Large outputs truncated: Outputs over 30000 characters are truncated. Use head_limit in assertions.

File not found errors: All file:// paths resolve relative to promptfooconfig.yaml location.

maxConcurrency ignored (shows "up to N at a time"): maxConcurrency must be under commandLineOptions:, not at the YAML top level. This is a common mistake.

LLM-as-judge returns 401 with relay API: Each llm-rubric assertion must have its own provider with apiBaseUrl. The main provider config is not inherited by grader assertions.

HTML tags in model output inflating metrics: Models may output <br>, <b>, etc. in structured content. Strip HTML in Python assertions before measuring:

import re
clean_text = re.sub(r'<[^>]+>', '', raw_text)

Echo Provider (Preview Mode)

Use the echo provider to preview rendered prompts without making API calls:

# promptfooconfig-preview.yaml
providers:
  - echo  # Returns prompt as output, no API calls

tests:
  - vars:
      input: "test content"

Use cases:

Preview prompt rendering before expensive API calls
Verify Few-shot examples are loaded correctly
Debug variable substitution issues
Validate prompt structure

Run preview mode

npx promptfoo@latest eval --config promptfooconfig-preview.yaml

Cost: Free - no API tokens consumed.

Advanced Few-Shot Implementation

Multi-turn Conversation Pattern

For complex few-shot learning with full examples:

[
  {"role": "system", "content": "{{system_prompt}}"},

  // Few-shot Example 1
  {"role": "user", "content": "Task: {{example_input_1}}"},
  {"role": "assistant", "content": "{{example_output_1}}"},

  // Few-shot Example 2 (optional)
  {"role": "user", "content": "Task: {{example_input_2}}"},
  {"role": "assistant", "content": "{{example_output_2}}"},

  // Actual test
  {"role": "user", "content": "Task: {{actual_input}}"}
]

Test case configuration:

tests:
  - vars:
      system_prompt: file://prompts/system.md
      # Few-shot examples
      example_input_1: file://data/examples/input1.txt
      example_output_1: file://data/examples/output1.txt
      example_input_2: file://data/examples/input2.txt
      example_output_2: file://data/examples/output2.txt
      # Actual test
      actual_input: file://data/test1.txt

Best practices:

Use 1-3 few-shot examples (more may dilute effectiveness)
Ensure examples match the task format exactly
Load examples from files for better maintainability
Use echo provider first to verify structure

Long Text Handling

For Chinese/long-form content evaluations (10k+ characters):

Configuration:

providers:
  - id: anthropic:messages:claude-sonnet-4-6
    config:
      max_tokens: 8192  # Increase for long outputs

defaultTest:
  assert:
    - type: python
      value: file://scripts/metrics.py:check_length

Python assertion for text metrics:

import re

def strip_tags(text: str) -> str:
    """Remove HTML tags for pure text."""
    return re.sub(r'<[^>]+>', '', text)

def check_length(output: str, context: dict) -> dict:
    """Check output length constraints."""
    raw_input = context['vars'].get('raw_input', '')

    input_len = len(strip_tags(raw_input))
    output_len = len(strip_tags(output))

    reduction_ratio = 1 - (output_len / input_len) if input_len > 0 else 0

    return {
        "pass": 0.7 <= reduction_ratio <= 0.9,
        "score": reduction_ratio,
        "reason": f"Reduction: {reduction_ratio:.1%} (target: 70-90%)",
        "named_scores": {
            "input_length": input_len,
            "output_length": output_len,
            "reduction_ratio": reduction_ratio
        }
    }

Real-World Example

Project: Chinese short-video content curation from long transcripts

Structure:

tiaogaoren/
├── promptfooconfig.yaml          # Production config
├── promptfooconfig-preview.yaml  # Preview config (echo provider)
├── prompts/
│   ├── tiaogaoren-prompt.json   # Chat format with few-shot
│   └── v4/system-v4.md          # System prompt
├── tests/cases.yaml              # 3 test samples
├── scripts/metrics.py            # Custom metrics (reduction ratio, etc.)
├── data/                         # 5 samples (2 few-shot, 3 eval)
└── results/

See: ./tiaogaoren/ (example project root) for full implementation.

Resources

For detailed API reference and advanced patterns, see references/promptfoo_api.md.

Weekly Installs

156

Repository

daymade/claude-…e-skills

GitHub Stars

708

First Seen

Jan 21, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

claude-code134

opencode132

codex128

gemini-cli124

cursor119

github-copilot119

AI 代码实施计划编写技能 | 自动化开发任务分解与 TDD 流程规划工具

47,700 周安装