Python LLM应用评估驱动开发指南：使用pixie-qa建立自动化测试流水线

eval-driven-dev by github/awesome-copilot

317 周安装量

26,700 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/github/awesome-copilot --skill eval-driven-dev

AI/机器学习自动化测试

🇨🇳中文介绍

Python LLM 应用的评估驱动开发

本技能注重实际操作，而非描述工作。当用户要求你为他们的应用设置评估时，你应该阅读他们的代码、编辑文件、运行命令并建立一个可运行的测试流水线——而不是为他们制定一个后续执行的计划。

启动检查（始终优先执行）

尝试升级用户环境中的 pixie-qa 包。根据项目检测包管理器（检查 uv.lock、poetry.lock、requirements.txt 或普通的 pip 环境）并运行相应的升级命令：

uv : uv add pixie-qa --upgrade (或 uv sync --upgrade-package pixie-qa)
poetry : poetry add pixie-qa@latest

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

设置与迭代：何时停止

这至关重要。 你的操作取决于用户的要求。

"设置 QA" / "设置评估" / "添加测试"（设置意图）

用户想要一个可运行的评估流水线。你的工作是完成阶段 0-7：安装、理解、插桩、构建运行工具、捕获真实跟踪、编写测试、构建数据集、运行测试。在首次测试运行后停止，无论测试通过与否。报告：

你设置了什么（插桩、运行工具、测试文件、数据集）
测试结果（通过/失败、分数）
如果测试失败：简要总结失败内容和可能原因——但不要修复任何问题

然后询问："QA 设置已完成。测试显示 N/M 通过。希望我调查失败原因并开始迭代吗？"

只有在用户确认后，才继续阶段 8（调查和修复）。

例外情况：如果测试运行本身出错（导入失败、缺少 API 密钥、配置错误）——这些是设置问题，而非评估失败。修复它们并重新运行，直到获得干净的测试执行，其中通过/失败反映的是实际应用质量，而非损坏的流水线。

"修复" / "改进" / "调试" / "为什么 X 失败"（迭代意图）

用户希望你调查并修复。继续执行所有阶段，包括阶段 8——调查失败、根本原因分析、应用修复、重建数据集、重新运行测试、迭代。

如果意图不明确，默认仅进行设置，并在迭代前询问。宁愿提前停止并询问，也不要对用户的应用代码进行不必要的更改。

硬性关卡：何时停止并让用户参与

有些障碍无法绕过。当你遇到时，停止工作并告诉用户你需要什么——不要猜测、伪造数据或跳过后续阶段。

缺少 API 密钥或凭据

如果应用或评估器需要 API 密钥（例如 OPENAI_API_KEY），而环境或 .env 文件中未设置，请明确告诉用户缺少哪个密钥，并等待他们提供。不要：

继续运行应用或评估（它们会失败）
硬编码占位符密钥
跳过到后续阶段，希望这无关紧要

无法从脚本运行应用

如果在阅读代码（阶段 1）后，你无法弄清楚如何从独立脚本调用应用的核心 LLM 调用函数——因为它需要运行中的服务器、webhook 触发器、复杂的身份验证或你无法模拟的外部基础设施——停止并询问用户：

"我已确定 <file> 中的 <function_name> 是需要评估的核心函数，但它需要 <dependency>，我无法轻松模拟。你能 (a) 展示如何独立调用此函数，或 (b) 自己使用一些代表性输入运行应用，以便我捕获跟踪吗？"

运行工具执行期间应用出错

如果运行工具脚本（阶段 4）出错，并且你在两次尝试后无法修复，请停止并与用户分享错误。常见的障碍包括数据库连接、缺少配置文件、身份验证/OAuth 流程以及硬件特定的依赖项。

为什么停止很重要

每个后续阶段都依赖于从实际应用中获取的真实跟踪。如果你无法运行应用，就无法捕获跟踪。如果你无法捕获跟踪，就无法构建真实的数据集。如果你伪造数据集，整个评估流水线测试的就是虚构内容，而非用户的应用。宁愿尽早停止并寻求用户帮助，也不要产生一个测试错误内容的评估流水线。

评估边界：评估什么

评估驱动开发专注于依赖于 LLM 的行为。 其目的是捕获系统中非确定性且难以用传统单元测试测试的部分的质量回归——即 LLM 调用及其驱动的决策。

范围内（评估这些）

LLM 响应质量：事实准确性、相关性、格式合规性、安全性
智能体路由决策：LLM 是否选择了正确的工具/交接/操作？
提示有效性：提示是否产生了期望的行为？
多轮对话连贯性：智能体是否在轮次间保持上下文？

范围外（不要用评估来测试这些）

工具实现（数据库查询、API 调用、关键词匹配、业务逻辑）——这些是传统软件；用单元测试测试它们
基础设施（身份验证、速率限制、缓存、序列化）
确定性后处理（格式化、过滤、排序结果）

边界是：LLM 调用下游的一切（工具、数据库、API）产生确定性输出，这些输出作为 LLM 驱动系统的输入。评估测试应将这些视为给定事实，并专注于 LLM 如何处理它们。

示例：如果一个 FAQ 工具有一个关键词匹配错误，返回了错误数据，那是一个传统错误——通过常规代码更改来修复，而不是调整评估阈值。评估测试的存在是为了验证_在给定正确的工具输出时_，LLM 智能体能产生正确的面向用户的响应。

在构建数据集和期望输出时，使用实际工具/系统的输出作为基准事实。评估用例的期望输出应反映_给定系统实际产生的工具结果_时，正确的 LLM 响应是什么样子。

阶段 0：确保 pixie-qa 已安装且 API 密钥已设置

在执行任何其他操作之前，检查 pixie-qa 包是否可用：

python -c "import pixie" 2>/dev/null && echo "installed" || echo "not installed"

如果未安装，则安装它：

pip install pixie-qa

这提供了 pixie Python 模块、pixie CLI 和 pixie test 运行器——这些都是插桩和评估所必需的。不要跳过此步骤；本技能中的所有其他内容都依赖于它。

被测应用几乎肯定需要 LLM 提供商的 API 密钥（例如 OPENAI_API_KEY、ANTHROPIC_API_KEY）。像 FactualityEval 这样的 LLM 作为评判者的评估器也需要 OPENAI_API_KEY。在运行任何东西之前，验证密钥是否已设置：

[ -n "$OPENAI_API_KEY" ] && echo "OPENAI_API_KEY set" || echo "OPENAI_API_KEY missing"

如果密钥未设置：检查项目是否使用 .env 文件。如果是，请注意 python-dotenv 仅在应用显式调用 load_dotenv() 时加载 .env——shell 命令和 pixie CLI 将看不到来自 .env 的变量，除非它们被导出。告诉用户缺少哪个密钥以及如何设置它。不要继续运行应用或评估，除非确认 API 密钥已设置——否则你会得到失败，浪费了时间，看起来像是应用错误。

阶段 1：理解应用

在接触任何代码之前，花时间实际阅读源代码。代码比询问用户能告诉你更多信息，并且让你处于更好的位置来做出关于评估内容和方式的明智决策。

软件如何运行：入口点是什么？如何启动它？它是 CLI、服务器还是库函数？必需的参数、配置文件或环境变量是什么？
LLM 的所有输入：这不仅仅限于用户的消息。追踪每一个被纳入任何 LLM 提示的数据片段：
- 用户输入（查询、消息、上传的文件）
- 系统提示（硬编码或模板化的）
- 检索到的上下文（RAG 块、搜索结果、数据库记录）
- 工具定义和函数模式
- 对话历史 / 记忆
- 改变提示行为的配置或功能标志
所有中间步骤和输出：遍历从输入到最终输出的代码路径，并记录每个阶段：
- 检索 / 搜索结果
- 工具调用及其结果
- 智能体路由 / 交接决策
- 中间 LLM 调用（例如，最终答案前的摘要）
- 后处理或格式化步骤
最终输出：用户看到什么？是什么格式？质量期望是什么？
用例和期望行为：应用应该处理哪些不同的事情？对于每个用例，"好"的响应是什么样子？什么会构成失败？

识别评估边界函数

这是你将做出的最重要的决定，正确与否决定了评估流水线测试的是真实应用还是虚构内容。

评估边界函数是实际生产代码中的这样一个函数：

接收结构化输入（文本、字典、消息列表）——而非原始 HTTP 请求、音频流或 webhook 负载
调用 LLM（直接或通过一系列内部调用）
返回 LLM 的响应（或其处理后的版本）

此函数上游的一切（webhook 处理器、语音转文本处理、请求解析、身份验证、会话管理）在构建运行工具时将被模拟或绕过。此函数本身及下游的一切是你正在评估的真实代码。

示例：在一个 Twilio 语音 AI 应用中：

Twilio 发送带有音频的 webhook → 上游，模拟此
音频处理将语音转换为文本 → 上游，模拟此
从 Redis 加载通话状态 → 上游，模拟或简化此
agent.respond(user_text, conversation_history) 调用 LLM → 评估边界函数
响应文本被转换为语音 → 下游，不属于评估范围

示例：在一个 FastAPI RAG 聊天机器人中：

HTTP 端点接收 POST 请求 → 上游，绕过此
请求验证和身份验证 → 上游，绕过此
chatbot.answer(question, context) 检索文档并调用 LLM → 评估边界函数
响应被格式化为 JSON → 下游，不属于评估范围

示例：在一个简单的 CLI 问答工具中：

main() 从 stdin 读取用户输入 → 上游，绕过此
answer_question(question) 调用 LLM → 评估边界函数

在识别评估边界函数时，记录：

确切的函数名和文件位置
其签名（参数名和类型）
它需要哪些上游依赖项（客户端、配置对象、状态）
这些依赖项中哪些需要真实凭据，哪些可以模拟

如果你无法识别一个清晰的评估边界函数——如果 LLM 调用与无法分离的基础设施代码深度纠缠——停止并询问用户。参见上面的"硬性关卡"。

将你的发现写入 pixie_qa/MEMORY.md。这是评估工作的主要工作文档。它应该易于人类阅读，并且足够详细，以便不熟悉项目的人能够理解应用和评估策略。

关键：MEMORY.md 记录你对现有应用代码的理解。它绝不能包含对 pixie 命令、你计划添加的插桩代码或尚不存在的脚本/函数的引用。 这些属于后面的部分，只有在它们被实现之后才添加。

理解部分应包括：

# 评估笔记：<项目名称>

## 应用如何工作

### 入口点和执行流程

<描述如何启动/运行应用，逐步发生了什么>

### LLM 调用的输入

<对于代码库中的每个 LLM 调用，记录：>

- 它在代码中的位置（文件 + 函数名）
- 它使用什么系统提示（引用或总结）
- 什么用户/动态内容输入其中
- 哪些工具/函数对它可用

### 中间处理

<描述输入和输出之间的任何步骤：>
- 检索、路由、工具执行等。
- 为每个步骤包含代码指针（文件:行号）

### 最终输出

<用户看到什么，什么格式，质量门槛应该是什么>

### 用例

<列出应用处理的每个不同场景，以及好/坏输出的示例>

### 评估边界函数

- **函数**：`<类.方法 或 函数名>`
- **位置**：`<文件:行号>`
- **签名**：`<参数和返回类型>`
- **需要模拟的上游依赖项**：<列出独立执行需要模拟的内容>
- **为什么选择此边界**：<解释为什么这是要评估的正确函数>

## 评估计划

### 评估什么及原因

<质量维度：事实准确性、相关性、格式合规性、安全性等。>

### 评估粒度

<哪个函数/跨度边界捕获一个"测试用例"？为什么选择那个边界？>

### 评估器和标准

<对于每个评估测试，指定：评估器、数据集、阈值、推理>

### 评估所需的数据

<需要捕获哪些数据点，以及它们在代码中的位置指针>

如果代码中确实有不清楚的地方，询问用户——但大多数问题在你仔细阅读代码后都会自行解答。

阶段 2：决定评估什么

既然你理解了应用，就可以对测量内容做出深思熟虑的选择：

哪个质量维度最重要？ QA 应用的事实准确性，结构化提取的输出格式，RAG 的相关性，面向用户文本的安全性。
评估哪个跨度： 整个流水线（root）还是仅 LLM 调用（last_llm_call）？如果你在调试检索，你可能评估的点与检查最终答案质量时不同。
哪些评估器合适： 参见 references/pixie-api.md → Evaluators。对于事实性 QA：FactualityEval。对于结构化输出：ValidJSONEval / JSONDiffEval。对于 RAG 流水线：ContextRelevancyEval / FaithfulnessEval。
通过标准： ScoreThreshold(threshold=0.7, pct=0.8) 意味着 80% 的案例必须得分 ≥ 0.7。思考对于这个应用来说"足够好"是什么样子。
期望输出： FactualityEval 需要它们。格式评估器通常不需要。

在编写任何代码之前，用计划更新 pixie_qa/MEMORY.md。

阶段 3：对应用进行插桩

向现有的生产代码添加 pixie 插桩。目标是捕获已经是应用正常执行路径一部分的函数的输入和输出。插桩必须在真实的代码路径上——与应用在生产中使用时运行的代码相同——以便在评估运行和实际使用中都能捕获跟踪。

在应用启动时添加 `enable_storage()`

在应用启动代码的开头调用一次 enable_storage()——在 main() 内部，或在服务器初始化的顶部。绝不要在模块级别（文件顶部，在任何函数之外），因为这会导致存储设置在导入时触发。

在 if __name__ == "__main__": 块内部
在 FastAPI lifespan 或 on_startup 处理程序中
在 main() / run() 函数的顶部
在测试文件的 runnable 函数内部

✅ 正确 — 在应用启动时

async def main(): enable_storage() ...

✅ 正确 — 在测试的 runnable 中

def runnable(eval_input): enable_storage() my_function(**eval_input)

❌ 错误 — 在模块级别，导入时运行

from pixie import enable_storage enable_storage() # 当任何文件导入此模块时运行！

用 `@observe` 或 `start_observation` 包装现有函数

关键：对生产代码路径进行插桩。切勿为测试创建单独的函数或替代代码路径。

@observe 装饰器或 start_observation 上下文管理器应放在应用在正常操作期间实际调用的现有函数上。如果应用的入口点是一个交互式的 main() 循环，对 main() 或其每轮用户调用的核心函数进行插桩——而不是一个复制逻辑的新辅助函数。

# ✅ 正确 — 装饰现有的生产函数
from pixie import observe

@observe(name="answer_question")
def answer_question(question: str, context: str) -> str:  # 现有函数
    ...  # 现有代码，未更改



# ✅ 正确 — 在现有函数内部使用上下文管理器
from pixie import start_observation

async def main():  # 现有函数
    ...
    with start_observation(input={"user_input": user_input}, name="handle_turn") as obs:
        result = await Runner.run(current_agent, input_items, context=context)
        # ... 现有的响应处理 ...
        obs.set_output(response_text)
    ...



# ❌ 错误 — 创建一个从 main() 复制逻辑的新函数
@observe(name="run_for_eval")
async def run_for_eval(user_messages: list[str]) -> str:
    # 这复制了 main() 的功能，创建了一个与生产环境不同的单独代码路径
    # 不要这样做。
    ...



# ❌ 错误 — 直接调用 LLM 而不是调用应用的函数
@observe(name="agent_answer_question")
def answer_question(question: str) -> str:
    # 这绕过了整个应用并直接调用 OpenAI。
    # 你正在测试你刚写的脚本，而不是用户的应用。
    response = client.responses.create(
        model="gpt-4.1",
        input=[{"role": "user", "content": question}],
    )
    return response.output_text

切勿为评估目的向应用代码添加新的包装函数。
切勿通过直接调用 LLM 提供者来绕过应用——如果你发现自己在测试或运行工具中编写 client.responses.create(...) 或 openai.ChatCompletion.create(...)，你并没有在测试应用。应该导入并调用应用自己的函数。
切勿更改函数的接口（参数、返回类型、行为）。
切勿将生产逻辑复制到单独的"可测试"函数中。
插桩纯粹是附加的——如果你移除所有 pixie 导入和装饰器，应用将完全一样工作。
插桩后，在运行结束时调用 flush() 以确保所有跨度都被写入。
对于交互式应用（CLI 循环、聊天界面），对每轮处理函数进行插桩——即接收用户输入并产生响应的函数。评估 runnable 应该调用这个相同的函数。

重要：所有 pixie 符号都可以从顶级 pixie 包导入。永远不要告诉用户从子模块导入（pixie.instrumentation、pixie.evals、pixie.storage.evaluable 等）——始终使用 from pixie import ...。

阶段 4：创建运行工具并验证跟踪

此阶段是一个硬性关卡。 在成功通过运行工具运行应用的真实代码并确认跟踪出现在数据库中之前，你无法继续编写测试或构建数据集。

运行工具是一个简短的脚本，它调用你在阶段 1 中识别的评估边界函数，绕过与 LLM 评估无关的外部基础设施。

如果评估边界函数是一个没有复杂依赖项的简单调用（例如，answer_question(question: str) -> str），那么工具可以非常简洁：

# pixie_qa/scripts/run_harness.py
from pixie import enable_storage, flush
from myapp import answer_question

enable_storage()
result = answer_question("What is the capital of France?")
print(f"Result: {result}")
flush()

运行它，验证跟踪出现，然后继续。

当应用有复杂依赖项时

大多数现实世界的应用需要更多设置。评估边界函数通常需要配置对象、数据库连接、API 客户端或状态对象才能运行。你的工作是模拟或存根最小必要的内容，以调用真实的生产函数。

# pixie_qa/scripts/run_harness.py
"""通过评估边界函数执行实际的应用代码。

模拟上游基础设施（webhooks、语音处理、通话状态等）
并使用代表性的文本输入调用真实的生产函数。
"""
from pixie import enable_storage, flush

# 如果项目使用 .env 存储 API 密钥，则加载它
from dotenv import load_dotenv
load_dotenv()

# 导入 ACTUAL 生产函数 — 不是副本，不是重新实现
from myapp.agents.llm.openai import OpenAILLM


def run_one_case(question: str) -> str:
    """使用最小化的模拟依赖项调用实际的生产函数。"""
    enable_storage()

    # 构建函数所需的最小上下文。
    # 使用真实的 API 客户端（需要真实密钥），模拟其他一切。
    llm = OpenAILLM(...)

    # 调用 ACTUAL 函数 — 生产环境使用的同一个
    result = llm.run_normal_ai_response(
        prompt=question,
        messages=[{"role": "user", "content": question}],
    )

    flush()
    return result


if __name__ == "__main__":
    test_inputs = [
        "What are your business hours?",
        "I need to update my account information.",
    ]
    for q in test_inputs:
        print(f"Q: {q}")
        print(f"A: {run_one_case(q)}")
        print("---")

运行工具的关键规则：

调用真实函数。 生产环境使用的同一个函数。如果你发现自己在工具中编写 client.responses.create(...) 或 openai.ChatCompletion.create(...)，而不是调用应用自己的函数，那么你正在绕过应用，测试完全不同的东西。
仅模拟上游基础设施。 数据库连接、webhook 负载、会话状态、音频处理——这些可以模拟或存根。LLM 调用本身必须是真实的，因为那正是你要评估的内容。
LLM API 密钥必须是真实的。 如果缺少，停止并询问用户。参见"硬性关卡"。
保持最小化。 这不是一个完整的集成测试。它是一种执行真实 LLM 调用代码路径并捕获跟踪的方式。
如果你在两次尝试后仍无法创建一个可工作的工具，停止并寻求用户帮助。

验证跟踪是否被捕获

运行工具后，验证跟踪是否实际被捕获：

python pixie_qa/scripts/run_harness.py

然后检查数据库：

import asyncio
from pixie import ObservationStore

async def check():
    store = ObservationStore()
    traces = await store.list_traces(limit=5)
    print(f"Found {len(traces)} traces")
    for t in traces:
        print(t)

asyncio.run(check())

需要检查什么：

数据库中至少出现一个跟踪
跟踪包含评估边界函数的跨度（跨度名称应与你在阶段 3 添加的 @observe(name=...) 匹配）
跨度已捕获具有合理值的 eval_input 和 eval_output

如果没有跟踪出现：

enable_storage() 是否在插桩函数运行之前被调用？
flush() 是否在函数返回后被调用？
@observe 装饰器是否在正确的函数上？
函数是否实际被执行（不仅仅是定义/导入）？

在数据库中看到来自实际应用的真实跟踪之前，不要继续阶段 5。 如果跟踪没有出现，立即调试问题或寻求用户帮助。这是一个设置问题，必须在其他任何事情之前解决。

阶段 5：编写评估测试文件

在构建数据集之前编写测试文件。这可能看起来是反过来的，但它迫使你在开始收集数据之前决定实际要测量什么——否则数据收集就没有方向。

创建 pixie_qa/tests/test_<feature>.py。模式是：一个调用应用现有生产函数的 runnable 适配器，加上一个调用 assert_dataset_pass 的异步测试函数：

from pixie import enable_storage, assert_dataset_pass, FactualityEval, ScoreThreshold, last_llm_call

from myapp import answer_question


def runnable(eval_input):
    """通过应用重放一个数据集项。

    调用生产应用使用的同一个函数。
    在此处 enable_storage() 确保在评估运行期间捕获跟踪。
    """
    enable_storage()
    answer_question(**eval_input)


async def test_factuality():
    await assert_dataset_pass(
        runnable=runnable,
        dataset_name="<dataset-name>",
        evaluators=[FactualityEval()],
        pass_criteria=ScoreThreshold(threshold=0.7, pct=0.8),
        from_trace=last_llm_call,
    )

注意 enable_storage() 属于 runnable 内部，而不是测试文件的模块级别——它需要在每次调用时触发，以便为该特定运行捕获跟踪。

runnable 导入并调用生产环境使用的同一个函数——你在阶段 1 识别并在阶段 4 验证的评估边界函数。如果 runnable 调用的函数与运行工具调用的函数不同，那就出了问题。

测试运行器是 pixie test（不是 pytest）：

pixie test                           # 运行当前目录中的所有 test_*.py
pixie test pixie_qa/tests/           # 指定路径
pixie test -k factuality             # 按名称过滤
pixie test -v                        # 详细模式：显示每个案例的分数和推理

pixie test 自动找到项目根目录（包含 pyproject.toml、setup.py 或 setup.cfg 的目录）并将其添加到 sys.path——就像 pytest 一样。测试文件中不需要 sys.path 技巧。

阶段 6：构建数据集

先决条件：你必须已成功运行应用并在阶段 4 验证了跟踪。如果你跳过了阶段 4 或它失败了，请返回——不要继续。

创建数据集，然后通过实际运行应用并使用代表性输入来填充它。数据集项必须包含从实际执行中捕获的真实应用输出。

pixie dataset create <dataset-name>
pixie dataset list   # 验证它存在

运行应用并将跟踪捕获到数据集

最简单的方法是将阶段 4 的运行工具扩展为数据集构建器。既然你已经有一个调用真实应用代码并产生跟踪的可工作脚本，就将其调整为保存结果：

# pixie_qa/scripts/build_dataset.py
import asyncio
from pixie import enable_storage, flush, DatasetStore, Evaluable

from myapp import answer_question

GOLDEN_CASES = [
    ("What is the capital of France?", "Paris"),
    ("What is the speed of light?", "299,792,458 meters per second"),
]

async def build_dataset():
    enable_storage()
    store = DatasetStore()
    try:
        store.create("qa-golden-set")
    except FileExistsError:
        pass

    for question, expected in GOLDEN_CASES:
        result = answer_question(question=question)
        flush()

        store.append("qa-golden-set", Evaluable(
            eval_input={"question": question},
            eval_output=result,
            expected_output=expected,
        ))

asyncio.run(build_dataset())

注意 eval_output=result 是运行应用的实际返回值——不是你手动输入的字符串。

或者，使用 CLI 进行逐案例捕获：

# 运行应用（enable_storage() 必须处于活动状态）
python -c "from myapp import main; main('What is the capital of France?')"

# 将根跨度保存到数据集
pixie dataset save <dataset-name>

# 或者专门保存最后一个 LLM 调用：
pixie dataset save <dataset-name> --select last_llm_call

# 添加上下文：
pixie dataset save <dataset-name> --notes "basic geography question"

# 为像 FactualityEval 这样的评估器附加期望输出：
echo '"Paris"' | pixie dataset save <dataset-name> --expected-output

数据集构建的大忌

切勿手动伪造 eval_output 值。 如果你在数据集 JSON 文件中输入 "eval_output": "4" 而应用实际上并未产生该输出，那么数据集测试的就是虚构内容。伪造的数据集比没有数据集更糟糕，因为它给了错误的信心——用户认为他们的应用正在被测试，但事实并非如此。

如果你发现自己直接在 JSON 文件中编写或编辑 eval_output 值，请停止。回到阶段 4，运行应用，并捕获真实输出。

数据集构建的关键规则

每个 eval_output 必须来自评估边界函数的真实执行。没有例外。
为基于比较的评估器（如 FactualityEval）包含期望输出。期望输出应反映给定工具/系统实际返回的内容时，正确的 LLM 响应——而不是基于修复非 LLM 错误的理想化答案。
覆盖你关心的输入范围：正常案例、边缘案例、应用可能出错的情况。
当使用 pixie dataset save 时，可评估项的 eval_metadata 将自动包含 trace_id 和 span_id，以便后续调试。

阶段 7：运行测试

pixie test pixie_qa/tests/ -v

-v 标志显示每个案例的分数和推理，这使得查看什么通过、什么失败变得容易得多。根据你的 ScoreThreshold 检查通过率是否合理。

在此阶段之后，如果用户的意图是"设置"——停止。 报告结果并在继续前询问。参见上面的"设置与迭代"。

阶段 8：调查失败

仅当用户要求迭代/修复，或在设置后明确确认时，才继续此处。

当测试失败时，目标是理解_为什么_，而不是调整阈值直到通过。调查必须彻底并有文档记录——用户需要看到实际数据、你的推理和你的结论。

步骤 1：获取详细的测试输出

pixie test pixie_qa/tests/ -v    # 显示每个案例的分数和推理

捕获完整的详细输出。对于每个失败的案例，记录：

eval_input（发送了什么）
eval_output（应用产生了什么）
expected_output（期望什么，如果适用）
评估器分数和推理

步骤 2：检查跟踪数据

对于每个失败的案例，查找完整的跟踪以查看应用内部发生了什么：

from pixie import DatasetStore

store = DatasetStore()
ds = store.get("<dataset-name>")
for i, item in enumerate(ds.items):
    print(i, item.eval_metadata)   # trace_id 在这里

然后检查完整的跨度树：

import asyncio
from pixie import ObservationStore

async def inspect(trace_id: str):
    store = ObservationStore()
    roots = await store.get_trace(trace_id)
    for root in roots:
        print(root.to_text())   # 完整的跨度树：输入、输出、LLM 消息

asyncio.run(inspect("the-trace-id-here"))

步骤 3：根本原因分析

遍历跟踪并准确识别失败源自何处。常见模式：

与 LLM 相关的失败（通过提示/模型/评估更改修复）：

症状	可能原因
尽管工具结果正确，但输出事实错误	提示没有指示 LLM 忠实使用工具输出
智能体路由到错误的工具/交接	路由提示或交接描述不明确
输出格式错误	提示中缺少格式说明
LLM 产生幻觉而不是使用工具	提示没有强制使用工具

非 LLM 失败（通过传统代码更改修复，超出评估范围）：

症状	可能原因
工具返回了错误数据	工具实现中的错误——修复工具，而不是评估
由于关键词不匹配，工具根本没有被调用	工具选择逻辑损坏——修复代码
数据库返回了陈旧/错误的记录	数据问题——独立修复
API 调用因错误而失败	基础设施问题

对于非 LLM 失败：在调查日志中记录它们并建议代码修复，但不要调整评估期望或阈值以适应非 LLM 代码中的错误。评估测试应假设系统的其余部分正常工作，来衡量 LLM 质量。

步骤 4：在 MEMORY.md 中记录发现

每次失败调查都必须在 pixie_qa/MEMORY.md 中以结构化格式记录：

### 调查：<test_name> 失败 — <日期>

**测试**：`pixie_qa/tests/test_customer_service.py` 中的 `test_faq_factuality`
**结果**：3/5 个案例通过（60%），阈值是 80% ≥ 0.7

#### 失败案例 1："What rows have extra legroom?"

- **eval_input**：`{"user_message": "What rows have extra legroom?"}`
- **eval_output**："I'm sorry, I don't have the exact row numbers for extra legroom..."
- **expected_output**："rows 5-8 Economy Plus with extra legroom"
- **评估器分数**：0.1 (FactualityEval)
- **评估器推理**："输出声称不知道答案，而参考明确说明是第 5-8 行..."

**跟踪分析**：
检查了跟踪 `abc123`。跨度树显示：

1. 分流智能体路由到 FAQ 智能体 ✓
2. FAQ 智能体调用了 `faq_lookup_tool("What rows have extra legroom?")` ✓
3. `faq_lookup_tool` 返回了 "I'm sorry, I don't know..." ← **根本原因**

**根本原因**：`faq_lookup_tool` (customer_service.py:112) 使用关键词匹配。
座位 FAQ 条目由关键词 `["seat", "seats", "seating", "plane"]` 触发。
问题 "What rows have extra legroom?" 不包含这些关键词中的任何一个，因此它
回退到默认的 "I don't know" 响应。

**分类**：非 LLM 失败——关键词匹配工具损坏。
LLM 智能体正确地路由到 FAQ 智能体并使用了工具；工具
本身返回了错误数据。

**修复**：将 `"row"`、`"rows"`、`"legroom"` 添加到
`faq_lookup_tool` (customer_service.py:130) 中的座位关键词列表。这是一个传统的代码修复，
而不是评估/提示更改。

**验证**：修复后，重新运行：
\`\`\`bash
python pixie_qa/scripts/build_dataset.py # 刷新数据集
pixie test pixie_qa/tests/ -k faq -v # 验证
\`\`\`

步骤 5：修复并重新运行

进行有针对性的更改，如果需要则重建数据集，并重新运行。最后总是给用户确切的命令来验证：

pixie test pixie_qa/tests/test_<feature>.py -v

# 评估笔记：<项目名称>

## 应用如何工作

### 入口点和执行流程

🇺🇸English

Evaluation-Driven Development for Python LLM Applications

This skill is about doing the work, not describing it. When a user asks you to set up evals for their app, you should be reading their code, editing their files, running commands, and producing a working test pipeline — not writing a plan for them to follow later.

Startup checks (always first)

Attempt to upgrade the pixie-qa package in the user's environment. Detect the package manager from the project (check for uv.lock, poetry.lock, requirements.txt, or a plain pip environment) and run the appropriate upgrade command:

uv : uv add pixie-qa --upgrade (or uv sync --upgrade-package pixie-qa)
poetry : poetry add pixie-qa@latest
pip : pip install --upgrade pixie-qa

If the upgrade fails (e.g., no network, version conflict), log the error and continue — a failed upgrade must not block the rest of the skill.

All pixie-generated files live in a singlepixie_qa directory at the project root:

pixie_qa/
  MEMORY.md              # your understanding and eval plan
  observations.db        # SQLite trace DB (auto-created by enable_storage)
  datasets/              # golden datasets (JSON files)
  tests/                 # eval test files (test_*.py)
  scripts/               # helper scripts (run_harness.py, build_dataset.py, etc.)

Setup vs. Iteration: when to stop

This is critical. What you do depends on what the user asked for.

"Setup QA" / "set up evals" / "add tests" (setup intent)

The user wants a working eval pipeline. Your job is Stages 0–7: install, understand, instrument, build a run harness, capture real traces, write tests, build dataset, run tests. Stop after the first test run , regardless of whether tests pass or fail. Report:

What you set up (instrumentation, run harness, test file, dataset)
The test results (pass/fail, scores)
If tests failed: a brief summary of what failed and likely causes — but do NOT fix anything

Then ask: "QA setup is complete. Tests show N/M passing. Want me to investigate the failures and start iterating?"

Only proceed to Stage 8 (investigation and fixes) if the user confirms.

Exception : If the test run itself errors out (import failures, missing API keys, configuration bugs) — those are setup problems , not eval failures. Fix them and re-run until you get a clean test execution where pass/fail reflects actual app quality, not broken plumbing.

"Fix" / "improve" / "debug" / "why is X failing" (iteration intent)

The user wants you to investigate and fix. Proceed through all stages including Stage 8 — investigate failures, root-cause them, apply fixes, rebuild dataset, re-run tests, iterate.

Ambiguous requests

If the intent is unclear, default to setup only and ask before iterating. It's better to stop early and ask than to make unwanted changes to the user's application code.

Hard gates: when to STOP and get the user involved

Some blockers cannot be worked around. When you hit one, stop working and tell the user what you need — do not guess, fabricate data, or skip ahead to later stages.

Missing API keys or credentials

If the app or evaluators need an API key (e.g. OPENAI_API_KEY) and it's not set in the environment or .env, tell the user exactly which key is missing and wait for them to provide it. Do not:

Proceed with running the app or evals (they will fail)
Hardcode a placeholder key
Skip to later stages hoping it won't matter

Cannot run the app from a script

If after reading the code (Stage 1) you cannot figure out how to invoke the app's core LLM-calling function from a standalone script — because it requires a running server, a webhook trigger, complex authentication, or external infrastructure you can't mock — stop and ask the user :

"I've identified <function_name> in <file> as the core function to evaluate, but it requires <dependency> which I can't easily mock. Can you either (a) show me how to call this function standalone, or (b) run the app yourself with a few representative inputs so I can capture traces?"

App errors during run harness execution

If the run harness script (Stage 4) errors out and you can't fix it after two attempts, stop and share the error with the user. Common blockers include database connections, missing configuration files, authentication/OAuth flows, and hardware-specific dependencies.

Why stopping matters

Every subsequent stage depends on having real traces from the actual app. If you can't run the app, you can't capture traces. If you can't capture traces, you can't build a real dataset. If you fabricate a dataset, the entire eval pipeline is testing a fiction, not the user's app. It's better to stop early and get the user's help than to produce an eval pipeline that tests the wrong thing.

The eval boundary: what to evaluate

Eval-driven development focuses on LLM-dependent behaviour. The purpose is to catch quality regressions in the parts of the system that are non-deterministic and hard to test with traditional unit tests — namely, LLM calls and the decisions they drive.

In scope (evaluate this)

LLM response quality: factual accuracy, relevance, format compliance, safety
Agent routing decisions: did the LLM choose the right tool/handoff/action?
Prompt effectiveness: does the prompt produce the desired behaviour?
Multi-turn coherence: does the agent maintain context across turns?

Out of scope (do NOT evaluate this with evals)

Tool implementations (database queries, API calls, keyword matching, business logic) — these are traditional software; test them with unit tests
Infrastructure (authentication, rate limiting, caching, serialization)
Deterministic post-processing (formatting, filtering, sorting results)

The boundary is: everything downstream of the LLM call (tools, databases, APIs) produces deterministic outputs that serve as inputs to the LLM-powered system. Eval tests should treat those as given facts and focus on what the LLM does with them.

Example : If an FAQ tool has a keyword-matching bug that returns wrong data, that's a traditional bug — fix it with a regular code change, not by adjusting eval thresholds. The eval tests exist to verify that given correct tool outputs , the LLM agent produces correct user-facing responses.

When building datasets and expected outputs, use the actual tool/system outputs as ground truth. The expected output for an eval case should reflect what a correct LLM response looks like given the tool results the system actually produces.

Stage 0: Ensure pixie-qa is Installed and API Keys Are Set

Before doing anything else, check that the pixie-qa package is available:

python -c "import pixie" 2>/dev/null && echo "installed" || echo "not installed"

If it's not installed, install it:

pip install pixie-qa

This provides the pixie Python module, the pixie CLI, and the pixie test runner — all required for instrumentation and evals. Don't skip this step; everything else in this skill depends on it.

Verify API keys

The application under test almost certainly needs an LLM provider API key (e.g. OPENAI_API_KEY, ANTHROPIC_API_KEY). LLM-as-judge evaluators like FactualityEval also need OPENAI_API_KEY. Before running anything , verify the key is set:

[ -n "$OPENAI_API_KEY" ] && echo "OPENAI_API_KEY set" || echo "OPENAI_API_KEY missing"

If the key is not set: check whether the project uses a .env file. If it does, note that python-dotenv only loads .env when the app explicitly calls load_dotenv() — shell commands and the pixie CLI will not see variables from .env unless they're exported. Tell the user which key is missing and how to set it. Do not proceed with running the app or evals without a confirmed API key — you'll get failures that waste time and look like app bugs.

Stage 1: Understand the Application

Before touching any code, spend time actually reading the source. The code will tell you more than asking the user would, and it puts you in a much better position to make good decisions about what and how to evaluate.

What to investigate

How the software runs : What is the entry point? How do you start it? Is it a CLI, a server, a library function? What are the required arguments, config files, or environment variables?
All inputs to the LLM : This is not limited to the user's message. Trace every piece of data that gets incorporated into any LLM prompt:
- User input (queries, messages, uploaded files)
- System prompts (hardcoded or templated)
- Retrieved context (RAG chunks, search results, database records)
- Tool definitions and function schemas
- Conversation history / memory
- Configuration or feature flags that change prompt behavior
All intermediate steps and outputs : Walk through the code path from input to final output and document each stage:
- Retrieval / search results
- Tool calls and their results
- Agent routing / handoff decisions
- Intermediate LLM calls (e.g., summarization before final answer)
- Post-processing or formatting steps
The final output : What does the user see? What format is it in? What are the quality expectations?
Use cases and expected behaviors : What are the distinct things the app is supposed to handle? For each use case, what does a "good" response look like? What would constitute a failure?

Identify the eval-boundary function

This is the single most important decision you'll make, and getting it right determines whether the eval pipeline tests the real app or a fiction.

The eval-boundary function is the function in the actual production code that:

Takes structured input (text, dict, message list) — not raw HTTP requests, audio streams, or webhook payloads
Calls the LLM (directly or through a chain of internal calls)
Returns the LLM's response (or a processed version of it)

Everything upstream of this function (webhook handlers, voice-to-text processing, request parsing, authentication, session management) will be mocked or bypassed when building the run harness. Everything at and below this function is the real code you're evaluating.

Example : In a Twilio voice AI app:

Twilio sends a webhook with audio → upstream, mock this
Audio processing converts speech to text → upstream, mock this
Call state is loaded from Redis → upstream, mock or simplify this
agent.respond(user_text, conversation_history) calls the LLM → eval-boundary function
Response text is converted to speech → downstream, not part of eval

Example : In a FastAPI RAG chatbot:

HTTP endpoint receives POST request → upstream, bypass this
Request validation and auth → upstream, bypass this
chatbot.answer(question, context) retrieves docs and calls LLM → eval-boundary function
Response is formatted as JSON → downstream, not part of eval

Example : In a simple CLI Q&A tool:

main() reads user input from stdin → upstream, bypass this
answer_question(question) calls the LLM → eval-boundary function

When identifying the eval-boundary function, record:

The exact function name and file location
Its signature (parameter names and types)
What upstream dependencies it needs (clients, config objects, state)
Which of those dependencies require real credentials vs. can be mocked

If you cannot identify a clear eval-boundary function — if the LLM call is deeply entangled with infrastructure code that can't be separated — stop and ask the user. See "Hard gates" above.

Write MEMORY.md

Write your findings down in pixie_qa/MEMORY.md. This is the primary working document for the eval effort. It should be human-readable and detailed enough that someone unfamiliar with the project can understand the application and the eval strategy.

CRITICAL: MEMORY.md documents your understanding of the existing application code. It must NOT contain references to pixie commands, instrumentation code you plan to add, or scripts/functions that don't exist yet. Those belong in later sections, only after they've been implemented.

The understanding section should include:

# Eval Notes: <Project Name>

## How the application works

### Entry point and execution flow

<Describe how to start/run the app, what happens step by step>

### Inputs to LLM calls

<For each LLM call in the codebase, document:>

- Where it is in the code (file + function name)
- What system prompt it uses (quote it or summarize)
- What user/dynamic content feeds into it
- What tools/functions are available to it

### Intermediate processing

<Describe any steps between input and output:>
- Retrieval, routing, tool execution, etc.
- Include code pointers (file:line) for each step

### Final output

<What the user sees, what format, what the quality bar should be>

### Use cases

<List each distinct scenario the app handles, with examples of good/bad outputs>

### Eval-boundary function

- **Function**: `<class.method or function_name>`
- **Location**: `<file:line>`
- **Signature**: `<parameters and return type>`
- **Upstream dependencies to mock**: <list what needs mocking for standalone execution>
- **Why this boundary**: <explain why this is the right function to evaluate>

## Evaluation plan

### What to evaluate and why

<Quality dimensions: factual accuracy, relevance, format compliance, safety, etc.>

### Evaluation granularity

<Which function/span boundary captures one "test case"? Why that boundary?>

### Evaluators and criteria

<For each eval test, specify: evaluator, dataset, threshold, reasoning>

### Data needed for evaluation

<What data points need to be captured, with code pointers to where they live>

If something is genuinely unclear from the code, ask the user — but most questions answer themselves once you've read the code carefully.

Stage 2: Decide What to Evaluate

Now that you understand the app, you can make thoughtful choices about what to measure:

What quality dimension matters most? Factual accuracy for QA apps, output format for structured extraction, relevance for RAG, safety for user-facing text.
Which span to evaluate: the whole pipeline (root) or just the LLM call (last_llm_call)? If you're debugging retrieval, you might evaluate at a different point than if you're checking final answer quality.
Which evaluators fit: see references/pixie-api.md → Evaluators. For factual QA: FactualityEval. For structured output: ValidJSONEval / JSONDiffEval. For RAG pipelines: ContextRelevancyEval / FaithfulnessEval.
Pass criteria: ScoreThreshold(threshold=0.7, pct=0.8) means 80% of cases must score ≥ 0.7. Think about what "good enough" looks like for this app.

Update pixie_qa/MEMORY.md with the plan before writing any code.

Stage 3: Instrument the Application

Add pixie instrumentation to the existing production code. The goal is to capture the inputs and outputs of functions that are already part of the application's normal execution path. Instrumentation must be on the real code path — the same code that runs when the app is used in production — so that traces are captured both during eval runs and real usage.

Add `enable_storage()` at application startup

Call enable_storage() once at the beginning of the application's startup code — inside main(), or at the top of a server's initialization. Never at module level (top of a file outside any function), because that causes storage setup to trigger on import.

Good places:

Inside if __name__ == "__main__": blocks
In a FastAPI lifespan or on_startup handler
At the top of main() / run() functions
Inside the runnable function in test files

✅ CORRECT — at application startup

async def main(): enable_storage() ...

✅ CORRECT — in a runnable for tests

def runnable(eval_input): enable_storage() my_function(**eval_input)

❌ WRONG — at module level, runs on import

from pixie import enable_storage enable_storage() # this runs when any file imports this module!

Wrap existing functions with `@observe` or `start_observation`

CRITICAL: Instrument the production code path. Never create separate functions or alternate code paths for testing.

The @observe decorator or start_observation context manager goes on the existing function that the app actually calls during normal operation. If the app's entry point is an interactive main() loop, instrument main() or the core function it calls per user turn — not a new helper function that duplicates logic.

# ✅ CORRECT — decorating the existing production function
from pixie import observe

@observe(name="answer_question")
def answer_question(question: str, context: str) -> str:  # existing function
    ...  # existing code, unchanged



# ✅ CORRECT — context manager inside an existing function
from pixie import start_observation

async def main():  # existing function
    ...
    with start_observation(input={"user_input": user_input}, name="handle_turn") as obs:
        result = await Runner.run(current_agent, input_items, context=context)
        # ... existing response handling ...
        obs.set_output(response_text)
    ...



# ❌ WRONG — creating a new function that duplicates logic from main()
@observe(name="run_for_eval")
async def run_for_eval(user_messages: list[str]) -> str:
    # This duplicates what main() does, creating a separate code path
    # that diverges from production. Don't do this.
    ...



# ❌ WRONG — calling the LLM directly instead of calling the app's function
@observe(name="agent_answer_question")
def answer_question(question: str) -> str:
    # This bypasses the entire app and calls OpenAI directly.
    # You're testing a script you just wrote, not the user's app.
    response = client.responses.create(
        model="gpt-4.1",
        input=[{"role": "user", "content": question}],
    )
    return response.output_text

Rules:

Never add new wrapper functions to the application code for eval purposes.
Never bypass the app by calling the LLM provider directly — if you find yourself writing client.responses.create(...) or openai.ChatCompletion.create(...) in a test or run harness, you're not testing the app. Import and call the app's own function instead.
Never change the function's interface (arguments, return type, behavior).
Never duplicate production logic into a separate "testable" function.
The instrumentation is purely additive — if you removed all pixie imports and decorators, the app would work identically.
After instrumentation, call flush() at the end of runs to make sure all spans are written.
For interactive apps (CLI loops, chat interfaces), instrument the per-turn processing function — the one that takes user input and produces a response. The eval runnable should call this same function.

Important : All pixie symbols are importable from the top-level pixie package. Never tell users to import from submodules (pixie.instrumentation, pixie.evals, pixie.storage.evaluable, etc.) — always use from pixie import ....

Stage 4: Create a Run Harness and Verify Traces

This stage is a hard gate. You cannot proceed to writing tests or building datasets until you have successfully run the app's real code through the run harness and confirmed that traces appear in the database.

The run harness is a short script that calls the eval-boundary function you identified in Stage 1, bypassing external infrastructure that isn't relevant to LLM evaluation.

When the app is simple

If the eval-boundary function is a straightforward call with no complex dependencies (e.g., answer_question(question: str) -> str), the harness can be minimal:

# pixie_qa/scripts/run_harness.py
from pixie import enable_storage, flush
from myapp import answer_question

enable_storage()
result = answer_question("What is the capital of France?")
print(f"Result: {result}")
flush()

Run it, verify traces appear, and move on.

When the app has complex dependencies

Most real-world apps need more setup. The eval-boundary function often requires configuration objects, database connections, API clients, or state objects to run. Your job is to mock or stub the minimum necessary to call the real production function.

# pixie_qa/scripts/run_harness.py
"""Exercises the actual app code through the eval-boundary function.

Mocks upstream infrastructure (webhooks, voice processing, call state, etc.)
and calls the real production function with representative text inputs.
"""
from pixie import enable_storage, flush

# Load .env if the project uses one for API keys
from dotenv import load_dotenv
load_dotenv()

# Import the ACTUAL production function — not a copy, not a re-implementation
from myapp.agents.llm.openai import OpenAILLM


def run_one_case(question: str) -> str:
    """Call the actual production function with minimal mocked dependencies."""
    enable_storage()

    # Construct the minimum context the function needs.
    # Use real API client (needs real key), mock everything else.
    llm = OpenAILLM(...)

    # Call the ACTUAL function — the same one production uses
    result = llm.run_normal_ai_response(
        prompt=question,
        messages=[{"role": "user", "content": question}],
    )

    flush()
    return result


if __name__ == "__main__":
    test_inputs = [
        "What are your business hours?",
        "I need to update my account information.",
    ]
    for q in test_inputs:
        print(f"Q: {q}")
        print(f"A: {run_one_case(q)}")
        print("---")

Critical rules for the run harness:

Call the real function. The same function production uses. If you find yourself writing client.responses.create(...) or openai.ChatCompletion.create(...) in the harness instead of calling the app's own function, you are bypassing the app and testing something else entirely.
Mock only upstream infrastructure. Database connections, webhook payloads, session state, audio processing — these can be mocked or stubbed. The LLM call itself must be real because that's what you're evaluating.
The LLM API key must be real. If it's missing, stop and ask the user. See "Hard gates."
Keep it minimal. This is not a full integration test. It's a way to exercise the real LLM-calling code path and capture traces.
If you can't create a working harness after two attempts , stop and ask the user for help.

Verify traces are captured

After running the harness, verify that traces were actually captured:

python pixie_qa/scripts/run_harness.py

Then check the database:

import asyncio
from pixie import ObservationStore

async def check():
    store = ObservationStore()
    traces = await store.list_traces(limit=5)
    print(f"Found {len(traces)} traces")
    for t in traces:
        print(t)

asyncio.run(check())

What to check:

At least one trace appears in the database
The trace contains a span for the eval-boundary function (the span name should match the @observe(name=...) you added in Stage 3)
The span has captured eval_input and eval_output with sensible values

If no traces appear:

Is enable_storage() being called before the instrumented function runs?
Is flush() being called after the function returns?
Is the @observe decorator on the correct function?
Is the function actually being executed (not just defined/imported)?

Do not proceed to Stage 5 until you have seen real traces from the actual app in the database. If traces don't appear, debug the issue now or ask the user for help. This is a setup problem and must be resolved before anything else.

Stage 5: Write the Eval Test File

Write the test file before building the dataset. This might seem backwards, but it forces you to decide what you're actually measuring before you start collecting data — otherwise the data collection has no direction.

Create pixie_qa/tests/test_<feature>.py. The pattern is: a runnable adapter that calls the app's existing production function , plus an async test function that calls assert_dataset_pass:

from pixie import enable_storage, assert_dataset_pass, FactualityEval, ScoreThreshold, last_llm_call

from myapp import answer_question


def runnable(eval_input):
    """Replays one dataset item through the app.

    Calls the same function the production app uses.
    enable_storage() here ensures traces are captured during eval runs.
    """
    enable_storage()
    answer_question(**eval_input)


async def test_factuality():
    await assert_dataset_pass(
        runnable=runnable,
        dataset_name="<dataset-name>",
        evaluators=[FactualityEval()],
        pass_criteria=ScoreThreshold(threshold=0.7, pct=0.8),
        from_trace=last_llm_call,
    )

Note that enable_storage() belongs inside the runnable, not at module level in the test file — it needs to fire on each invocation so the trace is captured for that specific run.

The runnable imports and calls the same function that production uses — the eval-boundary function you identified in Stage 1 and verified in Stage 4. If the runnable calls a different function than what the run harness calls, something is wrong.

The test runner is pixie test (not pytest):

pixie test                           # run all test_*.py in current directory
pixie test pixie_qa/tests/           # specify path
pixie test -k factuality             # filter by name
pixie test -v                        # verbose: shows per-case scores and reasoning

pixie test automatically finds the project root (the directory containing pyproject.toml, setup.py, or setup.cfg) and adds it to sys.path — just like pytest. No sys.path hacks are needed in test files.

Stage 6: Build the Dataset

Prerequisite : You must have successfully run the app and verified traces in Stage 4. If you skipped Stage 4 or it failed, go back — do not proceed.

Create the dataset, then populate it by actually running the app with representative inputs. Dataset items must contain real app outputs captured from actual execution.

pixie dataset create <dataset-name>
pixie dataset list   # verify it exists

Run the app and capture traces to the dataset

The easiest approach is to extend the run harness from Stage 4 into a dataset builder. Since you already have a working script that calls the real app code and produces traces, adapt it to save results:

# pixie_qa/scripts/build_dataset.py
import asyncio
from pixie import enable_storage, flush, DatasetStore, Evaluable

from myapp import answer_question

GOLDEN_CASES = [
    ("What is the capital of France?", "Paris"),
    ("What is the speed of light?", "299,792,458 meters per second"),
]

async def build_dataset():
    enable_storage()
    store = DatasetStore()
    try:
        store.create("qa-golden-set")
    except FileExistsError:
        pass

    for question, expected in GOLDEN_CASES:
        result = answer_question(question=question)
        flush()

        store.append("qa-golden-set", Evaluable(
            eval_input={"question": question},
            eval_output=result,
            expected_output=expected,
        ))

asyncio.run(build_dataset())

Note that eval_output=result is the actual return value from running the app — not a string you typed in.

Alternatively, use the CLI for per-case capture:

# Run the app (enable_storage() must be active)
python -c "from myapp import main; main('What is the capital of France?')"

# Save the root span to the dataset
pixie dataset save <dataset-name>

# Or specifically save the last LLM call:
pixie dataset save <dataset-name> --select last_llm_call

# Add context:
pixie dataset save <dataset-name> --notes "basic geography question"

# Attach expected output for evaluators like FactualityEval:
echo '"Paris"' | pixie dataset save <dataset-name> --expected-output

The cardinal sin of dataset building

Never fabricateeval_output values by hand. If you type "eval_output": "4" into a dataset JSON file without the app actually producing that output, the dataset is testing a fiction. A fabricated dataset is worse than no dataset because it gives false confidence — the user thinks their app is being tested, but it isn't.

If you catch yourself writing or editing eval_output values directly in a JSON file, stop. Go back to Stage 4, run the app, and capture real outputs.

Key rules for dataset building

Everyeval_output must come from a real execution of the eval-boundary function. No exceptions.
Include expected outputs for comparison-based evaluators like FactualityEval. Expected outputs should reflect the correct LLM response given what the tools/system actually return — not an idealized answer predicated on fixing non-LLM bugs.
Cover the range of inputs you care about: normal cases, edge cases, things the app might plausibly get wrong.
When using pixie dataset save, the evaluable's eval_metadata will automatically include trace_id and span_id for later debugging.

Stage 7: Run the Tests

pixie test pixie_qa/tests/ -v

The -v flag shows per-case scores and reasoning, which makes it much easier to see what's passing and what isn't. Check that the pass rates look reasonable given your ScoreThreshold.

After this stage, if the user's intent was "setup" — STOP. Report results and ask before proceeding. See "Setup vs. Iteration" above.

Stage 8: Investigate Failures

Only proceed here if the user asked for iteration/fixing, or explicitly confirmed after setup.

When tests fail, the goal is to understand why , not to adjust thresholds until things pass. Investigation must be thorough and documented — the user needs to see the actual data, your reasoning, and your conclusion.

Step 1: Get the detailed test output

pixie test pixie_qa/tests/ -v    # shows score and reasoning per case

Capture the full verbose output. For each failing case, note:

The eval_input (what was sent)
The eval_output (what the app produced)
The expected_output (what was expected, if applicable)
The evaluator score and reasoning

Step 2: Inspect the trace data

For each failing case, look up the full trace to see what happened inside the app:

from pixie import DatasetStore

store = DatasetStore()
ds = store.get("<dataset-name>")
for i, item in enumerate(ds.items):
    print(i, item.eval_metadata)   # trace_id is here

Then inspect the full span tree:

import asyncio
from pixie import ObservationStore

async def inspect(trace_id: str):
    store = ObservationStore()
    roots = await store.get_trace(trace_id)
    for root in roots:
        print(root.to_text())   # full span tree: inputs, outputs, LLM messages

asyncio.run(inspect("the-trace-id-here"))

Step 3: Root-cause analysis

Walk through the trace and identify exactly where the failure originates. Common patterns:

LLM-related failures (fix with prompt/model/eval changes):

Symptom	Likely cause
Output is factually wrong despite correct tool results	Prompt doesn't instruct the LLM to use tool output faithfully
Agent routes to wrong tool/handoff	Routing prompt or handoff descriptions are ambiguous
Output format is wrong	Missing format instructions in prompt
LLM hallucinated instead of using tool	Prompt doesn't enforce tool usage

Non-LLM failures (fix with traditional code changes, out of eval scope):

Symptom	Likely cause
Tool returned wrong data	Bug in tool implementation — fix the tool, not the eval
Tool wasn't called at all due to keyword mismatch	Tool-selection logic is broken — fix the code
Database returned stale/wrong records	Data issue — fix independently
API call failed with error	Infrastructure issue

For non-LLM failures: note them in the investigation log and recommend the code fix, but do not adjust eval expectations or thresholds to accommodate bugs in non-LLM code. The eval test should measure LLM quality assuming the rest of the system works correctly.

Step 4: Document findings in MEMORY.md

Every failure investigation must be documented inpixie_qa/MEMORY.md in a structured format:

### Investigation: <test_name> failure — <date>

**Test**: `test_faq_factuality` in `pixie_qa/tests/test_customer_service.py`
**Result**: 3/5 cases passed (60%), threshold was 80% ≥ 0.7

#### Failing case 1: "What rows have extra legroom?"

- **eval_input**: `{"user_message": "What rows have extra legroom?"}`
- **eval_output**: "I'm sorry, I don't have the exact row numbers for extra legroom..."
- **expected_output**: "rows 5-8 Economy Plus with extra legroom"
- **Evaluator score**: 0.1 (FactualityEval)
- **Evaluator reasoning**: "The output claims not to know the answer while the reference clearly states rows 5-8..."

**Trace analysis**:
Inspected trace `abc123`. The span tree shows:

1. Triage Agent routed to FAQ Agent ✓
2. FAQ Agent called `faq_lookup_tool("What rows have extra legroom?")` ✓
3. `faq_lookup_tool` returned "I'm sorry, I don't know..." ← **root cause**

**Root cause**: `faq_lookup_tool` (customer_service.py:112) uses keyword matching.
The seat FAQ entry is triggered by keywords `["seat", "seats", "seating", "plane"]`.
The question "What rows have extra legroom?" contains none of these keywords, so it
falls through to the default "I don't know" response.

**Classification**: Non-LLM failure — the keyword-matching tool is broken.
The LLM agent correctly routed to the FAQ agent and used the tool; the tool
itself returned wrong data.

**Fix**: Add `"row"`, `"rows"`, `"legroom"` to the seating keyword list in
`faq_lookup_tool` (customer_service.py:130). This is a traditional code fix,
not an eval/prompt change.

**Verification**: After fix, re-run:
\`\`\`bash
python pixie_qa/scripts/build_dataset.py # refresh dataset
pixie test pixie_qa/tests/ -k faq -v # verify
\`\`\`

Step 5: Fix and re-run

Make the targeted change, rebuild the dataset if needed, and re-run. Always finish by giving the user the exact commands to verify:

pixie test pixie_qa/tests/test_<feature>.py -v

Memory Template

# Eval Notes: <Project Name>

## How the application works

### Entry point and execution flow

<How to start/run the app. Step-by-step flow from input to output.>

### Inputs to LLM calls

<For EACH LLM call, document: location in code, system prompt, dynamic content, available tools>

### Intermediate processing

<Steps between input and output: retrieval, routing, tool calls, etc. Code pointers for each.>

### Final output

<What the user sees. Format. Quality expectations.>

### Use cases

<Each scenario with examples of good/bad outputs:>

1. <Use case 1>: <description>
   - Input example: ...
   - Good output: ...
   - Bad output: ...

### Eval-boundary function

- **Function**: `<fully qualified name>`
- **Location**: `<file:line>`
- **Signature**: `<params and return type>`
- **Upstream dependencies to mock**: <what needs mocking/stubbing>
- **Why this boundary**: <rationale>

## Evaluation plan

### What to evaluate and why

<Quality dimensions and rationale>

### Evaluators and criteria

| Test | Dataset | Evaluator | Criteria | Rationale |
| ---- | ------- | --------- | -------- | --------- |
| ...  | ...     | ...       | ...      | ...       |

### Data needed for evaluation

<What data to capture, with code pointers>

## Datasets

| Dataset | Items | Purpose |
| ------- | ----- | ------- |
| ...     | ...   | ...     |

## Investigation log

### <date> — <test_name> failure

<Full structured investigation as described in Stage 8>

Reference

See references/pixie-api.md for all CLI commands, evaluator signatures, and the Python dataset/store API.

Weekly Installs

317

Repository

github/awesome-copilot

GitHub Stars

26.7K

First Seen

8 days ago

Security Audits

Gen Agent Trust HubWarn SocketWarn SnykPass

Installed on

gemini-cli288

codex286

opencode278

cursor276

github-copilot275

cline274

AI Elements：基于shadcn/ui的AI原生应用组件库，快速构建对话界面

54,900 周安装

Expected outputs: FactualityEval needs them. Format evaluators usually don't.