LangSmith评估器使用指南 - 构建AI智能体评估系统与LLM评判者

langsmith-evaluator by langchain-ai/langsmith-skills

671 周安装量

76 GitHub Stars

安装命令

npx skills add https://github.com/langchain-ai/langsmith-skills --skill langsmith-evaluator

🇨🇳中文介绍

LANGSMITH_API_KEY=lsv2_pt_your_api_key_here # 必需 LANGSMITH_PROJECT=your-project-name # 检查此项以了解哪个项目包含追踪记录 LANGSMITH_WORKSPACE_ID=your-workspace-id # 可选：用于组织范围的密钥 OPENAI_API_KEY=your_openai_key # 用于 LLM 作为评判者

重要提示： 在查询或与 LangSmith 交互之前，请务必检查环境变量或 .env 文件中的 LANGSMITH_PROJECT。这会告诉你哪个项目包含相关的追踪记录和数据。如果 LangSmith 项目不可用，请运用你的最佳判断来确定正确的项目。

Python 依赖项

pip install langsmith langchain-openai python-dotenv

CLI 工具（用于上传评估器）

curl -sSL https://raw.githubusercontent.com/langchain-ai/langsmith-cli/main/scripts/install.sh | sh

JavaScript 依赖项

npm install langsmith openai

<crucial_requirement>

黄金法则：先检查，后实现

关键： 在编写任何评估器或提取逻辑之前，你必须：

运行你的智能体 处理样本输入并捕获实际输出

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

离线评估器与在线评估器

离线评估器（附加到数据集）：

函数签名：(run, example) - 接收运行输出和数据集示例
用例：将智能体输出与数据集中的期望值进行比较
上传方式：--dataset "Dataset Name"

在线评估器（附加到项目）：

函数签名：(run) - 仅接收运行输出，没有 example 参数
用例：对生产运行进行实时质量检查（无参考数据）
上传方式：--project "Project Name"

关键 - 返回格式：

每个评估器仅返回一个指标。对于多个指标，请创建多个评估器函数。
请不要返回 {"metric_name": value} 或指标列表 - 这会导致错误。

关键 - 本地与上传的差异：

| 本地 evaluate() | 上传到 LangSmith
---|---|---
列名 | Python：自动从函数名派生。TypeScript：必须包含 key 字段，否则列无标题 | 来自上传时设置的评估器名称。不要包含 key - 这会导致重复列
Python run 类型 | RunTree 对象 → run.outputs（属性访问） | dict → run["outputs"]（下标访问）。处理两者：run.outputs if hasattr(run, "outputs") else run.get("outputs", {})
TypeScript run 类型 | 始终属性访问：run.outputs?.field | 始终属性访问：run.outputs?.field
Python 返回值 | {"score": value, "comment": "..."} | {"score": value, "comment": "..."}
TypeScript 返回值 | { key: "name", score: value, comment: "..." } | { score: value, comment: "..." }
</evaluator_format> | |

LLM 作为评判者 - 使用 LLM 对输出进行评分。最适合主观质量评估（准确性、帮助性、相关性）。
自定义代码 - 确定性逻辑。最适合客观检查（精确匹配、轨迹验证、格式合规性）。 </evaluator_types>

LLM 作为评判者的评估器

注意： CLI 目前不支持 LLM-as-Judge 评估器的上传 - 仅支持代码评估器。对于针对数据集的评估，强烈建议定义本地评估器与 evaluate(evaluators=[...]) 一起使用。

class Grade(TypedDict): reasoning: Annotated[str, ..., "Explain your reasoning"] is_accurate: Annotated[bool, ..., "True if response is accurate"]

judge = ChatOpenAI(model="gpt-4o-mini", temperature=0).with_structured_output(Grade, method="json_schema", strict=True)

async def accuracy_evaluator(run, example): run_outputs = run.outputs if hasattr(run, "outputs") else run.get("outputs", {}) or {} example_outputs = example.outputs if hasattr(example, "outputs") else example.get("outputs", {}) or {} grade = await judge.ainvoke([{"role": "user", "content": f"Expected: {example_outputs}\nActual: {run_outputs}\nIs this accurate?"}]) return {"score": 1 if grade["is_accurate"] else 0, "comment": grade["reasoning"]}

</python>

<typescript>
```javascript
import OpenAI from "openai";

const openai = new OpenAI();

async function accuracyEvaluator(run, example) {
    const runOutputs = run.outputs ?? {};
    const exampleOutputs = example.outputs ?? {};

    const response = await openai.chat.completions.create({
    model: "gpt-4o-mini",
    temperature: 0,
    response_format: { type: "json_object" },
    messages: [
        { role: "system", content: 'Respond with JSON: {"is_accurate": boolean, "reasoning": string}' },
        { role: "user", content: `Expected: ${JSON.stringify(exampleOutputs)}\nActual: ${JSON.stringify(runOutputs)}\nIs this accurate?` }
    ]
    });

    const grade = JSON.parse(response.choices[0].message.content);
    return { score: grade.is_accurate ? 1 : 0, comment: grade.reasoning };
}

自定义代码评估器

编写评估器之前：

检查你的数据集以了解期望的字段名称（参见上面的黄金法则）
测试你的运行函数并验证其输出结构与数据集模式匹配
查询 LangSmith 追踪记录以调试任何不匹配的情况

运行函数执行你的智能体并返回用于评估的输出。

关键 - 首先测试你的运行函数： 在编写评估器之前，你必须测试你的运行函数并检查实际的输出结构。输出形态因框架、智能体类型和配置的不同而有差异。

调试工作流：

在样本输入上运行一次你的智能体
查询追踪记录以查看执行结构
打印原始输出并与追踪记录对比验证，确保输出包含正确的数据
根据需要调整运行函数
验证你的输出与你的数据集模式是否匹配

尽最大努力使你的运行函数输出与你的数据集模式相匹配。 这会使评估器变得简单且可重用。如果无法匹配，你的评估器必须知道如何从每一方提取和比较正确的字段。

对于轨迹评估，你的运行函数必须在执行期间捕获工具调用。

关键： 运行输出格式因框架和智能体类型的不同而有显著差异。你必须在实现之前进行检查：

LangGraph 智能体（LangChain OSS）： 使用 stream_mode="debug" 和 subgraphs=True 来捕获嵌套的子智能体工具调用。

import uuid

def run_agent_with_trajectory(agent, inputs: dict) -> dict:
    config = {"configurable": {"thread_id": f"eval-{uuid.uuid4()}"}}
    trajectory = []
    final_result = None

    for chunk in agent.stream(inputs, config=config, stream_mode="debug", subgraphs=True):
        # 步骤 1：打印数据块以理解结构
        print(f"DEBUG chunk: {chunk}")

        # 步骤 2：根据你观察到的结构编写提取逻辑
        # ... 你的提取逻辑写在这里 ...

    # 重要：运行后，查询 LangSmith 追踪记录以验证
    # 你的轨迹数据是否完整。默认输出可能缺少
    # 追踪记录中出现的工具调用。
    return {"output": final_result, "trajectory": trajectory}

自定义 / 非 LangChain 智能体：

首先检查输出 - 运行你的智能体并检查结果结构。轨迹数据可能已经包含在输出中（例如，result.tool_calls、result.steps 等）
回调/钩子 - 如果你的框架支持执行回调，注册一个钩子来记录每次调用时的工具名称
解析执行日志 - 作为最后的手段，从结构化日志或追踪数据中提取工具名称

关键是在执行时捕获工具名称，而不是在定义时。 </run_functions>

重要 - 自动运行行为： 上传到数据集的评估器在你对该数据集运行实验时会自动运行。你不需要将它们传递给 evaluate() - 只需针对数据集运行你的智能体，上传的评估器就会自动执行。

重要 - 本地与上传： 上传的评估器在沙盒环境中运行，包访问权限非常有限。仅使用内置/标准库导入，并将所有导入放在评估器函数体内。对于数据集（离线）评估器，建议首先使用 evaluate(evaluators=[...]) 在本地运行 - 这为你提供了完整的包访问权限。

重要 - 代码评估器与结构化评估器：

代码评估器（CLI 上传的内容）：在有限的环境中运行，没有外部包。用于确定性逻辑（精确匹配、轨迹验证）。
结构化评估器（LLM-as-Judge）：通过 LangSmith UI 配置，使用带有模型/提示/模式的特定有效载荷格式。CLI 目前尚不支持此格式。

重要 - 选择正确的目标：

--dataset：具有 (run, example) 签名的离线评估器 - 用于与期望值进行比较
--project：具有 (run) 签名的在线评估器 - 用于实时质量检查

你必须指定一个。不支持全局评估器。

# 列出所有评估器
langsmith evaluator list

# 上传离线评估器（附加到数据集）
langsmith evaluator upload my_evaluators.py \
  --name "Trajectory Match" --function trajectory_evaluator \
  --dataset "My Dataset" --replace

# 上传在线评估器（附加到项目）
langsmith evaluator upload my_evaluators.py \
  --name "Quality Check" --function quality_check \
  --project "Production Agent" --replace

# 删除
langsmith evaluator delete "Trajectory Match"

重要 - 安全提示：

CLI 在执行破坏性操作前会提示确认
除非用户明确要求，否则切勿使用 --yes 标志

对 LLM 评判者使用结构化输出 - 比解析自由文本更可靠
使评估器与数据集类型匹配 * 最终响应 → 使用 LLM 作为评判者进行质量评估 * 轨迹 → 使用自定义代码进行序列评估
对 LLM 评判者使用异步 - 支持并行评估
独立测试评估器 - 首先在已知的好/坏示例上进行验证
选择合适的语言 * Python：用于 Python 智能体、langchain 集成 * JavaScript：用于 TypeScript/Node.js 智能体 </best_practices>

<running_evaluations>

上传的评估器 在你运行实验时会自动运行 - 无需代码。本地评估器 直接传递用于开发/测试。

上传的评估器自动运行

results = evaluate(run_agent, data="My Dataset", experiment_prefix="eval-v1")

或者传递本地评估器进行测试

results = evaluate(run_agent, data="My Dataset", evaluators=[my_evaluator], experiment_prefix="eval-v1")

</python>

<typescript>
```javascript
import { evaluate } from "langsmith/evaluation";

// 上传的评估器自动运行
const results = await evaluate(runAgent, {
  data: "My Dataset",
  experimentPrefix: "eval-v1",
});

// 或者传递本地评估器进行测试
const results = await evaluate(runAgent, {
  data: "My Dataset",
  evaluators: [myEvaluator],
  experimentPrefix: "eval-v1",
});

输出不符合你的预期： 查询 LangSmith 追踪记录。它显示每个步骤的确切输入/输出 - 将你找到的内容与你试图提取的内容进行比较。

每个评估器一个指标： 返回 {"score": value, "comment": "..."}。对于多个指标，请创建单独的函数。

字段名称不匹配： 你的运行函数输出必须与数据集模式完全匹配。首先使用 client.read_example(example_id) 检查数据集。

RunTree 与 dict（仅限 Python）： 本地 evaluate() 传递 RunTree，上传的评估器接收 dict。处理两者：

run_outputs = run.outputs if hasattr(run, "outputs") else run.get("outputs", {}) or {}

TypeScript 始终使用属性访问：run.outputs?.field

🇺🇸English

LANGSMITH_API_KEY=lsv2_pt_your_api_key_here # Required LANGSMITH_PROJECT=your-project-name # Check this to know which project has traces LANGSMITH_WORKSPACE_ID=your-workspace-id # Optional: for org-scoped keys OPENAI_API_KEY=your_openai_key # For LLM as Judge

IMPORTANT: Always check the environment variables or .env file for LANGSMITH_PROJECT before querying or interacting with LangSmith. This tells you which project contains the relevant traces and data. If the LangSmith project is not available, use your best judgement to identify the right one.

Python Dependencies

pip install langsmith langchain-openai python-dotenv

CLI Tool (for uploading evaluators)

curl -sSL https://raw.githubusercontent.com/langchain-ai/langsmith-cli/main/scripts/install.sh | sh

JavaScript Dependencies

npm install langsmith openai

<crucial_requirement>

Golden Rule: Inspect Before You Implement

CRITICAL: Before writing ANY evaluator or extraction logic, you MUST:

Run your agent on sample inputs and capture the actual output
Inspect the output - print it, query LangSmith traces, understand the exact structure
Only then write code that processes that output

Output structures vary significantly by framework, agent type, and configuration. Never assume the shape - always verify first. Query LangSmith traces to when outputs don't contain needed data to understand how to extract from execution. </crucial_requirement>

<evaluator_format>

Offline vs Online Evaluators

Offline Evaluators (attached to datasets):

Function signature: (run, example) - receives both run outputs and dataset example
Use case: Comparing agent outputs to expected values in a dataset
Upload with: --dataset "Dataset Name"

Online Evaluators (attached to projects):

Function signature: (run) - receives only run outputs, NO example parameter
Use case: Real-time quality checks on production runs (no reference data)
Upload with: --project "Project Name"

CRITICAL - Return Format:

Each evaluator returns ONE metric only. For multiple metrics, create multiple evaluator functions.
Do NOT return {"metric_name": value} or lists of metrics - this will error.

CRITICAL - Local vs Uploaded Differences:

| Local evaluate() | Uploaded to LangSmith
---|---|---
Column name | Python: auto-derived from function name. TypeScript: must include key field or column is untitled | Comes from evaluator name set at upload time. Do NOT include key — it creates a duplicate column
Pythonrun type | RunTree object → run.outputs (attribute) | dict → run["outputs"] (subscript). Handle both: run.outputs if hasattr(run, "outputs") else run.get("outputs", {})
TypeScriptrun type | Always attribute access: | Always attribute access: | | | | </evaluator_format> | |

<evaluator_types>

LLM as Judge - Uses an LLM to grade outputs. Best for subjective quality (accuracy, helpfulness, relevance).
Custom Code - Deterministic logic. Best for objective checks (exact match, trajectory validation, format compliance). </evaluator_types>

<llm_judge>

LLM as Judge Evaluators

NOTE: LLM-as-Judge upload is currently not supported by the CLI — only code evaluators are supported. For evaluations against a dataset, STRONGLY PREFER defining local evaluators to use with evaluate(evaluators=[...]).

class Grade(TypedDict): reasoning: Annotated[str, ..., "Explain your reasoning"] is_accurate: Annotated[bool, ..., "True if response is accurate"]

judge = ChatOpenAI(model="gpt-4o-mini", temperature=0).with_structured_output(Grade, method="json_schema", strict=True)

</python>

<typescript>
```javascript
import OpenAI from "openai";

const openai = new OpenAI();

async function accuracyEvaluator(run, example) {
    const runOutputs = run.outputs ?? {};
    const exampleOutputs = example.outputs ?? {};

    const response = await openai.chat.completions.create({
    model: "gpt-4o-mini",
    temperature: 0,
    response_format: { type: "json_object" },
    messages: [
        { role: "system", content: 'Respond with JSON: {"is_accurate": boolean, "reasoning": string}' },
        { role: "user", content: `Expected: ${JSON.stringify(exampleOutputs)}\nActual: ${JSON.stringify(runOutputs)}\nIs this accurate?` }
    ]
    });

    const grade = JSON.parse(response.choices[0].message.content);
    return { score: grade.is_accurate ? 1 : 0, comment: grade.reasoning };
}

<code_evaluators>

Custom Code Evaluators

Before writing an evaluator:

Inspect your dataset to understand expected field names (see Golden Rule above)
Test your run function and verify its output structure matches the dataset schema
Query LangSmith traces to debug any mismatches

<run_functions>

Defining Run Functions

Run functions execute your agent and return outputs for evaluation.

CRITICAL - Test Your Run Function First: Before writing evaluators, you MUST test your run function and inspect the actual output structure. Output shapes vary by framework, agent type, and configuration.

Debugging workflow:

Run your agent once on sample input
Query the trace to see the execution structure
Print the raw output and verify against trace to output contains the right data
Adjust the run function as needed
Verify your output matches your dataset schema

Try your hardest to match your run function output to your dataset schema. This makes evaluators simple and reusable. If matching isn't possible, your evaluator must know how to extract and compare the right fields from each side.

Capturing Trajectories

For trajectory evaluation, your run function must capture tool calls during execution.

CRITICAL: Run output formats vary significantly by framework and agent type. You MUST inspect before implementing:

LangGraph agents (LangChain OSS): Use stream_mode="debug" with subgraphs=True to capture nested subagent tool calls.

import uuid

def run_agent_with_trajectory(agent, inputs: dict) -> dict:
    config = {"configurable": {"thread_id": f"eval-{uuid.uuid4()}"}}
    trajectory = []
    final_result = None

    for chunk in agent.stream(inputs, config=config, stream_mode="debug", subgraphs=True):
        # STEP 1: Print chunks to understand the structure
        print(f"DEBUG chunk: {chunk}")

        # STEP 2: Write extraction based on YOUR observed structure
        # ... your extraction logic here ...

    # IMPORTANT: After running, query the LangSmith trace to verify
    # your trajectory data is complete. Default output may be missing
    # tool calls that appear in the trace.
    return {"output": final_result, "trajectory": trajectory}

Custom / Non-LangChain Agents:

Inspect output first - Run your agent and inspect the result structure. Trajectory data may already be included in the output (e.g., result.tool_calls, result.steps, etc.)
Callbacks/Hooks - If your framework supports execution callbacks, register a hook that records tool names on each invocation
Parse execution logs - As a last resort, extract tool names from structured logs or trace data

The key is to capture the tool name at execution time, not at definition time. </run_functions>

IMPORTANT - Auto-Run Behavior: Evaluators uploaded to a dataset automatically run when you run experiments on that dataset. You do NOT need to pass them to evaluate() - just run your agent against the dataset and the uploaded evaluators execute automatically.

IMPORTANT - Local vs Uploaded: Uploaded evaluators run in a sandboxed environment with very limited package access. Only use built-in/standard library imports, and place all imports inside the evaluator function body. For dataset (offline) evaluators, prefer running locally with evaluate(evaluators=[...]) first — this gives you full package access.

IMPORTANT - Code vs Structured Evaluators:

Code evaluators (what the CLI uploads): Run in a limited environment without external packages. Use for deterministic logic (exact match, trajectory validation).
Structured evaluators (LLM-as-Judge): Configured via LangSmith UI, use a specific payload format with model/prompt/schema. The CLI does not support this format yet.

IMPORTANT - Choose the right target:

--dataset: Offline evaluator with (run, example) signature - for comparing to expected values
--project: Online evaluator with (run) signature - for real-time quality checks

You must specify one. Global evaluators are not supported.

# List all evaluators
langsmith evaluator list

# Upload offline evaluator (attached to dataset)
langsmith evaluator upload my_evaluators.py \
  --name "Trajectory Match" --function trajectory_evaluator \
  --dataset "My Dataset" --replace

# Upload online evaluator (attached to project)
langsmith evaluator upload my_evaluators.py \
  --name "Quality Check" --function quality_check \
  --project "Production Agent" --replace

# Delete
langsmith evaluator delete "Trajectory Match"

IMPORTANT - Safety Prompts:

The CLI prompts for confirmation before destructive operations
NEVER use--yes flag unless the user explicitly requests it

<best_practices>

Use structured output for LLM judges - More reliable than parsing free-text
Match evaluator to dataset type
- Final Response → LLM as Judge for quality
- Trajectory → Custom Code for sequence
Use async for LLM judges - Enables parallel evaluation
Test evaluators independently - Validate on known good/bad examples first
Choose the right language
- Python: Use for Python agents, langchain integrations
- JavaScript: Use for TypeScript/Node.js agents </best_practices>

<running_evaluations>

Running Evaluations

Uploaded evaluators auto-run when you run experiments - no code needed. Local evaluators are passed directly for development/testing.

Uploaded evaluators run automatically

results = evaluate(run_agent, data="My Dataset", experiment_prefix="eval-v1")

Or pass local evaluators for testing

results = evaluate(run_agent, data="My Dataset", evaluators=[my_evaluator], experiment_prefix="eval-v1")

</python>

<typescript>
```javascript
import { evaluate } from "langsmith/evaluation";

// Uploaded evaluators run automatically
const results = await evaluate(runAgent, {
  data: "My Dataset",
  experimentPrefix: "eval-v1",
});

// Or pass local evaluators for testing
const results = await evaluate(runAgent, {
  data: "My Dataset",
  evaluators: [myEvaluator],
  experimentPrefix: "eval-v1",
});

Output doesn't match what you expect: Query the LangSmith trace. It shows exact inputs/outputs at each step - compare what you find to what you're trying to extract.

One metric per evaluator: Return {"score": value, "comment": "..."}. For multiple metrics, create separate functions.

Field name mismatch: Your run function output must match dataset schema exactly. Inspect dataset first with client.read_example(example_id).

RunTree vs dict (Python only): Local evaluate() passes RunTree, uploaded evaluators receive dict. Handle both:

run_outputs = run.outputs if hasattr(run, "outputs") else run.get("outputs", {}) or {}

TypeScript always uses attribute access: run.outputs?.field

Weekly Installs

647

Repository

langchain-ai/la…h-skills

GitHub Stars

First Seen

Mar 4, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

claude-code590

codex474

cursor413

gemini-cli401

github-copilot400

opencode400

AI 代码实施计划编写技能 | 自动化开发任务分解与 TDD 流程规划工具

41,400 周安装

run.outputs?.field

{"score": value, "comment": "..."}

TypeScript return

{ key: "name", score: value, comment: "..." }

{ score: value, comment: "..." }