arize-prompt-optimization by arize-ai/arize-skills
npx skills add https://github.com/arize-ai/arize-skills --skill arize-prompt-optimizationLLM 应用程序遵循 OpenInference 语义约定发出跨度(span)。提示词根据跨度的种类和插桩方式存储在不同的跨度属性中:
| 列 | 包含内容 | 何时使用 |
|---|---|---|
attributes.llm.input_messages | 基于角色格式的结构化聊天消息(系统、用户、助手、工具) | 基于聊天的 LLM 提示词的主要来源 |
attributes.llm.input_messages.roles | 角色数组:system、user、assistant、tool |
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
| 提取单个消息角色 |
attributes.llm.input_messages.contents | 消息内容字符串数组 | 提取消息文本 |
attributes.input.value | 序列化的提示词或用户问题(通用,所有跨度种类) | 当结构化消息不可用时的后备方案 |
attributes.llm.prompt_template.template | 带有 {variable} 占位符的模板(例如 "Answer {question} using {context}") | 当应用程序使用提示词模板时 |
attributes.llm.prompt_template.variables | 模板变量值(JSON 对象) | 查看哪些值被替换到了模板中 |
attributes.output.value | 模型响应文本 | 查看 LLM 产生了什么 |
attributes.llm.output_messages | 结构化模型输出(包括工具调用) | 检查工具调用响应 |
attributes.openinference.span.kind = 'LLM'): 在 attributes.llm.input_messages 中查找结构化聊天消息,或者在 attributes.input.value 中查找序列化的提示词。在 attributes.llm.prompt_template.template 中查找模板。attributes.input.value 包含用户的问题。实际的 LLM 提示词位于子 LLM 跨度上——沿着追踪树向下导航。attributes.input.value 包含工具输入,attributes.output.value 包含工具结果。通常不是提示词存储的位置。这些列携带用于优化的反馈数据:
| 列模式 | 来源 | 告诉你什么 |
|---|---|---|
annotation.<name>.label | 人工评审员 | 分类等级(例如 correct、incorrect、partial) |
annotation.<name>.score | 人工评审员 | 数字质量分数(例如 0.0 - 1.0) |
annotation.<name>.text | 人工评审员 | 等级的自由文本解释 |
eval.<name>.label | LLM 作为评判者的评估 | 自动分类评估 |
eval.<name>.score | LLM 作为评判者的评估 | 自动数字分数 |
eval.<name>.explanation | LLM 作为评判者的评估 | 评估给出该分数的原因——对优化最有价值 |
attributes.input.value | 追踪数据 | 输入到 LLM 的内容 |
attributes.output.value | 追踪数据 | LLM 产生的内容 |
{experiment_name}.output | 实验运行 | 特定实验的输出 |
需要三样东西:ax CLI、一个 API 密钥(环境变量或配置文件)和一个项目。使用项目名称时还需要一个空间 ID。
如果 ax 未安装、不在 PATH 中或版本低于 0.8.0,请参阅 ax-setup.md。
运行快速检查以验证凭据:
macOS/Linux (bash):
ax --version && echo "--- env ---" && if [ -n "$ARIZE_API_KEY" ]; then echo "ARIZE_API_KEY: (set)"; else echo "ARIZE_API_KEY: (not set)"; fi && echo "ARIZE_SPACE_ID: ${ARIZE_SPACE_ID:-(not set)}" && echo "ARIZE_DEFAULT_PROJECT: ${ARIZE_DEFAULT_PROJECT:-(not set)}" && echo "--- profiles ---" && ax profiles show 2>&1
Windows (PowerShell):
ax --version; Write-Host "--- env ---"; Write-Host "ARIZE_API_KEY: $(if ($env:ARIZE_API_KEY) { '(set)' } else { '(not set)' })"; Write-Host "ARIZE_SPACE_ID: $env:ARIZE_SPACE_ID"; Write-Host "ARIZE_DEFAULT_PROJECT: $env:ARIZE_DEFAULT_PROJECT"; Write-Host "--- profiles ---"; ax profiles show 2>&1
读取输出并立即继续,如果环境变量或配置文件中任一包含 API 密钥。仅当两者都缺失时才询问用户。解决失败问题:
ax spaces list -o json 列出所有可访问的空间并选择正确的,或者如果用户希望直接提供,则询问问题ax projects list -o json --limit 100 并呈现为可选项如果设置了 ARIZE_DEFAULT_PROJECT(在上面的输出中可见),则将其值用作本会话中所有命令的项目。不要向用户询问项目 ID——直接使用它。继续使用此默认值,直到用户明确提供不同的项目。
如果未设置 ARIZE_DEFAULT_PROJECT 且未提供项目,则向用户询问。
# 列出 LLM 跨度(提示词所在位置)
ax spans list PROJECT_ID --filter "attributes.openinference.span.kind = 'LLM'" --limit 10
# 按模型筛选
ax spans list PROJECT_ID --filter "attributes.llm.model_name = 'gpt-4o'" --limit 10
# 按跨度名称筛选(例如,特定的 LLM 调用)
ax spans list PROJECT_ID --filter "name = 'ChatCompletion'" --limit 10
# 导出追踪中的所有跨度
ax spans export --trace-id TRACE_ID --project PROJECT_ID
# 导出单个跨度
ax spans export --span-id SPAN_ID --project PROJECT_ID
# 提取结构化聊天消息(系统 + 用户 + 助手)
jq '.[0] | {
messages: .attributes.llm.input_messages,
model: .attributes.llm.model_name
}' trace_*/spans.json
# 专门提取系统提示词
jq '[.[] | select(.attributes.llm.input_messages.roles[]? == "system")] | .[0].attributes.llm.input_messages' trace_*/spans.json
# 提取提示词模板和变量
jq '.[0].attributes.llm.prompt_template' trace_*/spans.json
# 从 input.value 中提取(非结构化提示词的后备方案)
jq '.[0].attributes.input.value' trace_*/spans.json
获取跨度数据后,将提示词重构为消息数组:
[
{"role": "system", "content": "You are a helpful assistant that..."},
{"role": "user", "content": "Given {input}, answer the question: {question}"}
]
如果跨度具有 attributes.llm.prompt_template.template,则提示词使用了变量。保留这些占位符({variable} 或 {{variable}})——它们会在运行时被替换。
# 查找错误跨度——这些表明提示词失败
ax spans list PROJECT_ID \
--filter "status_code = 'ERROR' AND attributes.openinference.span.kind = 'LLM'" \
--limit 20
# 查找评估分数低的跨度
ax spans list PROJECT_ID \
--filter "annotation.correctness.label = 'incorrect'" \
--limit 20
# 查找延迟高的跨度(可能表明提示词过于复杂)
ax spans list PROJECT_ID \
--filter "attributes.openinference.span.kind = 'LLM' AND latency_ms > 10000" \
--limit 20
# 导出错误追踪以进行详细检查
ax spans export --trace-id TRACE_ID --project PROJECT_ID
# 导出数据集(真实示例)
ax datasets export DATASET_ID
# -> dataset_*/examples.json
# 导出实验结果(LLM 产生的内容)
ax experiments export EXPERIMENT_ID
# -> experiment_*/runs.json
通过 example_id 连接两个文件,以查看输入、输出和评估:
# 计算示例和运行的数量
jq 'length' dataset_*/examples.json
jq 'length' experiment_*/runs.json
# 查看单个连接记录
jq -s '
.[0] as $dataset |
.[1][0] as $run |
($dataset[] | select(.id == $run.example_id)) as $example |
{
input: $example,
output: $run.output,
evaluations: $run.evaluations
}
' dataset_*/examples.json experiment_*/runs.json
# 查找失败的示例(评估分数低于阈值)
jq '[.[] | select(.evaluations.correctness.score < 0.5)]' experiment_*/runs.json
查找失败中的模式:
eval.*.explanation 告诉你为什么某些事情失败了使用此模板生成提示词的改进版本。填写三个占位符并将其发送给你的 LLM(GPT-4o、Claude 等):
You are an expert in prompt optimization. Given the original baseline prompt
and the associated performance data (inputs, outputs, evaluation labels, and
explanations), generate a revised version that improves results.
ORIGINAL BASELINE PROMPT
========================
{PASTE_ORIGINAL_PROMPT_HERE}
========================
PERFORMANCE DATA
================
The following records show how the current prompt performed. Each record
includes the input, the LLM output, and evaluation feedback:
{PASTE_RECORDS_HERE}
================
HOW TO USE THIS DATA
1. Compare outputs: Look at what the LLM generated vs what was expected
2. Review eval scores: Check which examples scored poorly and why
3. Examine annotations: Human feedback shows what worked and what didn't
4. Identify patterns: Look for common issues across multiple examples
5. Focus on failures: The rows where the output DIFFERS from the expected
value are the ones that need fixing
ALIGNMENT STRATEGY
- If outputs have extra text or reasoning not present in the ground truth,
remove instructions that encourage explanation or verbose reasoning
- If outputs are missing information, add instructions to include it
- If outputs are in the wrong format, add explicit format instructions
- Focus on the rows where the output differs from the target -- these are
the failures to fix
RULES
Maintain Structure:
- Use the same template variables as the current prompt ({var} or {{var}})
- Don't change sections that are already working
- Preserve the exact return format instructions from the original prompt
Avoid Overfitting:
- DO NOT copy examples verbatim into the prompt
- DO NOT quote specific test data outputs exactly
- INSTEAD: Extract the ESSENCE of what makes good vs bad outputs
- INSTEAD: Add general guidelines and principles
- INSTEAD: If adding few-shot examples, create SYNTHETIC examples that
demonstrate the principle, not real data from above
Goal: Create a prompt that generalizes well to new inputs, not one that
memorizes the test data.
OUTPUT FORMAT
Return the revised prompt as a JSON array of messages:
[
{"role": "system", "content": "..."},
{"role": "user", "content": "..."}
]
Also provide a brief reasoning section (bulleted list) explaining:
- What problems you found
- How the revised prompt addresses each one
在粘贴到模板之前,将记录格式化为 JSON 数组:
# 从数据集 + 实验:连接并选择相关列
jq -s '
.[0] as $ds |
[.[1][] | . as $run |
($ds[] | select(.id == $run.example_id)) as $ex |
{
input: $ex.input,
expected: $ex.expected_output,
actual_output: $run.output,
eval_score: $run.evaluations.correctness.score,
eval_label: $run.evaluations.correctness.label,
eval_explanation: $run.evaluations.correctness.explanation
}
]
' dataset_*/examples.json experiment_*/runs.json
# 从导出的跨度中:提取带有注释的输入/输出对
jq '[.[] | select(.attributes.openinference.span.kind == "LLM") | {
input: .attributes.input.value,
output: .attributes.output.value,
status: .status_code,
model: .attributes.llm.model_name
}]' trace_*/spans.json
在 LLM 返回修订后的消息数组后:
1. 提取提示词 -> 阶段 1(一次)
2. 运行实验 -> ax experiments create ...
3. 导出结果 -> ax experiments export EXPERIMENT_ID
4. 分析失败 -> jq 查找低分
5. 运行元提示词 -> 阶段 3 使用新的失败数据
6. 应用修订后的提示词
7. 从步骤 2 开始重复
# 比较不同实验的分数
# 实验 A(基线)
jq '[.[] | .evaluations.correctness.score] | add / length' experiment_a/runs.json
# 实验 B(优化后)
jq '[.[] | .evaluations.correctness.score] | add / length' experiment_b/runs.json
# 查找从失败翻转为通过的示例
jq -s '
[.[0][] | select(.evaluations.correctness.label == "incorrect")] as $fails |
[.[1][] | select(.evaluations.correctness.label == "correct") |
select(.example_id as $id | $fails | any(.example_id == $id))
] | length
' experiment_a/runs.json experiment_b/runs.json
ax experiments export EXP_A 和 ax experiments export EXP_B在编写或修订提示词时应用这些技巧:
| 技巧 | 何时应用 | 示例 |
|---|---|---|
| 清晰、详细的指令 | 输出模糊或离题 | "将情感分类为以下恰好一种:积极、消极、中性" |
| 指令放在开头 | 模型忽略后面的指令 | 将任务描述放在示例之前 |
| 分步分解 | 复杂的多步骤流程 | "首先提取实体,然后对每个实体进行分类,最后进行总结" |
| 特定角色 | 需要一致的风格/语气 | "你是一位为机构投资者撰写报告的高级金融分析师" |
| 分隔符标记 | 各部分混合在一起 | 使用 ---、### 或 XML 标签来分隔输入和指令 |
| 少样本示例 | 需要澄清输出格式 | 展示 2-3 个合成的输入/输出对 |
| 输出长度规范 | 响应太长或太短 | "用恰好 2-3 句话回答" |
| 推理指令 | 准确性至关重要 | "在回答之前逐步思考" |
| "我不知道"指南 | 存在幻觉风险 | "如果答案不在提供的上下文中,请说'我没有足够的信息'" |
优化使用模板变量的提示词时:
{variable}): Python f-string / Jinja 风格。在 Arize 中最常见。{{variable}}): Mustache 风格。当框架要求时使用。查找失败的追踪:
ax traces list PROJECT_ID --filter "status_code = 'ERROR'" --limit 5
导出追踪:
ax spans export --trace-id TRACE_ID --project PROJECT_ID
从 LLM 跨度中提取提示词:
jq '[.[] | select(.attributes.openinference.span.kind == "LLM")][0] | {
messages: .attributes.llm.input_messages, template: .attributes.llm.prompt_template, output: .attributes.output.value, error: .attributes.exception.message }' trace_*/spans.json
根据错误消息或输出确定失败原因
用提示词和错误上下文填写优化元提示词(阶段 3)
应用修订后的提示词
查找数据集和实验:
ax datasets list
ax experiments list --dataset-id DATASET_ID
导出两者:
ax datasets export DATASET_ID
ax experiments export EXPERIMENT_ID
为元提示词准备连接的数据
运行优化元提示词
使用修订后的提示词创建新实验以衡量改进
导出输出格式错误的跨度:
ax spans list PROJECT_ID \
--filter "attributes.openinference.span.kind = 'LLM' AND annotation.format.label = 'incorrect'"
--limit 10 -o json > bad_format.json
查看 LLM 产生的内容与预期内容的差异
向提示词添加明确的格式说明(JSON 模式、示例、分隔符)
常见修复:添加一个少样本示例,展示确切的期望输出格式
查找模型产生幻觉的追踪:
ax spans list PROJECT_ID \
--filter "annotation.faithfulness.label = 'unfaithful'"
--limit 20
导出并一起检查检索器和 LLM 跨度:
ax spans export --trace-id TRACE_ID --project PROJECT_ID
jq '[.[] | {kind: .attributes.openinference.span.kind, name, input: .attributes.input.value, output: .attributes.output.value}]' trace_*/spans.json
检查检索到的上下文是否确实包含答案
向系统提示词添加上下文基础指令:"仅使用提供上下文中的信息。如果答案不在上下文中,请说明。"
| 问题 | 解决方案 |
|---|---|
ax: command not found | 参见 ax-setup.md |
No profile found | 未配置任何配置文件。请参阅 ax-profiles.md 创建一个。 |
跨度上没有 input_messages | 检查跨度种类——链/代理跨度将提示词存储在子 LLM 跨度上,而不是它们自身 |
提示词模板为 null | 并非所有插桩都发出 prompt_template。改用 input_messages 或 input.value |
| 优化后变量丢失 | 验证修订后的提示词保留了原始提示词中的所有 {var} 占位符 |
| 优化使情况更糟 | 检查是否过拟合——元提示词可能记住了测试数据。确保少样本示例是合成的 |
| 没有 eval/annotation 列 | 首先运行评估(通过 Arize UI 或 SDK),然后重新导出 |
| 找不到实验输出列 | 列名是 {experiment_name}.output——通过 ax experiments get 检查确切的实验名称 |
jq 在跨度 JSON 上出错 | 确保你指向了正确的文件路径(例如 trace_*/spans.json) |
每周安装次数
133
仓库
GitHub 星标数
6
首次出现
2026年3月10日
安全审计
安装于
gemini-cli130
kimi-cli130
amp130
cline130
cursor130
opencode130
LLM applications emit spans following OpenInference semantic conventions. Prompts are stored in different span attributes depending on the span kind and instrumentation:
| Column | What it contains | When to use |
|---|---|---|
attributes.llm.input_messages | Structured chat messages (system, user, assistant, tool) in role-based format | Primary source for chat-based LLM prompts |
attributes.llm.input_messages.roles | Array of roles: system, user, assistant, tool | Extract individual message roles |
attributes.llm.input_messages.contents | Array of message content strings | Extract message text |
attributes.input.value | Serialized prompt or user question (generic, all span kinds) | Fallback when structured messages are not available |
attributes.llm.prompt_template.template | Template with {variable} placeholders (e.g., "Answer {question} using {context}") | When the app uses prompt templates |
attributes.llm.prompt_template.variables | Template variable values (JSON object) | See what values were substituted into the template |
attributes.output.value | Model response text | See what the LLM produced |
attributes.llm.output_messages | Structured model output (including tool calls) | Inspect tool-calling responses |
attributes.openinference.span.kind = 'LLM'): Check attributes.llm.input_messages for structured chat messages, OR attributes.input.value for a serialized prompt. Check attributes.llm.prompt_template.template for the template.attributes.input.value contains the user's question. The actual LLM prompt lives on child LLM spans -- navigate down the trace tree.attributes.input.value has tool input, attributes.output.value has tool result. Not typically where prompts live.These columns carry the feedback data used for optimization:
| Column pattern | Source | What it tells you |
|---|---|---|
annotation.<name>.label | Human reviewers | Categorical grade (e.g., correct, incorrect, partial) |
annotation.<name>.score | Human reviewers | Numeric quality score (e.g., 0.0 - 1.0) |
annotation.<name>.text | Human reviewers | Freeform explanation of the grade |
Three things are needed: ax CLI, an API key (env var or profile), and a project. A space ID is also needed when using project names.
If ax is not installed, not on PATH, or below version 0.8.0, see ax-setup.md.
Run a quick check for credentials:
macOS/Linux (bash):
ax --version && echo "--- env ---" && if [ -n "$ARIZE_API_KEY" ]; then echo "ARIZE_API_KEY: (set)"; else echo "ARIZE_API_KEY: (not set)"; fi && echo "ARIZE_SPACE_ID: ${ARIZE_SPACE_ID:-(not set)}" && echo "ARIZE_DEFAULT_PROJECT: ${ARIZE_DEFAULT_PROJECT:-(not set)}" && echo "--- profiles ---" && ax profiles show 2>&1
Windows (PowerShell):
ax --version; Write-Host "--- env ---"; Write-Host "ARIZE_API_KEY: $(if ($env:ARIZE_API_KEY) { '(set)' } else { '(not set)' })"; Write-Host "ARIZE_SPACE_ID: $env:ARIZE_SPACE_ID"; Write-Host "ARIZE_DEFAULT_PROJECT: $env:ARIZE_DEFAULT_PROJECT"; Write-Host "--- profiles ---"; ax profiles show 2>&1
Read the output and proceed immediately if either the env var or the profile has an API key. Only ask the user if both are missing. Resolve failures:
ax spaces list -o json to list all accessible spaces and pick the right one, or AskQuestion if the user prefers to provide it directlyax projects list -o json --limit 100 and present as selectable optionsIf ARIZE_DEFAULT_PROJECT is set (visible in the output above), use its value as the project for all commands in this session. Do NOT ask the user for a project ID -- just use it. Continue using this default until the user explicitly provides a different project.
If ARIZE_DEFAULT_PROJECT is not set and no project is provided, ask the user for one.
# List LLM spans (where prompts live)
ax spans list PROJECT_ID --filter "attributes.openinference.span.kind = 'LLM'" --limit 10
# Filter by model
ax spans list PROJECT_ID --filter "attributes.llm.model_name = 'gpt-4o'" --limit 10
# Filter by span name (e.g., a specific LLM call)
ax spans list PROJECT_ID --filter "name = 'ChatCompletion'" --limit 10
# Export all spans in a trace
ax spans export --trace-id TRACE_ID --project PROJECT_ID
# Export a single span
ax spans export --span-id SPAN_ID --project PROJECT_ID
# Extract structured chat messages (system + user + assistant)
jq '.[0] | {
messages: .attributes.llm.input_messages,
model: .attributes.llm.model_name
}' trace_*/spans.json
# Extract the system prompt specifically
jq '[.[] | select(.attributes.llm.input_messages.roles[]? == "system")] | .[0].attributes.llm.input_messages' trace_*/spans.json
# Extract prompt template and variables
jq '.[0].attributes.llm.prompt_template' trace_*/spans.json
# Extract from input.value (fallback for non-structured prompts)
jq '.[0].attributes.input.value' trace_*/spans.json
Once you have the span data, reconstruct the prompt as a messages array:
[
{"role": "system", "content": "You are a helpful assistant that..."},
{"role": "user", "content": "Given {input}, answer the question: {question}"}
]
If the span has attributes.llm.prompt_template.template, the prompt uses variables. Preserve these placeholders ({variable} or {{variable}}) -- they are substituted at runtime.
# Find error spans -- these indicate prompt failures
ax spans list PROJECT_ID \
--filter "status_code = 'ERROR' AND attributes.openinference.span.kind = 'LLM'" \
--limit 20
# Find spans with low eval scores
ax spans list PROJECT_ID \
--filter "annotation.correctness.label = 'incorrect'" \
--limit 20
# Find spans with high latency (may indicate overly complex prompts)
ax spans list PROJECT_ID \
--filter "attributes.openinference.span.kind = 'LLM' AND latency_ms > 10000" \
--limit 20
# Export error traces for detailed inspection
ax spans export --trace-id TRACE_ID --project PROJECT_ID
# Export a dataset (ground truth examples)
ax datasets export DATASET_ID
# -> dataset_*/examples.json
# Export experiment results (what the LLM produced)
ax experiments export EXPERIMENT_ID
# -> experiment_*/runs.json
Join the two files by example_id to see inputs alongside outputs and evaluations:
# Count examples and runs
jq 'length' dataset_*/examples.json
jq 'length' experiment_*/runs.json
# View a single joined record
jq -s '
.[0] as $dataset |
.[1][0] as $run |
($dataset[] | select(.id == $run.example_id)) as $example |
{
input: $example,
output: $run.output,
evaluations: $run.evaluations
}
' dataset_*/examples.json experiment_*/runs.json
# Find failed examples (where eval score < threshold)
jq '[.[] | select(.evaluations.correctness.score < 0.5)]' experiment_*/runs.json
Look for patterns across failures:
eval.*.explanation tells you WHY something failedUse this template to generate an improved version of the prompt. Fill in the three placeholders and send it to your LLM (GPT-4o, Claude, etc.):
You are an expert in prompt optimization. Given the original baseline prompt
and the associated performance data (inputs, outputs, evaluation labels, and
explanations), generate a revised version that improves results.
ORIGINAL BASELINE PROMPT
========================
{PASTE_ORIGINAL_PROMPT_HERE}
========================
PERFORMANCE DATA
================
The following records show how the current prompt performed. Each record
includes the input, the LLM output, and evaluation feedback:
{PASTE_RECORDS_HERE}
================
HOW TO USE THIS DATA
1. Compare outputs: Look at what the LLM generated vs what was expected
2. Review eval scores: Check which examples scored poorly and why
3. Examine annotations: Human feedback shows what worked and what didn't
4. Identify patterns: Look for common issues across multiple examples
5. Focus on failures: The rows where the output DIFFERS from the expected
value are the ones that need fixing
ALIGNMENT STRATEGY
- If outputs have extra text or reasoning not present in the ground truth,
remove instructions that encourage explanation or verbose reasoning
- If outputs are missing information, add instructions to include it
- If outputs are in the wrong format, add explicit format instructions
- Focus on the rows where the output differs from the target -- these are
the failures to fix
RULES
Maintain Structure:
- Use the same template variables as the current prompt ({var} or {{var}})
- Don't change sections that are already working
- Preserve the exact return format instructions from the original prompt
Avoid Overfitting:
- DO NOT copy examples verbatim into the prompt
- DO NOT quote specific test data outputs exactly
- INSTEAD: Extract the ESSENCE of what makes good vs bad outputs
- INSTEAD: Add general guidelines and principles
- INSTEAD: If adding few-shot examples, create SYNTHETIC examples that
demonstrate the principle, not real data from above
Goal: Create a prompt that generalizes well to new inputs, not one that
memorizes the test data.
OUTPUT FORMAT
Return the revised prompt as a JSON array of messages:
[
{"role": "system", "content": "..."},
{"role": "user", "content": "..."}
]
Also provide a brief reasoning section (bulleted list) explaining:
- What problems you found
- How the revised prompt addresses each one
Format the records as a JSON array before pasting into the template:
# From dataset + experiment: join and select relevant columns
jq -s '
.[0] as $ds |
[.[1][] | . as $run |
($ds[] | select(.id == $run.example_id)) as $ex |
{
input: $ex.input,
expected: $ex.expected_output,
actual_output: $run.output,
eval_score: $run.evaluations.correctness.score,
eval_label: $run.evaluations.correctness.label,
eval_explanation: $run.evaluations.correctness.explanation
}
]
' dataset_*/examples.json experiment_*/runs.json
# From exported spans: extract input/output pairs with annotations
jq '[.[] | select(.attributes.openinference.span.kind == "LLM") | {
input: .attributes.input.value,
output: .attributes.output.value,
status: .status_code,
model: .attributes.llm.model_name
}]' trace_*/spans.json
After the LLM returns the revised messages array:
1. Extract prompt -> Phase 1 (once)
2. Run experiment -> ax experiments create ...
3. Export results -> ax experiments export EXPERIMENT_ID
4. Analyze failures -> jq to find low scores
5. Run meta-prompt -> Phase 3 with new failure data
6. Apply revised prompt
7. Repeat from step 2
# Compare scores across experiments
# Experiment A (baseline)
jq '[.[] | .evaluations.correctness.score] | add / length' experiment_a/runs.json
# Experiment B (optimized)
jq '[.[] | .evaluations.correctness.score] | add / length' experiment_b/runs.json
# Find examples that flipped from fail to pass
jq -s '
[.[0][] | select(.evaluations.correctness.label == "incorrect")] as $fails |
[.[1][] | select(.evaluations.correctness.label == "correct") |
select(.example_id as $id | $fails | any(.example_id == $id))
] | length
' experiment_a/runs.json experiment_b/runs.json
ax experiments export EXP_A and ax experiments export EXP_BApply these when writing or revising prompts:
| Technique | When to apply | Example |
|---|---|---|
| Clear, detailed instructions | Output is vague or off-topic | "Classify the sentiment as exactly one of: positive, negative, neutral" |
| Instructions at the beginning | Model ignores later instructions | Put the task description before examples |
| Step-by-step breakdowns | Complex multi-step processes | "First extract entities, then classify each, then summarize" |
| Specific personas | Need consistent style/tone | "You are a senior financial analyst writing for institutional investors" |
| Delimiter tokens | Sections blend together | Use ---, ###, or XML tags to separate input from instructions |
| Few-shot examples | Output format needs clarification | Show 2-3 synthetic input/output pairs |
| Output length specifications |
When optimizing prompts that use template variables:
{variable}): Python f-string / Jinja style. Most common in Arize.{{variable}}): Mustache style. Used when the framework requires it.Find failing traces:
ax traces list PROJECT_ID --filter "status_code = 'ERROR'" --limit 5
Export the trace:
ax spans export --trace-id TRACE_ID --project PROJECT_ID
Extract the prompt from the LLM span:
jq '[.[] | select(.attributes.openinference.span.kind == "LLM")][0] | { messages: .attributes.llm.input_messages, template: .attributes.llm.prompt_template, output: .attributes.output.value, error: .attributes.exception.message }' trace_*/spans.json
Identify what failed from the error message or output
Fill in the optimization meta-prompt (Phase 3) with the prompt and error context
Apply the revised prompt
Find the dataset and experiment:
ax datasets list ax experiments list --dataset-id DATASET_ID
Export both:
ax datasets export DATASET_ID ax experiments export EXPERIMENT_ID
Prepare the joined data for the meta-prompt
Run the optimization meta-prompt
Create a new experiment with the revised prompt to measure improvement
Export spans where the output format is wrong:
ax spans list PROJECT_ID
--filter "attributes.openinference.span.kind = 'LLM' AND annotation.format.label = 'incorrect'"
--limit 10 -o json > bad_format.json
Look at what the LLM is producing vs what was expected
Add explicit format instructions to the prompt (JSON schema, examples, delimiters)
Common fix: add a few-shot example showing the exact desired output format
Find traces where the model hallucinated:
ax spans list PROJECT_ID
--filter "annotation.faithfulness.label = 'unfaithful'"
--limit 20
Export and inspect the retriever + LLM spans together:
ax spans export --trace-id TRACE_ID --project PROJECT_ID jq '[.[] | {kind: .attributes.openinference.span.kind, name, input: .attributes.input.value, output: .attributes.output.value}]' trace_*/spans.json
Check if the retrieved context actually contained the answer
Add grounding instructions to the system prompt: "Only use information from the provided context. If the answer is not in the context, say so."
| Problem | Solution |
|---|---|
ax: command not found | See ax-setup.md |
No profile found | No profile is configured. See ax-profiles.md to create one. |
No input_messages on span | Check span kind -- Chain/Agent spans store prompts on child LLM spans, not on themselves |
Prompt template is null | Not all instrumentations emit prompt_template. Use input_messages or input.value instead |
Weekly Installs
133
Repository
GitHub Stars
6
First Seen
Mar 10, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykPass
Installed on
gemini-cli130
kimi-cli130
amp130
cline130
cursor130
opencode130
AI 代码实施计划编写技能 | 自动化开发任务分解与 TDD 流程规划工具
48,300 周安装
eval.<name>.label| LLM-as-judge evals |
| Automated categorical assessment |
eval.<name>.score | LLM-as-judge evals | Automated numeric score |
eval.<name>.explanation | LLM-as-judge evals | Why the eval gave that score -- most valuable for optimization |
attributes.input.value | Trace data | What went into the LLM |
attributes.output.value | Trace data | What the LLM produced |
{experiment_name}.output | Experiment runs | Output from a specific experiment |
| Responses are too long or short |
| "Respond in exactly 2-3 sentences" |
| Reasoning instructions | Accuracy is critical | "Think step by step before answering" |
| "I don't know" guidelines | Hallucination is a risk | "If the answer is not in the provided context, say 'I don't have enough information'" |
| Variables lost after optimization | Verify the revised prompt preserves all {var} placeholders from the original |
| Optimization makes things worse | Check for overfitting -- the meta-prompt may have memorized test data. Ensure few-shot examples are synthetic |
| No eval/annotation columns | Run evaluations first (via Arize UI or SDK), then re-export |
| Experiment output column not found | The column name is {experiment_name}.output -- check exact experiment name via ax experiments get |
jq errors on span JSON | Ensure you're targeting the correct file path (e.g., trace_*/spans.json) |