Arize Prompt Optimization - 提示词优化技能详解，提升LLM应用性能与追踪数据分析

arize-prompt-optimization by arize-ai/arize-skills

188 周安装量

8 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/arize-ai/arize-skills --skill arize-prompt-optimization

AI/机器学习监控提示工程

🇨🇳中文介绍

Arize Prompt Optimization Skill

概念

提示词在追踪数据中的存储位置

LLM 应用程序遵循 OpenInference 语义约定发出跨度（span）。提示词根据跨度的种类和插桩方式存储在不同的跨度属性中：

列	包含内容	何时使用
`attributes.llm.input_messages`	基于角色格式的结构化聊天消息（系统、用户、助手、工具）	基于聊天的 LLM 提示词的主要来源
`attributes.llm.input_messages.roles`	角色数组：`system`、`user`、`assistant`、`tool`

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

按跨度种类查找提示词

LLM 跨度 (attributes.openinference.span.kind = 'LLM'): 在 attributes.llm.input_messages 中查找结构化聊天消息，或者在 attributes.input.value 中查找序列化的提示词。在 attributes.llm.prompt_template.template 中查找模板。
链/代理跨度：attributes.input.value 包含用户的问题。实际的 LLM 提示词位于子 LLM 跨度上——沿着追踪树向下导航。
工具跨度：attributes.input.value 包含工具输入，attributes.output.value 包含工具结果。通常不是提示词存储的位置。

这些列携带用于优化的反馈数据：

列模式	来源	告诉你什么
`annotation.<name>.label`	人工评审员	分类等级（例如 `correct`、`incorrect`、`partial`）
`annotation.<name>.score`	人工评审员	数字质量分数（例如 0.0 - 1.0）
`annotation.<name>.text`	人工评审员	等级的自由文本解释
`eval.<name>.label`	LLM 作为评判者的评估	自动分类评估
`eval.<name>.score`	LLM 作为评判者的评估	自动数字分数
`eval.<name>.explanation`	LLM 作为评判者的评估	评估给出该分数的原因——对优化最有价值
`attributes.input.value`	追踪数据	输入到 LLM 的内容
`attributes.output.value`	追踪数据	LLM 产生的内容
`{experiment_name}.output`	实验运行	特定实验的输出

需要三样东西：ax CLI、一个 API 密钥（环境变量或配置文件）和一个项目。使用项目名称时还需要一个空间 ID。

如果 ax 未安装、不在 PATH 中或版本低于 0.8.0，请参阅 ax-setup.md。

运行快速检查以验证凭据：

macOS/Linux (bash):

ax --version && echo "--- env ---" && if [ -n "$ARIZE_API_KEY" ]; then echo "ARIZE_API_KEY: (set)"; else echo "ARIZE_API_KEY: (not set)"; fi && echo "ARIZE_SPACE_ID: ${ARIZE_SPACE_ID:-(not set)}" && echo "ARIZE_DEFAULT_PROJECT: ${ARIZE_DEFAULT_PROJECT:-(not set)}" && echo "--- profiles ---" && ax profiles show 2>&1

Windows (PowerShell):

ax --version; Write-Host "--- env ---"; Write-Host "ARIZE_API_KEY: $(if ($env:ARIZE_API_KEY) { '(set)' } else { '(not set)' })"; Write-Host "ARIZE_SPACE_ID: $env:ARIZE_SPACE_ID"; Write-Host "ARIZE_DEFAULT_PROJECT: $env:ARIZE_DEFAULT_PROJECT"; Write-Host "--- profiles ---"; ax profiles show 2>&1

读取输出并立即继续，如果环境变量或配置文件中任一包含 API 密钥。仅当两者都缺失时才询问用户。解决失败问题：

环境变量中没有 API 密钥且没有配置文件 → 询问问题："Arize API 密钥 (https://app.arize.com/admin > API Keys)"，然后立即使用 ax-profiles.md 保存它
空间 ID 未知 → 运行 ax spaces list -o json 列出所有可访问的空间并选择正确的，或者如果用户希望直接提供，则询问问题
项目不明确 → 询问，或运行 ax projects list -o json --limit 100 并呈现为可选项

如果设置了 ARIZE_DEFAULT_PROJECT（在上面的输出中可见），则将其值用作本会话中所有命令的项目。不要向用户询问项目 ID——直接使用它。继续使用此默认值，直到用户明确提供不同的项目。

如果未设置 ARIZE_DEFAULT_PROJECT 且未提供项目，则向用户询问。

阶段 1：提取当前提示词

查找包含提示词的 LLM 跨度

# 列出 LLM 跨度（提示词所在位置）
ax spans list PROJECT_ID --filter "attributes.openinference.span.kind = 'LLM'" --limit 10

# 按模型筛选
ax spans list PROJECT_ID --filter "attributes.llm.model_name = 'gpt-4o'" --limit 10

# 按跨度名称筛选（例如，特定的 LLM 调用）
ax spans list PROJECT_ID --filter "name = 'ChatCompletion'" --limit 10

导出追踪以检查提示词结构

# 导出追踪中的所有跨度
ax spans export --trace-id TRACE_ID --project PROJECT_ID

# 导出单个跨度
ax spans export --span-id SPAN_ID --project PROJECT_ID

从导出的 JSON 中提取提示词

# 提取结构化聊天消息（系统 + 用户 + 助手）
jq '.[0] | {
  messages: .attributes.llm.input_messages,
  model: .attributes.llm.model_name
}' trace_*/spans.json

# 专门提取系统提示词
jq '[.[] | select(.attributes.llm.input_messages.roles[]? == "system")] | .[0].attributes.llm.input_messages' trace_*/spans.json

# 提取提示词模板和变量
jq '.[0].attributes.llm.prompt_template' trace_*/spans.json

# 从 input.value 中提取（非结构化提示词的后备方案）
jq '.[0].attributes.input.value' trace_*/spans.json

将提示词重构为消息数组

获取跨度数据后，将提示词重构为消息数组：

[
  {"role": "system", "content": "You are a helpful assistant that..."},
  {"role": "user", "content": "Given {input}, answer the question: {question}"}
]

如果跨度具有 attributes.llm.prompt_template.template，则提示词使用了变量。保留这些占位符（{variable} 或 {{variable}}）——它们会在运行时被替换。

阶段 2：收集性能数据

从追踪中（生产反馈）

# 查找错误跨度——这些表明提示词失败
ax spans list PROJECT_ID \
  --filter "status_code = 'ERROR' AND attributes.openinference.span.kind = 'LLM'" \
  --limit 20

# 查找评估分数低的跨度
ax spans list PROJECT_ID \
  --filter "annotation.correctness.label = 'incorrect'" \
  --limit 20

# 查找延迟高的跨度（可能表明提示词过于复杂）
ax spans list PROJECT_ID \
  --filter "attributes.openinference.span.kind = 'LLM' AND latency_ms > 10000" \
  --limit 20

# 导出错误追踪以进行详细检查
ax spans export --trace-id TRACE_ID --project PROJECT_ID

从数据集和实验中

# 导出数据集（真实示例）
ax datasets export DATASET_ID
# -> dataset_*/examples.json

# 导出实验结果（LLM 产生的内容）
ax experiments export EXPERIMENT_ID
# -> experiment_*/runs.json

合并数据集 + 实验以进行分析

通过 example_id 连接两个文件，以查看输入、输出和评估：

# 计算示例和运行的数量
jq 'length' dataset_*/examples.json
jq 'length' experiment_*/runs.json

# 查看单个连接记录
jq -s '
  .[0] as $dataset |
  .[1][0] as $run |
  ($dataset[] | select(.id == $run.example_id)) as $example |
  {
    input: $example,
    output: $run.output,
    evaluations: $run.evaluations
  }
' dataset_*/examples.json experiment_*/runs.json

# 查找失败的示例（评估分数低于阈值）
jq '[.[] | select(.evaluations.correctness.score < 0.5)]' experiment_*/runs.json

确定需要优化的内容

查找失败中的模式：

比较输出与真实情况：LLM 输出与预期在何处不同？
阅读评估解释：eval.*.explanation 告诉你为什么某些事情失败了
检查注释文本：人工反馈描述了具体问题
查找冗长度不匹配：如果输出相对于真实情况太长/太短
检查格式合规性：输出是否符合预期格式？

阶段 3：优化提示词

使用此模板生成提示词的改进版本。填写三个占位符并将其发送给你的 LLM（GPT-4o、Claude 等）：

You are an expert in prompt optimization. Given the original baseline prompt
and the associated performance data (inputs, outputs, evaluation labels, and
explanations), generate a revised version that improves results.

ORIGINAL BASELINE PROMPT
========================

{PASTE_ORIGINAL_PROMPT_HERE}

========================

PERFORMANCE DATA
================

The following records show how the current prompt performed. Each record
includes the input, the LLM output, and evaluation feedback:

{PASTE_RECORDS_HERE}

================

HOW TO USE THIS DATA

1. Compare outputs: Look at what the LLM generated vs what was expected
2. Review eval scores: Check which examples scored poorly and why
3. Examine annotations: Human feedback shows what worked and what didn't
4. Identify patterns: Look for common issues across multiple examples
5. Focus on failures: The rows where the output DIFFERS from the expected
   value are the ones that need fixing

ALIGNMENT STRATEGY

- If outputs have extra text or reasoning not present in the ground truth,
  remove instructions that encourage explanation or verbose reasoning
- If outputs are missing information, add instructions to include it
- If outputs are in the wrong format, add explicit format instructions
- Focus on the rows where the output differs from the target -- these are
  the failures to fix

RULES

Maintain Structure:
- Use the same template variables as the current prompt ({var} or {{var}})
- Don't change sections that are already working
- Preserve the exact return format instructions from the original prompt

Avoid Overfitting:
- DO NOT copy examples verbatim into the prompt
- DO NOT quote specific test data outputs exactly
- INSTEAD: Extract the ESSENCE of what makes good vs bad outputs
- INSTEAD: Add general guidelines and principles
- INSTEAD: If adding few-shot examples, create SYNTHETIC examples that
  demonstrate the principle, not real data from above

Goal: Create a prompt that generalizes well to new inputs, not one that
memorizes the test data.

OUTPUT FORMAT

Return the revised prompt as a JSON array of messages:

[
  {"role": "system", "content": "..."},
  {"role": "user", "content": "..."}
]

Also provide a brief reasoning section (bulleted list) explaining:
- What problems you found
- How the revised prompt addresses each one

在粘贴到模板之前，将记录格式化为 JSON 数组：

# 从数据集 + 实验：连接并选择相关列
jq -s '
  .[0] as $ds |
  [.[1][] | . as $run |
    ($ds[] | select(.id == $run.example_id)) as $ex |
    {
      input: $ex.input,
      expected: $ex.expected_output,
      actual_output: $run.output,
      eval_score: $run.evaluations.correctness.score,
      eval_label: $run.evaluations.correctness.label,
      eval_explanation: $run.evaluations.correctness.explanation
    }
  ]
' dataset_*/examples.json experiment_*/runs.json

# 从导出的跨度中：提取带有注释的输入/输出对
jq '[.[] | select(.attributes.openinference.span.kind == "LLM") | {
  input: .attributes.input.value,
  output: .attributes.output.value,
  status: .status_code,
  model: .attributes.llm.model_name
}]' trace_*/spans.json

应用修订后的提示词

在 LLM 返回修订后的消息数组后：

并排比较原始提示词和修订后的提示词
验证所有模板变量是否保留
检查格式说明是否完整
在全面部署之前，在几个示例上进行测试

1. 提取提示词    -> 阶段 1（一次）
2. 运行实验    -> ax experiments create ...
3. 导出结果    -> ax experiments export EXPERIMENT_ID
4. 分析失败    -> jq 查找低分
5. 运行元提示词   -> 阶段 3 使用新的失败数据
6. 应用修订后的提示词
7. 从步骤 2 开始重复

# 比较不同实验的分数
# 实验 A（基线）
jq '[.[] | .evaluations.correctness.score] | add / length' experiment_a/runs.json

# 实验 B（优化后）
jq '[.[] | .evaluations.correctness.score] | add / length' experiment_b/runs.json

# 查找从失败翻转为通过的示例
jq -s '
  [.[0][] | select(.evaluations.correctness.label == "incorrect")] as $fails |
  [.[1][] | select(.evaluations.correctness.label == "correct") |
    select(.example_id as $id | $fails | any(.example_id == $id))
  ] | length
' experiment_a/runs.json experiment_b/runs.json

A/B 比较两个提示词

针对同一数据集创建两个实验，每个实验使用不同的提示词版本
导出两者：ax experiments export EXP_A 和 ax experiments export EXP_B
比较平均分数、失败率和特定的示例翻转情况
检查回归——使用提示词 A 通过但使用提示词 B 失败的示例

提示工程最佳实践

在编写或修订提示词时应用这些技巧：

技巧	何时应用	示例
清晰、详细的指令	输出模糊或离题	"将情感分类为以下恰好一种：积极、消极、中性"
指令放在开头	模型忽略后面的指令	将任务描述放在示例之前
分步分解	复杂的多步骤流程	"首先提取实体，然后对每个实体进行分类，最后进行总结"
特定角色	需要一致的风格/语气	"你是一位为机构投资者撰写报告的高级金融分析师"
分隔符标记	各部分混合在一起	使用 `---`、`###` 或 XML 标签来分隔输入和指令
少样本示例	需要澄清输出格式	展示 2-3 个合成的输入/输出对
输出长度规范	响应太长或太短	"用恰好 2-3 句话回答"
推理指令	准确性至关重要	"在回答之前逐步思考"
"我不知道"指南	存在幻觉风险	"如果答案不在提供的上下文中，请说'我没有足够的信息'"

优化使用模板变量的提示词时：

单花括号 ({variable}): Python f-string / Jinja 风格。在 Arize 中最常见。
双花括号 ({{variable}}): Mustache 风格。当框架要求时使用。
优化期间不要添加或删除变量占位符
不要重命名变量——运行时替换依赖于确切的名称
如果添加少样本示例，请使用字面值，而不是变量占位符

从失败的追踪中优化提示词

查找失败的追踪：

ax traces list PROJECT_ID --filter "status_code = 'ERROR'" --limit 5

导出追踪：

ax spans export --trace-id TRACE_ID --project PROJECT_ID

从 LLM 跨度中提取提示词：
```
jq '[.[] | select(.attributes.openinference.span.kind == "LLM")][0] | {
```
messages: .attributes.llm.input_messages, template: .attributes.llm.prompt_template, output: .attributes.output.value, error: .attributes.exception.message }' trace_*/spans.json
根据错误消息或输出确定失败原因
用提示词和错误上下文填写优化元提示词（阶段 3）
应用修订后的提示词

使用数据集和实验进行优化

查找数据集和实验：
```
ax datasets list
```
ax experiments list --dataset-id DATASET_ID
导出两者：
```
ax datasets export DATASET_ID
```
ax experiments export EXPERIMENT_ID
为元提示词准备连接的数据
运行优化元提示词
使用修订后的提示词创建新实验以衡量改进

调试产生错误格式的提示词

导出输出格式错误的跨度：
```
ax spans list PROJECT_ID \
```
--filter "attributes.openinference.span.kind = 'LLM' AND annotation.format.label = 'incorrect'"
--limit 10 -o json > bad_format.json
查看 LLM 产生的内容与预期内容的差异
向提示词添加明确的格式说明（JSON 模式、示例、分隔符）
常见修复：添加一个少样本示例，展示确切的期望输出格式

减少 RAG 提示词中的幻觉

查找模型产生幻觉的追踪：
```
ax spans list PROJECT_ID \
```
--filter "annotation.faithfulness.label = 'unfaithful'"
--limit 20
导出并一起检查检索器和 LLM 跨度：
```
ax spans export --trace-id TRACE_ID --project PROJECT_ID
```
jq '[.[] | {kind: .attributes.openinference.span.kind, name, input: .attributes.input.value, output: .attributes.output.value}]' trace_*/spans.json
检查检索到的上下文是否确实包含答案
向系统提示词添加上下文基础指令："仅使用提供上下文中的信息。如果答案不在上下文中，请说明。"

问题	解决方案
`ax: command not found`	参见 ax-setup.md
`No profile found`	未配置任何配置文件。请参阅 ax-profiles.md 创建一个。
跨度上没有 `input_messages`	检查跨度种类——链/代理跨度将提示词存储在子 LLM 跨度上，而不是它们自身
提示词模板为 `null`	并非所有插桩都发出 `prompt_template`。改用 `input_messages` 或 `input.value`
优化后变量丢失	验证修订后的提示词保留了原始提示词中的所有 `{var}` 占位符
优化使情况更糟	检查是否过拟合——元提示词可能记住了测试数据。确保少样本示例是合成的
没有 eval/annotation 列	首先运行评估（通过 Arize UI 或 SDK），然后重新导出
找不到实验输出列	列名是 `{experiment_name}.output`——通过 `ax experiments get` 检查确切的实验名称
`jq` 在跨度 JSON 上出错	确保你指向了正确的文件路径（例如 `trace_*/spans.json`）

🇺🇸English

Arize Prompt Optimization Skill

Concepts

Where Prompts Live in Trace Data

LLM applications emit spans following OpenInference semantic conventions. Prompts are stored in different span attributes depending on the span kind and instrumentation:

Column	What it contains	When to use
`attributes.llm.input_messages`	Structured chat messages (system, user, assistant, tool) in role-based format	Primary source for chat-based LLM prompts
`attributes.llm.input_messages.roles`	Array of roles: `system`, `user`, `assistant`, `tool`	Extract individual message roles
`attributes.llm.input_messages.contents`	Array of message content strings	Extract message text
`attributes.input.value`	Serialized prompt or user question (generic, all span kinds)	Fallback when structured messages are not available
`attributes.llm.prompt_template.template`	Template with `{variable}` placeholders (e.g., `"Answer {question} using {context}"`)	When the app uses prompt templates
`attributes.llm.prompt_template.variables`	Template variable values (JSON object)	See what values were substituted into the template
`attributes.output.value`	Model response text	See what the LLM produced
`attributes.llm.output_messages`	Structured model output (including tool calls)	Inspect tool-calling responses

Finding Prompts by Span Kind

LLM span (attributes.openinference.span.kind = 'LLM'): Check attributes.llm.input_messages for structured chat messages, OR attributes.input.value for a serialized prompt. Check attributes.llm.prompt_template.template for the template.
Chain/Agent span : attributes.input.value contains the user's question. The actual LLM prompt lives on child LLM spans -- navigate down the trace tree.
Tool span : attributes.input.value has tool input, attributes.output.value has tool result. Not typically where prompts live.

Performance Signal Columns

These columns carry the feedback data used for optimization:

Column pattern	Source	What it tells you
`annotation.<name>.label`	Human reviewers	Categorical grade (e.g., `correct`, `incorrect`, `partial`)
`annotation.<name>.score`	Human reviewers	Numeric quality score (e.g., 0.0 - 1.0)
`annotation.<name>.text`	Human reviewers	Freeform explanation of the grade

Prerequisites

Three things are needed: ax CLI, an API key (env var or profile), and a project. A space ID is also needed when using project names.

Install ax

If ax is not installed, not on PATH, or below version 0.8.0, see ax-setup.md.

Verify environment

Run a quick check for credentials:

macOS/Linux (bash):

ax --version && echo "--- env ---" && if [ -n "$ARIZE_API_KEY" ]; then echo "ARIZE_API_KEY: (set)"; else echo "ARIZE_API_KEY: (not set)"; fi && echo "ARIZE_SPACE_ID: ${ARIZE_SPACE_ID:-(not set)}" && echo "ARIZE_DEFAULT_PROJECT: ${ARIZE_DEFAULT_PROJECT:-(not set)}" && echo "--- profiles ---" && ax profiles show 2>&1

Windows (PowerShell):

ax --version; Write-Host "--- env ---"; Write-Host "ARIZE_API_KEY: $(if ($env:ARIZE_API_KEY) { '(set)' } else { '(not set)' })"; Write-Host "ARIZE_SPACE_ID: $env:ARIZE_SPACE_ID"; Write-Host "ARIZE_DEFAULT_PROJECT: $env:ARIZE_DEFAULT_PROJECT"; Write-Host "--- profiles ---"; ax profiles show 2>&1

Read the output and proceed immediately if either the env var or the profile has an API key. Only ask the user if both are missing. Resolve failures:

No API key in env and no profile → AskQuestion : "Arize API key (https://app.arize.com/admin > API Keys)", then save it immediately using ax-profiles.md
Space ID unknown → run ax spaces list -o json to list all accessible spaces and pick the right one, or AskQuestion if the user prefers to provide it directly
Project unclear → ask, or run ax projects list -o json --limit 100 and present as selectable options

Default Project

If ARIZE_DEFAULT_PROJECT is set (visible in the output above), use its value as the project for all commands in this session. Do NOT ask the user for a project ID -- just use it. Continue using this default until the user explicitly provides a different project.

If ARIZE_DEFAULT_PROJECT is not set and no project is provided, ask the user for one.

Phase 1: Extract the Current Prompt

Find LLM spans containing prompts

# List LLM spans (where prompts live)
ax spans list PROJECT_ID --filter "attributes.openinference.span.kind = 'LLM'" --limit 10

# Filter by model
ax spans list PROJECT_ID --filter "attributes.llm.model_name = 'gpt-4o'" --limit 10

# Filter by span name (e.g., a specific LLM call)
ax spans list PROJECT_ID --filter "name = 'ChatCompletion'" --limit 10

Export a trace to inspect prompt structure

# Export all spans in a trace
ax spans export --trace-id TRACE_ID --project PROJECT_ID

# Export a single span
ax spans export --span-id SPAN_ID --project PROJECT_ID

Extract prompts from exported JSON

# Extract structured chat messages (system + user + assistant)
jq '.[0] | {
  messages: .attributes.llm.input_messages,
  model: .attributes.llm.model_name
}' trace_*/spans.json

# Extract the system prompt specifically
jq '[.[] | select(.attributes.llm.input_messages.roles[]? == "system")] | .[0].attributes.llm.input_messages' trace_*/spans.json

# Extract prompt template and variables
jq '.[0].attributes.llm.prompt_template' trace_*/spans.json

# Extract from input.value (fallback for non-structured prompts)
jq '.[0].attributes.input.value' trace_*/spans.json

Reconstruct the prompt as messages

Once you have the span data, reconstruct the prompt as a messages array:

[
  {"role": "system", "content": "You are a helpful assistant that..."},
  {"role": "user", "content": "Given {input}, answer the question: {question}"}
]

If the span has attributes.llm.prompt_template.template, the prompt uses variables. Preserve these placeholders ({variable} or {{variable}}) -- they are substituted at runtime.

Phase 2: Gather Performance Data

From traces (production feedback)

# Find error spans -- these indicate prompt failures
ax spans list PROJECT_ID \
  --filter "status_code = 'ERROR' AND attributes.openinference.span.kind = 'LLM'" \
  --limit 20

# Find spans with low eval scores
ax spans list PROJECT_ID \
  --filter "annotation.correctness.label = 'incorrect'" \
  --limit 20

# Find spans with high latency (may indicate overly complex prompts)
ax spans list PROJECT_ID \
  --filter "attributes.openinference.span.kind = 'LLM' AND latency_ms > 10000" \
  --limit 20

# Export error traces for detailed inspection
ax spans export --trace-id TRACE_ID --project PROJECT_ID

From datasets and experiments

# Export a dataset (ground truth examples)
ax datasets export DATASET_ID
# -> dataset_*/examples.json

# Export experiment results (what the LLM produced)
ax experiments export EXPERIMENT_ID
# -> experiment_*/runs.json

Merge dataset + experiment for analysis

Join the two files by example_id to see inputs alongside outputs and evaluations:

# Count examples and runs
jq 'length' dataset_*/examples.json
jq 'length' experiment_*/runs.json

# View a single joined record
jq -s '
  .[0] as $dataset |
  .[1][0] as $run |
  ($dataset[] | select(.id == $run.example_id)) as $example |
  {
    input: $example,
    output: $run.output,
    evaluations: $run.evaluations
  }
' dataset_*/examples.json experiment_*/runs.json

# Find failed examples (where eval score < threshold)
jq '[.[] | select(.evaluations.correctness.score < 0.5)]' experiment_*/runs.json

Identify what to optimize

Look for patterns across failures:

Compare outputs to ground truth : Where does the LLM output differ from expected?
Read eval explanations : eval.*.explanation tells you WHY something failed
Check annotation text : Human feedback describes specific issues
Look for verbosity mismatches : If outputs are too long/short vs ground truth
Check format compliance : Are outputs in the expected format?

Phase 3: Optimize the Prompt

The Optimization Meta-Prompt

Use this template to generate an improved version of the prompt. Fill in the three placeholders and send it to your LLM (GPT-4o, Claude, etc.):

You are an expert in prompt optimization. Given the original baseline prompt
and the associated performance data (inputs, outputs, evaluation labels, and
explanations), generate a revised version that improves results.

ORIGINAL BASELINE PROMPT
========================

{PASTE_ORIGINAL_PROMPT_HERE}

========================

PERFORMANCE DATA
================

The following records show how the current prompt performed. Each record
includes the input, the LLM output, and evaluation feedback:

{PASTE_RECORDS_HERE}

================

HOW TO USE THIS DATA

1. Compare outputs: Look at what the LLM generated vs what was expected
2. Review eval scores: Check which examples scored poorly and why
3. Examine annotations: Human feedback shows what worked and what didn't
4. Identify patterns: Look for common issues across multiple examples
5. Focus on failures: The rows where the output DIFFERS from the expected
   value are the ones that need fixing

ALIGNMENT STRATEGY

- If outputs have extra text or reasoning not present in the ground truth,
  remove instructions that encourage explanation or verbose reasoning
- If outputs are missing information, add instructions to include it
- If outputs are in the wrong format, add explicit format instructions
- Focus on the rows where the output differs from the target -- these are
  the failures to fix

RULES

Maintain Structure:
- Use the same template variables as the current prompt ({var} or {{var}})
- Don't change sections that are already working
- Preserve the exact return format instructions from the original prompt

Avoid Overfitting:
- DO NOT copy examples verbatim into the prompt
- DO NOT quote specific test data outputs exactly
- INSTEAD: Extract the ESSENCE of what makes good vs bad outputs
- INSTEAD: Add general guidelines and principles
- INSTEAD: If adding few-shot examples, create SYNTHETIC examples that
  demonstrate the principle, not real data from above

Goal: Create a prompt that generalizes well to new inputs, not one that
memorizes the test data.

OUTPUT FORMAT

Return the revised prompt as a JSON array of messages:

[
  {"role": "system", "content": "..."},
  {"role": "user", "content": "..."}
]

Also provide a brief reasoning section (bulleted list) explaining:
- What problems you found
- How the revised prompt addresses each one

Preparing the performance data

Format the records as a JSON array before pasting into the template:

# From dataset + experiment: join and select relevant columns
jq -s '
  .[0] as $ds |
  [.[1][] | . as $run |
    ($ds[] | select(.id == $run.example_id)) as $ex |
    {
      input: $ex.input,
      expected: $ex.expected_output,
      actual_output: $run.output,
      eval_score: $run.evaluations.correctness.score,
      eval_label: $run.evaluations.correctness.label,
      eval_explanation: $run.evaluations.correctness.explanation
    }
  ]
' dataset_*/examples.json experiment_*/runs.json

# From exported spans: extract input/output pairs with annotations
jq '[.[] | select(.attributes.openinference.span.kind == "LLM") | {
  input: .attributes.input.value,
  output: .attributes.output.value,
  status: .status_code,
  model: .attributes.llm.model_name
}]' trace_*/spans.json

Applying the revised prompt

After the LLM returns the revised messages array:

Compare the original and revised prompts side by side
Verify all template variables are preserved
Check that format instructions are intact
Test on a few examples before full deployment

Phase 4: Iterate

The optimization loop

1. Extract prompt    -> Phase 1 (once)
2. Run experiment    -> ax experiments create ...
3. Export results    -> ax experiments export EXPERIMENT_ID
4. Analyze failures  -> jq to find low scores
5. Run meta-prompt   -> Phase 3 with new failure data
6. Apply revised prompt
7. Repeat from step 2

Measure improvement

# Compare scores across experiments
# Experiment A (baseline)
jq '[.[] | .evaluations.correctness.score] | add / length' experiment_a/runs.json

# Experiment B (optimized)
jq '[.[] | .evaluations.correctness.score] | add / length' experiment_b/runs.json

# Find examples that flipped from fail to pass
jq -s '
  [.[0][] | select(.evaluations.correctness.label == "incorrect")] as $fails |
  [.[1][] | select(.evaluations.correctness.label == "correct") |
    select(.example_id as $id | $fails | any(.example_id == $id))
  ] | length
' experiment_a/runs.json experiment_b/runs.json

A/B compare two prompts

Create two experiments against the same dataset, each using a different prompt version
Export both: ax experiments export EXP_A and ax experiments export EXP_B
Compare average scores, failure rates, and specific example flips
Check for regressions -- examples that passed with prompt A but fail with prompt B

Prompt Engineering Best Practices

Apply these when writing or revising prompts:

Technique	When to apply	Example
Clear, detailed instructions	Output is vague or off-topic	"Classify the sentiment as exactly one of: positive, negative, neutral"
Instructions at the beginning	Model ignores later instructions	Put the task description before examples
Step-by-step breakdowns	Complex multi-step processes	"First extract entities, then classify each, then summarize"
Specific personas	Need consistent style/tone	"You are a senior financial analyst writing for institutional investors"
Delimiter tokens	Sections blend together	Use `---`, `###`, or XML tags to separate input from instructions
Few-shot examples	Output format needs clarification	Show 2-3 synthetic input/output pairs
Output length specifications

Variable preservation

When optimizing prompts that use template variables:

Single braces ({variable}): Python f-string / Jinja style. Most common in Arize.
Double braces ({{variable}}): Mustache style. Used when the framework requires it.
Never add or remove variable placeholders during optimization
Never rename variables -- the runtime substitution depends on exact names
If adding few-shot examples, use literal values, not variable placeholders

Workflows

Optimize a prompt from a failing trace

Find failing traces:

ax traces list PROJECT_ID --filter "status_code = 'ERROR'" --limit 5
Export the trace:

ax spans export --trace-id TRACE_ID --project PROJECT_ID
Extract the prompt from the LLM span:

jq '[.[] | select(.attributes.openinference.span.kind == "LLM")][0] | { messages: .attributes.llm.input_messages, template: .attributes.llm.prompt_template, output: .attributes.output.value, error: .attributes.exception.message }' trace_*/spans.json
Identify what failed from the error message or output
Fill in the optimization meta-prompt (Phase 3) with the prompt and error context
Apply the revised prompt

Optimize using a dataset and experiment

Find the dataset and experiment:

ax datasets list ax experiments list --dataset-id DATASET_ID
Export both:

ax datasets export DATASET_ID ax experiments export EXPERIMENT_ID
Prepare the joined data for the meta-prompt
Run the optimization meta-prompt
Create a new experiment with the revised prompt to measure improvement

Debug a prompt that produces wrong format

Export spans where the output format is wrong:

ax spans list PROJECT_ID
--filter "attributes.openinference.span.kind = 'LLM' AND annotation.format.label = 'incorrect'"
--limit 10 -o json > bad_format.json
Look at what the LLM is producing vs what was expected
Add explicit format instructions to the prompt (JSON schema, examples, delimiters)
Common fix: add a few-shot example showing the exact desired output format

Reduce hallucination in a RAG prompt

Find traces where the model hallucinated:

ax spans list PROJECT_ID
--filter "annotation.faithfulness.label = 'unfaithful'"
--limit 20
Export and inspect the retriever + LLM spans together:

ax spans export --trace-id TRACE_ID --project PROJECT_ID jq '[.[] | {kind: .attributes.openinference.span.kind, name, input: .attributes.input.value, output: .attributes.output.value}]' trace_*/spans.json
Check if the retrieved context actually contained the answer
Add grounding instructions to the system prompt: "Only use information from the provided context. If the answer is not in the context, say so."

Troubleshooting

Problem	Solution
`ax: command not found`	See ax-setup.md
`No profile found`	No profile is configured. See ax-profiles.md to create one.
No `input_messages` on span	Check span kind -- Chain/Agent spans store prompts on child LLM spans, not on themselves
Prompt template is `null`	Not all instrumentations emit `prompt_template`. Use `input_messages` or `input.value` instead

Weekly Installs

133

Repository

arize-ai/arize-skills

GitHub Stars

First Seen

Mar 10, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

gemini-cli130

kimi-cli130

amp130

cline130

cursor130

opencode130

AI 代码实施计划编写技能 | 自动化开发任务分解与 TDD 流程规划工具

48,300 周安装

Arize Prompt Optimization - 提示词优化技能详解，提升LLM应用性能与追踪数据分析

🇨🇳中文介绍

Arize Prompt Optimization Skill

概念

提示词在追踪数据中的存储位置

相关 Skills

按跨度种类查找提示词

性能信号列

先决条件

安装 ax

验证环境

默认项目

阶段 1：提取当前提示词

查找包含提示词的 LLM 跨度

导出追踪以检查提示词结构

从导出的 JSON 中提取提示词

将提示词重构为消息数组

阶段 2：收集性能数据

从追踪中（生产反馈）

从数据集和实验中

合并数据集 + 实验以进行分析

确定需要优化的内容

阶段 3：优化提示词

优化元提示词

准备性能数据

应用修订后的提示词

阶段 4：迭代

优化循环

衡量改进

A/B 比较两个提示词

提示工程最佳实践

变量保留

工作流程

从失败的追踪中优化提示词

使用数据集和实验进行优化

调试产生错误格式的提示词

减少 RAG 提示词中的幻觉

故障排除

🇺🇸English

Arize Prompt Optimization Skill

Concepts

Where Prompts Live in Trace Data

Finding Prompts by Span Kind

Performance Signal Columns

Prerequisites

Install ax

Verify environment

Default Project

Phase 1: Extract the Current Prompt

Find LLM spans containing prompts

Export a trace to inspect prompt structure

Extract prompts from exported JSON

Reconstruct the prompt as messages

Phase 2: Gather Performance Data

From traces (production feedback)

From datasets and experiments

Merge dataset + experiment for analysis

Identify what to optimize

Phase 3: Optimize the Prompt

The Optimization Meta-Prompt

Preparing the performance data

Applying the revised prompt

Phase 4: Iterate

The optimization loop

Measure improvement

A/B compare two prompts

Prompt Engineering Best Practices

Variable preservation

Workflows

Optimize a prompt from a failing trace

Optimize using a dataset and experiment

Debug a prompt that produces wrong format

Reduce hallucination in a RAG prompt

Troubleshooting

最新 Skills