phoenix-evals by arize-ai/phoenix
npx skills add https://github.com/arize-ai/phoenix --skill phoenix-evals为 AI/LLM 应用程序构建评估器。代码优先,利用 LLM 处理细微差别,并通过人工验证。
| 任务 | 文件 |
|---|---|
| 设置 | setup-python, setup-typescript |
| 确定评估内容 | evaluators-overview |
| 选择评判模型 | fundamentals-model-selection |
| 使用预构建评估器 | evaluators-pre-built |
| 构建代码评估器 | `evaluators-code-{python |
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
| 构建 LLM 评估器 | `evaluators-llm-{python |
| 批量评估 DataFrame | evaluate-dataframe-python |
| 运行实验 | `experiments-running-{python |
| 创建数据集 | `experiments-datasets-{python |
| 生成合成数据 | `experiments-synthetic-{python |
| 验证评估器准确性 | validation, `validation-evaluators-{python |
| 抽样审查轨迹 | `observe-sampling-{python |
| 分析错误 | error-analysis, error-analysis-multi-turn, axial-coding |
| RAG 评估 | evaluators-rag |
| 避免常见错误 | common-mistakes-python, fundamentals-anti-patterns |
| 生产环境 | production-overview, production-guardrails, production-continuous |
从零开始: observe-tracing-setup → error-analysis → axial-coding → evaluators-overview
构建评估器: fundamentals → common-mistakes-python → evaluators-{code\|llm}-{python\|typescript} → validation-evaluators-{python\|typescript}
RAG 系统: evaluators-rag → evaluators-code-* (检索) → evaluators-llm-* (忠实度)
生产环境: production-overview → production-guardrails → production-continuous
| 前缀 | 描述 |
|---|---|
fundamentals-* | 类型、分数、反模式 |
observe-* | 追踪、抽样 |
error-analysis-* | 查找故障 |
axial-coding-* | 故障分类 |
evaluators-* | 代码、LLM、RAG 评估器 |
experiments-* | 数据集、运行实验 |
validation-* | 根据人工标注验证评估器准确性 |
production-* | CI/CD、监控 |
| 原则 | 行动 |
|---|---|
| 错误分析优先 | 无法自动化你尚未观察到的东西 |
| 定制优于通用 | 根据你的故障构建 |
| 代码优先 | 确定性优先于 LLM |
| 验证评判模型 | >80% 真阳性率/真阴性率 |
| 二元优于李克特量表 | 通过/失败,而非 1-5 分 |
每周安装量
86
代码仓库
GitHub 星标数
8.8K
首次出现
Jan 27, 2026
安全审计
安装于
gemini-cli77
claude-code74
github-copilot74
codex74
opencode73
cursor70
Build evaluators for AI/LLM applications. Code first, LLM for nuance, validate against humans.
| Task | Files |
|---|---|
| Setup | setup-python, setup-typescript |
| Decide what to evaluate | evaluators-overview |
| Choose a judge model | fundamentals-model-selection |
| Use pre-built evaluators | evaluators-pre-built |
| Build code evaluator | `evaluators-code-{python |
| Build LLM evaluator | `evaluators-llm-{python |
| Batch evaluate DataFrame | evaluate-dataframe-python |
| Run experiment | `experiments-running-{python |
| Create dataset | `experiments-datasets-{python |
| Generate synthetic data | `experiments-synthetic-{python |
| Validate evaluator accuracy | validation, `validation-evaluators-{python |
| Sample traces for review | `observe-sampling-{python |
| Analyze errors | error-analysis, error-analysis-multi-turn, axial-coding |
| RAG evals | evaluators-rag |
| Avoid common mistakes | common-mistakes-python, fundamentals-anti-patterns |
| Production | production-overview, production-guardrails, production-continuous |
Starting Fresh: observe-tracing-setup → error-analysis → axial-coding → evaluators-overview
Building Evaluator: fundamentals → common-mistakes-python → evaluators-{code\|llm}-{python\|typescript} → validation-evaluators-{python\|typescript}
RAG Systems: evaluators-rag → evaluators-code-* (retrieval) → evaluators-llm-* (faithfulness)
Production: production-overview → production-guardrails → production-continuous
| Prefix | Description |
|---|---|
fundamentals-* | Types, scores, anti-patterns |
observe-* | Tracing, sampling |
error-analysis-* | Finding failures |
axial-coding-* | Categorizing failures |
evaluators-* | Code, LLM, RAG evaluators |
experiments-* | Datasets, running experiments |
| Principle | Action |
|---|---|
| Error analysis first | Can't automate what you haven't observed |
| Custom > generic | Build from your failures |
| Code first | Deterministic before LLM |
| Validate judges | >80% TPR/TNR |
| Binary > Likert | Pass/fail, not 1-5 |
Weekly Installs
86
Repository
GitHub Stars
8.8K
First Seen
Jan 27, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykPass
Installed on
gemini-cli77
claude-code74
github-copilot74
codex74
opencode73
cursor70
AI 代码实施计划编写技能 | 自动化开发任务分解与 TDD 流程规划工具
49,000 周安装
validation-* | Validating evaluator accuracy against human labels |
production-* | CI/CD, monitoring |