构建AI追踪记录标注界面：自定义数据审阅与评估工具开发指南

build-review-interface by hamelsmu/evals-skills

183 周安装量

1,100 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/hamelsmu/evals-skills --skill build-review-interface

AI/机器学习数据可视化测试

🇨🇳中文介绍

构建自定义标注界面

概述

构建一个 HTML 页面，用于从数据源（JSON/CSV 文件）加载追踪记录，每次显示一条追踪记录，并提供通过/失败按钮、自由文本注释字段以及上一个/下一个导航功能。将标签保存到本地文件（CSV/SQLite/JSON）。然后根据以下指南针对特定领域进行定制。

数据显示

以最适合该领域人类阅读的格式呈现所有数据。电子邮件应看起来像电子邮件。代码应具有语法高亮。Markdown 应被渲染。表格应呈现为表格。JSON 应被美化打印并可折叠。

折叠重复元素。 如果每条追踪记录共享相同的系统提示，将其放入 <details> 切换标签中。
提取并突出显示关键元数据。 如果追踪数据中埋藏了属性名称、客户端类型或会话 ID，请将其提取出来并作为标题或徽章显著显示。
按角色或状态进行颜色编码。 使用左侧边框颜色来快速区分用户消息、助手消息、工具调用和系统提示。
视觉上分组相关元素。 工具调用及其响应应在视觉上关联（缩进、共享边框）。
折叠无助于判断的内容。 冗长的工具响应 JSON、中间推理步骤和调试上下文应放在切换标签后面。
高亮最重要的内容。 使审阅者需要判断的主要内容在视觉上占主导地位。加粗关键实体（价格、日期、名称）。使用字体大小和间距来创建层次结构。
显示完整的追踪记录。 包含所有中间步骤（工具调用、检索到的上下文、推理），而不仅仅是最终输出。默认情况下折叠它们，但要保持可访问性。
清理渲染内容。 在渲染之前，从 LLM 输出中剥离原始 HTML。如果渲染的 markdown 中的图像可能是跟踪像素，则禁用它们。

反馈收集

在追踪记录级别进行标注。审阅者判断整个追踪记录，而非单个片段。

二元通过/失败按钮作为主要操作。
自由文本注释字段，供审阅者描述出错（或正确）之处。
用于不确定情况的推迟按钮。
每次操作后自动保存。

一旦你通过错误分析确定了失败类别，之后可以添加预定义的失败模式标签作为可点击的复选框、下拉列表或选择列表，以便审阅者除了写注释外，还可以从已知类别中选择。但在初始构建阶段不要添加这些。

导航与状态

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

选择要加载的追踪记录

构建应用程序以接受来自任何来源（JSON/CSV 文件）的追踪记录。将采样逻辑保留在应用程序外部的一个单独脚本中。从随机采样开始。

参考面板： 可切换的面板，在追踪记录旁边显示真实情况、预期答案或评分标准定义。

筛选： 根据与产品相关的元数据维度（渠道、用户类型、流水线版本）筛选追踪记录。

聚类： 根据元数据或语义相似性对追踪记录进行分组。显示每个聚类的代表性追踪记录，并支持深入查看。

每条追踪记录的布局、控件和术语保持一致
通过和失败按钮在视觉上有明显区别（颜色、大小）
所有主要操作都支持键盘快捷键
即使部分内容被折叠，也能访问完整的追踪记录
标签自动持久化，无需显式保存
默认采用追踪记录级别标注（而非片段级别）
所有数据以其原生格式渲染（markdown 渲染为 HTML，代码高亮显示，JSON 美化打印，表格呈现为 HTML 表格，URL 显示为可点击链接）

构建界面后，使用 Playwright 进行验证。

视觉审查： 加载代表性追踪数据后，对界面进行截图。审查每张截图，检查：

布局和间距：视觉层次是否清晰？是否能立即看到重要内容？
可读性：所有数据是否都以原生格式渲染？是否存在原始 JSON 块、未渲染的 markdown 或无样式内容？
美观性：界面看起来是否专业、整洁？领域专家会使用它吗？
响应性：布局在不同窗口尺寸下是否保持正常？

功能测试： 编写一个 Playwright 脚本，执行完整的标注工作流程：

加载应用程序并验证追踪记录是否显示
点击某条追踪记录的"通过"，验证标签是否保存
点击某条追踪记录的"失败"，添加注释，验证两者是否都保存
点击"推迟"，验证是否被记录
使用按钮和键盘快捷键向前和向后导航
验证追踪记录计数器是否正确更新
通过重新加载页面并检查标签是否持久化来验证自动保存
展开折叠的部分（系统提示、工具调用）并验证内容可访问
测试所有键盘快捷键是否触发正确的操作

🇺🇸English

Build a Custom Annotation Interface

Overview

Build an HTML page that loads traces from a data source (JSON/CSV file), displays one trace at a time with Pass/Fail buttons, a free-text notes field, and Next/Previous navigation. Save labels to a local file (CSV/SQLite/JSON). Then customize to the domain using the guidelines below.

Data Display

Format all data in the most human-readable representation for the domain. Emails should look like emails. Code should have syntax highlighting. Markdown should be rendered. Tables should be tables. JSON should be pretty-printed and collapsible.

Collapse repetitive elements. If every trace shares the same system prompt, put it in a <details> toggle.
Extract and surface key metadata. If traces contain a property name, client type, or session ID buried in the data, extract it and display it prominently as a header or badge.
Color-code by role or status. Use left-border colors to distinguish user messages, assistant messages, tool calls, and system prompts at a glance.
Group related elements visually. Tool calls and their responses should be visually linked (indentation, shared border).
Collapse what doesn't help judgment. Verbose tool response JSON, intermediate reasoning steps, and debugging context go behind toggles.
Highlight what matters most. Make the primary content reviewers judge visually dominant. Bold key entities (prices, dates, names). Use font size and spacing to create hierarchy.
Show the full trace. Include all intermediate steps (tool calls, retrieved context, reasoning), not just the final output. Collapse them by default but keep them accessible.
Sanitize rendered content. Strip raw HTML from LLM outputs before rendering. Disable images in rendered markdown if they could be tracking pixels.

Feedback Collection

Annotate at the trace level. The reviewer judges the whole trace, not individual spans.

Binary Pass/Fail buttons as the primary action.
Free-text notes field for the reviewer to describe what went wrong (or right).
Defer button for uncertain cases.
Auto-save on every action.

Once you have established failure categories from error analysis, you can later add predefined failure mode tags as clickable checkboxes, dropdowns or picklists so reviewers can select from known categories in addition to writing notes. But don't add these in the initial build.

Navigation and Status

Next/Previous buttons and keyboard arrow keys.
Trace counter showing position and progress ("12 of 87 remaining").
Jump to specific trace by ID.
Counts of labeled vs unlabeled traces.

Keyboard Shortcuts

Arrow keys = Navigate traces
1 = Pass              2 = Fail
D = Defer             U = Undo last action
Cmd+S = Save          Cmd+Enter = Save and next

Selecting Traces to Load

Build the app to accept traces from any source (JSON/CSV file). Keep sampling logic outside the app in a separate script. Start with random sampling.

Additional Features

Reference panel: Toggle-able panel showing ground truth, expected answers, or rubric definitions alongside the trace.

Filtering: Filter traces by metadata dimensions relevant to the product (channel, user type, pipeline version).

Clustering: Group traces by metadata or semantic similarity. Show representative traces per cluster with drill-down.

Design Checklist

Same layout, controls, and terminology on every trace
Pass and Fail buttons are visually distinct (color, size)
Keyboard shortcuts work for all primary actions
Full trace accessible even when sections are collapsed
Labels persist automatically without explicit save
Trace-level annotation (not span-level) as the default
All data rendered in its native format (markdown as HTML, code with highlighting, JSON pretty-printed, tables as HTML tables, URLs as clickable links)

Testing

After building the interface, verify it with Playwright.

Visual review: Take screenshots of the interface with representative trace data loaded. Review each screenshot for:

Layout and spacing: is the visual hierarchy clear? Can you immediately see what matters?
Readability: is all data rendered in its native format? Are there any raw JSON blobs, unrendered markdown, or unstyled content?
Aesthetics: does the interface look professional and clean? Would a domain expert use this?
Responsiveness: does the layout hold at different window sizes?

Functional test: Write a Playwright script that performs a full annotation workflow:

Load the app and verify traces are displayed
Click Pass on a trace, verify the label is saved
Click Fail on a trace, add a note, verify both are saved
Click Defer, verify it is recorded
Navigate forward and backward with buttons and keyboard shortcuts
Verify the trace counter updates correctly
Verify auto-save by reloading the page and checking labels persist
Expand collapsed sections (system prompts, tool calls) and verify content is accessible
Test that all keyboard shortcuts trigger the correct actions

Weekly Installs

134

Repository

hamelsmu/evals-skills

GitHub Stars

955

First Seen

Mar 3, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

codex132

gemini-cli131

kimi-cli131

github-copilot131

cursor131

opencode131

AI 代码实施计划编写技能 | 自动化开发任务分解与 TDD 流程规划工具

50,900 周安装