qa-agent-testing by vasilyu1983/ai-agents-public
npx skills add https://github.com/vasilyu1983/ai-agents-public --skill qa-agent-testing为 LLM 智能体/角色设计和运行可靠的评估套件,包括使用工具和多智能体系统。
使用 assets/ 中的复制粘贴模板进行第 0 天设置。
分别评估推理层和行动层:
| 层级 | 测试内容 | 关键指标 |
|---|---|---|
| 推理 | 规划、决策、意图 | 意图解析、任务粘附性、上下文保持 |
| 行动 | 工具调用、执行、副作用 |
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
| 工具调用准确性、完成率、错误恢复 |
| 维度 | 衡量内容 | 层级 |
|---|---|---|
| 任务成功率 | 正确的结果和满足的约束 | 智能体 |
| 安全性/策略 | 正确的拒绝和安全替代方案 | 智能体 |
| 可靠性 | 多次运行和小幅提示更改下的稳定性 | 智能体 |
| 延迟/成本 | 每个任务和每个套件的预算 | 业务 |
| 可调试性 | 失败时产生证据 | 智能体 |
| 事实依据 | 幻觉率、引用准确性 | 模型 |
| 偏见检测 | 跨人口统计输入的公平性 | 模型 |
应做:
避免:
| 需求 | 使用 | 位置 |
|---|---|---|
| 构建 10 个任务 | 任务模式 + 示例 | references/test-case-design.md |
| 设计拒绝案例 | 拒绝类别 + 模板 | references/refusal-patterns.md |
| 评分运行 | 详细评分标准 + 阈值 | references/scoring-rubric.md |
| 快速计算套件统计 | CLI 工具脚本 | scripts/score_suite.py |
| 管理回归 | 重新运行工作流 + 基线策略 | references/regression-protocol.md |
| 沙箱化工具 | 隔离层级 + 加固 | references/tool-sandboxing.md |
| 测试多智能体系统 | 协调模式 + 套件模板 | references/multi-agent-testing.md |
| 安全使用 LLM 作为评判者 | 偏见 + 缓解措施 | references/llm-judge-limitations.md |
| 测试提示注入攻击 | 注入分类 + 测试用例 | references/prompt-injection-testing.md |
| 检测幻觉 | 检测方法 + 评分 | references/hallucination-detection.md |
| 设计评估数据集 | 数据集构建 + 维护 | references/eval-dataset-design.md |
| 从模板开始 | 测试框架 + 评分表 + 日志 | assets/ |
测试一个智能体?
- 新智能体?
- 创建 QA 测试框架 -> 定义 10 个任务 + 5 个拒绝案例 -> 运行基线测试
- 提示改变了?
- 重新运行完整的 15 项检查套件 -> 与基线比较
- 工具/知识改变了?
- 重新运行受影响的测试 -> 记录到回归日志
- 质量审查?
- 根据评分标准评分 -> 识别薄弱环节 -> 修复提示
scripts/score_suite.py 计算平均值、标准化分数和基本的 PASS/CONDITIONAL/FAIL 分类。references/scoring-rubric.md。references/test-case-design.md - 10 任务模式 + 验证 + 变形附加项references/refusal-patterns.md - 拒绝类别 + 响应模板 + 测试策略references/scoring-rubric.md - 评分指南、阈值、差异指标、评判者校准references/regression-protocol.md - 重新运行范围、基线策略、恢复流程references/tool-sandboxing.md - 沙箱层级、工具加固、注入/渗出测试思路references/multi-agent-testing.md - 协调测试模式 + 套件模板references/llm-judge-limitations.md - LLM 作为评判者的偏见、局限性、缓解措施references/prompt-injection-testing.md - 注入分类法、测试用例和防御验证references/hallucination-detection.md - 幻觉检测方法、评分和基准references/eval-dataset-design.md - 评估数据集构建、版本控制和维护assets/qa-harness-template.md - 可复制粘贴的测试框架assets/scoring-sheet.md - 评分跟踪表assets/regression-log.md - 版本跟踪日志查看 data/sources.json 获取:
成功标准:10 个任务中每个得分 >= 12/18,每个拒绝案例得分 >= 2/3,多次运行结果稳定,且没有新的严重失败。
每周安装数
76
代码仓库
GitHub 星标数
49
首次出现
Jan 23, 2026
安全审计
安装于
opencode59
gemini-cli58
codex57
cursor57
github-copilot55
cline50
Design and run reliable evaluation suites for LLM agents/personas, including tool-using and multi-agent systems.
Use the copy-paste templates in assets/ for day-0 setup.
Evaluate reasoning and action layers separately:
| Layer | What to Test | Key Metrics |
|---|---|---|
| Reasoning | Planning, decision-making, intent | Intent resolution, task adhesion, context retention |
| Action | Tool calls, execution, side effects | Tool call accuracy, completion rate, error recovery |
| Dimension | What to Measure | Level |
|---|---|---|
| Task success | Correct outcome and constraints met | Agent |
| Safety/policy | Correct refusals and safe alternatives | Agent |
| Reliability | Stability across reruns and small prompt changes | Agent |
| Latency/cost | Budgets per task and per suite | Business |
| Debuggability | Failures produce evidence (logs, traces) | Agent |
| Factual grounding | Hallucination rate, citation accuracy | Model |
| Bias detection | Fairness across demographic inputs | Model |
Do:
Avoid:
| Need | Use | Location |
|---|---|---|
| Build the 10 tasks | Task patterns + examples | references/test-case-design.md |
| Design refusals | Refusal categories + templates | references/refusal-patterns.md |
| Score runs | Detailed rubric + thresholds | references/scoring-rubric.md |
| Compute suite math quickly | CLI utility script | scripts/score_suite.py |
| Manage regressions | Re-run workflow + baseline policy |
Testing an agent?
- New agent?
- Create QA harness -> Define 10 tasks + 5 refusals -> Run baseline
- Prompt changed?
- Re-run full 15-check suite -> Compare to baseline
- Tool/knowledge changed?
- Re-run affected tests -> Log in regression log
- Quality review?
- Score against rubric -> Identify weak areas -> Fix prompt
scripts/score_suite.py to compute averages, normalized scores, and basic PASS/CONDITIONAL/FAIL classification.references/scoring-rubric.md.references/test-case-design.md - 10-task patterns + validation + metamorphic add-onsreferences/refusal-patterns.md - refusal categories + response templates + test tacticsreferences/scoring-rubric.md - scoring guide, thresholds, variance metrics, judge calibrationreferences/regression-protocol.md - re-run scope, baseline policy, recovery proceduresreferences/tool-sandboxing.md - sandbox tiers, tool hardening, injection/exfil test ideasreferences/multi-agent-testing.md - coordination testing patterns + suite templatereferences/llm-judge-limitations.md - LLM-as-judge biases, limits, mitigationsreferences/prompt-injection-testing.md - Injection taxonomy, test cases, and defense validationassets/qa-harness-template.md - copy-paste harnessassets/scoring-sheet.md - scoring trackerassets/regression-log.md - version trackingSee data/sources.json for:
Success Criteria: Each of the 10 tasks scores >= 12/18 and each refusal scores >= 2/3 (or PASS by your policy oracle), with stable results across reruns and no new hard failures.
Weekly Installs
76
Repository
GitHub Stars
49
First Seen
Jan 23, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykPass
Installed on
opencode59
gemini-cli58
codex57
cursor57
github-copilot55
cline50
AI Elements:基于shadcn/ui的AI原生应用组件库,快速构建对话界面
69,600 周安装
references/regression-protocol.md| Sandbox tools | Isolation tiers + hardening | references/tool-sandboxing.md |
| Test multi-agent systems | Coordination patterns + suite template | references/multi-agent-testing.md |
| Use LLM-as-judge safely | Biases + mitigations | references/llm-judge-limitations.md |
| Test prompt injection attacks | Injection taxonomy + test cases | references/prompt-injection-testing.md |
| Detect hallucinations | Detection methods + scoring | references/hallucination-detection.md |
| Design eval datasets | Dataset construction + maintenance | references/eval-dataset-design.md |
| Start from templates | Harness + scoring sheet + log | assets/ |
references/hallucination-detection.md - Hallucination detection methods, scoring, and benchmarksreferences/eval-dataset-design.md - Evaluation dataset construction, versioning, and maintenance