LLM智能体QA测试指南：构建可靠评估套件与自动化测试流程

qa-agent-testing by vasilyu1983/ai-agents-public

76 周安装量

49 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/vasilyu1983/ai-agents-public --skill qa-agent-testing

AI/机器学习自动化测试

🇨🇳中文介绍

QA Agent Testing (Jan 2026)

为 LLM 智能体/角色设计和运行可靠的评估套件，包括使用工具和多智能体系统。

默认 QA 工作流程

定义待测角色：范围、非范围和安全边界。
定义 10 个代表性任务。
定义 5 个拒绝处理的边界情况。
定义输出契约。
在确定性控制和工具追踪下运行套件。
使用 6 维度评分标准进行评分；跟踪多次运行间的差异。
记录基线和回归情况；根据阈值控制合并/部署。

使用 assets/ 中的复制粘贴模板进行第 0 天设置。

确定性和稳定性控制

控制输入：固定提示/配置、测试夹具、稳定的工具响应，尽可能冻结时间/时区。
控制采样：在支持的情况下使用固定的种子/温度；记录模型/配置版本。
记录工具追踪：工具名称、参数、输出、延迟、错误、重试和副作用。

双层评估

分别评估推理层和行动层：

层级	测试内容	关键指标
推理	规划、决策、意图	意图解析、任务粘附性、上下文保持
行动	工具调用、执行、副作用

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

维度	衡量内容	层级
任务成功率	正确的结果和满足的约束	智能体
安全性/策略	正确的拒绝和安全替代方案	智能体
可靠性	多次运行和小幅提示更改下的稳定性	智能体
延迟/成本	每个任务和每个套件的预算	业务
可调试性	失败时产生证据	智能体
事实依据	幻觉率、引用准确性	模型
偏见检测	跨人口统计输入的公平性	模型

鲁棒性和安全性测试

变形测试：运行保持语义的小幅提示/输入改写；对输出强制执行不变性。
提示注入测试：将工具输出、检索到的文本和用户提供的文档视为不可信；验证智能体不会遵循与系统/开发者约束相冲突的嵌入指令。
工具故障注入：模拟超时、重试、部分数据和工具错误；验证优雅恢复。
差异测试：比较不同模型/配置版本的行为，以发现回归和意外变化。

除了人工审查外，使用客观的预言机。
对有问题的评估进行隔离，并指定负责人和过期时间，就像处理 CI 中的不稳定测试一样。

仅评估没有工具故障和对抗性输入的“理想提示”。
让自我评估替代真实情况检查。

需求	使用	位置
构建 10 个任务	任务模式 + 示例	`references/test-case-design.md`
设计拒绝案例	拒绝类别 + 模板	`references/refusal-patterns.md`
评分运行	详细评分标准 + 阈值	`references/scoring-rubric.md`
快速计算套件统计	CLI 工具脚本	`scripts/score_suite.py`
管理回归	重新运行工作流 + 基线策略	`references/regression-protocol.md`
沙箱化工具	隔离层级 + 加固	`references/tool-sandboxing.md`
测试多智能体系统	协调模式 + 套件模板	`references/multi-agent-testing.md`
安全使用 LLM 作为评判者	偏见 + 缓解措施	`references/llm-judge-limitations.md`
测试提示注入攻击	注入分类 + 测试用例	`references/prompt-injection-testing.md`
检测幻觉	检测方法 + 评分	`references/hallucination-detection.md`
设计评估数据集	数据集构建 + 维护	`references/eval-dataset-design.md`
从模板开始	测试框架 + 评分表 + 日志	`assets/`

测试一个智能体？
  - 新智能体？
    - 创建 QA 测试框架 -> 定义 10 个任务 + 5 个拒绝案例 -> 运行基线测试
  - 提示改变了？
    - 重新运行完整的 15 项检查套件 -> 与基线比较
  - 工具/知识改变了？
    - 重新运行受影响的测试 -> 记录到回归日志
  - 质量审查？
    - 根据评分标准评分 -> 识别薄弱环节 -> 修复提示

使用 6 维度评分标准对每次运行进行评分。
优先采用考虑差异的套件级门控；避免将非确定性视为免费通行证。
使用 scripts/score_suite.py 计算平均值、标准化分数和基本的 PASS/CONDITIONAL/FAIL 分类。
有关详细方法，请参阅 references/scoring-rubric.md。

references/test-case-design.md - 10 任务模式 + 验证 + 变形附加项
references/refusal-patterns.md - 拒绝类别 + 响应模板 + 测试策略
references/scoring-rubric.md - 评分指南、阈值、差异指标、评判者校准
references/regression-protocol.md - 重新运行范围、基线策略、恢复流程
references/tool-sandboxing.md - 沙箱层级、工具加固、注入/渗出测试思路
references/multi-agent-testing.md - 协调测试模式 + 套件模板
references/llm-judge-limitations.md - LLM 作为评判者的偏见、局限性、缓解措施
references/prompt-injection-testing.md - 注入分类法、测试用例和防御验证
references/hallucination-detection.md - 幻觉检测方法、评分和基准
references/eval-dataset-design.md - 评估数据集构建、版本控制和维护

assets/qa-harness-template.md - 可复制粘贴的测试框架
assets/scoring-sheet.md - 评分跟踪表
assets/regression-log.md - 版本跟踪日志

查看 data/sources.json 获取：

LLM 评估研究
红队方法
提示测试框架

qa-testing-strategy : ../qa-testing-strategy/SKILL.md - 通用测试策略
ai-prompt-engineering : ../ai-prompt-engineering/SKILL.md - 提示设计模式

复制 assets/qa-harness-template.md
填写 PUT 部分
为你的智能体定义 10 个代表性任务
添加 5 个拒绝处理的边界情况
指定输出契约
运行基线测试
将结果记录到回归日志

成功标准：10 个任务中每个得分 >= 12/18，每个拒绝案例得分 >= 2/3，多次运行结果稳定，且没有新的严重失败。

在给出最终答案前，使用网络搜索/网页抓取来验证当前的外部事实、版本、价格、截止日期、法规或平台行为。
优先使用一手来源；对于易变信息，报告来源链接和日期。
如果无法访问网络，请说明限制并将指导标记为未经验证。

🇺🇸English

QA Agent Testing (Jan 2026)

Design and run reliable evaluation suites for LLM agents/personas, including tool-using and multi-agent systems.

Default QA Workflow

Define the Persona Under Test (PUT): scope, out-of-scope, and safety boundaries.
Define 10 representative tasks (Must Ace).
Define 5 refusal edge cases (Must Decline + redirect).
Define an output contract (format, tone, structure, citations).
Run the suite with determinism controls and tool tracing.
Score with the 6-dimension rubric; track variance across reruns.
Log baselines and regressions; gate merges/deploys on thresholds.

Use the copy-paste templates in assets/ for day-0 setup.

Determinism and Flake Control

Control inputs: pin prompts/config, fixtures, stable tool responses, frozen time/timezone where possible.
Control sampling: fixed seeds/temperatures where supported; log model/config versions.
Record tool traces: tool name, args, outputs, latency, errors, retries, and side effects.

Two-Layer Evaluation (2026)

Evaluate reasoning and action layers separately:

Layer	What to Test	Key Metrics
Reasoning	Planning, decision-making, intent	Intent resolution, task adhesion, context retention
Action	Tool calls, execution, side effects	Tool call accuracy, completion rate, error recovery

Evaluation Dimensions (Score What Matters)

Dimension	What to Measure	Level
Task success	Correct outcome and constraints met	Agent
Safety/policy	Correct refusals and safe alternatives	Agent
Reliability	Stability across reruns and small prompt changes	Agent
Latency/cost	Budgets per task and per suite	Business
Debuggability	Failures produce evidence (logs, traces)	Agent
Factual grounding	Hallucination rate, citation accuracy	Model
Bias detection	Fairness across demographic inputs	Model

CI Economics

PR gate: small, high-signal smoke eval suite.
Scheduled: full scenario suites, adversarial inputs, and cost/latency regression checks (track separately from quality scoring).

Robustness and Security Tests (Recommended)

Metamorphic tests: run small, meaning-preserving prompt/input rewrites; enforce invariants on outputs.
Prompt injection tests: treat tool outputs, retrieved text, and user-provided documents as untrusted; verify the agent does not follow embedded instructions that conflict with system/developer constraints.
Tool fault injection: simulate timeouts, retries, partial data, and tool errors; verify graceful recovery.
Differential testing: compare behavior across model/config versions for regressions and unexpected shifts.

Do / Avoid

Do:

Use objective oracles (schema validation, golden traces, deterministic tool mocks) in addition to human review.
Quarantine flaky evals with owners and expiry, just like flaky tests in CI.

Avoid:

Evaluating only "happy prompts" with no tool failures and no adversarial inputs.
Letting self-evaluations substitute for ground-truth checks.

Quick Reference

Need	Use	Location
Build the 10 tasks	Task patterns + examples	`references/test-case-design.md`
Design refusals	Refusal categories + templates	`references/refusal-patterns.md`
Score runs	Detailed rubric + thresholds	`references/scoring-rubric.md`
Compute suite math quickly	CLI utility script	`scripts/score_suite.py`
Manage regressions	Re-run workflow + baseline policy

Decision Tree

Testing an agent?
  - New agent?
    - Create QA harness -> Define 10 tasks + 5 refusals -> Run baseline
  - Prompt changed?
    - Re-run full 15-check suite -> Compare to baseline
  - Tool/knowledge changed?
    - Re-run affected tests -> Log in regression log
  - Quality review?
    - Score against rubric -> Identify weak areas -> Fix prompt

Scoring and Gates

Score each run with the 6-dimension rubric (0-3 each; max 18 per task).
Prefer suite-level gating that accounts for variance; avoid treating non-determinism as a free pass.
Use scripts/score_suite.py to compute averages, normalized scores, and basic PASS/CONDITIONAL/FAIL classification.
For detailed methodology (including judge calibration and variance metrics), see references/scoring-rubric.md.

Navigation

Resources

references/test-case-design.md - 10-task patterns + validation + metamorphic add-ons
references/refusal-patterns.md - refusal categories + response templates + test tactics
references/scoring-rubric.md - scoring guide, thresholds, variance metrics, judge calibration
references/regression-protocol.md - re-run scope, baseline policy, recovery procedures
references/tool-sandboxing.md - sandbox tiers, tool hardening, injection/exfil test ideas
references/multi-agent-testing.md - coordination testing patterns + suite template
references/llm-judge-limitations.md - LLM-as-judge biases, limits, mitigations
references/prompt-injection-testing.md - Injection taxonomy, test cases, and defense validation

Templates

assets/qa-harness-template.md - copy-paste harness
assets/scoring-sheet.md - scoring tracker
assets/regression-log.md - version tracking

External Resources

See data/sources.json for:

LLM evaluation research
Red-teaming methodologies
Prompt testing frameworks

Related Skills

qa-testing-strategy : ../qa-testing-strategy/SKILL.md - General testing strategies
ai-prompt-engineering : ../ai-prompt-engineering/SKILL.md - Prompt design patterns

Quick Start

Copy assets/qa-harness-template.md
Fill in PUT (Persona Under Test) section
Define 10 representative tasks for your agent
Add 5 refusal edge cases
Specify output contracts
Run baseline test
Log results in regression log

Success Criteria: Each of the 10 tasks scores >= 12/18 and each refusal scores >= 2/3 (or PASS by your policy oracle), with stable results across reruns and no new hard failures.

Fact-Checking

Use web search/web fetch to verify current external facts, versions, pricing, deadlines, regulations, or platform behavior before final answers.
Prefer primary sources; report source links and dates for volatile information.
If web access is unavailable, state the limitation and mark guidance as unverified.

Weekly Installs

Repository

vasilyu1983/ai-…s-public

GitHub Stars

First Seen

Jan 23, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

opencode59

gemini-cli58

codex57

cursor57

github-copilot55

cline50

AI Elements：基于shadcn/ui的AI原生应用组件库，快速构建对话界面

69,600 周安装

references/regression-protocol.md

references/hallucination-detection.md - Hallucination detection methods, scoring, and benchmarks

references/eval-dataset-design.md - Evaluation dataset construction, versioning, and maintenance

LLM智能体QA测试指南：构建可靠评估套件与自动化测试流程

🇨🇳中文介绍

QA Agent Testing (Jan 2026)

默认 QA 工作流程

确定性和稳定性控制

双层评估

相关 Skills

评估维度

CI 经济学

鲁棒性和安全性测试

应做 / 避免

快速参考

决策树

评分与门控

导航

资源

模板

外部资源

相关技能

快速开始

事实核查

🇺🇸English

QA Agent Testing (Jan 2026)

Default QA Workflow

Determinism and Flake Control

Two-Layer Evaluation (2026)

Evaluation Dimensions (Score What Matters)

CI Economics

Robustness and Security Tests (Recommended)

Do / Avoid

Quick Reference

Decision Tree

Scoring and Gates

Navigation

Resources

Templates

External Resources

Related Skills

Quick Start

Fact-Checking

最新 Skills