MCP服务器评估指南：创建复杂测试问题，衡量LLM工具调用能力

mcp-builder by bobmatnyc/claude-mpm-skills

75 周安装量

27 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/bobmatnyc/claude-mpm-skills --skill mcp-builder

AI/机器学习技术文档测试

🇨🇳中文介绍

MCP 服务器评估指南

概述

本文档为创建 MCP 服务器的全面评估提供指导。评估旨在测试 LLM 是否能够仅使用您 MCP 服务器提供的工具，来有效回答现实、复杂的问题。

快速参考

评估要求

创建 10 个人类可读的问题
问题必须是只读的、独立的、非破坏性的
每个问题需要多次工具调用（可能数十次）
答案必须是单一的、可验证的值
答案必须是稳定的（不会随时间改变）

输出格式

<evaluation>
   <qa_pair>
      <question>您的问题放在这里</question>
      <answer>单一可验证的答案</answer>
   </qa_pair>
</evaluation>

评估的目的

衡量 MCP 服务器质量的标准，不在于服务器实现工具的好坏或全面性，而在于这些实现（输入/输出模式、文档字符串/描述、功能）能否让 LLM 在没有其他上下文且仅能访问 MCP 服务器的情况下，回答现实且困难的问题。

评估概述

创建 10 个人类可读的问题，这些问题仅需要只读的、独立的、非破坏性的和幂等的操作来回答。每个问题应具备：

现实性
清晰简洁
无歧义
复杂性，可能需要数十次工具调用或步骤
可用一个您预先确定的、单一的、可验证的值来回答

问题指南

核心要求

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

步骤 1：文档检查

阅读目标 API 的文档以了解：

可用的端点和功能
如果存在模糊性，从网络获取额外信息
尽可能并行化此步骤
确保每个子代理仅检查来自文件系统或网络的文档

步骤 2：工具检查

列出 MCP 服务器中可用的工具：

直接检查 MCP 服务器
理解输入/输出模式、文档字符串和描述
在此阶段不调用工具本身

步骤 3：深化理解

重复步骤 1 和 2，直到您有良好的理解：

多次迭代
考虑要创建的任务类型
完善您的理解
在任何阶段都不应阅读 MCP 服务器实现本身的代码
运用您的直觉和理解来创建合理、现实但极具挑战性的任务

步骤 4：只读内容检查

了解 API 和工具后，使用 MCP 服务器工具：

仅使用只读和非破坏性操作检查内容
目标：识别特定内容（例如，用户、频道、消息、项目、任务）以创建现实的问题
不应调用任何修改状态的工具
不会阅读 MCP 服务器实现本身的代码
将此步骤与各个子代理进行独立探索并行化
确保每个子代理仅执行只读的、非破坏性的和幂等的操作
小心：某些工具可能返回大量数据，导致您耗尽上下文
进行增量的、小规模的和有针对性的工具调用进行探索
在所有工具调用请求中，使用 limit 参数限制结果（<10）
使用分页

步骤 5：任务生成

检查内容后，创建 10 个人类可读的问题：

LLM 应该能够使用 MCP 服务器回答这些问题
遵循上述所有问题和答案指南

每个问答对包含一个问题和答案。输出应是一个具有以下结构的 XML 文件：

<evaluation>
   <qa_pair>
      <question>查找在 2024 年第二季度创建且已完成任务数量最多的项目。项目名称是什么？</question>
      <answer>Website Redesign</answer>
   </qa_pair>
   <qa_pair>
      <question>搜索在 2024 年 3 月关闭且标记为 "bug" 的问题。哪个用户关闭的问题最多？提供他们的用户名。</question>
      <answer>sarah_dev</answer>
   </qa_pair>
   <qa_pair>
      <question>查找在 2024 年 1 月 1 日至 2024 年 1 月 31 日期间合并的、修改了 /api 目录中文件的拉取请求。有多少不同的贡献者参与了这些 PR？</question>
      <answer>7</answer>
   </qa_pair>
   <qa_pair>
      <question>查找在 2023 年之前创建且拥有最多星标的仓库。仓库名称是什么？</question>
      <answer>data-pipeline</answer>
   </qa_pair>
</evaluation>

示例 1：需要深度探索的多跳问题（GitHub MCP）

<qa_pair>
   <question>查找在 2023 年第三季度被归档且之前是该组织中分支最多的项目。该仓库主要使用什么编程语言？</question>
   <answer>Python</answer>
</qa_pair>

这个问题很好，因为：

需要多次搜索以找到已归档的仓库
需要确定在归档前哪个仓库分支最多
需要检查仓库详情以获取语言信息
答案是简单、可验证的值
基于不会改变的历史（已关闭）数据

示例 2：需要理解上下文而非关键词匹配（项目管理 MCP）

<qa_pair>
   <question>定位在 2023 年底完成的、专注于改善客户入职的举措。项目负责人在完成后创建了一份回顾文档。负责人当时的职位头衔是什么？</question>
   <answer>Product Manager</answer>
</qa_pair>

这个问题很好，因为：

不使用特定的项目名称（“专注于改善客户入职的举措”）
需要从特定时间范围查找已完成的项目
需要识别项目负责人及其角色
需要理解回顾文档中的上下文
答案是易于人类阅读且稳定的
基于已完成的工作（不会改变）

示例 3：需要多个步骤的复杂聚合（问题跟踪器 MCP）

<qa_pair>
   <question>在 2024 年 1 月报告的所有标记为关键优先级的错误中，哪位受理人在 48 小时内解决了其分配的错误比例最高？提供受理人的用户名。</question>
   <answer>alex_eng</answer>
</qa_pair>

这个问题很好，因为：

需要按日期、优先级和状态过滤错误
需要按受理人分组并计算解决率
需要理解时间戳以确定 48 小时窗口
测试分页（可能需要处理许多错误）
答案是单一的用户名
基于特定时间段的历史数据

示例 4：需要跨多种数据类型的综合（CRM MCP）

<qa_pair>
   <question>查找在 2023 年第四季度从 Starter 计划升级到 Enterprise 计划且年度合同价值最高的账户。该账户属于哪个行业？</question>
   <answer>Healthcare</answer>
</qa_pair>

这个问题很好，因为：

需要理解订阅层级变更
需要识别特定时间范围内的升级事件
需要比较合同价值
必须访问账户行业信息
答案简单且可验证
基于已完成的历史交易

示例 1：答案随时间变化

<qa_pair>
   <question>目前分配给工程团队的未解决问题有多少个？</question>
   <answer>47</answer>
</qa_pair>

这个问题很差，因为：

答案会随着问题的创建、关闭或重新分配而改变
不是基于稳定的/静态的数据
依赖于动态的“当前状态”

示例 2：通过关键词搜索太容易

<qa_pair>
   <question>查找标题为 "Add authentication feature" 的拉取请求，并告诉我谁创建了它。</question>
   <answer>developer123</answer>
</qa_pair>

这个问题很差，因为：

可以通过对确切标题进行简单的关键词搜索来解决
不需要深度探索或理解
不需要综合或分析

示例 3：答案格式模糊

<qa_pair>
   <question>列出所有以 Python 为主要语言的仓库。</question>
   <answer>repo1, repo2, repo3, data-pipeline, ml-tools</answer>
</qa_pair>

这个问题很差，因为：

答案是一个列表，可以以任何顺序返回
难以通过直接字符串比较进行验证
LLM 可能以不同方式格式化（JSON 数组、逗号分隔、换行分隔）
最好询问特定的聚合（计数）或最高级（最多星标）

检查 XML 文件以理解模式
加载每个任务指令，并并行使用 MCP 服务器和工具，通过尝试直接解决任务来识别正确答案
标记任何需要写入或破坏性操作的操作
累积所有正确答案并替换文档中的任何错误答案
删除任何需要写入或破坏性操作的 <qa_pair>

请记住并行化解决任务以避免耗尽上下文，然后累积所有答案并在最后对文件进行更改。

创建高质量评估的技巧

深思熟虑并提前规划，然后再生成任务
在有机会时并行化以加速流程并管理上下文
专注于现实用例，即人类实际想要完成的任务
创建具有挑战性的问题，以测试 MCP 服务器能力的极限
确保稳定性，通过使用历史数据和已关闭的概念
验证答案，通过使用 MCP 服务器工具解决问题
迭代和完善基于您在过程中所学到的内容

创建评估文件后，使用提供的评估工具来测试 MCP 服务器。

安装依赖项

pip install -r scripts/requirements.txt

或手动安装：

pip install anthropic mcp

设置 API 密钥

export ANTHROPIC_API_KEY=your_api_key_here

评估文件使用带有 <qa_pair> 元素的 XML 格式：

<evaluation>
   <qa_pair>
      <question>查找在 2024 年第二季度创建且已完成任务数量最多的项目。项目名称是什么？</question>
      <answer>Website Redesign</answer>
   </qa_pair>
   <qa_pair>
      <question>搜索在 2024 年 3 月关闭且标记为 "bug" 的问题。哪个用户关闭的问题最多？提供他们的用户名。</question>
      <answer>sarah_dev</answer>
   </qa_pair>
</evaluation>

评估脚本 (scripts/evaluation.py) 支持三种传输类型：

stdio 传输：评估脚本会自动为您启动和管理 MCP 服务器进程。不要手动运行服务器。
sse/http 传输：您必须在运行评估之前单独启动 MCP 服务器。脚本连接到指定 URL 上已运行的服务器。

1. 本地 STDIO 服务器

对于本地运行的 MCP 服务器（脚本自动启动服务器）：

python scripts/evaluation.py \
  -t stdio \
  -c python \
  -a my_mcp_server.py \
  evaluation.xml

带有环境变量：

python scripts/evaluation.py \
  -t stdio \
  -c python \
  -a my_mcp_server.py \
  -e API_KEY=abc123 \
  -e DEBUG=true \
  evaluation.xml

2. 服务器发送事件 (SSE)

对于基于 SSE 的 MCP 服务器（首先启动服务器）：

python scripts/evaluation.py \
  -t sse \
  -u https://example.com/mcp \
  -H "Authorization: Bearer token123" \
  -H "X-Custom-Header: value" \
  evaluation.xml

3. HTTP (可流式 HTTP)

对于基于 HTTP 的 MCP 服务器（首先启动服务器）：

python scripts/evaluation.py \
  -t http \
  -u https://example.com/mcp \
  -H "Authorization: Bearer token123" \
  evaluation.xml

usage: evaluation.py [-h] [-t {stdio,sse,http}] [-m MODEL] [-c COMMAND]
                     [-a ARGS [ARGS ...]] [-e ENV [ENV ...]] [-u URL]
                     [-H HEADERS [HEADERS ...]] [-o OUTPUT]
                     eval_file

positional arguments:
  eval_file             评估 XML 文件的路径

optional arguments:
  -h, --help            显示帮助信息
  -t, --transport       传输类型：stdio、sse 或 http（默认：stdio）
  -m, --model           使用的 Claude 模型（默认：claude-3-7-sonnet-20250219）
  -o, --output          报告的输出文件（默认：打印到标准输出）

stdio 选项：
  -c, --command         运行 MCP 服务器的命令（例如，python、node）
  -a, --args            命令的参数（例如，server.py）
  -e, --env             环境变量，格式为 KEY=VALUE

sse/http 选项：
  -u, --url             MCP 服务器 URL
  -H, --header          HTTP 头，格式为 'Key: Value'

评估脚本生成一份详细报告，包括：

摘要统计：
- 准确率（正确/总数）
- 平均任务持续时间
- 每个任务的平均工具调用次数
- 总工具调用次数
每个任务的结果：
- 提示和预期响应
- 代理的实际响应
- 答案是否正确（✅/❌）
- 持续时间和工具调用详情
- 代理对其方法的总结
- 代理对工具的反馈

将报告保存到文件

python scripts/evaluation.py \
  -t stdio \
  -c python \
  -a my_server.py \
  -o evaluation_report.md \
  evaluation.xml

完整示例工作流程

以下是创建和运行评估的完整示例：

创建您的评估文件 (my_evaluation.xml)：

<evaluation>
   <qa_pair>
      <question>查找在 2024 年 1 月创建问题最多的用户。他们的用户名是什么？</question>
      <answer>alice_developer</answer>
   </qa_pair>
   <qa_pair>
      <question>在 2024 年第一季度合并的所有拉取请求中，哪个仓库的数量最多？提供仓库名称。</question>
      <answer>backend-api</answer>
   </qa_pair>
   <qa_pair>
      <question>查找在 2023 年 12 月完成且从开始到结束持续时间最长的项目。它花了多少天？</question>
      <answer>127</answer>
   </qa_pair>
</evaluation>

安装依赖项：

pip install -r scripts/requirements.txt
export ANTHROPIC_API_KEY=your_api_key

运行评估：

python scripts/evaluation.py \
  -t stdio \
  -c python \
  -a github_mcp_server.py \
  -e GITHUB_TOKEN=ghp_xxx \
  -o github_eval_report.md \
  my_evaluation.xml

在 github_eval_report.md 中查看报告以：
- 查看哪些问题通过/失败
- 阅读代理对您工具的反馈
- 识别需要改进的领域
- 迭代您的 MCP 服务器设计

本地 MCP 服务器约定 (mcp-* 仓库)

在构建或评估您的 MCP 服务器时，请使用以下模式：

提供一个运行 stdio 服务器的 mcp 子命令（例如，mcp-vector-search mcp、mcp-ticketer mcp --path <repo>、mcp-browser mcp）。
倾向于紧凑、整合的工具界面以减少令牌占用；在列出数据时使用分页和紧凑模式。
包含 setup、install 或 doctor 命令以验证运行时依赖项并与客户端集成。
优先使用带有 type: stdio、明确 command 和最少 env 覆盖的 .mcp.json 条目。

示例 .mcp.json 条目：

{
  "mcpServers": {
    "mcp-vector-search": {
      "type": "stdio",
      "command": "uv",
      "args": ["run", "mcp-vector-search", "mcp"],
      "env": {
        "MCP_ENABLE_FILE_WATCHING": "true"
      }
    }
  }
}

为适配器和数据库路径使用明确的环境变量（例如，MCP_TICKETER_ADAPTER、KUZU_MEMORY_DB）。
运行 HTTP/SSE 变体时公开健康检查端点。
为浏览器集成允许动态端口分配（mcp-browser 使用端口范围）。

发生连接错误时：

STDIO：验证命令和参数是否正确
SSE/HTTP：检查 URL 是否可访问以及头信息是否正确
确保任何必需的 API 密钥已在环境变量或头信息中设置

如果许多评估失败：

查看代理对每个任务的反馈
检查工具描述是否清晰全面
验证输入参数是否文档齐全
考虑工具返回的数据是否过多或过少
确保错误消息具有可操作性

如果任务超时：

使用能力更强的模型（例如，claude-3-7-sonnet-20250219）
检查工具是否返回过多数据
验证分页是否正常工作
考虑简化复杂问题

🇺🇸English

MCP Server Evaluation Guide

Overview

This document provides guidance on creating comprehensive evaluations for MCP servers. Evaluations test whether LLMs can effectively use your MCP server to answer realistic, complex questions using only the tools provided.

Quick Reference

Evaluation Requirements

Create 10 human-readable questions
Questions must be READ-ONLY, INDEPENDENT, NON-DESTRUCTIVE
Each question requires multiple tool calls (potentially dozens)
Answers must be single, verifiable values
Answers must be STABLE (won't change over time)

Output Format

<evaluation>
   <qa_pair>
      <question>Your question here</question>
      <answer>Single verifiable answer</answer>
   </qa_pair>
</evaluation>

Purpose of Evaluations

The measure of quality of an MCP server is NOT how well or comprehensively the server implements tools, but how well these implementations (input/output schemas, docstrings/descriptions, functionality) enable LLMs with no other context and access ONLY to the MCP servers to answer realistic and difficult questions.

Evaluation Overview

Create 10 human-readable questions requiring ONLY READ-ONLY, INDEPENDENT, NON-DESTRUCTIVE, and IDEMPOTENT operations to answer. Each question should be:

Realistic
Clear and concise
Unambiguous
Complex, requiring potentially dozens of tool calls or steps
Answerable with a single, verifiable value that you identify in advance

Question Guidelines

Core Requirements

Questions MUST be independent
- Each question should NOT depend on the answer to any other question
- Should not assume prior write operations from processing another question
Questions MUST require ONLY NON-DESTRUCTIVE AND IDEMPOTENT tool use
- Should not instruct or require modifying state to arrive at the correct answer
Questions must be REALISTIC, CLEAR, CONCISE, and COMPLEX
- Must require another LLM to use multiple (potentially dozens of) tools or steps to answer

Complexity and Depth

Questions must require deep exploration
- Consider multi-hop questions requiring multiple sub-questions and sequential tool calls
- Each step should benefit from information found in previous questions
Questions may require extensive paging
- May need paging through multiple pages of results
- May require querying old data (1-2 years out-of-date) to find niche information
- The questions must be DIFFICULT
Questions must require deep understanding
- Rather than surface-level knowledge
- May pose complex ideas as True/False questions requiring evidence
- May use multiple-choice format where LLM must search different hypotheses
Questions must not be solvable with straightforward keyword search
- Do not include specific keywords from the target content
- Use synonyms, related concepts, or paraphrases
- Require multiple searches, analyzing multiple related items, extracting context, then deriving the answer

Tool Testing

Questions should stress-test tool return values
- May elicit tools returning large JSON objects or lists, overwhelming the LLM
- Should require understanding multiple modalities of data:
  - IDs and names
  - Timestamps and datetimes (months, days, years, seconds)
  - File IDs, names, extensions, and mimetypes
  - URLs, GIDs, etc.
- Should probe the tool's ability to return all useful forms of data
Questions should MOSTLY reflect real human use cases
- The kinds of information retrieval tasks that HUMANS assisted by an LLM would care about
Questions may require dozens of tool calls

 * This challenges LLMs with limited context

 * Encourages MCP server tools to reduce information returned

11. Include ambiguous questions

 * May be ambiguous OR require difficult decisions on which tools to call
 * Force the LLM to potentially make mistakes or misinterpret
 * Ensure that despite AMBIGUITY, there is STILL A SINGLE VERIFIABLE ANSWER

Stability

Questions must be designed so the answer DOES NOT CHANGE

 * Do not ask questions that rely on "current state" which is dynamic

 * For example, do not count: 
   * Number of reactions to a post
   * Number of replies to a thread
   * Number of members in a channel

13. DO NOT let the MCP server RESTRICT the kinds of questions you create

 * Create challenging and complex questions
 * Some may not be solvable with the available MCP server tools
 * Questions may require specific output formats (datetime vs. epoch time, JSON vs. MARKDOWN)
 * Questions may require dozens of tool calls to complete

Answer Guidelines

Verification

Answers must be VERIFIABLE via direct string comparison
- If the answer can be re-written in many formats, clearly specify the output format in the QUESTION
- Examples: "Use YYYY/MM/DD.", "Respond True or False.", "Answer A, B, C, or D and nothing else."
- Answer should be a single VERIFIABLE value such as:
  - User ID, user name, display name, first name, last name
  - Channel ID, channel name
  - Message ID, string
  - URL, title
  - Numerical quantity
  - Timestamp, datetime
  - Boolean (for True/False questions)
  - Email address, phone number
  - File ID, file name, file extension
  - Multiple choice answer
- Answers must not require special formatting or complex, structured output
- Answer will be verified using DIRECT STRING COMPARISON

Readability

Answers should generally prefer HUMAN-READABLE formats
- Examples: names, first name, last name, datetime, file name, message string, URL, yes/no, true/false, a/b/c/d
- Rather than opaque IDs (though IDs are acceptable)
- The VAST MAJORITY of answers should be human-readable

Stability

Answers must be STABLE/STATIONARY
- Look at old content (e.g., conversations that have ended, projects that have launched, questions answered)
- Create QUESTIONS based on "closed" concepts that will always return the same answer
- Questions may ask to consider a fixed time window to insulate from non-stationary answers
- Rely on context UNLIKELY to change
- Example: if finding a paper name, be SPECIFIC enough so answer is not confused with papers published later
Answers must be CLEAR and UNAMBIGUOUS
- Questions must be designed so there is a single, clear answer
- Answer can be derived from using the MCP server tools

Diversity

Answers must be DIVERSE
- Answer should be a single VERIFIABLE value in diverse modalities and formats
- User concept: user ID, user name, display name, first name, last name, email address, phone number
- Channel concept: channel ID, channel name, channel topic
- Message concept: message ID, message string, timestamp, month, day, year
Answers must NOT be complex structures
- Not a list of values
- Not a complex object
- Not a list of IDs or strings
- Not natural language text
- UNLESS the answer can be straightforwardly verified using DIRECT STRING COMPARISON
- And can be realistically reproduced
- It should be unlikely that an LLM would return the same list in any other order or format

Evaluation Process

Step 1: Documentation Inspection

Read the documentation of the target API to understand:

Available endpoints and functionality
If ambiguity exists, fetch additional information from the web
Parallelize this step AS MUCH AS POSSIBLE
Ensure each subagent is ONLY examining documentation from the file system or on the web

Step 2: Tool Inspection

List the tools available in the MCP server:

Inspect the MCP server directly
Understand input/output schemas, docstrings, and descriptions
WITHOUT calling the tools themselves at this stage

Step 3: Developing Understanding

Repeat steps 1 & 2 until you have a good understanding:

Iterate multiple times
Consider the kinds of tasks to create
Refine your understanding
At NO stage should you READ the code of the MCP server implementation itself
Use your intuition and understanding to create reasonable, realistic, but VERY challenging tasks

Step 4: Read-Only Content Inspection

After understanding the API and tools, USE the MCP server tools:

Inspect content using READ-ONLY and NON-DESTRUCTIVE operations ONLY
Goal: identify specific content (e.g., users, channels, messages, projects, tasks) for creating realistic questions
Should NOT call any tools that modify state
Will NOT read the code of the MCP server implementation itself
Parallelize this step with individual sub-agents pursuing independent explorations
Ensure each subagent is only performing READ-ONLY, NON-DESTRUCTIVE, and IDEMPOTENT operations
BE CAREFUL: SOME TOOLS may return LOTS OF DATA which would cause you to run out of CONTEXT
Make INCREMENTAL, SMALL, AND TARGETED tool calls for exploration
In all tool call requests, use the limit parameter to limit results (<10)
Use pagination

Step 5: Task Generation

After inspecting the content, create 10 human-readable questions:

An LLM should be able to answer these with the MCP server
Follow all question and answer guidelines above

Output Format

Each QA pair consists of a question and an answer. The output should be an XML file with this structure:

<evaluation>
   <qa_pair>
      <question>Find the project created in Q2 2024 with the highest number of completed tasks. What is the project name?</question>
      <answer>Website Redesign</answer>
   </qa_pair>
   <qa_pair>
      <question>Search for issues labeled as "bug" that were closed in March 2024. Which user closed the most issues? Provide their username.</question>
      <answer>sarah_dev</answer>
   </qa_pair>
   <qa_pair>
      <question>Look for pull requests that modified files in the /api directory and were merged between January 1 and January 31, 2024. How many different contributors worked on these PRs?</question>
      <answer>7</answer>
   </qa_pair>
   <qa_pair>
      <question>Find the repository with the most stars that was created before 2023. What is the repository name?</question>
      <answer>data-pipeline</answer>
   </qa_pair>
</evaluation>

Evaluation Examples

Good Questions

Example 1: Multi-hop question requiring deep exploration (GitHub MCP)

<qa_pair>
   <question>Find the repository that was archived in Q3 2023 and had previously been the most forked project in the organization. What was the primary programming language used in that repository?</question>
   <answer>Python</answer>
</qa_pair>

This question is good because:

Requires multiple searches to find archived repositories
Needs to identify which had the most forks before archival
Requires examining repository details for the language
Answer is a simple, verifiable value
Based on historical (closed) data that won't change

Example 2: Requires understanding context without keyword matching (Project Management MCP)

<qa_pair>
   <question>Locate the initiative focused on improving customer onboarding that was completed in late 2023. The project lead created a retrospective document after completion. What was the lead's role title at that time?</question>
   <answer>Product Manager</answer>
</qa_pair>

This question is good because:

Doesn't use specific project name ("initiative focused on improving customer onboarding")
Requires finding completed projects from specific timeframe
Needs to identify the project lead and their role
Requires understanding context from retrospective documents
Answer is human-readable and stable
Based on completed work (won't change)

Example 3: Complex aggregation requiring multiple steps (Issue Tracker MCP)

<qa_pair>
   <question>Among all bugs reported in January 2024 that were marked as critical priority, which assignee resolved the highest percentage of their assigned bugs within 48 hours? Provide the assignee's username.</question>
   <answer>alex_eng</answer>
</qa_pair>

This question is good because:

Requires filtering bugs by date, priority, and status
Needs to group by assignee and calculate resolution rates
Requires understanding timestamps to determine 48-hour windows
Tests pagination (potentially many bugs to process)
Answer is a single username
Based on historical data from specific time period

Example 4: Requires synthesis across multiple data types (CRM MCP)

<qa_pair>
   <question>Find the account that upgraded from the Starter to Enterprise plan in Q4 2023 and had the highest annual contract value. What industry does this account operate in?</question>
   <answer>Healthcare</answer>
</qa_pair>

This question is good because:

Requires understanding subscription tier changes
Needs to identify upgrade events in specific timeframe
Requires comparing contract values
Must access account industry information
Answer is simple and verifiable
Based on completed historical transactions

Poor Questions

Example 1: Answer changes over time

<qa_pair>
   <question>How many open issues are currently assigned to the engineering team?</question>
   <answer>47</answer>
</qa_pair>

This question is poor because:

The answer will change as issues are created, closed, or reassigned
Not based on stable/stationary data
Relies on "current state" which is dynamic

Example 2: Too easy with keyword search

<qa_pair>
   <question>Find the pull request with title "Add authentication feature" and tell me who created it.</question>
   <answer>developer123</answer>
</qa_pair>

This question is poor because:

Can be solved with a straightforward keyword search for exact title
Doesn't require deep exploration or understanding
No synthesis or analysis needed

Example 3: Ambiguous answer format

<qa_pair>
   <question>List all the repositories that have Python as their primary language.</question>
   <answer>repo1, repo2, repo3, data-pipeline, ml-tools</answer>
</qa_pair>

This question is poor because:

Answer is a list that could be returned in any order
Difficult to verify with direct string comparison
LLM might format differently (JSON array, comma-separated, newline-separated)
Better to ask for a specific aggregate (count) or superlative (most stars)

Verification Process

After creating evaluations:

Examine the XML file to understand the schema
Load each task instruction and in parallel using the MCP server and tools, identify the correct answer by attempting to solve the task directly
Flag any operations that require WRITE or DESTRUCTIVE operations
Accumulate all CORRECT answers and replace any incorrect answers in the document
Remove any<qa_pair> that require WRITE or DESTRUCTIVE operations

Remember to parallelize solving tasks to avoid running out of context, then accumulate all answers and make changes to the file at the end.

Tips for Creating Quality Evaluations

Think Hard and Plan Ahead before generating tasks
Parallelize Where Opportunity Arises to speed up the process and manage context
Focus on Realistic Use Cases that humans would actually want to accomplish
Create Challenging Questions that test the limits of the MCP server's capabilities
Ensure Stability by using historical data and closed concepts
Verify Answers by solving the questions using the MCP server tools
Iterate and Refine based on what you learn during the process

Running Evaluations

After creating the evaluation file, use the provided evaluation harness to test the MCP server.

Setup

Install Dependencies

pip install -r scripts/requirements.txt

Or install manually:

     pip install anthropic mcp

2. Set API Key

     export ANTHROPIC_API_KEY=your_api_key_here

Evaluation File Format

Evaluation files use XML format with <qa_pair> elements:

<evaluation>
   <qa_pair>
      <question>Find the project created in Q2 2024 with the highest number of completed tasks. What is the project name?</question>
      <answer>Website Redesign</answer>
   </qa_pair>
   <qa_pair>
      <question>Search for issues labeled as "bug" that were closed in March 2024. Which user closed the most issues? Provide their username.</question>
      <answer>sarah_dev</answer>
   </qa_pair>
</evaluation>

Running Evaluations

The evaluation script (scripts/evaluation.py) supports three transport types:

Important:

stdio transport : The evaluation script automatically launches and manages the MCP server process for you. Do not run the server manually.
sse/http transports : You must start the MCP server separately before running the evaluation. The script connects to the already-running server at the specified URL.

1. Local STDIO Server

For locally-run MCP servers (script launches the server automatically):

python scripts/evaluation.py \
  -t stdio \
  -c python \
  -a my_mcp_server.py \
  evaluation.xml

With environment variables:

python scripts/evaluation.py \
  -t stdio \
  -c python \
  -a my_mcp_server.py \
  -e API_KEY=abc123 \
  -e DEBUG=true \
  evaluation.xml

2. Server-Sent Events (SSE)

For SSE-based MCP servers (start the server first):

python scripts/evaluation.py \
  -t sse \
  -u https://example.com/mcp \
  -H "Authorization: Bearer token123" \
  -H "X-Custom-Header: value" \
  evaluation.xml

3. HTTP (Streamable HTTP)

For HTTP-based MCP servers (start the server first):

python scripts/evaluation.py \
  -t http \
  -u https://example.com/mcp \
  -H "Authorization: Bearer token123" \
  evaluation.xml

Command-Line Options

usage: evaluation.py [-h] [-t {stdio,sse,http}] [-m MODEL] [-c COMMAND]
                     [-a ARGS [ARGS ...]] [-e ENV [ENV ...]] [-u URL]
                     [-H HEADERS [HEADERS ...]] [-o OUTPUT]
                     eval_file

positional arguments:
  eval_file             Path to evaluation XML file

optional arguments:
  -h, --help            Show help message
  -t, --transport       Transport type: stdio, sse, or http (default: stdio)
  -m, --model           Claude model to use (default: claude-3-7-sonnet-20250219)
  -o, --output          Output file for report (default: print to stdout)

stdio options:
  -c, --command         Command to run MCP server (e.g., python, node)
  -a, --args            Arguments for the command (e.g., server.py)
  -e, --env             Environment variables in KEY=VALUE format

sse/http options:
  -u, --url             MCP server URL
  -H, --header          HTTP headers in 'Key: Value' format

Output

The evaluation script generates a detailed report including:

Summary Statistics :
- Accuracy (correct/total)
- Average task duration
- Average tool calls per task
- Total tool calls
Per-Task Results :
- Prompt and expected response
- Actual response from the agent
- Whether the answer was correct (✅/❌)
- Duration and tool call details
- Agent's summary of its approach
- Agent's feedback on the tools

Save Report to File

python scripts/evaluation.py \
  -t stdio \
  -c python \
  -a my_server.py \
  -o evaluation_report.md \
  evaluation.xml

Complete Example Workflow

Here's a complete example of creating and running an evaluation:

Create your evaluation file (my_evaluation.xml):

<evaluation>

   <qa_pair>
      <question>Find the user who created the most issues in January 2024. What is their username?</question>
      <answer>alice_developer</answer>
   </qa_pair>
   <qa_pair>
      <question>Among all pull requests merged in Q1 2024, which repository had the highest number? Provide the repository name.</question>
      <answer>backend-api</answer>
   </qa_pair>
   <qa_pair>
      <question>Find the project that was completed in December 2023 and had the longest duration from start to finish. How many days did it take?</question>
      <answer>127</answer>
   </qa_pair>
</evaluation>

2. Install dependencies :

pip install -r scripts/requirements.txt
export ANTHROPIC_API_KEY=your_api_key

3. Run evaluation :

python scripts/evaluation.py \
  -t stdio \
  -c python \
  -a github_mcp_server.py \
  -e GITHUB_TOKEN=ghp_xxx \
  -o github_eval_report.md \
  my_evaluation.xml

4. Review the report in github_eval_report.md to: * See which questions passed/failed * Read the agent's feedback on your tools * Identify areas for improvement * Iterate on your MCP server design

Local MCP Server Conventions (mcp-* Repos)

Use these patterns when building or evaluating your MCP servers:

Provide a mcp subcommand that runs the stdio server (e.g., mcp-vector-search mcp, mcp-ticketer mcp --path <repo>, mcp-browser mcp).
Favor compact, consolidated tool surfaces to reduce token footprint; use pagination and compact modes when listing data.
Include setup, install, or doctor commands to validate runtime dependencies and integrate with clients.
Prefer .mcp.json entries with type: stdio, explicit command, and minimal overrides.

Example .mcp.json entry:

{
  "mcpServers": {
    "mcp-vector-search": {
      "type": "stdio",
      "command": "uv",
      "args": ["run", "mcp-vector-search", "mcp"],
      "env": {
        "MCP_ENABLE_FILE_WATCHING": "true"
      }
    }
  }
}

Operational notes:

Use explicit env vars for adapters and database paths (e.g., MCP_TICKETER_ADAPTER, KUZU_MEMORY_DB).
Expose a health endpoint when running HTTP/SSE variants.
Allow dynamic port allocation for browser integrations (mcp-browser uses a port range).

Troubleshooting

Connection Errors

When connection errors occur:

STDIO : Verify the command and arguments are correct
SSE/HTTP : Check the URL is accessible and headers are correct
Ensure any required API keys are set in environment variables or headers

Low Accuracy

If many evaluations fail:

Review the agent's feedback for each task
Check if tool descriptions are clear and comprehensive
Verify input parameters are well-documented
Consider whether tools return too much or too little data
Ensure error messages are actionable

Timeout Issues

If tasks are timing out:

Use a more capable model (e.g., claude-3-7-sonnet-20250219)
Check if tools are returning too much data
Verify pagination is working correctly
Consider simplifying complex questions

Weekly Installs

Repository

bobmatnyc/claud…m-skills

GitHub Stars

First Seen

Jan 23, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykFail

Installed on

claude-code50

gemini-cli47

opencode47

codex45

cursor43

github-copilot42

AI 代码实施计划编写技能 | 自动化开发任务分解与 TDD 流程规划工具

50,900 周安装

MCP服务器评估指南：创建复杂测试问题，衡量LLM工具调用能力

🇨🇳中文介绍

MCP 服务器评估指南

概述

快速参考

评估要求

输出格式

评估的目的

评估概述

问题指南

核心要求

相关 Skills

复杂性和深度

工具测试

稳定性

答案指南

验证

可读性

稳定性

多样性

评估流程

步骤 1：文档检查

步骤 2：工具检查

步骤 3：深化理解

步骤 4：只读内容检查

步骤 5：任务生成

输出格式

评估示例

好的问题

差的问题

验证流程

创建高质量评估的技巧

运行评估

设置

评估文件格式

运行评估

1. 本地 STDIO 服务器

2. 服务器发送事件 (SSE)

3. HTTP (可流式 HTTP)

命令行选项

输出

将报告保存到文件

完整示例工作流程

本地 MCP 服务器约定 (mcp-* 仓库)

故障排除

连接错误

准确率低

超时问题

🇺🇸English

MCP Server Evaluation Guide

Overview

Quick Reference

Evaluation Requirements

Output Format

Purpose of Evaluations

Evaluation Overview

Question Guidelines

Core Requirements

Complexity and Depth

Tool Testing

Stability

Answer Guidelines

Verification

Readability

Stability

Diversity

Evaluation Process

Step 1: Documentation Inspection

Step 2: Tool Inspection

Step 3: Developing Understanding

Step 4: Read-Only Content Inspection

Step 5: Task Generation

Output Format

Evaluation Examples

Good Questions

Poor Questions

Verification Process

Tips for Creating Quality Evaluations

Running Evaluations

Setup