mcp-builder by bobmatnyc/claude-mpm-skills
npx skills add https://github.com/bobmatnyc/claude-mpm-skills --skill mcp-builder本文档为创建 MCP 服务器的全面评估提供指导。评估旨在测试 LLM 是否能够仅使用您 MCP 服务器提供的工具,来有效回答现实、复杂的问题。
<evaluation>
<qa_pair>
<question>您的问题放在这里</question>
<answer>单一可验证的答案</answer>
</qa_pair>
</evaluation>
衡量 MCP 服务器质量的标准,不在于服务器实现工具的好坏或全面性,而在于这些实现(输入/输出模式、文档字符串/描述、功能)能否让 LLM 在没有其他上下文且仅能访问 MCP 服务器的情况下,回答现实且困难的问题。
创建 10 个人类可读的问题,这些问题仅需要只读的、独立的、非破坏性的和幂等的操作来回答。每个问题应具备:
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
阅读目标 API 的文档以了解:
列出 MCP 服务器中可用的工具:
重复步骤 1 和 2,直到您有良好的理解:
了解 API 和工具后,使用 MCP 服务器工具:
limit 参数限制结果(<10)检查内容后,创建 10 个人类可读的问题:
每个问答对包含一个问题和答案。输出应是一个具有以下结构的 XML 文件:
<evaluation>
<qa_pair>
<question>查找在 2024 年第二季度创建且已完成任务数量最多的项目。项目名称是什么?</question>
<answer>Website Redesign</answer>
</qa_pair>
<qa_pair>
<question>搜索在 2024 年 3 月关闭且标记为 "bug" 的问题。哪个用户关闭的问题最多?提供他们的用户名。</question>
<answer>sarah_dev</answer>
</qa_pair>
<qa_pair>
<question>查找在 2024 年 1 月 1 日至 2024 年 1 月 31 日期间合并的、修改了 /api 目录中文件的拉取请求。有多少不同的贡献者参与了这些 PR?</question>
<answer>7</answer>
</qa_pair>
<qa_pair>
<question>查找在 2023 年之前创建且拥有最多星标的仓库。仓库名称是什么?</question>
<answer>data-pipeline</answer>
</qa_pair>
</evaluation>
示例 1:需要深度探索的多跳问题(GitHub MCP)
<qa_pair>
<question>查找在 2023 年第三季度被归档且之前是该组织中分支最多的项目。该仓库主要使用什么编程语言?</question>
<answer>Python</answer>
</qa_pair>
这个问题很好,因为:
示例 2:需要理解上下文而非关键词匹配(项目管理 MCP)
<qa_pair>
<question>定位在 2023 年底完成的、专注于改善客户入职的举措。项目负责人在完成后创建了一份回顾文档。负责人当时的职位头衔是什么?</question>
<answer>Product Manager</answer>
</qa_pair>
这个问题很好,因为:
示例 3:需要多个步骤的复杂聚合(问题跟踪器 MCP)
<qa_pair>
<question>在 2024 年 1 月报告的所有标记为关键优先级的错误中,哪位受理人在 48 小时内解决了其分配的错误比例最高?提供受理人的用户名。</question>
<answer>alex_eng</answer>
</qa_pair>
这个问题很好,因为:
示例 4:需要跨多种数据类型的综合(CRM MCP)
<qa_pair>
<question>查找在 2023 年第四季度从 Starter 计划升级到 Enterprise 计划且年度合同价值最高的账户。该账户属于哪个行业?</question>
<answer>Healthcare</answer>
</qa_pair>
这个问题很好,因为:
示例 1:答案随时间变化
<qa_pair>
<question>目前分配给工程团队的未解决问题有多少个?</question>
<answer>47</answer>
</qa_pair>
这个问题很差,因为:
示例 2:通过关键词搜索太容易
<qa_pair>
<question>查找标题为 "Add authentication feature" 的拉取请求,并告诉我谁创建了它。</question>
<answer>developer123</answer>
</qa_pair>
这个问题很差,因为:
示例 3:答案格式模糊
<qa_pair>
<question>列出所有以 Python 为主要语言的仓库。</question>
<answer>repo1, repo2, repo3, data-pipeline, ml-tools</answer>
</qa_pair>
这个问题很差,因为:
创建评估后:
<qa_pair>请记住并行化解决任务以避免耗尽上下文,然后累积所有答案并在最后对文件进行更改。
创建评估文件后,使用提供的评估工具来测试 MCP 服务器。
安装依赖项
pip install -r scripts/requirements.txt
或手动安装:
pip install anthropic mcp
设置 API 密钥
export ANTHROPIC_API_KEY=your_api_key_here
评估文件使用带有 <qa_pair> 元素的 XML 格式:
<evaluation>
<qa_pair>
<question>查找在 2024 年第二季度创建且已完成任务数量最多的项目。项目名称是什么?</question>
<answer>Website Redesign</answer>
</qa_pair>
<qa_pair>
<question>搜索在 2024 年 3 月关闭且标记为 "bug" 的问题。哪个用户关闭的问题最多?提供他们的用户名。</question>
<answer>sarah_dev</answer>
</qa_pair>
</evaluation>
评估脚本 (scripts/evaluation.py) 支持三种传输类型:
重要提示:
对于本地运行的 MCP 服务器(脚本自动启动服务器):
python scripts/evaluation.py \
-t stdio \
-c python \
-a my_mcp_server.py \
evaluation.xml
带有环境变量:
python scripts/evaluation.py \
-t stdio \
-c python \
-a my_mcp_server.py \
-e API_KEY=abc123 \
-e DEBUG=true \
evaluation.xml
对于基于 SSE 的 MCP 服务器(首先启动服务器):
python scripts/evaluation.py \
-t sse \
-u https://example.com/mcp \
-H "Authorization: Bearer token123" \
-H "X-Custom-Header: value" \
evaluation.xml
对于基于 HTTP 的 MCP 服务器(首先启动服务器):
python scripts/evaluation.py \
-t http \
-u https://example.com/mcp \
-H "Authorization: Bearer token123" \
evaluation.xml
usage: evaluation.py [-h] [-t {stdio,sse,http}] [-m MODEL] [-c COMMAND]
[-a ARGS [ARGS ...]] [-e ENV [ENV ...]] [-u URL]
[-H HEADERS [HEADERS ...]] [-o OUTPUT]
eval_file
positional arguments:
eval_file 评估 XML 文件的路径
optional arguments:
-h, --help 显示帮助信息
-t, --transport 传输类型:stdio、sse 或 http(默认:stdio)
-m, --model 使用的 Claude 模型(默认:claude-3-7-sonnet-20250219)
-o, --output 报告的输出文件(默认:打印到标准输出)
stdio 选项:
-c, --command 运行 MCP 服务器的命令(例如,python、node)
-a, --args 命令的参数(例如,server.py)
-e, --env 环境变量,格式为 KEY=VALUE
sse/http 选项:
-u, --url MCP 服务器 URL
-H, --header HTTP 头,格式为 'Key: Value'
评估脚本生成一份详细报告,包括:
python scripts/evaluation.py \
-t stdio \
-c python \
-a my_server.py \
-o evaluation_report.md \
evaluation.xml
以下是创建和运行评估的完整示例:
创建您的评估文件 (my_evaluation.xml):
<evaluation>
<qa_pair>
<question>查找在 2024 年 1 月创建问题最多的用户。他们的用户名是什么?</question>
<answer>alice_developer</answer>
</qa_pair>
<qa_pair>
<question>在 2024 年第一季度合并的所有拉取请求中,哪个仓库的数量最多?提供仓库名称。</question>
<answer>backend-api</answer>
</qa_pair>
<qa_pair>
<question>查找在 2023 年 12 月完成且从开始到结束持续时间最长的项目。它花了多少天?</question>
<answer>127</answer>
</qa_pair>
</evaluation>
安装依赖项:
pip install -r scripts/requirements.txt
export ANTHROPIC_API_KEY=your_api_key
运行评估:
python scripts/evaluation.py \
-t stdio \
-c python \
-a github_mcp_server.py \
-e GITHUB_TOKEN=ghp_xxx \
-o github_eval_report.md \
my_evaluation.xml
在 github_eval_report.md 中查看报告以:
在构建或评估您的 MCP 服务器时,请使用以下模式:
mcp 子命令(例如,mcp-vector-search mcp、mcp-ticketer mcp --path <repo>、mcp-browser mcp)。setup、install 或 doctor 命令以验证运行时依赖项并与客户端集成。type: stdio、明确 command 和最少 env 覆盖的 .mcp.json 条目。示例 .mcp.json 条目:
{
"mcpServers": {
"mcp-vector-search": {
"type": "stdio",
"command": "uv",
"args": ["run", "mcp-vector-search", "mcp"],
"env": {
"MCP_ENABLE_FILE_WATCHING": "true"
}
}
}
}
操作说明:
MCP_TICKETER_ADAPTER、KUZU_MEMORY_DB)。发生连接错误时:
如果许多评估失败:
如果任务超时:
claude-3-7-sonnet-20250219)每周安装
63
仓库
GitHub 星标
22
首次出现
Jan 23, 2026
安全审计
安装于
claude-code50
gemini-cli47
opencode47
codex45
cursor43
github-copilot42
This document provides guidance on creating comprehensive evaluations for MCP servers. Evaluations test whether LLMs can effectively use your MCP server to answer realistic, complex questions using only the tools provided.
<evaluation>
<qa_pair>
<question>Your question here</question>
<answer>Single verifiable answer</answer>
</qa_pair>
</evaluation>
The measure of quality of an MCP server is NOT how well or comprehensively the server implements tools, but how well these implementations (input/output schemas, docstrings/descriptions, functionality) enable LLMs with no other context and access ONLY to the MCP servers to answer realistic and difficult questions.
Create 10 human-readable questions requiring ONLY READ-ONLY, INDEPENDENT, NON-DESTRUCTIVE, and IDEMPOTENT operations to answer. Each question should be:
Questions MUST be independent
Questions MUST require ONLY NON-DESTRUCTIVE AND IDEMPOTENT tool use
Questions must be REALISTIC, CLEAR, CONCISE, and COMPLEX
Questions must require deep exploration
Questions may require extensive paging
Questions must require deep understanding
Questions must not be solvable with straightforward keyword search
Questions should stress-test tool return values
Questions should MOSTLY reflect real human use cases
Questions may require dozens of tool calls
* This challenges LLMs with limited context
* Encourages MCP server tools to reduce information returned
11. Include ambiguous questions
* May be ambiguous OR require difficult decisions on which tools to call
* Force the LLM to potentially make mistakes or misinterpret
* Ensure that despite AMBIGUITY, there is STILL A SINGLE VERIFIABLE ANSWER
* Do not ask questions that rely on "current state" which is dynamic
* For example, do not count:
* Number of reactions to a post
* Number of replies to a thread
* Number of members in a channel
13. DO NOT let the MCP server RESTRICT the kinds of questions you create
* Create challenging and complex questions
* Some may not be solvable with the available MCP server tools
* Questions may require specific output formats (datetime vs. epoch time, JSON vs. MARKDOWN)
* Questions may require dozens of tool calls to complete
Answers must be STABLE/STATIONARY
Answers must be CLEAR and UNAMBIGUOUS
Answers must be DIVERSE
Answers must NOT be complex structures
Read the documentation of the target API to understand:
List the tools available in the MCP server:
Repeat steps 1 & 2 until you have a good understanding:
After understanding the API and tools, USE the MCP server tools:
limit parameter to limit results (<10)After inspecting the content, create 10 human-readable questions:
Each QA pair consists of a question and an answer. The output should be an XML file with this structure:
<evaluation>
<qa_pair>
<question>Find the project created in Q2 2024 with the highest number of completed tasks. What is the project name?</question>
<answer>Website Redesign</answer>
</qa_pair>
<qa_pair>
<question>Search for issues labeled as "bug" that were closed in March 2024. Which user closed the most issues? Provide their username.</question>
<answer>sarah_dev</answer>
</qa_pair>
<qa_pair>
<question>Look for pull requests that modified files in the /api directory and were merged between January 1 and January 31, 2024. How many different contributors worked on these PRs?</question>
<answer>7</answer>
</qa_pair>
<qa_pair>
<question>Find the repository with the most stars that was created before 2023. What is the repository name?</question>
<answer>data-pipeline</answer>
</qa_pair>
</evaluation>
Example 1: Multi-hop question requiring deep exploration (GitHub MCP)
<qa_pair>
<question>Find the repository that was archived in Q3 2023 and had previously been the most forked project in the organization. What was the primary programming language used in that repository?</question>
<answer>Python</answer>
</qa_pair>
This question is good because:
Example 2: Requires understanding context without keyword matching (Project Management MCP)
<qa_pair>
<question>Locate the initiative focused on improving customer onboarding that was completed in late 2023. The project lead created a retrospective document after completion. What was the lead's role title at that time?</question>
<answer>Product Manager</answer>
</qa_pair>
This question is good because:
Example 3: Complex aggregation requiring multiple steps (Issue Tracker MCP)
<qa_pair>
<question>Among all bugs reported in January 2024 that were marked as critical priority, which assignee resolved the highest percentage of their assigned bugs within 48 hours? Provide the assignee's username.</question>
<answer>alex_eng</answer>
</qa_pair>
This question is good because:
Example 4: Requires synthesis across multiple data types (CRM MCP)
<qa_pair>
<question>Find the account that upgraded from the Starter to Enterprise plan in Q4 2023 and had the highest annual contract value. What industry does this account operate in?</question>
<answer>Healthcare</answer>
</qa_pair>
This question is good because:
Example 1: Answer changes over time
<qa_pair>
<question>How many open issues are currently assigned to the engineering team?</question>
<answer>47</answer>
</qa_pair>
This question is poor because:
Example 2: Too easy with keyword search
<qa_pair>
<question>Find the pull request with title "Add authentication feature" and tell me who created it.</question>
<answer>developer123</answer>
</qa_pair>
This question is poor because:
Example 3: Ambiguous answer format
<qa_pair>
<question>List all the repositories that have Python as their primary language.</question>
<answer>repo1, repo2, repo3, data-pipeline, ml-tools</answer>
</qa_pair>
This question is poor because:
After creating evaluations:
<qa_pair> that require WRITE or DESTRUCTIVE operationsRemember to parallelize solving tasks to avoid running out of context, then accumulate all answers and make changes to the file at the end.
After creating the evaluation file, use the provided evaluation harness to test the MCP server.
Install Dependencies
pip install -r scripts/requirements.txt
Or install manually:
pip install anthropic mcp
2. Set API Key
export ANTHROPIC_API_KEY=your_api_key_here
Evaluation files use XML format with <qa_pair> elements:
<evaluation>
<qa_pair>
<question>Find the project created in Q2 2024 with the highest number of completed tasks. What is the project name?</question>
<answer>Website Redesign</answer>
</qa_pair>
<qa_pair>
<question>Search for issues labeled as "bug" that were closed in March 2024. Which user closed the most issues? Provide their username.</question>
<answer>sarah_dev</answer>
</qa_pair>
</evaluation>
The evaluation script (scripts/evaluation.py) supports three transport types:
Important:
For locally-run MCP servers (script launches the server automatically):
python scripts/evaluation.py \
-t stdio \
-c python \
-a my_mcp_server.py \
evaluation.xml
With environment variables:
python scripts/evaluation.py \
-t stdio \
-c python \
-a my_mcp_server.py \
-e API_KEY=abc123 \
-e DEBUG=true \
evaluation.xml
For SSE-based MCP servers (start the server first):
python scripts/evaluation.py \
-t sse \
-u https://example.com/mcp \
-H "Authorization: Bearer token123" \
-H "X-Custom-Header: value" \
evaluation.xml
For HTTP-based MCP servers (start the server first):
python scripts/evaluation.py \
-t http \
-u https://example.com/mcp \
-H "Authorization: Bearer token123" \
evaluation.xml
usage: evaluation.py [-h] [-t {stdio,sse,http}] [-m MODEL] [-c COMMAND]
[-a ARGS [ARGS ...]] [-e ENV [ENV ...]] [-u URL]
[-H HEADERS [HEADERS ...]] [-o OUTPUT]
eval_file
positional arguments:
eval_file Path to evaluation XML file
optional arguments:
-h, --help Show help message
-t, --transport Transport type: stdio, sse, or http (default: stdio)
-m, --model Claude model to use (default: claude-3-7-sonnet-20250219)
-o, --output Output file for report (default: print to stdout)
stdio options:
-c, --command Command to run MCP server (e.g., python, node)
-a, --args Arguments for the command (e.g., server.py)
-e, --env Environment variables in KEY=VALUE format
sse/http options:
-u, --url MCP server URL
-H, --header HTTP headers in 'Key: Value' format
The evaluation script generates a detailed report including:
Summary Statistics :
Per-Task Results :
python scripts/evaluation.py \
-t stdio \
-c python \
-a my_server.py \
-o evaluation_report.md \
evaluation.xml
Here's a complete example of creating and running an evaluation:
my_evaluation.xml):<evaluation>
<qa_pair>
<question>Find the user who created the most issues in January 2024. What is their username?</question>
<answer>alice_developer</answer>
</qa_pair>
<qa_pair>
<question>Among all pull requests merged in Q1 2024, which repository had the highest number? Provide the repository name.</question>
<answer>backend-api</answer>
</qa_pair>
<qa_pair>
<question>Find the project that was completed in December 2023 and had the longest duration from start to finish. How many days did it take?</question>
<answer>127</answer>
</qa_pair>
</evaluation>
2. Install dependencies :
pip install -r scripts/requirements.txt
export ANTHROPIC_API_KEY=your_api_key
3. Run evaluation :
python scripts/evaluation.py \
-t stdio \
-c python \
-a github_mcp_server.py \
-e GITHUB_TOKEN=ghp_xxx \
-o github_eval_report.md \
my_evaluation.xml
4. Review the report in github_eval_report.md to:
* See which questions passed/failed
* Read the agent's feedback on your tools
* Identify areas for improvement
* Iterate on your MCP server design
Use these patterns when building or evaluating your MCP servers:
mcp subcommand that runs the stdio server (e.g., mcp-vector-search mcp, mcp-ticketer mcp --path <repo>, mcp-browser mcp).setup, install, or doctor commands to validate runtime dependencies and integrate with clients..mcp.json entries with type: stdio, explicit command, and minimal overrides.Example .mcp.json entry:
{
"mcpServers": {
"mcp-vector-search": {
"type": "stdio",
"command": "uv",
"args": ["run", "mcp-vector-search", "mcp"],
"env": {
"MCP_ENABLE_FILE_WATCHING": "true"
}
}
}
}
Operational notes:
MCP_TICKETER_ADAPTER, KUZU_MEMORY_DB).When connection errors occur:
If many evaluations fail:
If tasks are timing out:
claude-3-7-sonnet-20250219)Weekly Installs
63
Repository
GitHub Stars
22
First Seen
Jan 23, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykFail
Installed on
claude-code50
gemini-cli47
opencode47
codex45
cursor43
github-copilot42
AI 代码实施计划编写技能 | 自动化开发任务分解与 TDD 流程规划工具
50,900 周安装
env