hugging-face-evaluation by huggingface/skills
npx skills add https://github.com/huggingface/skills --skill hugging-face-evaluation此技能提供工具,用于向 Hugging Face 模型卡片添加结构化评估结果。它支持多种添加评估数据的方法:
uv 集成直接在 Hugging Face Jobs 上运行评估1.3.0
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
注意:使用 uv run 时,vLLM 依赖项会通过 PEP 723 脚本头自动安装。
在使用 --create-pr 创建任何拉取请求之前,您必须检查现有的开放 PR:
uv run scripts/evaluation_manager.py get-prs --repo-id "username/model-name"
如果存在开放 PR:
这可以防止向模型仓库发送重复的评估 PR。
所有路径均相对于包含此 SKILL.md 文件的目录。 在运行任何脚本之前,请先
cd到该目录或使用完整路径。
使用 --help 获取最新的工作流程指导。 适用于普通 Python 或 uv run:
uv run scripts/evaluation_manager.py --help
uv run scripts/evaluation_manager.py inspect-tables --help
uv run scripts/evaluation_manager.py extract-readme --help
关键工作流程(与 CLI 帮助匹配):
get-prs → 首先检查现有的开放 PRinspect-tables → 查找表格编号/列extract-readme --table N → 默认打印 YAML--apply(推送)或 --create-pr 来写入更改inspect-tables 查看 README 中的所有表格,包括结构、列和示例行--table N 从特定表格中提取(当存在多个表格时需要)--model-column-index(索引来自检查输出)。仅当列标题文本完全匹配时才使用 --model-name-override。--task-type 设置 model-index 输出中的 task.type 字段(例如 text-generation、summarization)inspect-ai 库运行标准评估⚠️ 重要提示: 此方法仅适用于安装了 uv 且具有足够 GPU 内存的设备。优点: 无需使用 hf_jobs() MCP 工具,可以直接在终端中运行脚本 使用时机: 用户在本地设备上直接工作且 GPU 可用时
nvidia-smi 检查 GPU 是否可用uv run scripts/train_sft_example.py
该技能包含 scripts/ 目录下的 Python 脚本来执行操作。
uv run(PEP 723 头自动安装依赖项)uv pip install huggingface-hub markdown-it-py python-dotenv pyyaml requestsHF_TOKEN 环境变量AA_API_KEY 环境变量python-dotenv,.env 文件会自动加载推荐流程(与 --help 匹配):
# 1) 检查表格以获取表格编号和列提示
uv run scripts/evaluation_manager.py inspect-tables --repo-id "username/model"
# 2) 提取特定表格(默认打印 YAML)
uv run scripts/evaluation_manager.py extract-readme \
--repo-id "username/model" \
--table 1 \
[--model-column-index <inspect-tables 显示的列索引>] \
[--model-name-override "<列标题/模型名称>"] # 如果无法使用索引,请使用确切的标题文本
# 3) 应用更改(推送或 PR)
uv run scripts/evaluation_manager.py extract-readme \
--repo-id "username/model" \
--table 1 \
--apply # 直接推送
# 或
uv run scripts/evaluation_manager.py extract-readme \
--repo-id "username/model" \
--table 1 \
--create-pr # 打开一个 PR
验证清单:
--model-column-index;如果使用 --model-name-override,列标题文本必须完全匹配。从 Artificial Analysis API 获取基准测试分数并将其添加到模型卡片。
基本用法:
AA_API_KEY="your-api-key" uv run scripts/evaluation_manager.py import-aa \
--creator-slug "anthropic" \
--model-name "claude-sonnet-4" \
--repo-id "username/model-name"
使用环境文件:
# 创建 .env 文件
echo "AA_API_KEY=your-api-key" >> .env
echo "HF_TOKEN=your-hf-token" >> .env
# 运行导入
uv run scripts/evaluation_manager.py import-aa \
--creator-slug "anthropic" \
--model-name "claude-sonnet-4" \
--repo-id "username/model-name"
创建拉取请求:
uv run scripts/evaluation_manager.py import-aa \
--creator-slug "anthropic" \
--model-name "claude-sonnet-4" \
--repo-id "username/model-name" \
--create-pr
使用 hf jobs uv run CLI 在 Hugging Face 基础设施上提交评估作业。
直接 CLI 用法:
HF_TOKEN=$HF_TOKEN \
hf jobs uv run hf-evaluation/scripts/inspect_eval_uv.py \
--flavor cpu-basic \
--secret HF_TOKEN=$HF_TOKEN \
-- --model "meta-llama/Llama-2-7b-hf" \
--task "mmlu"
GPU 示例(A10G):
HF_TOKEN=$HF_TOKEN \
hf jobs uv run hf-evaluation/scripts/inspect_eval_uv.py \
--flavor a10g-small \
--secret HF_TOKEN=$HF_TOKEN \
-- --model "meta-llama/Llama-2-7b-hf" \
--task "gsm8k"
Python 辅助脚本(可选):
uv run scripts/run_eval_job.py \
--model "meta-llama/Llama-2-7b-hf" \
--task "mmlu" \
--hardware "t4-small"
使用 vLLM 或 accelerate 后端直接在 GPU 上评估自定义 HuggingFace 模型。这些脚本与推理提供商脚本分开,并在作业的硬件上本地运行模型。
| 功能 | vLLM 脚本 | 推理提供商脚本 |
|---|---|---|
| 模型访问 | 任何 HF 模型 | 具有 API 端点的模型 |
| 硬件 | 您的 GPU(或 HF Jobs GPU) | 提供商的基础设施 |
| 成本 | HF Jobs 计算成本 | API 使用费 |
| 速度 | vLLM 优化 | 取决于提供商 |
| 离线 | 是(下载后) | 否 |
lighteval 是 HuggingFace 的评估库,支持 Open LLM Leaderboard 任务。
独立运行(本地 GPU):
# 使用 vLLM 运行 MMLU 5-shot
uv run scripts/lighteval_vllm_uv.py \
--model meta-llama/Llama-3.2-1B \
--tasks "leaderboard|mmlu|5"
# 运行多个任务
uv run scripts/lighteval_vllm_uv.py \
--model meta-llama/Llama-3.2-1B \
--tasks "leaderboard|mmlu|5,leaderboard|gsm8k|5"
# 使用 accelerate 后端代替 vLLM
uv run scripts/lighteval_vllm_uv.py \
--model meta-llama/Llama-3.2-1B \
--tasks "leaderboard|mmlu|5" \
--backend accelerate
# 聊天/指令调优模型
uv run scripts/lighteval_vllm_uv.py \
--model meta-llama/Llama-3.2-1B-Instruct \
--tasks "leaderboard|mmlu|5" \
--use-chat-template
通过 HF Jobs:
hf jobs uv run scripts/lighteval_vllm_uv.py \
--flavor a10g-small \
--secrets HF_TOKEN=$HF_TOKEN \
-- --model meta-llama/Llama-3.2-1B \
--tasks "leaderboard|mmlu|5"
lighteval 任务格式: 任务使用格式 suite|task|num_fewshot:
leaderboard|mmlu|5 - 5-shot MMLUleaderboard|gsm8k|5 - 5-shot GSM8Klighteval|hellaswag|0 - 零-shot HellaSwagleaderboard|arc_challenge|25 - 25-shot ARC-Challenge查找可用任务: 完整的可用 lighteval 任务列表可在以下位置找到:https://github.com/huggingface/lighteval/blob/main/examples/tasks/all_tasks.txt
此文件包含所有支持的任务,格式为 suite|task|num_fewshot|0(尾部的 0 是版本标志,可以忽略)。常见的套件包括:
leaderboard - Open LLM Leaderboard 任务(MMLU、GSM8K、ARC、HellaSwag 等)lighteval - 额外的 lighteval 任务bigbench - BigBench 任务original - 原始基准测试任务要使用列表中的任务,请提取 suite|task|num_fewshot 部分(不带尾部的 0)并将其传递给 --tasks 参数。例如:
leaderboard|mmlu|0 → 使用:leaderboard|mmlu|0(或更改为 5 表示 5-shot)bigbench|abstract_narrative_understanding|0 → 使用:bigbench|abstract_narrative_understanding|0lighteval|wmt14:hi-en|0 → 使用:lighteval|wmt14:hi-en|0多个任务可以指定为逗号分隔的值:--tasks "leaderboard|mmlu|5,leaderboard|gsm8k|5"
inspect-ai 是英国 AI 安全研究所的评估框架。
独立运行(本地 GPU):
# 使用 vLLM 运行 MMLU
uv run scripts/inspect_vllm_uv.py \
--model meta-llama/Llama-3.2-1B \
--task mmlu
# 使用 HuggingFace Transformers 后端
uv run scripts/inspect_vllm_uv.py \
--model meta-llama/Llama-3.2-1B \
--task mmlu \
--backend hf
# 使用张量并行的多 GPU
uv run scripts/inspect_vllm_uv.py \
--model meta-llama/Llama-3.2-70B \
--task mmlu \
--tensor-parallel-size 4
通过 HF Jobs:
hf jobs uv run scripts/inspect_vllm_uv.py \
--flavor a10g-small \
--secrets HF_TOKEN=$HF_TOKEN \
-- --model meta-llama/Llama-3.2-1B \
--task mmlu
可用的 inspect-ai 任务:
mmlu - 大规模多任务语言理解gsm8k - 小学数学hellaswag - 常识推理arc_challenge - AI2 推理挑战truthfulqa - TruthfulQA 基准测试winogrande - Winograd 模式挑战humaneval - 代码生成辅助脚本自动选择硬件并简化作业提交:
# 根据模型大小自动检测硬件
uv run scripts/run_vllm_eval_job.py \
--model meta-llama/Llama-3.2-1B \
--task "leaderboard|mmlu|5" \
--framework lighteval
# 显式硬件选择
uv run scripts/run_vllm_eval_job.py \
--model meta-llama/Llama-3.2-70B \
--task mmlu \
--framework inspect \
--hardware a100-large \
--tensor-parallel-size 4
# 使用 HF Transformers 后端
uv run scripts/run_vllm_eval_job.py \
--model microsoft/phi-2 \
--task mmlu \
--framework inspect \
--backend hf
硬件推荐:
| 模型大小 | 推荐硬件 |
|---|---|
| < 30 亿参数 | t4-small |
| 30 亿 - 130 亿 | a10g-small |
| 130 亿 - 340 亿 | a10g-large |
| 340 亿以上 | a100-large |
顶级帮助和版本:
uv run scripts/evaluation_manager.py --help
uv run scripts/evaluation_manager.py --version
检查表格(从此开始):
uv run scripts/evaluation_manager.py inspect-tables --repo-id "username/model-name"
从 README 中提取:
uv run scripts/evaluation_manager.py extract-readme \
--repo-id "username/model-name" \
--table N \
[--model-column-index N] \
[--model-name-override "Exact Column Header or Model Name"] \
[--task-type "text-generation"] \
[--dataset-name "Custom Benchmarks"] \
[--apply | --create-pr]
从 Artificial Analysis 导入:
AA_API_KEY=... uv run scripts/evaluation_manager.py import-aa \
--creator-slug "creator-name" \
--model-name "model-slug" \
--repo-id "username/model-name" \
[--create-pr]
查看 / 验证:
uv run scripts/evaluation_manager.py show --repo-id "username/model-name"
uv run scripts/evaluation_manager.py validate --repo-id "username/model-name"
检查开放 PR(在使用 --create-pr 之前始终运行):
uv run scripts/evaluation_manager.py get-prs --repo-id "username/model-name"
列出模型仓库的所有开放拉取请求。显示 PR 编号、标题、作者、日期和 URL。
运行评估作业(推理提供商):
hf jobs uv run scripts/inspect_eval_uv.py \
--flavor "cpu-basic|t4-small|..." \
--secret HF_TOKEN=$HF_TOKEN \
-- --model "model-id" \
--task "task-name"
或使用 Python 辅助脚本:
uv run scripts/run_eval_job.py \
--model "model-id" \
--task "task-name" \
--hardware "cpu-basic|t4-small|..."
运行 vLLM 评估(自定义模型):
# 使用 vLLM 的 lighteval
hf jobs uv run scripts/lighteval_vllm_uv.py \
--flavor "a10g-small" \
--secrets HF_TOKEN=$HF_TOKEN \
-- --model "model-id" \
--tasks "leaderboard|mmlu|5"
# 使用 vLLM 的 inspect-ai
hf jobs uv run scripts/inspect_vllm_uv.py \
--flavor "a10g-small" \
--secrets HF_TOKEN=$HF_TOKEN \
-- --model "model-id" \
--task "mmlu"
# 辅助脚本(自动硬件选择)
uv run scripts/run_vllm_eval_job.py \
--model "model-id" \
--task "leaderboard|mmlu|5" \
--framework lighteval
生成的 model-index 遵循以下结构:
model-index:
- name: Model Name
results:
- task:
type: text-generation
dataset:
name: Benchmark Dataset
type: benchmark_type
metrics:
- name: MMLU
type: mmlu
value: 85.2
- name: HumanEval
type: humaneval
value: 72.5
source:
name: Source Name
url: https://source-url.com
警告:不要在模型名称中使用 markdown 格式。使用表格中的确切名称。仅在 source.url 字段中使用 url。
get-prs 以避免重复inspect-tables 开始 : 查看表格结构并获取正确的提取命令--help 获取指导 : 运行 inspect-tables --help 查看完整工作流程--apply 或 --create-pr 之前查看它--table N : 当存在多个评估表格时需要--model-name-override : 从 inspect-tables 输出中复制确切的列标题--create-pr当提取包含多个模型的评估表格(无论是作为列还是行)时,脚本使用精确的规范化令牌匹配:
**,链接 []())- 和 _ 替换为空格)"OLMo-3-32B" → {"olmo", "3", "32b"} 匹配 "**Olmo 3 32B**" 或 "[Olmo-3-32B](...)对于基于列的表格(基准测试作为行,模型作为列):
对于转置表格(模型作为行,基准测试作为列):
这确保仅提取正确模型的分数,绝不会提取不相关的模型或训练检查点。
更新您自己的模型:
# 从 README 中提取并直接推送
uv run scripts/evaluation_manager.py extract-readme \
--repo-id "your-username/your-model" \
--task-type "text-generation"
更新他人的模型(完整工作流程):
# 步骤 1:始终首先检查现有 PR
uv run scripts/evaluation_manager.py get-prs \
--repo-id "other-username/their-model"
# 步骤 2:如果不存在开放 PR,则继续创建一个
uv run scripts/evaluation_manager.py extract-readme \
--repo-id "other-username/their-model" \
--create-pr
# 如果确实存在开放 PR:
# - 警告用户有关现有 PR 的情况
# - 向他们展示 PR URL
# - 除非用户明确确认,否则不要创建新的 PR
导入新的基准测试:
# 步骤 1:检查现有 PR
uv run scripts/evaluation_manager.py get-prs \
--repo-id "anthropic/claude-sonnet-4"
# 步骤 2:如果没有 PR,则从 Artificial Analysis 导入
AA_API_KEY=... uv run scripts/evaluation_manager.py import-aa \
--creator-slug "anthropic" \
--model-name "claude-sonnet-4" \
--repo-id "anthropic/claude-sonnet-4" \
--create-pr
问题 : "在 README 中未找到评估表格"
问题 : "在转置表格中找不到模型 'X'"
--model-name-override 并指定列表中的确切名称--model-name-override "**Olmo 3-32B**"问题 : "未设置 AA_API_KEY"
问题 : "令牌没有写入权限"
问题 : "在 Artificial Analysis 中找不到模型"
问题 : "硬件需要付费"
问题 : "vLLM 内存不足" 或 CUDA OOM
--gpu-memory-utilization,或使用 --tensor-parallel-size 进行多 GPU问题 : "vLLM 不支持模型架构"
--backend hf(inspect-ai)或 --backend accelerate(lighteval)以使用 HuggingFace Transformers问题 : "需要信任远程代码"
--trust-remote-code 标志问题 : "未找到聊天模板"
--use-chat-templatePython 脚本集成:
import subprocess
import os
def update_model_evaluations(repo_id, readme_content):
"""Update model card with evaluations from README."""
result = subprocess.run([
"python", "scripts/evaluation_manager.py",
"extract-readme",
"--repo-id", repo_id,
"--create-pr"
], capture_output=True, text=True)
if result.returncode == 0:
print(f"Successfully updated {repo_id}")
else:
print(f"Error: {result.stderr}")
每周安装次数
264
仓库
GitHub 星标数
9.9K
首次出现
Jan 20, 2026
安全审计
安装于
opencode229
gemini-cli225
codex224
github-copilot213
cursor206
claude-code202
This skill provides tools to add structured evaluation results to Hugging Face model cards. It supports multiple methods for adding evaluation data:
uv integration1.3.0
Note: vLLM dependencies are installed automatically via PEP 723 script headers when using uv run.
Before creating ANY pull request with--create-pr, you MUST check for existing open PRs:
uv run scripts/evaluation_manager.py get-prs --repo-id "username/model-name"
If open PRs exist:
This prevents spamming model repositories with duplicate evaluation PRs.
All paths are relative to the directory containing this SKILL.md file. Before running any script, first
cdto that directory or use the full path.
Use--help for the latest workflow guidance. Works with plain Python or uv run:
uv run scripts/evaluation_manager.py --help
uv run scripts/evaluation_manager.py inspect-tables --help
uv run scripts/evaluation_manager.py extract-readme --help
Key workflow (matches CLI help):
get-prs → check for existing open PRs firstinspect-tables → find table numbers/columnsextract-readme --table N → prints YAML by default--apply (push) or --create-pr to write changesinspect-tables to see all tables in a README with structure, columns, and sample rows--table N to extract from a specific table (required when multiple tables exist)--model-column-index (index from inspect output). Use --model-name-override only with exact column header text.--task-type sets the task.type field in model-index output (e.g., text-generation, )inspect-ai library⚠️ Important: This approach is only possible on devices with uv installed and sufficient GPU memory. Benefits: No need to use hf_jobs() MCP tool, can run scripts directly in terminal When to use: User working in local device directly when GPU is available
nvidia-smiuv run scripts/train_sft_example.py
The skill includes Python scripts in scripts/ to perform operations.
uv run (PEP 723 header auto-installs deps)uv pip install huggingface-hub markdown-it-py python-dotenv pyyaml requestsHF_TOKEN environment variable with Write-access tokenAA_API_KEY environment variable.env is loaded automatically if python-dotenv is installedRecommended flow (matches --help):
# 1) Inspect tables to get table numbers and column hints
uv run scripts/evaluation_manager.py inspect-tables --repo-id "username/model"
# 2) Extract a specific table (prints YAML by default)
uv run scripts/evaluation_manager.py extract-readme \
--repo-id "username/model" \
--table 1 \
[--model-column-index <column index shown by inspect-tables>] \
[--model-name-override "<column header/model name>"] # use exact header text if you can't use the index
# 3) Apply changes (push or PR)
uv run scripts/evaluation_manager.py extract-readme \
--repo-id "username/model" \
--table 1 \
--apply # push directly
# or
uv run scripts/evaluation_manager.py extract-readme \
--repo-id "username/model" \
--table 1 \
--create-pr # open a PR
Validation checklist:
--model-column-index; if using --model-name-override, the column header text must be exact.Fetch benchmark scores from Artificial Analysis API and add them to a model card.
Basic Usage:
AA_API_KEY="your-api-key" uv run scripts/evaluation_manager.py import-aa \
--creator-slug "anthropic" \
--model-name "claude-sonnet-4" \
--repo-id "username/model-name"
With Environment File:
# Create .env file
echo "AA_API_KEY=your-api-key" >> .env
echo "HF_TOKEN=your-hf-token" >> .env
# Run import
uv run scripts/evaluation_manager.py import-aa \
--creator-slug "anthropic" \
--model-name "claude-sonnet-4" \
--repo-id "username/model-name"
Create Pull Request:
uv run scripts/evaluation_manager.py import-aa \
--creator-slug "anthropic" \
--model-name "claude-sonnet-4" \
--repo-id "username/model-name" \
--create-pr
Submit an evaluation job on Hugging Face infrastructure using the hf jobs uv run CLI.
Direct CLI Usage:
HF_TOKEN=$HF_TOKEN \
hf jobs uv run hf-evaluation/scripts/inspect_eval_uv.py \
--flavor cpu-basic \
--secret HF_TOKEN=$HF_TOKEN \
-- --model "meta-llama/Llama-2-7b-hf" \
--task "mmlu"
GPU Example (A10G):
HF_TOKEN=$HF_TOKEN \
hf jobs uv run hf-evaluation/scripts/inspect_eval_uv.py \
--flavor a10g-small \
--secret HF_TOKEN=$HF_TOKEN \
-- --model "meta-llama/Llama-2-7b-hf" \
--task "gsm8k"
Python Helper (optional):
uv run scripts/run_eval_job.py \
--model "meta-llama/Llama-2-7b-hf" \
--task "mmlu" \
--hardware "t4-small"
Evaluate custom HuggingFace models directly on GPU using vLLM or accelerate backends. These scripts are separate from inference provider scripts and run models locally on the job's hardware.
| Feature | vLLM Scripts | Inference Provider Scripts |
|---|---|---|
| Model access | Any HF model | Models with API endpoints |
| Hardware | Your GPU (or HF Jobs GPU) | Provider's infrastructure |
| Cost | HF Jobs compute cost | API usage fees |
| Speed | vLLM optimized | Depends on provider |
| Offline | Yes (after download) | No |
lighteval is HuggingFace's evaluation library, supporting Open LLM Leaderboard tasks.
Standalone (local GPU):
# Run MMLU 5-shot with vLLM
uv run scripts/lighteval_vllm_uv.py \
--model meta-llama/Llama-3.2-1B \
--tasks "leaderboard|mmlu|5"
# Run multiple tasks
uv run scripts/lighteval_vllm_uv.py \
--model meta-llama/Llama-3.2-1B \
--tasks "leaderboard|mmlu|5,leaderboard|gsm8k|5"
# Use accelerate backend instead of vLLM
uv run scripts/lighteval_vllm_uv.py \
--model meta-llama/Llama-3.2-1B \
--tasks "leaderboard|mmlu|5" \
--backend accelerate
# Chat/instruction-tuned models
uv run scripts/lighteval_vllm_uv.py \
--model meta-llama/Llama-3.2-1B-Instruct \
--tasks "leaderboard|mmlu|5" \
--use-chat-template
Via HF Jobs:
hf jobs uv run scripts/lighteval_vllm_uv.py \
--flavor a10g-small \
--secrets HF_TOKEN=$HF_TOKEN \
-- --model meta-llama/Llama-3.2-1B \
--tasks "leaderboard|mmlu|5"
lighteval Task Format: Tasks use the format suite|task|num_fewshot:
leaderboard|mmlu|5 - MMLU with 5-shotleaderboard|gsm8k|5 - GSM8K with 5-shotlighteval|hellaswag|0 - HellaSwag zero-shotleaderboard|arc_challenge|25 - ARC-Challenge with 25-shotFinding Available Tasks: The complete list of available lighteval tasks can be found at: https://github.com/huggingface/lighteval/blob/main/examples/tasks/all_tasks.txt
This file contains all supported tasks in the format suite|task|num_fewshot|0 (the trailing 0 is a version flag and can be ignored). Common suites include:
leaderboard - Open LLM Leaderboard tasks (MMLU, GSM8K, ARC, HellaSwag, etc.)lighteval - Additional lighteval tasksbigbench - BigBench tasksoriginal - Original benchmark tasksTo use a task from the list, extract the suite|task|num_fewshot portion (without the trailing 0) and pass it to the --tasks parameter. For example:
leaderboard|mmlu|0 → Use: leaderboard|mmlu|0 (or change to 5 for 5-shot)bigbench|abstract_narrative_understanding|0 → Use: bigbench|abstract_narrative_understanding|0lighteval|wmt14:hi-en|0 → Use: lighteval|wmt14:hi-en|0Multiple tasks can be specified as comma-separated values: --tasks "leaderboard|mmlu|5,leaderboard|gsm8k|5"
inspect-ai is the UK AI Safety Institute's evaluation framework.
Standalone (local GPU):
# Run MMLU with vLLM
uv run scripts/inspect_vllm_uv.py \
--model meta-llama/Llama-3.2-1B \
--task mmlu
# Use HuggingFace Transformers backend
uv run scripts/inspect_vllm_uv.py \
--model meta-llama/Llama-3.2-1B \
--task mmlu \
--backend hf
# Multi-GPU with tensor parallelism
uv run scripts/inspect_vllm_uv.py \
--model meta-llama/Llama-3.2-70B \
--task mmlu \
--tensor-parallel-size 4
Via HF Jobs:
hf jobs uv run scripts/inspect_vllm_uv.py \
--flavor a10g-small \
--secrets HF_TOKEN=$HF_TOKEN \
-- --model meta-llama/Llama-3.2-1B \
--task mmlu
Available inspect-ai Tasks:
mmlu - Massive Multitask Language Understandinggsm8k - Grade School Mathhellaswag - Common sense reasoningarc_challenge - AI2 Reasoning Challengetruthfulqa - TruthfulQA benchmarkwinogrande - Winograd Schema Challengehumaneval - Code generationThe helper script auto-selects hardware and simplifies job submission:
# Auto-detect hardware based on model size
uv run scripts/run_vllm_eval_job.py \
--model meta-llama/Llama-3.2-1B \
--task "leaderboard|mmlu|5" \
--framework lighteval
# Explicit hardware selection
uv run scripts/run_vllm_eval_job.py \
--model meta-llama/Llama-3.2-70B \
--task mmlu \
--framework inspect \
--hardware a100-large \
--tensor-parallel-size 4
# Use HF Transformers backend
uv run scripts/run_vllm_eval_job.py \
--model microsoft/phi-2 \
--task mmlu \
--framework inspect \
--backend hf
Hardware Recommendations:
| Model Size | Recommended Hardware |
|---|---|
| < 3B params | t4-small |
| 3B - 13B | a10g-small |
| 13B - 34B | a10g-large |
| 34B+ | a100-large |
Top-level help and version:
uv run scripts/evaluation_manager.py --help
uv run scripts/evaluation_manager.py --version
Inspect Tables (start here):
uv run scripts/evaluation_manager.py inspect-tables --repo-id "username/model-name"
Extract from README:
uv run scripts/evaluation_manager.py extract-readme \
--repo-id "username/model-name" \
--table N \
[--model-column-index N] \
[--model-name-override "Exact Column Header or Model Name"] \
[--task-type "text-generation"] \
[--dataset-name "Custom Benchmarks"] \
[--apply | --create-pr]
Import from Artificial Analysis:
AA_API_KEY=... uv run scripts/evaluation_manager.py import-aa \
--creator-slug "creator-name" \
--model-name "model-slug" \
--repo-id "username/model-name" \
[--create-pr]
View / Validate:
uv run scripts/evaluation_manager.py show --repo-id "username/model-name"
uv run scripts/evaluation_manager.py validate --repo-id "username/model-name"
Check Open PRs (ALWAYS run before --create-pr):
uv run scripts/evaluation_manager.py get-prs --repo-id "username/model-name"
Lists all open pull requests for the model repository. Shows PR number, title, author, date, and URL.
Run Evaluation Job (Inference Providers):
hf jobs uv run scripts/inspect_eval_uv.py \
--flavor "cpu-basic|t4-small|..." \
--secret HF_TOKEN=$HF_TOKEN \
-- --model "model-id" \
--task "task-name"
or use the Python helper:
uv run scripts/run_eval_job.py \
--model "model-id" \
--task "task-name" \
--hardware "cpu-basic|t4-small|..."
Run vLLM Evaluation (Custom Models):
# lighteval with vLLM
hf jobs uv run scripts/lighteval_vllm_uv.py \
--flavor "a10g-small" \
--secrets HF_TOKEN=$HF_TOKEN \
-- --model "model-id" \
--tasks "leaderboard|mmlu|5"
# inspect-ai with vLLM
hf jobs uv run scripts/inspect_vllm_uv.py \
--flavor "a10g-small" \
--secrets HF_TOKEN=$HF_TOKEN \
-- --model "model-id" \
--task "mmlu"
# Helper script (auto hardware selection)
uv run scripts/run_vllm_eval_job.py \
--model "model-id" \
--task "leaderboard|mmlu|5" \
--framework lighteval
The generated model-index follows this structure:
model-index:
- name: Model Name
results:
- task:
type: text-generation
dataset:
name: Benchmark Dataset
type: benchmark_type
metrics:
- name: MMLU
type: mmlu
value: 85.2
- name: HumanEval
type: humaneval
value: 72.5
source:
name: Source Name
url: https://source-url.com
WARNING: Do not use markdown formatting in the model name. Use the exact name from the table. Only use urls in the source.url field.
get-prs before creating any new PR to avoid duplicatesinspect-tables: See table structure and get the correct extraction command--help for guidance: Run inspect-tables --help to see the complete workflow--apply or --create-pr--table N for multi-table READMEs: Required when multiple evaluation tables exist--model-name-override for comparison tables: Copy the exact column header from outputWhen extracting evaluation tables with multiple models (either as columns or rows), the script uses exact normalized token matching :
**, links []() )- and _ with spaces)"OLMo-3-32B" → {"olmo", "3", "32b"} matches "**Olmo 3 32B**" or "[Olmo-3-32B](...)For column-based tables (benchmarks as rows, models as columns):
For transposed tables (models as rows, benchmarks as columns):
This ensures only the correct model's scores are extracted, never unrelated models or training checkpoints.
Update Your Own Model:
# Extract from README and push directly
uv run scripts/evaluation_manager.py extract-readme \
--repo-id "your-username/your-model" \
--task-type "text-generation"
Update Someone Else's Model (Full Workflow):
# Step 1: ALWAYS check for existing PRs first
uv run scripts/evaluation_manager.py get-prs \
--repo-id "other-username/their-model"
# Step 2: If NO open PRs exist, proceed with creating one
uv run scripts/evaluation_manager.py extract-readme \
--repo-id "other-username/their-model" \
--create-pr
# If open PRs DO exist:
# - Warn the user about existing PRs
# - Show them the PR URLs
# - Do NOT create a new PR unless user explicitly confirms
Import Fresh Benchmarks:
# Step 1: Check for existing PRs
uv run scripts/evaluation_manager.py get-prs \
--repo-id "anthropic/claude-sonnet-4"
# Step 2: If no PRs, import from Artificial Analysis
AA_API_KEY=... uv run scripts/evaluation_manager.py import-aa \
--creator-slug "anthropic" \
--model-name "claude-sonnet-4" \
--repo-id "anthropic/claude-sonnet-4" \
--create-pr
Issue : "No evaluation tables found in README"
Issue : "Could not find model 'X' in transposed table"
--model-name-override with the exact name from the list--model-name-override "**Olmo 3-32B**"Issue : "AA_API_KEY not set"
Issue : "Token does not have write access"
Issue : "Model not found in Artificial Analysis"
Issue : "Payment required for hardware"
Issue : "vLLM out of memory" or CUDA OOM
--gpu-memory-utilization, or use --tensor-parallel-size for multi-GPUIssue : "Model architecture not supported by vLLM"
--backend hf (inspect-ai) or --backend accelerate (lighteval) for HuggingFace TransformersIssue : "Trust remote code required"
--trust-remote-code flag for models with custom code (e.g., Phi-2, Qwen)Issue : "Chat template not found"
--use-chat-template for instruction-tuned models that include a chat templatePython Script Integration:
import subprocess
import os
def update_model_evaluations(repo_id, readme_content):
"""Update model card with evaluations from README."""
result = subprocess.run([
"python", "scripts/evaluation_manager.py",
"extract-readme",
"--repo-id", repo_id,
"--create-pr"
], capture_output=True, text=True)
if result.returncode == 0:
print(f"Successfully updated {repo_id}")
else:
print(f"Error: {result.stderr}")
Weekly Installs
264
Repository
GitHub Stars
9.9K
First Seen
Jan 20, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykWarn
Installed on
opencode229
gemini-cli225
codex224
github-copilot213
cursor206
claude-code202
React 组合模式指南:Vercel 组件架构最佳实践,提升代码可维护性
106,200 周安装
summarizationinspect-tables--create-pr when updating models you don't own