Hugging Face 模型评估技能：自动提取、导入和运行AI模型评测，优化模型卡片

hugging-face-evaluation by huggingface/skills

264 周安装量

9,900 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/huggingface/skills --skill hugging-face-evaluation

AI/机器学习自动化开发运维

🇨🇳中文介绍

概述

此技能提供工具，用于向 Hugging Face 模型卡片添加结构化评估结果。它支持多种添加评估数据的方法：

从 README 内容中提取现有的评估表格
从 Artificial Analysis 导入基准测试分数
使用 vLLM 或 accelerate 后端运行自定义模型评估（lighteval/inspect-ai）

与 HF 生态系统的集成

模型卡片 : 更新 model-index 元数据以集成排行榜
Artificial Analysis : 用于基准测试导入的直接 API 集成
Papers with Code : 兼容其 model-index 规范
Jobs : 通过 uv 集成直接在 Hugging Face Jobs 上运行评估
vLLM : 用于自定义模型评估的高效 GPU 推理
lighteval : HuggingFace 的评估库，支持 vLLM/accelerate 后端
inspect-ai : 英国 AI 安全研究所的评估框架

版本

1.3.0

依赖项

核心依赖项

huggingface_hub>=0.26.0
markdown-it-py>=3.0.0
python-dotenv>=1.2.1
pyyaml>=6.0.3
requests>=2.32.5
re (built-in)

推理提供商评估

inspect-ai>=0.3.0
inspect-evals
openai

vLLM 自定义模型评估（需要 GPU）

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

重要提示：使用此技能

⚠️ 关键：创建新 PR 前检查现有 PR

在使用 --create-pr 创建任何拉取请求之前，您必须检查现有的开放 PR：

uv run scripts/evaluation_manager.py get-prs --repo-id "username/model-name"

如果存在开放 PR：

不要创建新的 PR - 这会为维护者创建重复工作
警告用户 已存在开放 PR
向用户展示 现有的 PR URL，以便他们可以查看
只有在用户明确确认他们想要创建另一个 PR 时才继续

这可以防止向模型仓库发送重复的评估 PR。

所有路径均相对于包含此 SKILL.md 文件的目录。 在运行任何脚本之前，请先 cd 到该目录或使用完整路径。

使用 --help 获取最新的工作流程指导。 适用于普通 Python 或 uv run：

uv run scripts/evaluation_manager.py --help
uv run scripts/evaluation_manager.py inspect-tables --help
uv run scripts/evaluation_manager.py extract-readme --help

关键工作流程（与 CLI 帮助匹配）：

get-prs → 首先检查现有的开放 PR
inspect-tables → 查找表格编号/列
extract-readme --table N → 默认打印 YAML
添加 --apply（推送）或 --create-pr 来写入更改

1. 检查并提取 README 中的评估表格

检查表格 : 使用 inspect-tables 查看 README 中的所有表格，包括结构、列和示例行
解析 Markdown 表格 : 使用 markdown-it-py 进行准确解析（忽略代码块和示例）
表格选择 : 使用 --table N 从特定表格中提取（当存在多个表格时需要）
格式检测 : 识别常见格式（基准测试作为行、列或包含多个模型的比较表格）
列匹配 : 自动识别模型列/行；优先使用 --model-column-index（索引来自检查输出）。仅当列标题文本完全匹配时才使用 --model-name-override。
YAML 生成 : 将选定的表格转换为 model-index YAML 格式
任务类型设置 : --task-type 设置 model-index 输出中的 task.type 字段（例如 text-generation、summarization）

2. 从 Artificial Analysis 导入

API 集成 : 直接从 Artificial Analysis 获取基准测试分数
自动格式化 : 将 API 响应转换为 model-index 格式
元数据保留 : 保留来源归属和 URL
PR 创建 : 自动创建包含评估更新的拉取请求

3. Model-Index 管理

YAML 生成 : 创建格式正确的 model-index 条目
合并支持 : 向现有模型卡片添加评估而不覆盖
验证 : 确保符合 Papers with Code 规范
批量操作 : 高效处理多个模型

4. 在 HF Jobs 上运行评估（推理提供商）

Inspect-AI 集成 : 使用 inspect-ai 库运行标准评估
UV 集成 : 在 HF 基础设施上无缝运行具有临时依赖项的 Python 脚本
零配置 : 无需 Dockerfile 或 Space 管理
硬件选择 : 为评估作业配置 CPU 或 GPU 硬件
安全执行 : 通过 CLI 传递的密钥安全处理 API 令牌

5. 使用 vLLM 运行自定义模型评估（新功能）

⚠️ 重要提示： 此方法仅适用于安装了 uv 且具有足够 GPU 内存的设备。优点： 无需使用 hf_jobs() MCP 工具，可以直接在终端中运行脚本 使用时机： 用户在本地设备上直接工作且 GPU 可用时

检查脚本路径
检查是否安装了 uv
使用 nvidia-smi 检查 GPU 是否可用

uv run scripts/train_sft_example.py

vLLM 后端 : 高性能 GPU 推理（比标准 HF 方法快 5-10 倍）
lighteval 框架 : HuggingFace 的评估库，包含 Open LLM Leaderboard 任务
inspect-ai 框架 : 英国 AI 安全研究所的评估库
独立运行或 Jobs : 本地运行或提交到 HF Jobs 基础设施

该技能包含 scripts/ 目录下的 Python 脚本来执行操作。

推荐：使用 uv run（PEP 723 头自动安装依赖项）
可选手动回退：uv pip install huggingface-hub markdown-it-py python-dotenv pyyaml requests
设置具有写入权限令牌的 HF_TOKEN 环境变量
对于 Artificial Analysis：设置 AA_API_KEY 环境变量
如果安装了 python-dotenv，.env 文件会自动加载

方法 1：从 README 中提取（CLI 工作流程）

推荐流程（与 --help 匹配）：

# 1) 检查表格以获取表格编号和列提示
uv run scripts/evaluation_manager.py inspect-tables --repo-id "username/model"

# 2) 提取特定表格（默认打印 YAML）
uv run scripts/evaluation_manager.py extract-readme \
  --repo-id "username/model" \
  --table 1 \
  [--model-column-index <inspect-tables 显示的列索引>] \
  [--model-name-override "<列标题/模型名称>"]  # 如果无法使用索引，请使用确切的标题文本

# 3) 应用更改（推送或 PR）
uv run scripts/evaluation_manager.py extract-readme \
  --repo-id "username/model" \
  --table 1 \
  --apply       # 直接推送
# 或
uv run scripts/evaluation_manager.py extract-readme \
  --repo-id "username/model" \
  --table 1 \
  --create-pr   # 打开一个 PR

默认打印 YAML；在应用前与 README 表格进行比较。
优先使用 --model-column-index；如果使用 --model-name-override，列标题文本必须完全匹配。
对于转置表格（模型作为行），确保只提取一行。

方法 2：从 Artificial Analysis 导入

从 Artificial Analysis API 获取基准测试分数并将其添加到模型卡片。

AA_API_KEY="your-api-key" uv run scripts/evaluation_manager.py import-aa \
  --creator-slug "anthropic" \
  --model-name "claude-sonnet-4" \
  --repo-id "username/model-name"

使用环境文件：

# 创建 .env 文件
echo "AA_API_KEY=your-api-key" >> .env
echo "HF_TOKEN=your-hf-token" >> .env

# 运行导入
uv run scripts/evaluation_manager.py import-aa \
  --creator-slug "anthropic" \
  --model-name "claude-sonnet-4" \
  --repo-id "username/model-name"

创建拉取请求：

uv run scripts/evaluation_manager.py import-aa \
  --creator-slug "anthropic" \
  --model-name "claude-sonnet-4" \
  --repo-id "username/model-name" \
  --create-pr

方法 3：运行评估作业

使用 hf jobs uv run CLI 在 Hugging Face 基础设施上提交评估作业。

直接 CLI 用法：

HF_TOKEN=$HF_TOKEN \
hf jobs uv run hf-evaluation/scripts/inspect_eval_uv.py \
  --flavor cpu-basic \
  --secret HF_TOKEN=$HF_TOKEN \
  -- --model "meta-llama/Llama-2-7b-hf" \
     --task "mmlu"

GPU 示例（A10G）：

HF_TOKEN=$HF_TOKEN \
hf jobs uv run hf-evaluation/scripts/inspect_eval_uv.py \
  --flavor a10g-small \
  --secret HF_TOKEN=$HF_TOKEN \
  -- --model "meta-llama/Llama-2-7b-hf" \
     --task "gsm8k"

Python 辅助脚本（可选）：

uv run scripts/run_eval_job.py \
  --model "meta-llama/Llama-2-7b-hf" \
  --task "mmlu" \
  --hardware "t4-small"

方法 4：使用 vLLM 运行自定义模型评估

使用 vLLM 或 accelerate 后端直接在 GPU 上评估自定义 HuggingFace 模型。这些脚本与推理提供商脚本分开，并在作业的硬件上本地运行模型。

何时使用 vLLM 评估（与推理提供商对比）

功能	vLLM 脚本	推理提供商脚本
模型访问	任何 HF 模型	具有 API 端点的模型
硬件	您的 GPU（或 HF Jobs GPU）	提供商的基础设施
成本	HF Jobs 计算成本	API 使用费
速度	vLLM 优化	取决于提供商
离线	是（下载后）	否

选项 A：使用 vLLM 后端的 lighteval

lighteval 是 HuggingFace 的评估库，支持 Open LLM Leaderboard 任务。

独立运行（本地 GPU）：

# 使用 vLLM 运行 MMLU 5-shot
uv run scripts/lighteval_vllm_uv.py \
  --model meta-llama/Llama-3.2-1B \
  --tasks "leaderboard|mmlu|5"

# 运行多个任务
uv run scripts/lighteval_vllm_uv.py \
  --model meta-llama/Llama-3.2-1B \
  --tasks "leaderboard|mmlu|5,leaderboard|gsm8k|5"

# 使用 accelerate 后端代替 vLLM
uv run scripts/lighteval_vllm_uv.py \
  --model meta-llama/Llama-3.2-1B \
  --tasks "leaderboard|mmlu|5" \
  --backend accelerate

# 聊天/指令调优模型
uv run scripts/lighteval_vllm_uv.py \
  --model meta-llama/Llama-3.2-1B-Instruct \
  --tasks "leaderboard|mmlu|5" \
  --use-chat-template

通过 HF Jobs：

hf jobs uv run scripts/lighteval_vllm_uv.py \
  --flavor a10g-small \
  --secrets HF_TOKEN=$HF_TOKEN \
  -- --model meta-llama/Llama-3.2-1B \
     --tasks "leaderboard|mmlu|5"

lighteval 任务格式： 任务使用格式 suite|task|num_fewshot：

leaderboard|mmlu|5 - 5-shot MMLU
leaderboard|gsm8k|5 - 5-shot GSM8K
lighteval|hellaswag|0 - 零-shot HellaSwag
leaderboard|arc_challenge|25 - 25-shot ARC-Challenge

查找可用任务： 完整的可用 lighteval 任务列表可在以下位置找到：https://github.com/huggingface/lighteval/blob/main/examples/tasks/all_tasks.txt

此文件包含所有支持的任务，格式为 suite|task|num_fewshot|0（尾部的 0 是版本标志，可以忽略）。常见的套件包括：

leaderboard - Open LLM Leaderboard 任务（MMLU、GSM8K、ARC、HellaSwag 等）
lighteval - 额外的 lighteval 任务
bigbench - BigBench 任务
original - 原始基准测试任务

要使用列表中的任务，请提取 suite|task|num_fewshot 部分（不带尾部的 0）并将其传递给 --tasks 参数。例如：

来自文件：leaderboard|mmlu|0 → 使用：leaderboard|mmlu|0（或更改为 5 表示 5-shot）
来自文件：bigbench|abstract_narrative_understanding|0 → 使用：bigbench|abstract_narrative_understanding|0
来自文件：lighteval|wmt14:hi-en|0 → 使用：lighteval|wmt14:hi-en|0

多个任务可以指定为逗号分隔的值：--tasks "leaderboard|mmlu|5,leaderboard|gsm8k|5"

选项 B：使用 vLLM 后端的 inspect-ai

inspect-ai 是英国 AI 安全研究所的评估框架。

独立运行（本地 GPU）：

# 使用 vLLM 运行 MMLU
uv run scripts/inspect_vllm_uv.py \
  --model meta-llama/Llama-3.2-1B \
  --task mmlu

# 使用 HuggingFace Transformers 后端
uv run scripts/inspect_vllm_uv.py \
  --model meta-llama/Llama-3.2-1B \
  --task mmlu \
  --backend hf

# 使用张量并行的多 GPU
uv run scripts/inspect_vllm_uv.py \
  --model meta-llama/Llama-3.2-70B \
  --task mmlu \
  --tensor-parallel-size 4

通过 HF Jobs：

hf jobs uv run scripts/inspect_vllm_uv.py \
  --flavor a10g-small \
  --secrets HF_TOKEN=$HF_TOKEN \
  -- --model meta-llama/Llama-3.2-1B \
     --task mmlu

可用的 inspect-ai 任务：

mmlu - 大规模多任务语言理解
gsm8k - 小学数学
hellaswag - 常识推理
arc_challenge - AI2 推理挑战
truthfulqa - TruthfulQA 基准测试
winogrande - Winograd 模式挑战
humaneval - 代码生成

选项 C：Python 辅助脚本

辅助脚本自动选择硬件并简化作业提交：

# 根据模型大小自动检测硬件
uv run scripts/run_vllm_eval_job.py \
  --model meta-llama/Llama-3.2-1B \
  --task "leaderboard|mmlu|5" \
  --framework lighteval

# 显式硬件选择
uv run scripts/run_vllm_eval_job.py \
  --model meta-llama/Llama-3.2-70B \
  --task mmlu \
  --framework inspect \
  --hardware a100-large \
  --tensor-parallel-size 4

# 使用 HF Transformers 后端
uv run scripts/run_vllm_eval_job.py \
  --model microsoft/phi-2 \
  --task mmlu \
  --framework inspect \
  --backend hf

模型大小	推荐硬件
< 30 亿参数	`t4-small`
30 亿 - 130 亿	`a10g-small`
130 亿 - 340 亿	`a10g-large`
340 亿以上	`a100-large`

顶级帮助和版本：

uv run scripts/evaluation_manager.py --help
uv run scripts/evaluation_manager.py --version

检查表格（从此开始）：

uv run scripts/evaluation_manager.py inspect-tables --repo-id "username/model-name"

从 README 中提取：

uv run scripts/evaluation_manager.py extract-readme \
  --repo-id "username/model-name" \
  --table N \
  [--model-column-index N] \
  [--model-name-override "Exact Column Header or Model Name"] \
  [--task-type "text-generation"] \
  [--dataset-name "Custom Benchmarks"] \
  [--apply | --create-pr]

从 Artificial Analysis 导入：

AA_API_KEY=... uv run scripts/evaluation_manager.py import-aa \
  --creator-slug "creator-name" \
  --model-name "model-slug" \
  --repo-id "username/model-name" \
  [--create-pr]

查看 / 验证：

uv run scripts/evaluation_manager.py show --repo-id "username/model-name"
uv run scripts/evaluation_manager.py validate --repo-id "username/model-name"

检查开放 PR（在使用 --create-pr 之前始终运行）：

uv run scripts/evaluation_manager.py get-prs --repo-id "username/model-name"

列出模型仓库的所有开放拉取请求。显示 PR 编号、标题、作者、日期和 URL。

运行评估作业（推理提供商）：

hf jobs uv run scripts/inspect_eval_uv.py \
  --flavor "cpu-basic|t4-small|..." \
  --secret HF_TOKEN=$HF_TOKEN \
  -- --model "model-id" \
     --task "task-name"

或使用 Python 辅助脚本：

uv run scripts/run_eval_job.py \
  --model "model-id" \
  --task "task-name" \
  --hardware "cpu-basic|t4-small|..."

运行 vLLM 评估（自定义模型）：

# 使用 vLLM 的 lighteval
hf jobs uv run scripts/lighteval_vllm_uv.py \
  --flavor "a10g-small" \
  --secrets HF_TOKEN=$HF_TOKEN \
  -- --model "model-id" \
     --tasks "leaderboard|mmlu|5"

# 使用 vLLM 的 inspect-ai
hf jobs uv run scripts/inspect_vllm_uv.py \
  --flavor "a10g-small" \
  --secrets HF_TOKEN=$HF_TOKEN \
  -- --model "model-id" \
     --task "mmlu"

# 辅助脚本（自动硬件选择）
uv run scripts/run_vllm_eval_job.py \
  --model "model-id" \
  --task "leaderboard|mmlu|5" \
  --framework lighteval

生成的 model-index 遵循以下结构：

model-index:
  - name: Model Name
    results:
      - task:
          type: text-generation
        dataset:
          name: Benchmark Dataset
          type: benchmark_type
        metrics:
          - name: MMLU
            type: mmlu
            value: 85.2
          - name: HumanEval
            type: humaneval
            value: 72.5
        source:
          name: Source Name
          url: https://source-url.com

警告：不要在模型名称中使用 markdown 格式。使用表格中的确切名称。仅在 source.url 字段中使用 url。

未找到表格 : 如果未检测到评估表格，脚本将报告
无效格式 : 针对格式错误的表格提供清晰的错误消息
API 错误 : 针对临时性 Artificial Analysis API 故障的重试逻辑
令牌问题 : 在尝试更新之前进行验证
合并冲突 : 添加新条目时保留现有的 model-index 条目
Space 创建 : 优雅地处理命名冲突和硬件请求失败

首先检查现有 PR : 在创建任何新 PR 之前运行 get-prs 以避免重复
始终从 inspect-tables 开始 : 查看表格结构并获取正确的提取命令
使用 --help 获取指导 : 运行 inspect-tables --help 查看完整工作流程
先预览 : 默认行为打印 YAML；在使用 --apply 或 --create-pr 之前查看它
验证提取的值 : 手动将 YAML 输出与 README 表格进行比较
对于多表格 README，使用 --table N : 当存在多个评估表格时需要
对于比较表格，使用 --model-name-override : 从 inspect-tables 输出中复制确切的列标题
为他人创建 PR : 在更新您不拥有的模型时使用 --create-pr
每个仓库一个模型 : 仅将主要模型的结果添加到 model-index
YAML 名称中不要使用 markdown : YAML 中的模型名字段应为纯文本

当提取包含多个模型的评估表格（无论是作为列还是行）时，脚本使用精确的规范化令牌匹配：

移除 markdown 格式（粗体 **，链接 []()）
规范化名称（小写，将 - 和 _ 替换为空格）
比较令牌集："OLMo-3-32B" → {"olmo", "3", "32b"} 匹配 "**Olmo 3 32B**" 或 "[Olmo-3-32B](...)
仅当令牌完全匹配时才提取（处理不同的单词顺序和分隔符）
如果未找到完全匹配则失败（而不是从相似名称中猜测）

对于基于列的表格（基准测试作为行，模型作为列）：

查找与模型名称匹配的列标题
仅从该列提取分数

对于转置表格（模型作为行，基准测试作为列）：

查找第一列中与模型名称匹配的行
仅从该行提取所有基准测试分数

这确保仅提取正确模型的分数，绝不会提取不相关的模型或训练检查点。

更新您自己的模型：

# 从 README 中提取并直接推送
uv run scripts/evaluation_manager.py extract-readme \
  --repo-id "your-username/your-model" \
  --task-type "text-generation"

更新他人的模型（完整工作流程）：

# 步骤 1：始终首先检查现有 PR
uv run scripts/evaluation_manager.py get-prs \
  --repo-id "other-username/their-model"

# 步骤 2：如果不存在开放 PR，则继续创建一个
uv run scripts/evaluation_manager.py extract-readme \
  --repo-id "other-username/their-model" \
  --create-pr

# 如果确实存在开放 PR：
# - 警告用户有关现有 PR 的情况
# - 向他们展示 PR URL
# - 除非用户明确确认，否则不要创建新的 PR

导入新的基准测试：

# 步骤 1：检查现有 PR
uv run scripts/evaluation_manager.py get-prs \
  --repo-id "anthropic/claude-sonnet-4"

# 步骤 2：如果没有 PR，则从 Artificial Analysis 导入
AA_API_KEY=... uv run scripts/evaluation_manager.py import-aa \
  --creator-slug "anthropic" \
  --model-name "claude-sonnet-4" \
  --repo-id "anthropic/claude-sonnet-4" \
  --create-pr

问题 : "在 README 中未找到评估表格"

解决方案 : 检查 README 是否包含带有数字分数的 markdown 表格

问题 : "在转置表格中找不到模型 'X'"

解决方案 : 脚本将显示可用模型。使用 --model-name-override 并指定列表中的确切名称
示例 : --model-name-override "**Olmo 3-32B**"

问题 : "未设置 AA_API_KEY"

解决方案 : 设置环境变量或添加到 .env 文件

问题 : "令牌没有写入权限"

解决方案 : 确保 HF_TOKEN 对仓库具有写入权限

问题 : "在 Artificial Analysis 中找不到模型"

解决方案 : 验证 creator-slug 和 model-name 是否与 API 值匹配

问题 : "硬件需要付费"

解决方案 : 向您的 Hugging Face 账户添加付款方式以使用非 CPU 硬件

问题 : "vLLM 内存不足" 或 CUDA OOM

解决方案 : 使用更大的硬件规格，减少 --gpu-memory-utilization，或使用 --tensor-parallel-size 进行多 GPU

问题 : "vLLM 不支持模型架构"

解决方案 : 使用 --backend hf（inspect-ai）或 --backend accelerate（lighteval）以使用 HuggingFace Transformers

问题 : "需要信任远程代码"

解决方案 : 对于具有自定义代码的模型（例如 Phi-2、Qwen），添加 --trust-remote-code 标志

问题 : "未找到聊天模板"

解决方案 : 仅对包含聊天模板的指令调优模型使用 --use-chat-template

Python 脚本集成：

import subprocess
import os

def update_model_evaluations(repo_id, readme_content):
    """Update model card with evaluations from README."""
    result = subprocess.run([
        "python", "scripts/evaluation_manager.py",
        "extract-readme",
        "--repo-id", repo_id,
        "--create-pr"
    ], capture_output=True, text=True)

    if result.returncode == 0:
        print(f"Successfully updated {repo_id}")
    else:
        print(f"Error: {result.stderr}")

🇺🇸English

Overview

This skill provides tools to add structured evaluation results to Hugging Face model cards. It supports multiple methods for adding evaluation data:

Extracting existing evaluation tables from README content
Importing benchmark scores from Artificial Analysis
Running custom model evaluations with vLLM or accelerate backends (lighteval/inspect-ai)

Integration with HF Ecosystem

Model Cards : Updates model-index metadata for leaderboard integration
Artificial Analysis : Direct API integration for benchmark imports
Papers with Code : Compatible with their model-index specification
Jobs : Run evaluations directly on Hugging Face Jobs with uv integration
vLLM : Efficient GPU inference for custom model evaluation
lighteval : HuggingFace's evaluation library with vLLM/accelerate backends
inspect-ai : UK AI Safety Institute's evaluation framework

Version

1.3.0

Dependencies

Core Dependencies

huggingface_hub>=0.26.0
markdown-it-py>=3.0.0
python-dotenv>=1.2.1
pyyaml>=6.0.3
requests>=2.32.5
re (built-in)

Inference Provider Evaluation

inspect-ai>=0.3.0
inspect-evals
openai

vLLM Custom Model Evaluation (GPU required)

lighteval[accelerate,vllm]>=0.6.0
vllm>=0.4.0
torch>=2.0.0
transformers>=4.40.0
accelerate>=0.30.0

Note: vLLM dependencies are installed automatically via PEP 723 script headers when using uv run.

IMPORTANT: Using This Skill

⚠️ CRITICAL: Check for Existing PRs Before Creating New Ones

Before creating ANY pull request with--create-pr, you MUST check for existing open PRs:

uv run scripts/evaluation_manager.py get-prs --repo-id "username/model-name"

If open PRs exist:

DO NOT create a new PR - this creates duplicate work for maintainers
Warn the user that open PRs already exist
Show the user the existing PR URLs so they can review them
Only proceed if the user explicitly confirms they want to create another PR

This prevents spamming model repositories with duplicate evaluation PRs.

All paths are relative to the directory containing this SKILL.md file. Before running any script, first cd to that directory or use the full path.

Use--help for the latest workflow guidance. Works with plain Python or uv run:

uv run scripts/evaluation_manager.py --help
uv run scripts/evaluation_manager.py inspect-tables --help
uv run scripts/evaluation_manager.py extract-readme --help

Key workflow (matches CLI help):

get-prs → check for existing open PRs first
inspect-tables → find table numbers/columns
extract-readme --table N → prints YAML by default
add --apply (push) or --create-pr to write changes

Core Capabilities

1. Inspect and Extract Evaluation Tables from README

Inspect Tables : Use inspect-tables to see all tables in a README with structure, columns, and sample rows
Parse Markdown Tables : Accurate parsing using markdown-it-py (ignores code blocks and examples)
Table Selection : Use --table N to extract from a specific table (required when multiple tables exist)
Format Detection : Recognize common formats (benchmarks as rows, columns, or comparison tables with multiple models)
Column Matching : Automatically identify model columns/rows; prefer --model-column-index (index from inspect output). Use --model-name-override only with exact column header text.
YAML Generation : Convert selected table to model-index YAML format
Task Typing : --task-type sets the task.type field in model-index output (e.g., text-generation, )

2. Import from Artificial Analysis

API Integration : Fetch benchmark scores directly from Artificial Analysis
Automatic Formatting : Convert API responses to model-index format
Metadata Preservation : Maintain source attribution and URLs
PR Creation : Automatically create pull requests with evaluation updates

3. Model-Index Management

YAML Generation : Create properly formatted model-index entries
Merge Support : Add evaluations to existing model cards without overwriting
Validation : Ensure compliance with Papers with Code specification
Batch Operations : Process multiple models efficiently

4. Run Evaluations on HF Jobs (Inference Providers)

Inspect-AI Integration : Run standard evaluations using the inspect-ai library
UV Integration : Seamlessly run Python scripts with ephemeral dependencies on HF infrastructure
Zero-Config : No Dockerfiles or Space management required
Hardware Selection : Configure CPU or GPU hardware for the evaluation job
Secure Execution : Handles API tokens safely via secrets passed through the CLI

5. Run Custom Model Evaluations with vLLM (NEW)

⚠️ Important: This approach is only possible on devices with uv installed and sufficient GPU memory. Benefits: No need to use hf_jobs() MCP tool, can run scripts directly in terminal When to use: User working in local device directly when GPU is available

Before running the script

check the script path
check uv is installed
check gpu is available with nvidia-smi

Running the script

uv run scripts/train_sft_example.py

Features

vLLM Backend : High-performance GPU inference (5-10x faster than standard HF methods)
lighteval Framework : HuggingFace's evaluation library with Open LLM Leaderboard tasks
inspect-ai Framework : UK AI Safety Institute's evaluation library
Standalone or Jobs : Run locally or submit to HF Jobs infrastructure

Usage Instructions

The skill includes Python scripts in scripts/ to perform operations.

Prerequisites

Preferred: use uv run (PEP 723 header auto-installs deps)
Optional manual fallback: uv pip install huggingface-hub markdown-it-py python-dotenv pyyaml requests
Set HF_TOKEN environment variable with Write-access token
For Artificial Analysis: Set AA_API_KEY environment variable
.env is loaded automatically if python-dotenv is installed

Method 1: Extract from README (CLI workflow)

Recommended flow (matches --help):

# 1) Inspect tables to get table numbers and column hints
uv run scripts/evaluation_manager.py inspect-tables --repo-id "username/model"

# 2) Extract a specific table (prints YAML by default)
uv run scripts/evaluation_manager.py extract-readme \
  --repo-id "username/model" \
  --table 1 \
  [--model-column-index <column index shown by inspect-tables>] \
  [--model-name-override "<column header/model name>"]  # use exact header text if you can't use the index

# 3) Apply changes (push or PR)
uv run scripts/evaluation_manager.py extract-readme \
  --repo-id "username/model" \
  --table 1 \
  --apply       # push directly
# or
uv run scripts/evaluation_manager.py extract-readme \
  --repo-id "username/model" \
  --table 1 \
  --create-pr   # open a PR

Validation checklist:

YAML is printed by default; compare against the README table before applying.
Prefer --model-column-index; if using --model-name-override, the column header text must be exact.
For transposed tables (models as rows), ensure only one row is extracted.

Method 2: Import from Artificial Analysis

Fetch benchmark scores from Artificial Analysis API and add them to a model card.

Basic Usage:

AA_API_KEY="your-api-key" uv run scripts/evaluation_manager.py import-aa \
  --creator-slug "anthropic" \
  --model-name "claude-sonnet-4" \
  --repo-id "username/model-name"

With Environment File:

# Create .env file
echo "AA_API_KEY=your-api-key" >> .env
echo "HF_TOKEN=your-hf-token" >> .env

# Run import
uv run scripts/evaluation_manager.py import-aa \
  --creator-slug "anthropic" \
  --model-name "claude-sonnet-4" \
  --repo-id "username/model-name"

Create Pull Request:

uv run scripts/evaluation_manager.py import-aa \
  --creator-slug "anthropic" \
  --model-name "claude-sonnet-4" \
  --repo-id "username/model-name" \
  --create-pr

Method 3: Run Evaluation Job

Submit an evaluation job on Hugging Face infrastructure using the hf jobs uv run CLI.

Direct CLI Usage:

HF_TOKEN=$HF_TOKEN \
hf jobs uv run hf-evaluation/scripts/inspect_eval_uv.py \
  --flavor cpu-basic \
  --secret HF_TOKEN=$HF_TOKEN \
  -- --model "meta-llama/Llama-2-7b-hf" \
     --task "mmlu"

GPU Example (A10G):

HF_TOKEN=$HF_TOKEN \
hf jobs uv run hf-evaluation/scripts/inspect_eval_uv.py \
  --flavor a10g-small \
  --secret HF_TOKEN=$HF_TOKEN \
  -- --model "meta-llama/Llama-2-7b-hf" \
     --task "gsm8k"

Python Helper (optional):

uv run scripts/run_eval_job.py \
  --model "meta-llama/Llama-2-7b-hf" \
  --task "mmlu" \
  --hardware "t4-small"

Method 4: Run Custom Model Evaluation with vLLM

Evaluate custom HuggingFace models directly on GPU using vLLM or accelerate backends. These scripts are separate from inference provider scripts and run models locally on the job's hardware.

When to Use vLLM Evaluation (vs Inference Providers)

Feature	vLLM Scripts	Inference Provider Scripts
Model access	Any HF model	Models with API endpoints
Hardware	Your GPU (or HF Jobs GPU)	Provider's infrastructure
Cost	HF Jobs compute cost	API usage fees
Speed	vLLM optimized	Depends on provider
Offline	Yes (after download)	No

Option A: lighteval with vLLM Backend

lighteval is HuggingFace's evaluation library, supporting Open LLM Leaderboard tasks.

Standalone (local GPU):

# Run MMLU 5-shot with vLLM
uv run scripts/lighteval_vllm_uv.py \
  --model meta-llama/Llama-3.2-1B \
  --tasks "leaderboard|mmlu|5"

# Run multiple tasks
uv run scripts/lighteval_vllm_uv.py \
  --model meta-llama/Llama-3.2-1B \
  --tasks "leaderboard|mmlu|5,leaderboard|gsm8k|5"

# Use accelerate backend instead of vLLM
uv run scripts/lighteval_vllm_uv.py \
  --model meta-llama/Llama-3.2-1B \
  --tasks "leaderboard|mmlu|5" \
  --backend accelerate

# Chat/instruction-tuned models
uv run scripts/lighteval_vllm_uv.py \
  --model meta-llama/Llama-3.2-1B-Instruct \
  --tasks "leaderboard|mmlu|5" \
  --use-chat-template

Via HF Jobs:

hf jobs uv run scripts/lighteval_vllm_uv.py \
  --flavor a10g-small \
  --secrets HF_TOKEN=$HF_TOKEN \
  -- --model meta-llama/Llama-3.2-1B \
     --tasks "leaderboard|mmlu|5"

lighteval Task Format: Tasks use the format suite|task|num_fewshot:

leaderboard|mmlu|5 - MMLU with 5-shot
leaderboard|gsm8k|5 - GSM8K with 5-shot
lighteval|hellaswag|0 - HellaSwag zero-shot
leaderboard|arc_challenge|25 - ARC-Challenge with 25-shot

Finding Available Tasks: The complete list of available lighteval tasks can be found at: https://github.com/huggingface/lighteval/blob/main/examples/tasks/all_tasks.txt

This file contains all supported tasks in the format suite|task|num_fewshot|0 (the trailing 0 is a version flag and can be ignored). Common suites include:

leaderboard - Open LLM Leaderboard tasks (MMLU, GSM8K, ARC, HellaSwag, etc.)
lighteval - Additional lighteval tasks
bigbench - BigBench tasks
original - Original benchmark tasks

To use a task from the list, extract the suite|task|num_fewshot portion (without the trailing 0) and pass it to the --tasks parameter. For example:

From file: leaderboard|mmlu|0 → Use: leaderboard|mmlu|0 (or change to 5 for 5-shot)
From file: bigbench|abstract_narrative_understanding|0 → Use: bigbench|abstract_narrative_understanding|0
From file: lighteval|wmt14:hi-en|0 → Use: lighteval|wmt14:hi-en|0

Multiple tasks can be specified as comma-separated values: --tasks "leaderboard|mmlu|5,leaderboard|gsm8k|5"

Option B: inspect-ai with vLLM Backend

inspect-ai is the UK AI Safety Institute's evaluation framework.

Standalone (local GPU):

# Run MMLU with vLLM
uv run scripts/inspect_vllm_uv.py \
  --model meta-llama/Llama-3.2-1B \
  --task mmlu

# Use HuggingFace Transformers backend
uv run scripts/inspect_vllm_uv.py \
  --model meta-llama/Llama-3.2-1B \
  --task mmlu \
  --backend hf

# Multi-GPU with tensor parallelism
uv run scripts/inspect_vllm_uv.py \
  --model meta-llama/Llama-3.2-70B \
  --task mmlu \
  --tensor-parallel-size 4

Via HF Jobs:

hf jobs uv run scripts/inspect_vllm_uv.py \
  --flavor a10g-small \
  --secrets HF_TOKEN=$HF_TOKEN \
  -- --model meta-llama/Llama-3.2-1B \
     --task mmlu

Available inspect-ai Tasks:

mmlu - Massive Multitask Language Understanding
gsm8k - Grade School Math
hellaswag - Common sense reasoning
arc_challenge - AI2 Reasoning Challenge
truthfulqa - TruthfulQA benchmark
winogrande - Winograd Schema Challenge
humaneval - Code generation

Option C: Python Helper Script

The helper script auto-selects hardware and simplifies job submission:

# Auto-detect hardware based on model size
uv run scripts/run_vllm_eval_job.py \
  --model meta-llama/Llama-3.2-1B \
  --task "leaderboard|mmlu|5" \
  --framework lighteval

# Explicit hardware selection
uv run scripts/run_vllm_eval_job.py \
  --model meta-llama/Llama-3.2-70B \
  --task mmlu \
  --framework inspect \
  --hardware a100-large \
  --tensor-parallel-size 4

# Use HF Transformers backend
uv run scripts/run_vllm_eval_job.py \
  --model microsoft/phi-2 \
  --task mmlu \
  --framework inspect \
  --backend hf

Hardware Recommendations:

Model Size	Recommended Hardware
< 3B params	`t4-small`
3B - 13B	`a10g-small`
13B - 34B	`a10g-large`
34B+	`a100-large`

Commands Reference

Top-level help and version:

uv run scripts/evaluation_manager.py --help
uv run scripts/evaluation_manager.py --version

Inspect Tables (start here):

uv run scripts/evaluation_manager.py inspect-tables --repo-id "username/model-name"

Extract from README:

uv run scripts/evaluation_manager.py extract-readme \
  --repo-id "username/model-name" \
  --table N \
  [--model-column-index N] \
  [--model-name-override "Exact Column Header or Model Name"] \
  [--task-type "text-generation"] \
  [--dataset-name "Custom Benchmarks"] \
  [--apply | --create-pr]

Import from Artificial Analysis:

AA_API_KEY=... uv run scripts/evaluation_manager.py import-aa \
  --creator-slug "creator-name" \
  --model-name "model-slug" \
  --repo-id "username/model-name" \
  [--create-pr]

View / Validate:

uv run scripts/evaluation_manager.py show --repo-id "username/model-name"
uv run scripts/evaluation_manager.py validate --repo-id "username/model-name"

Check Open PRs (ALWAYS run before --create-pr):

uv run scripts/evaluation_manager.py get-prs --repo-id "username/model-name"

Lists all open pull requests for the model repository. Shows PR number, title, author, date, and URL.

Run Evaluation Job (Inference Providers):

hf jobs uv run scripts/inspect_eval_uv.py \
  --flavor "cpu-basic|t4-small|..." \
  --secret HF_TOKEN=$HF_TOKEN \
  -- --model "model-id" \
     --task "task-name"

or use the Python helper:

uv run scripts/run_eval_job.py \
  --model "model-id" \
  --task "task-name" \
  --hardware "cpu-basic|t4-small|..."

Run vLLM Evaluation (Custom Models):

# lighteval with vLLM
hf jobs uv run scripts/lighteval_vllm_uv.py \
  --flavor "a10g-small" \
  --secrets HF_TOKEN=$HF_TOKEN \
  -- --model "model-id" \
     --tasks "leaderboard|mmlu|5"

# inspect-ai with vLLM
hf jobs uv run scripts/inspect_vllm_uv.py \
  --flavor "a10g-small" \
  --secrets HF_TOKEN=$HF_TOKEN \
  -- --model "model-id" \
     --task "mmlu"

# Helper script (auto hardware selection)
uv run scripts/run_vllm_eval_job.py \
  --model "model-id" \
  --task "leaderboard|mmlu|5" \
  --framework lighteval

Model-Index Format

The generated model-index follows this structure:

model-index:
  - name: Model Name
    results:
      - task:
          type: text-generation
        dataset:
          name: Benchmark Dataset
          type: benchmark_type
        metrics:
          - name: MMLU
            type: mmlu
            value: 85.2
          - name: HumanEval
            type: humaneval
            value: 72.5
        source:
          name: Source Name
          url: https://source-url.com

WARNING: Do not use markdown formatting in the model name. Use the exact name from the table. Only use urls in the source.url field.

Error Handling

Table Not Found : Script will report if no evaluation tables are detected
Invalid Format : Clear error messages for malformed tables
API Errors : Retry logic for transient Artificial Analysis API failures
Token Issues : Validation before attempting updates
Merge Conflicts : Preserves existing model-index entries when adding new ones
Space Creation : Handles naming conflicts and hardware request failures gracefully

Best Practices

Check for existing PRs first : Run get-prs before creating any new PR to avoid duplicates
Always start withinspect-tables: See table structure and get the correct extraction command
Use--help for guidance: Run inspect-tables --help to see the complete workflow
Preview first : Default behavior prints YAML; review it before using --apply or --create-pr
Verify extracted values : Compare YAML output against the README table manually
Use--table N for multi-table READMEs: Required when multiple evaluation tables exist
Use--model-name-override for comparison tables: Copy the exact column header from output

Model Name Matching

When extracting evaluation tables with multiple models (either as columns or rows), the script uses exact normalized token matching :

Removes markdown formatting (bold **, links []() )
Normalizes names (lowercase, replace - and _ with spaces)
Compares token sets: "OLMo-3-32B" → {"olmo", "3", "32b"} matches "**Olmo 3 32B**" or "[Olmo-3-32B](...)
Only extracts if tokens match exactly (handles different word orders and separators)
Fails if no exact match found (rather than guessing from similar names)

For column-based tables (benchmarks as rows, models as columns):

Finds the column header matching the model name
Extracts scores from that column only

For transposed tables (models as rows, benchmarks as columns):

Finds the row in the first column matching the model name
Extracts all benchmark scores from that row only

This ensures only the correct model's scores are extracted, never unrelated models or training checkpoints.

Common Patterns

Update Your Own Model:

# Extract from README and push directly
uv run scripts/evaluation_manager.py extract-readme \
  --repo-id "your-username/your-model" \
  --task-type "text-generation"

Update Someone Else's Model (Full Workflow):

# Step 1: ALWAYS check for existing PRs first
uv run scripts/evaluation_manager.py get-prs \
  --repo-id "other-username/their-model"

# Step 2: If NO open PRs exist, proceed with creating one
uv run scripts/evaluation_manager.py extract-readme \
  --repo-id "other-username/their-model" \
  --create-pr

# If open PRs DO exist:
# - Warn the user about existing PRs
# - Show them the PR URLs
# - Do NOT create a new PR unless user explicitly confirms

Import Fresh Benchmarks:

# Step 1: Check for existing PRs
uv run scripts/evaluation_manager.py get-prs \
  --repo-id "anthropic/claude-sonnet-4"

# Step 2: If no PRs, import from Artificial Analysis
AA_API_KEY=... uv run scripts/evaluation_manager.py import-aa \
  --creator-slug "anthropic" \
  --model-name "claude-sonnet-4" \
  --repo-id "anthropic/claude-sonnet-4" \
  --create-pr

Troubleshooting

Issue : "No evaluation tables found in README"

Solution : Check if README contains markdown tables with numeric scores

Issue : "Could not find model 'X' in transposed table"

Solution : The script will display available models. Use --model-name-override with the exact name from the list
Example : --model-name-override "**Olmo 3-32B**"

Issue : "AA_API_KEY not set"

Solution : Set environment variable or add to .env file

Issue : "Token does not have write access"

Solution : Ensure HF_TOKEN has write permissions for the repository

Issue : "Model not found in Artificial Analysis"

Solution : Verify creator-slug and model-name match API values

Issue : "Payment required for hardware"

Solution : Add a payment method to your Hugging Face account to use non-CPU hardware

Issue : "vLLM out of memory" or CUDA OOM

Solution : Use a larger hardware flavor, reduce --gpu-memory-utilization, or use --tensor-parallel-size for multi-GPU

Issue : "Model architecture not supported by vLLM"

Solution : Use --backend hf (inspect-ai) or --backend accelerate (lighteval) for HuggingFace Transformers

Issue : "Trust remote code required"

Solution : Add --trust-remote-code flag for models with custom code (e.g., Phi-2, Qwen)

Issue : "Chat template not found"

Solution : Only use --use-chat-template for instruction-tuned models that include a chat template

Integration Examples

Python Script Integration:

import subprocess
import os

def update_model_evaluations(repo_id, readme_content):
    """Update model card with evaluations from README."""
    result = subprocess.run([
        "python", "scripts/evaluation_manager.py",
        "extract-readme",
        "--repo-id", repo_id,
        "--create-pr"
    ], capture_output=True, text=True)

    if result.returncode == 0:
        print(f"Successfully updated {repo_id}")
    else:
        print(f"Error: {result.stderr}")

Weekly Installs

264

Repository

huggingface/skills

GitHub Stars

9.9K

First Seen

Jan 20, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykWarn

Installed on

opencode229

gemini-cli225

codex224

github-copilot213

cursor206

claude-code202

React 组合模式指南：Vercel 组件架构最佳实践，提升代码可维护性

106,200 周安装

Create PRs for Others : Use --create-pr when updating models you don't own

One model per repo : Only add the main model's results to model-index

No markdown in YAML names : The model name field in YAML should be plain text