⚠️

重要前提

安装AI Skills的关键前提是：必须科学上网，且开启TUN模式，这一点至关重要，直接决定安装能否顺利完成，在此郑重提醒三遍：科学上网，科学上网，科学上网。查看完整安装教程 →

NeMo Evaluator SDK - 企业级大语言模型基准测试工具，支持多后端与100+基准

nemo-evaluator-sdk by davila7/claude-code-templates

73 周安装量

24,400 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/davila7/claude-code-templates --skill nemo-evaluator-sdk

AI/机器学习高性能计算测试

🇨🇳中文介绍

NeMo Evaluator SDK - 企业级大语言模型基准测试

快速开始

NeMo Evaluator SDK 通过容器化、可复现的评估方式，利用多后端执行（本地 Docker、Slurm HPC、Lepton 云），在 18 个以上测试框架的 100 多个基准测试中对大语言模型进行评估。

安装：

pip install nemo-evaluator-launcher

设置 API 密钥并运行评估：

export NGC_API_KEY=nvapi-your-key-here

# 创建最小化配置
cat > config.yaml << 'EOF'
defaults:
  - execution: local
  - deployment: none
  - _self_

execution:
  output_dir: ./results

target:
  api_endpoint:
    model_id: meta/llama-3.1-8b-instruct
    url: https://integrate.api.nvidia.com/v1/chat/completions
    api_key_name: NGC_API_KEY

evaluation:
  tasks:
    - name: ifeval
EOF

# 运行评估
nemo-evaluator-launcher run --config-dir . --config-name config

查看可用任务：

nemo-evaluator-launcher ls tasks

常见工作流程

工作流程 1：在标准基准测试上评估模型

在任何 OpenAI 兼容的端点上运行核心学术基准测试（MMLU、GSM8K、IFEval）。

检查清单：

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

工作流程 2：在 Slurm HPC 集群上运行评估

在 HPC 基础设施上执行大规模评估。

Slurm 评估：
- [ ] 步骤 1：配置 Slurm 设置
- [ ] 步骤 2：设置模型部署
- [ ] 步骤 3：启动评估
- [ ] 步骤 4：监控作业状态

步骤 1：配置 Slurm 设置

# slurm_config.yaml
defaults:
  - execution: slurm
  - deployment: vllm
  - _self_

execution:
  hostname: cluster.example.com
  account: my_slurm_account
  partition: gpu
  output_dir: /shared/results
  walltime: "04:00:00"
  nodes: 1
  gpus_per_node: 8

步骤 2：设置模型部署

deployment:
  checkpoint_path: /shared/models/llama-3.1-8b
  tensor_parallel_size: 2
  data_parallel_size: 4
  max_model_len: 4096

target:
  api_endpoint:
    model_id: llama-3.1-8b
    # URL 由部署自动生成

步骤 3：启动评估

nemo-evaluator-launcher run \
  --config-dir . \
  --config-name slurm_config

步骤 4：监控作业状态

# 检查状态（查询 sacct）
nemo-evaluator-launcher status <invocation_id>

# 查看详细信息
nemo-evaluator-launcher info <invocation_id>

# 必要时终止
nemo-evaluator-launcher kill <invocation_id>

工作流程 3：比较多个模型

在同一任务上对多个模型进行基准测试以进行比较。

模型比较：
- [ ] 步骤 1：创建基础配置
- [ ] 步骤 2：使用覆盖参数运行评估
- [ ] 步骤 3：导出并比较结果

步骤 1：创建基础配置

# base_eval.yaml
defaults:
  - execution: local
  - deployment: none
  - _self_

execution:
  output_dir: ./comparison_results

evaluation:
  nemo_evaluator_config:
    config:
      params:
        temperature: 0.01
        parallelism: 4
  tasks:
    - name: mmlu_pro
    - name: gsm8k_cot_instruct
    - name: ifeval

步骤 2：使用模型覆盖参数运行评估

# 评估 Llama 3.1 8B
nemo-evaluator-launcher run \
  --config-dir . \
  --config-name base_eval \
  -o target.api_endpoint.model_id=meta/llama-3.1-8b-instruct \
  -o target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions

# 评估 Mistral 7B
nemo-evaluator-launcher run \
  --config-dir . \
  --config-name base_eval \
  -o target.api_endpoint.model_id=mistralai/mistral-7b-instruct-v0.3 \
  -o target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions

步骤 3：导出并比较

# 导出到 MLflow
nemo-evaluator-launcher export <invocation_id_1> --dest mlflow
nemo-evaluator-launcher export <invocation_id_2> --dest mlflow

# 导出到本地 JSON
nemo-evaluator-launcher export <invocation_id> --dest local --format json

# 导出到 Weights & Biases
nemo-evaluator-launcher export <invocation_id> --dest wandb

工作流程 4：安全性和视觉语言评估

在安全基准测试和 VLM 任务上评估模型。

安全性/VLM 评估：
- [ ] 步骤 1：配置安全性任务
- [ ] 步骤 2：设置 VLM 任务（如适用）
- [ ] 步骤 3：运行评估

步骤 1：配置安全性任务

evaluation:
  tasks:
    - name: aegis              # 安全性测试框架
    - name: wildguard          # 安全性分类
    - name: garak              # 安全性探测

步骤 2：配置 VLM 任务

# 对于视觉语言模型
target:
  api_endpoint:
    type: vlm  # 视觉语言端点
    model_id: nvidia/llama-3.2-90b-vision-instruct
    url: https://integrate.api.nvidia.com/v1/chat/completions

evaluation:
  tasks:
    - name: ocrbench           # OCR 评估
    - name: chartqa            # 图表理解
    - name: mmmu               # 多模态理解

使用场景与替代方案对比

在以下情况下使用 NeMo Evaluator：

需要在一个平台中使用来自 18 个以上测试框架的 100 多个基准测试
在 Slurm HPC 集群 或云端运行评估
需要 可复现的 容器化评估
针对 OpenAI 兼容的 API（vLLM、TRT-LLM、NIMs）进行评估
需要具有结果导出功能（MLflow、W&B）的 企业级 评估

在以下情况下使用替代方案：

lm-evaluation-harness ：用于快速本地评估的更简单设置
bigcode-evaluation-harness ：仅专注于代码基准测试
HELM ：斯坦福大学的更广泛评估（公平性、效率）
自定义脚本 ：高度专业化的领域评估

支持的测试框架和任务

测试框架	任务数量	类别
`lm-evaluation-harness`	60+	MMLU、GSM8K、HellaSwag、ARC
`simple-evals`	20+	GPQA、MATH、AIME
`bigcode-evaluation-harness`	25+	HumanEval、MBPP、MultiPL-E
`safety-harness`	3	Aegis、WildGuard
`garak`	1	安全性探测
`vlmevalkit`	6+	OCRBench、ChartQA、MMMU
`bfcl`	6	函数调用 v2/v3
`mtbench`	2	多轮对话
`livecodebench`	10+	实时编码评估
`helm`	15	医疗领域
`nemo-skills`	8	数学、科学、智能体

问题：容器拉取失败

确保 NGC 凭据已配置：

docker login nvcr.io -u '$oauthtoken' -p $NGC_API_KEY

问题：任务需要环境变量

某些任务需要 HF_TOKEN 或 JUDGE_API_KEY：

evaluation:
  tasks:
    - name: gpqa_diamond
      env_vars:
        HF_TOKEN: HF_TOKEN  # 将环境变量名映射到环境变量

问题：评估超时

增加并行度或减少样本数：

-o +evaluation.nemo_evaluator_config.config.params.parallelism=8
-o +evaluation.nemo_evaluator_config.config.params.limit_samples=100

问题：Slurm 作业未启动

检查 Slurm 账户和分区：

execution:
  account: correct_account
  partition: gpu
  qos: normal  # 可能需要特定的 QOS

问题：结果与预期不符

验证配置是否与报告的设置匹配：

evaluation:
  nemo_evaluator_config:
    config:
      params:
        temperature: 0.0  # 确定性
        num_fewshot: 5    # 检查论文的 fewshot 数量

命令	描述
`run`	使用配置执行评估
`status <id>`	检查作业状态
`info <id>`	查看详细的作业信息
`ls tasks`	列出可用的基准测试
`ls runs`	列出所有调用记录
`export <id>`	导出结果（mlflow/wandb/local）
`kill <id>`	终止正在运行的作业

# 覆盖模型端点
-o target.api_endpoint.model_id=my-model
-o target.api_endpoint.url=http://localhost:8000/v1/chat/completions

# 添加评估参数
-o +evaluation.nemo_evaluator_config.config.params.temperature=0.5
-o +evaluation.nemo_evaluator_config.config.params.parallelism=8
-o +evaluation.nemo_evaluator_config.config.params.limit_samples=50

# 更改执行设置
-o execution.output_dir=/custom/path
-o execution.mode=parallel

# 动态设置任务
-o 'evaluation.tasks=[{name: ifeval}, {name: gsm8k}]'

在不使用 CLI 的情况下进行编程式评估：

from nemo_evaluator.core.evaluate import evaluate
from nemo_evaluator.api.api_dataclasses import (
    EvaluationConfig,
    EvaluationTarget,
    ApiEndpoint,
    EndpointType,
    ConfigParams
)

# 配置评估
eval_config = EvaluationConfig(
    type="mmlu_pro",
    output_dir="./results",
    params=ConfigParams(
        limit_samples=10,
        temperature=0.0,
        max_new_tokens=1024,
        parallelism=4
    )
)

# 配置目标端点
target_config = EvaluationTarget(
    api_endpoint=ApiEndpoint(
        model_id="meta/llama-3.1-8b-instruct",
        url="https://integrate.api.nvidia.com/v1/chat/completions",
        type=EndpointType.CHAT,
        api_key="nvapi-your-key-here"
    )
)

# 运行评估
result = evaluate(eval_cfg=eval_config, target_cfg=target_config)

多后端执行 ：参见 references/execution-backends.md 配置深入解析 ：参见 references/configuration.md 适配器和拦截器系统 ：参见 references/adapter-system.md 自定义基准测试集成 ：参见 references/custom-benchmarks.md

Python ：3.10-3.13
Docker ：本地执行所需
NGC API 密钥 ：用于拉取容器和使用 NVIDIA Build
HF_TOKEN ：某些基准测试所需（GPQA、MMLU）

GitHub ：https://github.com/NVIDIA-NeMo/Evaluator
NGC 容器 ：nvcr.io/nvidia/eval-factory/
NVIDIA Build ：https://build.nvidia.com（免费托管模型）
文档：https://github.com/NVIDIA-NeMo/Evaluator/tree/main/docs

2026 年 1 月 29 日

🇺🇸English

NeMo Evaluator SDK - Enterprise LLM Benchmarking

Quick Start

NeMo Evaluator SDK evaluates LLMs across 100+ benchmarks from 18+ harnesses using containerized, reproducible evaluation with multi-backend execution (local Docker, Slurm HPC, Lepton cloud).

Installation :

pip install nemo-evaluator-launcher

Set API key and run evaluation :

export NGC_API_KEY=nvapi-your-key-here

# Create minimal config
cat > config.yaml << 'EOF'
defaults:
  - execution: local
  - deployment: none
  - _self_

execution:
  output_dir: ./results

target:
  api_endpoint:
    model_id: meta/llama-3.1-8b-instruct
    url: https://integrate.api.nvidia.com/v1/chat/completions
    api_key_name: NGC_API_KEY

evaluation:
  tasks:
    - name: ifeval
EOF

# Run evaluation
nemo-evaluator-launcher run --config-dir . --config-name config

View available tasks :

nemo-evaluator-launcher ls tasks

Common Workflows

Workflow 1: Evaluate Model on Standard Benchmarks

Run core academic benchmarks (MMLU, GSM8K, IFEval) on any OpenAI-compatible endpoint.

Checklist :

Standard Evaluation:
- [ ] Step 1: Configure API endpoint
- [ ] Step 2: Select benchmarks
- [ ] Step 3: Run evaluation
- [ ] Step 4: Check results

Step 1: Configure API endpoint

# config.yaml
defaults:
  - execution: local
  - deployment: none
  - _self_

execution:
  output_dir: ./results

target:
  api_endpoint:
    model_id: meta/llama-3.1-8b-instruct
    url: https://integrate.api.nvidia.com/v1/chat/completions
    api_key_name: NGC_API_KEY

For self-hosted endpoints (vLLM, TRT-LLM):

target:
  api_endpoint:
    model_id: my-model
    url: http://localhost:8000/v1/chat/completions
    api_key_name: ""  # No key needed for local

Step 2: Select benchmarks

Add tasks to your config:

evaluation:
  tasks:
    - name: ifeval           # Instruction following
    - name: gpqa_diamond     # Graduate-level QA
      env_vars:
        HF_TOKEN: HF_TOKEN   # Some tasks need HF token
    - name: gsm8k_cot_instruct  # Math reasoning
    - name: humaneval        # Code generation

Step 3: Run evaluation

# Run with config file
nemo-evaluator-launcher run \
  --config-dir . \
  --config-name config

# Override output directory
nemo-evaluator-launcher run \
  --config-dir . \
  --config-name config \
  -o execution.output_dir=./my_results

# Limit samples for quick testing
nemo-evaluator-launcher run \
  --config-dir . \
  --config-name config \
  -o +evaluation.nemo_evaluator_config.config.params.limit_samples=10

Step 4: Check results

# Check job status
nemo-evaluator-launcher status <invocation_id>

# List all runs
nemo-evaluator-launcher ls runs

# View results
cat results/<invocation_id>/<task>/artifacts/results.yml

Workflow 2: Run Evaluation on Slurm HPC Cluster

Execute large-scale evaluation on HPC infrastructure.

Checklist :

Slurm Evaluation:
- [ ] Step 1: Configure Slurm settings
- [ ] Step 2: Set up model deployment
- [ ] Step 3: Launch evaluation
- [ ] Step 4: Monitor job status

Step 1: Configure Slurm settings

# slurm_config.yaml
defaults:
  - execution: slurm
  - deployment: vllm
  - _self_

execution:
  hostname: cluster.example.com
  account: my_slurm_account
  partition: gpu
  output_dir: /shared/results
  walltime: "04:00:00"
  nodes: 1
  gpus_per_node: 8

Step 2: Set up model deployment

deployment:
  checkpoint_path: /shared/models/llama-3.1-8b
  tensor_parallel_size: 2
  data_parallel_size: 4
  max_model_len: 4096

target:
  api_endpoint:
    model_id: llama-3.1-8b
    # URL auto-generated by deployment

Step 3: Launch evaluation

nemo-evaluator-launcher run \
  --config-dir . \
  --config-name slurm_config

Step 4: Monitor job status

# Check status (queries sacct)
nemo-evaluator-launcher status <invocation_id>

# View detailed info
nemo-evaluator-launcher info <invocation_id>

# Kill if needed
nemo-evaluator-launcher kill <invocation_id>

Workflow 3: Compare Multiple Models

Benchmark multiple models on the same tasks for comparison.

Checklist :

Model Comparison:
- [ ] Step 1: Create base config
- [ ] Step 2: Run evaluations with overrides
- [ ] Step 3: Export and compare results

Step 1: Create base config

# base_eval.yaml
defaults:
  - execution: local
  - deployment: none
  - _self_

execution:
  output_dir: ./comparison_results

evaluation:
  nemo_evaluator_config:
    config:
      params:
        temperature: 0.01
        parallelism: 4
  tasks:
    - name: mmlu_pro
    - name: gsm8k_cot_instruct
    - name: ifeval

Step 2: Run evaluations with model overrides

# Evaluate Llama 3.1 8B
nemo-evaluator-launcher run \
  --config-dir . \
  --config-name base_eval \
  -o target.api_endpoint.model_id=meta/llama-3.1-8b-instruct \
  -o target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions

# Evaluate Mistral 7B
nemo-evaluator-launcher run \
  --config-dir . \
  --config-name base_eval \
  -o target.api_endpoint.model_id=mistralai/mistral-7b-instruct-v0.3 \
  -o target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions

Step 3: Export and compare

# Export to MLflow
nemo-evaluator-launcher export <invocation_id_1> --dest mlflow
nemo-evaluator-launcher export <invocation_id_2> --dest mlflow

# Export to local JSON
nemo-evaluator-launcher export <invocation_id> --dest local --format json

# Export to Weights & Biases
nemo-evaluator-launcher export <invocation_id> --dest wandb

Workflow 4: Safety and Vision-Language Evaluation

Evaluate models on safety benchmarks and VLM tasks.

Checklist :

Safety/VLM Evaluation:
- [ ] Step 1: Configure safety tasks
- [ ] Step 2: Set up VLM tasks (if applicable)
- [ ] Step 3: Run evaluation

Step 1: Configure safety tasks

evaluation:
  tasks:
    - name: aegis              # Safety harness
    - name: wildguard          # Safety classification
    - name: garak              # Security probing

Step 2: Configure VLM tasks

# For vision-language models
target:
  api_endpoint:
    type: vlm  # Vision-language endpoint
    model_id: nvidia/llama-3.2-90b-vision-instruct
    url: https://integrate.api.nvidia.com/v1/chat/completions

evaluation:
  tasks:
    - name: ocrbench           # OCR evaluation
    - name: chartqa            # Chart understanding
    - name: mmmu               # Multimodal understanding

When to Use vs Alternatives

Use NeMo Evaluator when:

Need 100+ benchmarks from 18+ harnesses in one platform
Running evaluations on Slurm HPC clusters or cloud
Requiring reproducible containerized evaluation
Evaluating against OpenAI-compatible APIs (vLLM, TRT-LLM, NIMs)
Need enterprise-grade evaluation with result export (MLflow, W&B)

Use alternatives instead:

lm-evaluation-harness : Simpler setup for quick local evaluation
bigcode-evaluation-harness : Focused only on code benchmarks
HELM : Stanford's broader evaluation (fairness, efficiency)
Custom scripts : Highly specialized domain evaluation

Supported Harnesses and Tasks

Harness	Task Count	Categories
`lm-evaluation-harness`	60+	MMLU, GSM8K, HellaSwag, ARC
`simple-evals`	20+	GPQA, MATH, AIME
`bigcode-evaluation-harness`	25+	HumanEval, MBPP, MultiPL-E
`safety-harness`	3	Aegis, WildGuard
`garak`	1	Security probing

Common Issues

Issue: Container pull fails

Ensure NGC credentials are configured:

docker login nvcr.io -u '$oauthtoken' -p $NGC_API_KEY

Issue: Task requires environment variable

Some tasks need HF_TOKEN or JUDGE_API_KEY:

evaluation:
  tasks:
    - name: gpqa_diamond
      env_vars:
        HF_TOKEN: HF_TOKEN  # Maps env var name to env var

Issue: Evaluation timeout

Increase parallelism or reduce samples:

-o +evaluation.nemo_evaluator_config.config.params.parallelism=8
-o +evaluation.nemo_evaluator_config.config.params.limit_samples=100

Issue: Slurm job not starting

Check Slurm account and partition:

execution:
  account: correct_account
  partition: gpu
  qos: normal  # May need specific QOS

Issue: Different results than expected

Verify configuration matches reported settings:

evaluation:
  nemo_evaluator_config:
    config:
      params:
        temperature: 0.0  # Deterministic
        num_fewshot: 5    # Check paper's fewshot count

CLI Reference

Command	Description
`run`	Execute evaluation with config
`status <id>`	Check job status
`info <id>`	View detailed job info
`ls tasks`	List available benchmarks
`ls runs`	List all invocations
`export <id>`	Export results (mlflow/wandb/local)

Configuration Override Examples

# Override model endpoint
-o target.api_endpoint.model_id=my-model
-o target.api_endpoint.url=http://localhost:8000/v1/chat/completions

# Add evaluation parameters
-o +evaluation.nemo_evaluator_config.config.params.temperature=0.5
-o +evaluation.nemo_evaluator_config.config.params.parallelism=8
-o +evaluation.nemo_evaluator_config.config.params.limit_samples=50

# Change execution settings
-o execution.output_dir=/custom/path
-o execution.mode=parallel

# Dynamically set tasks
-o 'evaluation.tasks=[{name: ifeval}, {name: gsm8k}]'

Python API Usage

For programmatic evaluation without the CLI:

from nemo_evaluator.core.evaluate import evaluate
from nemo_evaluator.api.api_dataclasses import (
    EvaluationConfig,
    EvaluationTarget,
    ApiEndpoint,
    EndpointType,
    ConfigParams
)

# Configure evaluation
eval_config = EvaluationConfig(
    type="mmlu_pro",
    output_dir="./results",
    params=ConfigParams(
        limit_samples=10,
        temperature=0.0,
        max_new_tokens=1024,
        parallelism=4
    )
)

# Configure target endpoint
target_config = EvaluationTarget(
    api_endpoint=ApiEndpoint(
        model_id="meta/llama-3.1-8b-instruct",
        url="https://integrate.api.nvidia.com/v1/chat/completions",
        type=EndpointType.CHAT,
        api_key="nvapi-your-key-here"
    )
)

# Run evaluation
result = evaluate(eval_cfg=eval_config, target_cfg=target_config)

Advanced Topics

Multi-backend execution : See references/execution-backends.md Configuration deep-dive : See references/configuration.md Adapter and interceptor system : See references/adapter-system.md Custom benchmark integration : See references/custom-benchmarks.md

Requirements

Python : 3.10-3.13
Docker : Required for local execution
NGC API Key : For pulling containers and using NVIDIA Build
HF_TOKEN : Required for some benchmarks (GPQA, MMLU)

Resources

GitHub : https://github.com/NVIDIA-NeMo/Evaluator
NGC Containers : nvcr.io/nvidia/eval-factory/
NVIDIA Build : https://build.nvidia.com (free hosted models)
Documentation : https://github.com/NVIDIA-NeMo/Evaluator/tree/main/docs

Weekly Installs

Repository

davila7/claude-…emplates

GitHub Stars

24.0K

First Seen

Jan 29, 2026

Security Audits

Gen Agent Trust HubWarn SocketPass SnykFail

Installed on

opencode67

codex64

github-copilot62

gemini-cli60

cursor60

claude-code59

超能力技能使用指南：AI助手技能调用优先级与工作流程详解

56,600 周安装

NeMo Evaluator SDK - 企业级大语言模型基准测试工具，支持多后端与100+基准

🇨🇳中文介绍

NeMo Evaluator SDK - 企业级大语言模型基准测试

快速开始

常见工作流程

工作流程 1：在标准基准测试上评估模型

相关 Skills

工作流程 2：在 Slurm HPC 集群上运行评估

工作流程 3：比较多个模型

工作流程 4：安全性和视觉语言评估

使用场景与替代方案对比

支持的测试框架和任务

常见问题

CLI 参考

配置覆盖示例

Python API 使用

高级主题

要求

资源

🇺🇸English

NeMo Evaluator SDK - Enterprise LLM Benchmarking

Quick Start

Common Workflows

Workflow 1: Evaluate Model on Standard Benchmarks

Workflow 2: Run Evaluation on Slurm HPC Cluster

Workflow 3: Compare Multiple Models

Workflow 4: Safety and Vision-Language Evaluation

When to Use vs Alternatives

Supported Harnesses and Tasks

Common Issues

CLI Reference

Configuration Override Examples

Python API Usage

Advanced Topics

Requirements

Resources

最新 Skills