重要前提
安装AI Skills的关键前提是:必须科学上网,且开启TUN模式,这一点至关重要,直接决定安装能否顺利完成,在此郑重提醒三遍:科学上网,科学上网,科学上网。查看完整安装教程 →
nemo-evaluator-sdk by davila7/claude-code-templates
npx skills add https://github.com/davila7/claude-code-templates --skill nemo-evaluator-sdkNeMo Evaluator SDK 通过容器化、可复现的评估方式,利用多后端执行(本地 Docker、Slurm HPC、Lepton 云),在 18 个以上测试框架的 100 多个基准测试中对大语言模型进行评估。
安装:
pip install nemo-evaluator-launcher
设置 API 密钥并运行评估:
export NGC_API_KEY=nvapi-your-key-here
# 创建最小化配置
cat > config.yaml << 'EOF'
defaults:
- execution: local
- deployment: none
- _self_
execution:
output_dir: ./results
target:
api_endpoint:
model_id: meta/llama-3.1-8b-instruct
url: https://integrate.api.nvidia.com/v1/chat/completions
api_key_name: NGC_API_KEY
evaluation:
tasks:
- name: ifeval
EOF
# 运行评估
nemo-evaluator-launcher run --config-dir . --config-name config
查看可用任务:
nemo-evaluator-launcher ls tasks
在任何 OpenAI 兼容的端点上运行核心学术基准测试(MMLU、GSM8K、IFEval)。
检查清单:
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
标准评估:
- [ ] 步骤 1:配置 API 端点
- [ ] 步骤 2:选择基准测试
- [ ] 步骤 3:运行评估
- [ ] 步骤 4:检查结果
步骤 1:配置 API 端点
# config.yaml
defaults:
- execution: local
- deployment: none
- _self_
execution:
output_dir: ./results
target:
api_endpoint:
model_id: meta/llama-3.1-8b-instruct
url: https://integrate.api.nvidia.com/v1/chat/completions
api_key_name: NGC_API_KEY
对于自托管端点(vLLM、TRT-LLM):
target:
api_endpoint:
model_id: my-model
url: http://localhost:8000/v1/chat/completions
api_key_name: "" # 本地无需密钥
步骤 2:选择基准测试
将任务添加到配置中:
evaluation:
tasks:
- name: ifeval # 指令遵循
- name: gpqa_diamond # 研究生级别问答
env_vars:
HF_TOKEN: HF_TOKEN # 某些任务需要 HF 令牌
- name: gsm8k_cot_instruct # 数学推理
- name: humaneval # 代码生成
步骤 3:运行评估
# 使用配置文件运行
nemo-evaluator-launcher run \
--config-dir . \
--config-name config
# 覆盖输出目录
nemo-evaluator-launcher run \
--config-dir . \
--config-name config \
-o execution.output_dir=./my_results
# 限制样本数以进行快速测试
nemo-evaluator-launcher run \
--config-dir . \
--config-name config \
-o +evaluation.nemo_evaluator_config.config.params.limit_samples=10
步骤 4:检查结果
# 检查作业状态
nemo-evaluator-launcher status <invocation_id>
# 列出所有运行记录
nemo-evaluator-launcher ls runs
# 查看结果
cat results/<invocation_id>/<task>/artifacts/results.yml
在 HPC 基础设施上执行大规模评估。
检查清单:
Slurm 评估:
- [ ] 步骤 1:配置 Slurm 设置
- [ ] 步骤 2:设置模型部署
- [ ] 步骤 3:启动评估
- [ ] 步骤 4:监控作业状态
步骤 1:配置 Slurm 设置
# slurm_config.yaml
defaults:
- execution: slurm
- deployment: vllm
- _self_
execution:
hostname: cluster.example.com
account: my_slurm_account
partition: gpu
output_dir: /shared/results
walltime: "04:00:00"
nodes: 1
gpus_per_node: 8
步骤 2:设置模型部署
deployment:
checkpoint_path: /shared/models/llama-3.1-8b
tensor_parallel_size: 2
data_parallel_size: 4
max_model_len: 4096
target:
api_endpoint:
model_id: llama-3.1-8b
# URL 由部署自动生成
步骤 3:启动评估
nemo-evaluator-launcher run \
--config-dir . \
--config-name slurm_config
步骤 4:监控作业状态
# 检查状态(查询 sacct)
nemo-evaluator-launcher status <invocation_id>
# 查看详细信息
nemo-evaluator-launcher info <invocation_id>
# 必要时终止
nemo-evaluator-launcher kill <invocation_id>
在同一任务上对多个模型进行基准测试以进行比较。
检查清单:
模型比较:
- [ ] 步骤 1:创建基础配置
- [ ] 步骤 2:使用覆盖参数运行评估
- [ ] 步骤 3:导出并比较结果
步骤 1:创建基础配置
# base_eval.yaml
defaults:
- execution: local
- deployment: none
- _self_
execution:
output_dir: ./comparison_results
evaluation:
nemo_evaluator_config:
config:
params:
temperature: 0.01
parallelism: 4
tasks:
- name: mmlu_pro
- name: gsm8k_cot_instruct
- name: ifeval
步骤 2:使用模型覆盖参数运行评估
# 评估 Llama 3.1 8B
nemo-evaluator-launcher run \
--config-dir . \
--config-name base_eval \
-o target.api_endpoint.model_id=meta/llama-3.1-8b-instruct \
-o target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions
# 评估 Mistral 7B
nemo-evaluator-launcher run \
--config-dir . \
--config-name base_eval \
-o target.api_endpoint.model_id=mistralai/mistral-7b-instruct-v0.3 \
-o target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions
步骤 3:导出并比较
# 导出到 MLflow
nemo-evaluator-launcher export <invocation_id_1> --dest mlflow
nemo-evaluator-launcher export <invocation_id_2> --dest mlflow
# 导出到本地 JSON
nemo-evaluator-launcher export <invocation_id> --dest local --format json
# 导出到 Weights & Biases
nemo-evaluator-launcher export <invocation_id> --dest wandb
在安全基准测试和 VLM 任务上评估模型。
检查清单:
安全性/VLM 评估:
- [ ] 步骤 1:配置安全性任务
- [ ] 步骤 2:设置 VLM 任务(如适用)
- [ ] 步骤 3:运行评估
步骤 1:配置安全性任务
evaluation:
tasks:
- name: aegis # 安全性测试框架
- name: wildguard # 安全性分类
- name: garak # 安全性探测
步骤 2:配置 VLM 任务
# 对于视觉语言模型
target:
api_endpoint:
type: vlm # 视觉语言端点
model_id: nvidia/llama-3.2-90b-vision-instruct
url: https://integrate.api.nvidia.com/v1/chat/completions
evaluation:
tasks:
- name: ocrbench # OCR 评估
- name: chartqa # 图表理解
- name: mmmu # 多模态理解
在以下情况下使用 NeMo Evaluator:
在以下情况下使用替代方案:
| 测试框架 | 任务数量 | 类别 |
|---|---|---|
lm-evaluation-harness | 60+ | MMLU、GSM8K、HellaSwag、ARC |
simple-evals | 20+ | GPQA、MATH、AIME |
bigcode-evaluation-harness | 25+ | HumanEval、MBPP、MultiPL-E |
safety-harness | 3 | Aegis、WildGuard |
garak | 1 | 安全性探测 |
vlmevalkit | 6+ | OCRBench、ChartQA、MMMU |
bfcl | 6 | 函数调用 v2/v3 |
mtbench | 2 | 多轮对话 |
livecodebench | 10+ | 实时编码评估 |
helm | 15 | 医疗领域 |
nemo-skills | 8 | 数学、科学、智能体 |
问题:容器拉取失败
确保 NGC 凭据已配置:
docker login nvcr.io -u '$oauthtoken' -p $NGC_API_KEY
问题:任务需要环境变量
某些任务需要 HF_TOKEN 或 JUDGE_API_KEY:
evaluation:
tasks:
- name: gpqa_diamond
env_vars:
HF_TOKEN: HF_TOKEN # 将环境变量名映射到环境变量
问题:评估超时
增加并行度或减少样本数:
-o +evaluation.nemo_evaluator_config.config.params.parallelism=8
-o +evaluation.nemo_evaluator_config.config.params.limit_samples=100
问题:Slurm 作业未启动
检查 Slurm 账户和分区:
execution:
account: correct_account
partition: gpu
qos: normal # 可能需要特定的 QOS
问题:结果与预期不符
验证配置是否与报告的设置匹配:
evaluation:
nemo_evaluator_config:
config:
params:
temperature: 0.0 # 确定性
num_fewshot: 5 # 检查论文的 fewshot 数量
| 命令 | 描述 |
|---|---|
run | 使用配置执行评估 |
status <id> | 检查作业状态 |
info <id> | 查看详细的作业信息 |
ls tasks | 列出可用的基准测试 |
ls runs | 列出所有调用记录 |
export <id> | 导出结果(mlflow/wandb/local) |
kill <id> | 终止正在运行的作业 |
# 覆盖模型端点
-o target.api_endpoint.model_id=my-model
-o target.api_endpoint.url=http://localhost:8000/v1/chat/completions
# 添加评估参数
-o +evaluation.nemo_evaluator_config.config.params.temperature=0.5
-o +evaluation.nemo_evaluator_config.config.params.parallelism=8
-o +evaluation.nemo_evaluator_config.config.params.limit_samples=50
# 更改执行设置
-o execution.output_dir=/custom/path
-o execution.mode=parallel
# 动态设置任务
-o 'evaluation.tasks=[{name: ifeval}, {name: gsm8k}]'
在不使用 CLI 的情况下进行编程式评估:
from nemo_evaluator.core.evaluate import evaluate
from nemo_evaluator.api.api_dataclasses import (
EvaluationConfig,
EvaluationTarget,
ApiEndpoint,
EndpointType,
ConfigParams
)
# 配置评估
eval_config = EvaluationConfig(
type="mmlu_pro",
output_dir="./results",
params=ConfigParams(
limit_samples=10,
temperature=0.0,
max_new_tokens=1024,
parallelism=4
)
)
# 配置目标端点
target_config = EvaluationTarget(
api_endpoint=ApiEndpoint(
model_id="meta/llama-3.1-8b-instruct",
url="https://integrate.api.nvidia.com/v1/chat/completions",
type=EndpointType.CHAT,
api_key="nvapi-your-key-here"
)
)
# 运行评估
result = evaluate(eval_cfg=eval_config, target_cfg=target_config)
多后端执行 :参见 references/execution-backends.md 配置深入解析 :参见 references/configuration.md 适配器和拦截器系统 :参见 references/adapter-system.md 自定义基准测试集成 :参见 references/custom-benchmarks.md
每周安装量
72
仓库
GitHub 星标数
24.0K
首次出现
2026 年 1 月 29 日
安全审计
安装于
opencode67
codex64
github-copilot62
gemini-cli60
cursor60
claude-code59
NeMo Evaluator SDK evaluates LLMs across 100+ benchmarks from 18+ harnesses using containerized, reproducible evaluation with multi-backend execution (local Docker, Slurm HPC, Lepton cloud).
Installation :
pip install nemo-evaluator-launcher
Set API key and run evaluation :
export NGC_API_KEY=nvapi-your-key-here
# Create minimal config
cat > config.yaml << 'EOF'
defaults:
- execution: local
- deployment: none
- _self_
execution:
output_dir: ./results
target:
api_endpoint:
model_id: meta/llama-3.1-8b-instruct
url: https://integrate.api.nvidia.com/v1/chat/completions
api_key_name: NGC_API_KEY
evaluation:
tasks:
- name: ifeval
EOF
# Run evaluation
nemo-evaluator-launcher run --config-dir . --config-name config
View available tasks :
nemo-evaluator-launcher ls tasks
Run core academic benchmarks (MMLU, GSM8K, IFEval) on any OpenAI-compatible endpoint.
Checklist :
Standard Evaluation:
- [ ] Step 1: Configure API endpoint
- [ ] Step 2: Select benchmarks
- [ ] Step 3: Run evaluation
- [ ] Step 4: Check results
Step 1: Configure API endpoint
# config.yaml
defaults:
- execution: local
- deployment: none
- _self_
execution:
output_dir: ./results
target:
api_endpoint:
model_id: meta/llama-3.1-8b-instruct
url: https://integrate.api.nvidia.com/v1/chat/completions
api_key_name: NGC_API_KEY
For self-hosted endpoints (vLLM, TRT-LLM):
target:
api_endpoint:
model_id: my-model
url: http://localhost:8000/v1/chat/completions
api_key_name: "" # No key needed for local
Step 2: Select benchmarks
Add tasks to your config:
evaluation:
tasks:
- name: ifeval # Instruction following
- name: gpqa_diamond # Graduate-level QA
env_vars:
HF_TOKEN: HF_TOKEN # Some tasks need HF token
- name: gsm8k_cot_instruct # Math reasoning
- name: humaneval # Code generation
Step 3: Run evaluation
# Run with config file
nemo-evaluator-launcher run \
--config-dir . \
--config-name config
# Override output directory
nemo-evaluator-launcher run \
--config-dir . \
--config-name config \
-o execution.output_dir=./my_results
# Limit samples for quick testing
nemo-evaluator-launcher run \
--config-dir . \
--config-name config \
-o +evaluation.nemo_evaluator_config.config.params.limit_samples=10
Step 4: Check results
# Check job status
nemo-evaluator-launcher status <invocation_id>
# List all runs
nemo-evaluator-launcher ls runs
# View results
cat results/<invocation_id>/<task>/artifacts/results.yml
Execute large-scale evaluation on HPC infrastructure.
Checklist :
Slurm Evaluation:
- [ ] Step 1: Configure Slurm settings
- [ ] Step 2: Set up model deployment
- [ ] Step 3: Launch evaluation
- [ ] Step 4: Monitor job status
Step 1: Configure Slurm settings
# slurm_config.yaml
defaults:
- execution: slurm
- deployment: vllm
- _self_
execution:
hostname: cluster.example.com
account: my_slurm_account
partition: gpu
output_dir: /shared/results
walltime: "04:00:00"
nodes: 1
gpus_per_node: 8
Step 2: Set up model deployment
deployment:
checkpoint_path: /shared/models/llama-3.1-8b
tensor_parallel_size: 2
data_parallel_size: 4
max_model_len: 4096
target:
api_endpoint:
model_id: llama-3.1-8b
# URL auto-generated by deployment
Step 3: Launch evaluation
nemo-evaluator-launcher run \
--config-dir . \
--config-name slurm_config
Step 4: Monitor job status
# Check status (queries sacct)
nemo-evaluator-launcher status <invocation_id>
# View detailed info
nemo-evaluator-launcher info <invocation_id>
# Kill if needed
nemo-evaluator-launcher kill <invocation_id>
Benchmark multiple models on the same tasks for comparison.
Checklist :
Model Comparison:
- [ ] Step 1: Create base config
- [ ] Step 2: Run evaluations with overrides
- [ ] Step 3: Export and compare results
Step 1: Create base config
# base_eval.yaml
defaults:
- execution: local
- deployment: none
- _self_
execution:
output_dir: ./comparison_results
evaluation:
nemo_evaluator_config:
config:
params:
temperature: 0.01
parallelism: 4
tasks:
- name: mmlu_pro
- name: gsm8k_cot_instruct
- name: ifeval
Step 2: Run evaluations with model overrides
# Evaluate Llama 3.1 8B
nemo-evaluator-launcher run \
--config-dir . \
--config-name base_eval \
-o target.api_endpoint.model_id=meta/llama-3.1-8b-instruct \
-o target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions
# Evaluate Mistral 7B
nemo-evaluator-launcher run \
--config-dir . \
--config-name base_eval \
-o target.api_endpoint.model_id=mistralai/mistral-7b-instruct-v0.3 \
-o target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions
Step 3: Export and compare
# Export to MLflow
nemo-evaluator-launcher export <invocation_id_1> --dest mlflow
nemo-evaluator-launcher export <invocation_id_2> --dest mlflow
# Export to local JSON
nemo-evaluator-launcher export <invocation_id> --dest local --format json
# Export to Weights & Biases
nemo-evaluator-launcher export <invocation_id> --dest wandb
Evaluate models on safety benchmarks and VLM tasks.
Checklist :
Safety/VLM Evaluation:
- [ ] Step 1: Configure safety tasks
- [ ] Step 2: Set up VLM tasks (if applicable)
- [ ] Step 3: Run evaluation
Step 1: Configure safety tasks
evaluation:
tasks:
- name: aegis # Safety harness
- name: wildguard # Safety classification
- name: garak # Security probing
Step 2: Configure VLM tasks
# For vision-language models
target:
api_endpoint:
type: vlm # Vision-language endpoint
model_id: nvidia/llama-3.2-90b-vision-instruct
url: https://integrate.api.nvidia.com/v1/chat/completions
evaluation:
tasks:
- name: ocrbench # OCR evaluation
- name: chartqa # Chart understanding
- name: mmmu # Multimodal understanding
Use NeMo Evaluator when:
Use alternatives instead:
| Harness | Task Count | Categories |
|---|---|---|
lm-evaluation-harness | 60+ | MMLU, GSM8K, HellaSwag, ARC |
simple-evals | 20+ | GPQA, MATH, AIME |
bigcode-evaluation-harness | 25+ | HumanEval, MBPP, MultiPL-E |
safety-harness | 3 | Aegis, WildGuard |
garak | 1 | Security probing |
Issue: Container pull fails
Ensure NGC credentials are configured:
docker login nvcr.io -u '$oauthtoken' -p $NGC_API_KEY
Issue: Task requires environment variable
Some tasks need HF_TOKEN or JUDGE_API_KEY:
evaluation:
tasks:
- name: gpqa_diamond
env_vars:
HF_TOKEN: HF_TOKEN # Maps env var name to env var
Issue: Evaluation timeout
Increase parallelism or reduce samples:
-o +evaluation.nemo_evaluator_config.config.params.parallelism=8
-o +evaluation.nemo_evaluator_config.config.params.limit_samples=100
Issue: Slurm job not starting
Check Slurm account and partition:
execution:
account: correct_account
partition: gpu
qos: normal # May need specific QOS
Issue: Different results than expected
Verify configuration matches reported settings:
evaluation:
nemo_evaluator_config:
config:
params:
temperature: 0.0 # Deterministic
num_fewshot: 5 # Check paper's fewshot count
| Command | Description |
|---|---|
run | Execute evaluation with config |
status <id> | Check job status |
info <id> | View detailed job info |
ls tasks | List available benchmarks |
ls runs | List all invocations |
export <id> | Export results (mlflow/wandb/local) |
# Override model endpoint
-o target.api_endpoint.model_id=my-model
-o target.api_endpoint.url=http://localhost:8000/v1/chat/completions
# Add evaluation parameters
-o +evaluation.nemo_evaluator_config.config.params.temperature=0.5
-o +evaluation.nemo_evaluator_config.config.params.parallelism=8
-o +evaluation.nemo_evaluator_config.config.params.limit_samples=50
# Change execution settings
-o execution.output_dir=/custom/path
-o execution.mode=parallel
# Dynamically set tasks
-o 'evaluation.tasks=[{name: ifeval}, {name: gsm8k}]'
For programmatic evaluation without the CLI:
from nemo_evaluator.core.evaluate import evaluate
from nemo_evaluator.api.api_dataclasses import (
EvaluationConfig,
EvaluationTarget,
ApiEndpoint,
EndpointType,
ConfigParams
)
# Configure evaluation
eval_config = EvaluationConfig(
type="mmlu_pro",
output_dir="./results",
params=ConfigParams(
limit_samples=10,
temperature=0.0,
max_new_tokens=1024,
parallelism=4
)
)
# Configure target endpoint
target_config = EvaluationTarget(
api_endpoint=ApiEndpoint(
model_id="meta/llama-3.1-8b-instruct",
url="https://integrate.api.nvidia.com/v1/chat/completions",
type=EndpointType.CHAT,
api_key="nvapi-your-key-here"
)
)
# Run evaluation
result = evaluate(eval_cfg=eval_config, target_cfg=target_config)
Multi-backend execution : See references/execution-backends.md Configuration deep-dive : See references/configuration.md Adapter and interceptor system : See references/adapter-system.md Custom benchmark integration : See references/custom-benchmarks.md
Weekly Installs
72
Repository
GitHub Stars
24.0K
First Seen
Jan 29, 2026
Security Audits
Gen Agent Trust HubWarnSocketPassSnykFail
Installed on
opencode67
codex64
github-copilot62
gemini-cli60
cursor60
claude-code59
超能力技能使用指南:AI助手技能调用优先级与工作流程详解
56,600 周安装
Swift Testing 框架指南:现代测试、迁移、F.I.R.S.T.原则与 Arrange-Act-Assert 模式
103 周安装
多智能体架构模式详解:解决AI上下文限制,提升复杂任务处理能力
102 周安装
Play.fun 游戏集成技能 - 官方API、SDK与部署指南 | OpusGameLabs
102 周安装
shadcn组件发现工具:在1500+组件库中搜索,避免重复造轮子
108 周安装
Apple HIG项目上下文技能:自动收集iOS/macOS应用设计开发信息
106 周安装
Apple HIG 状态组件设计指南:进度条、旋转器、活动圆环最佳实践
105 周安装
vlmevalkit | 6+ | OCRBench, ChartQA, MMMU |
bfcl | 6 | Function calling v2/v3 |
mtbench | 2 | Multi-turn conversation |
livecodebench | 10+ | Live coding evaluation |
helm | 15 | Medical domain |
nemo-skills | 8 | Math, science, agentic |
kill <id> | Terminate running job |