LLM评估工具lm-evaluation-harness使用指南：HuggingFace模型基准测试与性能分析

evaluating-llms-harness by davila7/claude-code-templates

212 周安装量

23,400 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/davila7/claude-code-templates --skill evaluating-llms-harness

AI/机器学习自动化测试

🇨🇳中文介绍

lm-evaluation-harness - LLM 基准测试

快速开始

lm-evaluation-harness 使用标准化的提示词和指标，在 60 多个学术基准上评估 LLM。

安装：

pip install lm-eval

评估任何 HuggingFace 模型：

lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-hf \
  --tasks mmlu,gsm8k,hellaswag \
  --device cuda:0 \
  --batch_size 8

查看可用任务：

lm_eval --tasks list

常用工作流

工作流 1：标准基准测试评估

在核心基准（MMLU、GSM8K、HumanEval）上评估模型。

复制此清单：

Benchmark Evaluation:
- [ ] Step 1: Choose benchmark suite
- [ ] Step 2: Configure model
- [ ] Step 3: Run evaluation
- [ ] Step 4: Analyze results

步骤 1：选择基准套件

核心推理基准：

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

工作流 2：跟踪训练进度

在训练期间评估检查点。

Training Progress Tracking:
- [ ] Step 1: Set up periodic evaluation
- [ ] Step 2: Choose quick benchmarks
- [ ] Step 3: Automate evaluation
- [ ] Step 4: Plot learning curves

步骤 1：设置定期评估

每 N 个训练步骤评估一次：

#!/bin/bash
# eval_checkpoint.sh

CHECKPOINT_DIR=$1
STEP=$2

lm_eval --model hf \
  --model_args pretrained=$CHECKPOINT_DIR/checkpoint-$STEP \
  --tasks gsm8k,hellaswag \
  --num_fewshot 0 \  # 0-shot 以提高速度
  --batch_size 16 \
  --output_path results/step-$STEP.json

步骤 2：选择快速基准

用于频繁评估的快速基准：

HellaSwag：在 1 个 GPU 上约 10 分钟
GSM8K：约 5 分钟
PIQA：约 2 分钟

避免用于频繁评估（太慢）：

MMLU：约 2 小时（57 个学科）
HumanEval：需要代码执行

步骤 3：自动化评估

与训练脚本集成：

# 在训练循环中
if step % eval_interval == 0:
    model.save_pretrained(f"checkpoints/step-{step}")

    # 运行评估
    os.system(f"./eval_checkpoint.sh checkpoints step-{step}")

或使用 PyTorch Lightning 回调：

from pytorch_lightning import Callback

class EvalHarnessCallback(Callback):
    def on_validation_epoch_end(self, trainer, pl_module):
        step = trainer.global_step
        checkpoint_path = f"checkpoints/step-{step}"

        # 保存检查点
        trainer.save_checkpoint(checkpoint_path)

        # 运行 lm-eval
        os.system(f"lm_eval --model hf --model_args pretrained={checkpoint_path} ...")

步骤 4：绘制学习曲线

import json
import matplotlib.pyplot as plt

# 加载所有结果
steps = []
mmlu_scores = []

for file in sorted(glob.glob("results/step-*.json")):
    with open(file) as f:
        data = json.load(f)
        step = int(file.split("-")[1].split(".")[0])
        steps.append(step)
        mmlu_scores.append(data["results"]["mmlu"]["acc"])

# 绘图
plt.plot(steps, mmlu_scores)
plt.xlabel("Training Step")
plt.ylabel("MMLU Accuracy")
plt.title("Training Progress")
plt.savefig("training_curve.png")

工作流 3：比较多个模型

用于模型比较的基准套件。

Model Comparison:
- [ ] Step 1: Define model list
- [ ] Step 2: Run evaluations
- [ ] Step 3: Generate comparison table

步骤 1：定义模型列表

# models.txt
meta-llama/Llama-2-7b-hf
meta-llama/Llama-2-13b-hf
mistralai/Mistral-7B-v0.1
microsoft/phi-2

步骤 2：运行评估

#!/bin/bash
# eval_all_models.sh

TASKS="mmlu,gsm8k,hellaswag,truthfulqa"

while read model; do
    echo "Evaluating $model"

    # 提取模型名称用于输出文件
    model_name=$(echo $model | sed 's/\//-/g')

    lm_eval --model hf \
      --model_args pretrained=$model,dtype=bfloat16 \
      --tasks $TASKS \
      --num_fewshot 5 \
      --batch_size auto \
      --output_path results/$model_name.json

done < models.txt

步骤 3：生成比较表

import json
import pandas as pd

models = [
    "meta-llama-Llama-2-7b-hf",
    "meta-llama-Llama-2-13b-hf",
    "mistralai-Mistral-7B-v0.1",
    "microsoft-phi-2"
]

tasks = ["mmlu", "gsm8k", "hellaswag", "truthfulqa"]

results = []
for model in models:
    with open(f"results/{model}.json") as f:
        data = json.load(f)
        row = {"Model": model.replace("-", "/")}
        for task in tasks:
            # 获取每个任务的主要指标
            metrics = data["results"][task]
            if "acc" in metrics:
                row[task.upper()] = f"{metrics['acc']:.3f}"
            elif "exact_match" in metrics:
                row[task.upper()] = f"{metrics['exact_match']:.3f}"
        results.append(row)

df = pd.DataFrame(results)
print(df.to_markdown(index=False))

| Model                  | MMLU  | GSM8K | HELLASWAG | TRUTHFULQA |
|------------------------|-------|-------|-----------|------------|
| meta-llama/Llama-2-7b  | 0.459 | 0.142 | 0.765     | 0.391      |
| meta-llama/Llama-2-13b | 0.549 | 0.287 | 0.801     | 0.430      |
| mistralai/Mistral-7B   | 0.626 | 0.395 | 0.812     | 0.428      |
| microsoft/phi-2        | 0.560 | 0.613 | 0.682     | 0.447      |

工作流 4：使用 vLLM 评估（更快的推理）

使用 vLLM 后端进行 5-10 倍更快的评估。

vLLM Evaluation:
- [ ] Step 1: Install vLLM
- [ ] Step 2: Configure vLLM backend
- [ ] Step 3: Run evaluation

步骤 1：安装 vLLM

步骤 2：配置 vLLM 后端

lm_eval --model vllm \
  --model_args pretrained=meta-llama/Llama-2-7b-hf,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.8 \
  --tasks mmlu \
  --batch_size auto

步骤 3：运行评估

vLLM 比标准 HuggingFace 快 5-10 倍：

# 标准 HF：7B 模型上 MMLU 约 2 小时
lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-hf \
  --tasks mmlu \
  --batch_size 8

# vLLM：7B 模型上 MMLU 约 15-20 分钟
lm_eval --model vllm \
  --model_args pretrained=meta-llama/Llama-2-7b-hf,tensor_parallel_size=2 \
  --tasks mmlu \
  --batch_size auto

何时使用与替代方案

在以下情况使用 lm-evaluation-harness：

为学术论文进行模型基准测试
在标准任务上比较模型质量
跟踪训练进度
报告标准化指标（每个人都使用相同的提示词）
需要可重复的评估

改用替代方案：

HELM（斯坦福）：更广泛的评估（公平性、效率、校准）
AlpacaEval：使用 LLM 评判的指令遵循评估
MT-Bench：对话式多轮评估
自定义脚本：特定领域的评估

问题：评估太慢

使用 vLLM 后端：

lm_eval --model vllm \
  --model_args pretrained=model-name,tensor_parallel_size=2

或减少 fewshot 示例：

--num_fewshot 0  # 而不是 5

或评估 MMLU 的子集：

--tasks mmlu_stem  # 仅 STEM 学科

问题：内存不足

减少批次大小：

--batch_size 1  # 或 --batch_size auto

--model_args pretrained=model-name,load_in_8bit=True

启用 CPU 卸载：

--model_args pretrained=model-name,device_map=auto,offload_folder=offload

问题：结果与报告的不同

检查 fewshot 数量：

--num_fewshot 5  # 大多数论文使用 5-shot

检查确切的任务名称：

--tasks mmlu  # 不是 mmlu_direct 或 mmlu_fewshot

验证模型和分词器是否匹配：

--model_args pretrained=model-name,tokenizer=same-model-name

问题：HumanEval 不执行代码

安装执行依赖项：

pip install human-eval

启用代码执行：

lm_eval --model hf \
  --model_args pretrained=model-name \
  --tasks humaneval \
  --allow_code_execution  # HumanEval 必需

基准描述：查看 references/benchmark-guide.md 以获取所有 60 多个任务的详细描述、它们测量的内容以及解释。

自定义任务：查看 references/custom-tasks.md 以创建特定领域的评估任务。

API 评估：查看 references/api-evaluation.md 以评估 OpenAI、Anthropic 和其他 API 模型。

多 GPU 策略：查看 references/distributed-eval.md 以了解数据并行和张量并行评估。

GPU：NVIDIA（CUDA 11.8+），可在 CPU 上运行（非常慢）
VRAM：
- 7B 模型：16GB（bf16）或 8GB（8 位）
- 13B 模型：28GB（bf16）或 14GB（8 位）
- 70B 模型：需要多 GPU 或量化
时间（7B 模型，单个 A100）：
- HellaSwag：10 分钟
- GSM8K：5 分钟
- MMLU（完整）：2 小时
- HumanEval：20 分钟

GitHub：https://github.com/EleutherAI/lm-evaluation-harness
文档：https://github.com/EleutherAI/lm-evaluation-harness/tree/main/docs
任务库：60 多个任务，包括 MMLU、GSM8K、HumanEval、TruthfulQA、HellaSwag、ARC、WinoGrande 等。
排行榜：https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard（使用此工具）

2026 年 1 月 21 日

🇺🇸English

lm-evaluation-harness - LLM Benchmarking

Quick start

lm-evaluation-harness evaluates LLMs across 60+ academic benchmarks using standardized prompts and metrics.

Installation :

pip install lm-eval

Evaluate any HuggingFace model :

lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-hf \
  --tasks mmlu,gsm8k,hellaswag \
  --device cuda:0 \
  --batch_size 8

View available tasks :

lm_eval --tasks list

Common workflows

Workflow 1: Standard benchmark evaluation

Evaluate model on core benchmarks (MMLU, GSM8K, HumanEval).

Copy this checklist:

Benchmark Evaluation:
- [ ] Step 1: Choose benchmark suite
- [ ] Step 2: Configure model
- [ ] Step 3: Run evaluation
- [ ] Step 4: Analyze results

Step 1: Choose benchmark suite

Core reasoning benchmarks :

MMLU (Massive Multitask Language Understanding) - 57 subjects, multiple choice
GSM8K - Grade school math word problems
HellaSwag - Common sense reasoning
TruthfulQA - Truthfulness and factuality
ARC (AI2 Reasoning Challenge) - Science questions

Code benchmarks :

HumanEval - Python code generation (164 problems)
MBPP (Mostly Basic Python Problems) - Python coding

Standard suite (recommended for model releases):

--tasks mmlu,gsm8k,hellaswag,truthfulqa,arc_challenge

Step 2: Configure model

HuggingFace model :

lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-hf,dtype=bfloat16 \
  --tasks mmlu \
  --device cuda:0 \
  --batch_size auto  # Auto-detect optimal batch size

Quantized model (4-bit/8-bit) :

lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-hf,load_in_4bit=True \
  --tasks mmlu \
  --device cuda:0

Custom checkpoint :

lm_eval --model hf \
  --model_args pretrained=/path/to/my-model,tokenizer=/path/to/tokenizer \
  --tasks mmlu \
  --device cuda:0

Step 3: Run evaluation

# Full MMLU evaluation (57 subjects)
lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-hf \
  --tasks mmlu \
  --num_fewshot 5 \  # 5-shot evaluation (standard)
  --batch_size 8 \
  --output_path results/ \
  --log_samples  # Save individual predictions

# Multiple benchmarks at once
lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-hf \
  --tasks mmlu,gsm8k,hellaswag,truthfulqa,arc_challenge \
  --num_fewshot 5 \
  --batch_size 8 \
  --output_path results/llama2-7b-eval.json

Step 4: Analyze results

Results saved to results/llama2-7b-eval.json:

{
  "results": {
    "mmlu": {
      "acc": 0.459,
      "acc_stderr": 0.004
    },
    "gsm8k": {
      "exact_match": 0.142,
      "exact_match_stderr": 0.006
    },
    "hellaswag": {
      "acc_norm": 0.765,
      "acc_norm_stderr": 0.004
    }
  },
  "config": {
    "model": "hf",
    "model_args": "pretrained=meta-llama/Llama-2-7b-hf",
    "num_fewshot": 5
  }
}

Workflow 2: Track training progress

Evaluate checkpoints during training.

Training Progress Tracking:
- [ ] Step 1: Set up periodic evaluation
- [ ] Step 2: Choose quick benchmarks
- [ ] Step 3: Automate evaluation
- [ ] Step 4: Plot learning curves

Step 1: Set up periodic evaluation

Evaluate every N training steps:

#!/bin/bash
# eval_checkpoint.sh

CHECKPOINT_DIR=$1
STEP=$2

lm_eval --model hf \
  --model_args pretrained=$CHECKPOINT_DIR/checkpoint-$STEP \
  --tasks gsm8k,hellaswag \
  --num_fewshot 0 \  # 0-shot for speed
  --batch_size 16 \
  --output_path results/step-$STEP.json

Step 2: Choose quick benchmarks

Fast benchmarks for frequent evaluation:

HellaSwag : ~10 minutes on 1 GPU
GSM8K : ~5 minutes
PIQA : ~2 minutes

Avoid for frequent eval (too slow):

MMLU : ~2 hours (57 subjects)
HumanEval : Requires code execution

Step 3: Automate evaluation

Integrate with training script:

# In training loop
if step % eval_interval == 0:
    model.save_pretrained(f"checkpoints/step-{step}")

    # Run evaluation
    os.system(f"./eval_checkpoint.sh checkpoints step-{step}")

Or use PyTorch Lightning callbacks:

from pytorch_lightning import Callback

class EvalHarnessCallback(Callback):
    def on_validation_epoch_end(self, trainer, pl_module):
        step = trainer.global_step
        checkpoint_path = f"checkpoints/step-{step}"

        # Save checkpoint
        trainer.save_checkpoint(checkpoint_path)

        # Run lm-eval
        os.system(f"lm_eval --model hf --model_args pretrained={checkpoint_path} ...")

Step 4: Plot learning curves

import json
import matplotlib.pyplot as plt

# Load all results
steps = []
mmlu_scores = []

for file in sorted(glob.glob("results/step-*.json")):
    with open(file) as f:
        data = json.load(f)
        step = int(file.split("-")[1].split(".")[0])
        steps.append(step)
        mmlu_scores.append(data["results"]["mmlu"]["acc"])

# Plot
plt.plot(steps, mmlu_scores)
plt.xlabel("Training Step")
plt.ylabel("MMLU Accuracy")
plt.title("Training Progress")
plt.savefig("training_curve.png")

Workflow 3: Compare multiple models

Benchmark suite for model comparison.

Model Comparison:
- [ ] Step 1: Define model list
- [ ] Step 2: Run evaluations
- [ ] Step 3: Generate comparison table

Step 1: Define model list

# models.txt
meta-llama/Llama-2-7b-hf
meta-llama/Llama-2-13b-hf
mistralai/Mistral-7B-v0.1
microsoft/phi-2

Step 2: Run evaluations

#!/bin/bash
# eval_all_models.sh

TASKS="mmlu,gsm8k,hellaswag,truthfulqa"

while read model; do
    echo "Evaluating $model"

    # Extract model name for output file
    model_name=$(echo $model | sed 's/\//-/g')

    lm_eval --model hf \
      --model_args pretrained=$model,dtype=bfloat16 \
      --tasks $TASKS \
      --num_fewshot 5 \
      --batch_size auto \
      --output_path results/$model_name.json

done < models.txt

Step 3: Generate comparison table

import json
import pandas as pd

models = [
    "meta-llama-Llama-2-7b-hf",
    "meta-llama-Llama-2-13b-hf",
    "mistralai-Mistral-7B-v0.1",
    "microsoft-phi-2"
]

tasks = ["mmlu", "gsm8k", "hellaswag", "truthfulqa"]

results = []
for model in models:
    with open(f"results/{model}.json") as f:
        data = json.load(f)
        row = {"Model": model.replace("-", "/")}
        for task in tasks:
            # Get primary metric for each task
            metrics = data["results"][task]
            if "acc" in metrics:
                row[task.upper()] = f"{metrics['acc']:.3f}"
            elif "exact_match" in metrics:
                row[task.upper()] = f"{metrics['exact_match']:.3f}"
        results.append(row)

df = pd.DataFrame(results)
print(df.to_markdown(index=False))

Output:

| Model                  | MMLU  | GSM8K | HELLASWAG | TRUTHFULQA |
|------------------------|-------|-------|-----------|------------|
| meta-llama/Llama-2-7b  | 0.459 | 0.142 | 0.765     | 0.391      |
| meta-llama/Llama-2-13b | 0.549 | 0.287 | 0.801     | 0.430      |
| mistralai/Mistral-7B   | 0.626 | 0.395 | 0.812     | 0.428      |
| microsoft/phi-2        | 0.560 | 0.613 | 0.682     | 0.447      |

Workflow 4: Evaluate with vLLM (faster inference)

Use vLLM backend for 5-10x faster evaluation.

vLLM Evaluation:
- [ ] Step 1: Install vLLM
- [ ] Step 2: Configure vLLM backend
- [ ] Step 3: Run evaluation

Step 1: Install vLLM

pip install vllm

Step 2: Configure vLLM backend

lm_eval --model vllm \
  --model_args pretrained=meta-llama/Llama-2-7b-hf,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.8 \
  --tasks mmlu \
  --batch_size auto

Step 3: Run evaluation

vLLM is 5-10× faster than standard HuggingFace:

# Standard HF: ~2 hours for MMLU on 7B model
lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-hf \
  --tasks mmlu \
  --batch_size 8

# vLLM: ~15-20 minutes for MMLU on 7B model
lm_eval --model vllm \
  --model_args pretrained=meta-llama/Llama-2-7b-hf,tensor_parallel_size=2 \
  --tasks mmlu \
  --batch_size auto

When to use vs alternatives

Use lm-evaluation-harness when:

Benchmarking models for academic papers
Comparing model quality across standard tasks
Tracking training progress
Reporting standardized metrics (everyone uses same prompts)
Need reproducible evaluation

Use alternatives instead:

HELM (Stanford): Broader evaluation (fairness, efficiency, calibration)
AlpacaEval : Instruction-following evaluation with LLM judges
MT-Bench : Conversational multi-turn evaluation
Custom scripts : Domain-specific evaluation

Common issues

Issue: Evaluation too slow

Use vLLM backend:

lm_eval --model vllm \
  --model_args pretrained=model-name,tensor_parallel_size=2

Or reduce fewshot examples:

--num_fewshot 0  # Instead of 5

Or evaluate subset of MMLU:

--tasks mmlu_stem  # Only STEM subjects

Issue: Out of memory

Reduce batch size:

--batch_size 1  # Or --batch_size auto

Use quantization:

--model_args pretrained=model-name,load_in_8bit=True

Enable CPU offloading:

--model_args pretrained=model-name,device_map=auto,offload_folder=offload

Issue: Different results than reported

Check fewshot count:

--num_fewshot 5  # Most papers use 5-shot

Check exact task name:

--tasks mmlu  # Not mmlu_direct or mmlu_fewshot

Verify model and tokenizer match:

--model_args pretrained=model-name,tokenizer=same-model-name

Issue: HumanEval not executing code

Install execution dependencies:

pip install human-eval

Enable code execution:

lm_eval --model hf \
  --model_args pretrained=model-name \
  --tasks humaneval \
  --allow_code_execution  # Required for HumanEval

Advanced topics

Benchmark descriptions : See references/benchmark-guide.md for detailed description of all 60+ tasks, what they measure, and interpretation.

Custom tasks : See references/custom-tasks.md for creating domain-specific evaluation tasks.

API evaluation : See references/api-evaluation.md for evaluating OpenAI, Anthropic, and other API models.

Multi-GPU strategies : See references/distributed-eval.md for data parallel and tensor parallel evaluation.

Hardware requirements

GPU : NVIDIA (CUDA 11.8+), works on CPU (very slow)
VRAM :
- 7B model: 16GB (bf16) or 8GB (8-bit)
- 13B model: 28GB (bf16) or 14GB (8-bit)
- 70B model: Requires multi-GPU or quantization
Time (7B model, single A100):
- HellaSwag: 10 minutes
- GSM8K: 5 minutes
- MMLU (full): 2 hours
- HumanEval: 20 minutes

Resources

GitHub: https://github.com/EleutherAI/lm-evaluation-harness
Docs: https://github.com/EleutherAI/lm-evaluation-harness/tree/main/docs
Task library: 60+ tasks including MMLU, GSM8K, HumanEval, TruthfulQA, HellaSwag, ARC, WinoGrande, etc.
Leaderboard: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard (uses this harness)

Weekly Installs

178

Repository

davila7/claude-…emplates

GitHub Stars

22.6K

First Seen

Jan 21, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykWarn

Installed on

opencode142

claude-code140

gemini-cli132

codex125

cursor125

github-copilot113

AI Elements：基于shadcn/ui的AI原生应用组件库，快速构建对话界面

60,400 周安装