evaluating-llms-harness by davila7/claude-code-templates
npx skills add https://github.com/davila7/claude-code-templates --skill evaluating-llms-harnesslm-evaluation-harness 使用标准化的提示词和指标,在 60 多个学术基准上评估 LLM。
安装:
pip install lm-eval
评估任何 HuggingFace 模型:
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf \
--tasks mmlu,gsm8k,hellaswag \
--device cuda:0 \
--batch_size 8
查看可用任务:
lm_eval --tasks list
在核心基准(MMLU、GSM8K、HumanEval)上评估模型。
复制此清单:
Benchmark Evaluation:
- [ ] Step 1: Choose benchmark suite
- [ ] Step 2: Configure model
- [ ] Step 3: Run evaluation
- [ ] Step 4: Analyze results
步骤 1:选择基准套件
核心推理基准:
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
代码基准:
标准套件(推荐用于模型发布):
--tasks mmlu,gsm8k,hellaswag,truthfulqa,arc_challenge
步骤 2:配置模型
HuggingFace 模型:
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf,dtype=bfloat16 \
--tasks mmlu \
--device cuda:0 \
--batch_size auto # 自动检测最佳批次大小
量化模型(4 位/8 位):
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf,load_in_4bit=True \
--tasks mmlu \
--device cuda:0
自定义检查点:
lm_eval --model hf \
--model_args pretrained=/path/to/my-model,tokenizer=/path/to/tokenizer \
--tasks mmlu \
--device cuda:0
步骤 3:运行评估
# 完整的 MMLU 评估(57 个学科)
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf \
--tasks mmlu \
--num_fewshot 5 \ # 5-shot 评估(标准)
--batch_size 8 \
--output_path results/ \
--log_samples # 保存单个预测
# 同时运行多个基准测试
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf \
--tasks mmlu,gsm8k,hellaswag,truthfulqa,arc_challenge \
--num_fewshot 5 \
--batch_size 8 \
--output_path results/llama2-7b-eval.json
步骤 4:分析结果
结果保存到 results/llama2-7b-eval.json:
{
"results": {
"mmlu": {
"acc": 0.459,
"acc_stderr": 0.004
},
"gsm8k": {
"exact_match": 0.142,
"exact_match_stderr": 0.006
},
"hellaswag": {
"acc_norm": 0.765,
"acc_norm_stderr": 0.004
}
},
"config": {
"model": "hf",
"model_args": "pretrained=meta-llama/Llama-2-7b-hf",
"num_fewshot": 5
}
}
在训练期间评估检查点。
Training Progress Tracking:
- [ ] Step 1: Set up periodic evaluation
- [ ] Step 2: Choose quick benchmarks
- [ ] Step 3: Automate evaluation
- [ ] Step 4: Plot learning curves
步骤 1:设置定期评估
每 N 个训练步骤评估一次:
#!/bin/bash
# eval_checkpoint.sh
CHECKPOINT_DIR=$1
STEP=$2
lm_eval --model hf \
--model_args pretrained=$CHECKPOINT_DIR/checkpoint-$STEP \
--tasks gsm8k,hellaswag \
--num_fewshot 0 \ # 0-shot 以提高速度
--batch_size 16 \
--output_path results/step-$STEP.json
步骤 2:选择快速基准
用于频繁评估的快速基准:
避免用于频繁评估(太慢):
步骤 3:自动化评估
与训练脚本集成:
# 在训练循环中
if step % eval_interval == 0:
model.save_pretrained(f"checkpoints/step-{step}")
# 运行评估
os.system(f"./eval_checkpoint.sh checkpoints step-{step}")
或使用 PyTorch Lightning 回调:
from pytorch_lightning import Callback
class EvalHarnessCallback(Callback):
def on_validation_epoch_end(self, trainer, pl_module):
step = trainer.global_step
checkpoint_path = f"checkpoints/step-{step}"
# 保存检查点
trainer.save_checkpoint(checkpoint_path)
# 运行 lm-eval
os.system(f"lm_eval --model hf --model_args pretrained={checkpoint_path} ...")
步骤 4:绘制学习曲线
import json
import matplotlib.pyplot as plt
# 加载所有结果
steps = []
mmlu_scores = []
for file in sorted(glob.glob("results/step-*.json")):
with open(file) as f:
data = json.load(f)
step = int(file.split("-")[1].split(".")[0])
steps.append(step)
mmlu_scores.append(data["results"]["mmlu"]["acc"])
# 绘图
plt.plot(steps, mmlu_scores)
plt.xlabel("Training Step")
plt.ylabel("MMLU Accuracy")
plt.title("Training Progress")
plt.savefig("training_curve.png")
用于模型比较的基准套件。
Model Comparison:
- [ ] Step 1: Define model list
- [ ] Step 2: Run evaluations
- [ ] Step 3: Generate comparison table
步骤 1:定义模型列表
# models.txt
meta-llama/Llama-2-7b-hf
meta-llama/Llama-2-13b-hf
mistralai/Mistral-7B-v0.1
microsoft/phi-2
步骤 2:运行评估
#!/bin/bash
# eval_all_models.sh
TASKS="mmlu,gsm8k,hellaswag,truthfulqa"
while read model; do
echo "Evaluating $model"
# 提取模型名称用于输出文件
model_name=$(echo $model | sed 's/\//-/g')
lm_eval --model hf \
--model_args pretrained=$model,dtype=bfloat16 \
--tasks $TASKS \
--num_fewshot 5 \
--batch_size auto \
--output_path results/$model_name.json
done < models.txt
步骤 3:生成比较表
import json
import pandas as pd
models = [
"meta-llama-Llama-2-7b-hf",
"meta-llama-Llama-2-13b-hf",
"mistralai-Mistral-7B-v0.1",
"microsoft-phi-2"
]
tasks = ["mmlu", "gsm8k", "hellaswag", "truthfulqa"]
results = []
for model in models:
with open(f"results/{model}.json") as f:
data = json.load(f)
row = {"Model": model.replace("-", "/")}
for task in tasks:
# 获取每个任务的主要指标
metrics = data["results"][task]
if "acc" in metrics:
row[task.upper()] = f"{metrics['acc']:.3f}"
elif "exact_match" in metrics:
row[task.upper()] = f"{metrics['exact_match']:.3f}"
results.append(row)
df = pd.DataFrame(results)
print(df.to_markdown(index=False))
输出:
| Model | MMLU | GSM8K | HELLASWAG | TRUTHFULQA |
|------------------------|-------|-------|-----------|------------|
| meta-llama/Llama-2-7b | 0.459 | 0.142 | 0.765 | 0.391 |
| meta-llama/Llama-2-13b | 0.549 | 0.287 | 0.801 | 0.430 |
| mistralai/Mistral-7B | 0.626 | 0.395 | 0.812 | 0.428 |
| microsoft/phi-2 | 0.560 | 0.613 | 0.682 | 0.447 |
使用 vLLM 后端进行 5-10 倍更快的评估。
vLLM Evaluation:
- [ ] Step 1: Install vLLM
- [ ] Step 2: Configure vLLM backend
- [ ] Step 3: Run evaluation
步骤 1:安装 vLLM
pip install vllm
步骤 2:配置 vLLM 后端
lm_eval --model vllm \
--model_args pretrained=meta-llama/Llama-2-7b-hf,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.8 \
--tasks mmlu \
--batch_size auto
步骤 3:运行评估
vLLM 比标准 HuggingFace 快 5-10 倍:
# 标准 HF:7B 模型上 MMLU 约 2 小时
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf \
--tasks mmlu \
--batch_size 8
# vLLM:7B 模型上 MMLU 约 15-20 分钟
lm_eval --model vllm \
--model_args pretrained=meta-llama/Llama-2-7b-hf,tensor_parallel_size=2 \
--tasks mmlu \
--batch_size auto
在以下情况使用 lm-evaluation-harness:
改用替代方案:
问题:评估太慢
使用 vLLM 后端:
lm_eval --model vllm \
--model_args pretrained=model-name,tensor_parallel_size=2
或减少 fewshot 示例:
--num_fewshot 0 # 而不是 5
或评估 MMLU 的子集:
--tasks mmlu_stem # 仅 STEM 学科
问题:内存不足
减少批次大小:
--batch_size 1 # 或 --batch_size auto
使用量化:
--model_args pretrained=model-name,load_in_8bit=True
启用 CPU 卸载:
--model_args pretrained=model-name,device_map=auto,offload_folder=offload
问题:结果与报告的不同
检查 fewshot 数量:
--num_fewshot 5 # 大多数论文使用 5-shot
检查确切的任务名称:
--tasks mmlu # 不是 mmlu_direct 或 mmlu_fewshot
验证模型和分词器是否匹配:
--model_args pretrained=model-name,tokenizer=same-model-name
问题:HumanEval 不执行代码
安装执行依赖项:
pip install human-eval
启用代码执行:
lm_eval --model hf \
--model_args pretrained=model-name \
--tasks humaneval \
--allow_code_execution # HumanEval 必需
基准描述:查看 references/benchmark-guide.md 以获取所有 60 多个任务的详细描述、它们测量的内容以及解释。
自定义任务:查看 references/custom-tasks.md 以创建特定领域的评估任务。
API 评估:查看 references/api-evaluation.md 以评估 OpenAI、Anthropic 和其他 API 模型。
多 GPU 策略:查看 references/distributed-eval.md 以了解数据并行和张量并行评估。
每周安装数
178
仓库
GitHub 星标数
22.6K
首次出现
2026 年 1 月 21 日
安全审计
安装于
opencode142
claude-code140
gemini-cli132
codex125
cursor125
github-copilot113
lm-evaluation-harness evaluates LLMs across 60+ academic benchmarks using standardized prompts and metrics.
Installation :
pip install lm-eval
Evaluate any HuggingFace model :
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf \
--tasks mmlu,gsm8k,hellaswag \
--device cuda:0 \
--batch_size 8
View available tasks :
lm_eval --tasks list
Evaluate model on core benchmarks (MMLU, GSM8K, HumanEval).
Copy this checklist:
Benchmark Evaluation:
- [ ] Step 1: Choose benchmark suite
- [ ] Step 2: Configure model
- [ ] Step 3: Run evaluation
- [ ] Step 4: Analyze results
Step 1: Choose benchmark suite
Core reasoning benchmarks :
Code benchmarks :
Standard suite (recommended for model releases):
--tasks mmlu,gsm8k,hellaswag,truthfulqa,arc_challenge
Step 2: Configure model
HuggingFace model :
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf,dtype=bfloat16 \
--tasks mmlu \
--device cuda:0 \
--batch_size auto # Auto-detect optimal batch size
Quantized model (4-bit/8-bit) :
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf,load_in_4bit=True \
--tasks mmlu \
--device cuda:0
Custom checkpoint :
lm_eval --model hf \
--model_args pretrained=/path/to/my-model,tokenizer=/path/to/tokenizer \
--tasks mmlu \
--device cuda:0
Step 3: Run evaluation
# Full MMLU evaluation (57 subjects)
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf \
--tasks mmlu \
--num_fewshot 5 \ # 5-shot evaluation (standard)
--batch_size 8 \
--output_path results/ \
--log_samples # Save individual predictions
# Multiple benchmarks at once
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf \
--tasks mmlu,gsm8k,hellaswag,truthfulqa,arc_challenge \
--num_fewshot 5 \
--batch_size 8 \
--output_path results/llama2-7b-eval.json
Step 4: Analyze results
Results saved to results/llama2-7b-eval.json:
{
"results": {
"mmlu": {
"acc": 0.459,
"acc_stderr": 0.004
},
"gsm8k": {
"exact_match": 0.142,
"exact_match_stderr": 0.006
},
"hellaswag": {
"acc_norm": 0.765,
"acc_norm_stderr": 0.004
}
},
"config": {
"model": "hf",
"model_args": "pretrained=meta-llama/Llama-2-7b-hf",
"num_fewshot": 5
}
}
Evaluate checkpoints during training.
Training Progress Tracking:
- [ ] Step 1: Set up periodic evaluation
- [ ] Step 2: Choose quick benchmarks
- [ ] Step 3: Automate evaluation
- [ ] Step 4: Plot learning curves
Step 1: Set up periodic evaluation
Evaluate every N training steps:
#!/bin/bash
# eval_checkpoint.sh
CHECKPOINT_DIR=$1
STEP=$2
lm_eval --model hf \
--model_args pretrained=$CHECKPOINT_DIR/checkpoint-$STEP \
--tasks gsm8k,hellaswag \
--num_fewshot 0 \ # 0-shot for speed
--batch_size 16 \
--output_path results/step-$STEP.json
Step 2: Choose quick benchmarks
Fast benchmarks for frequent evaluation:
Avoid for frequent eval (too slow):
Step 3: Automate evaluation
Integrate with training script:
# In training loop
if step % eval_interval == 0:
model.save_pretrained(f"checkpoints/step-{step}")
# Run evaluation
os.system(f"./eval_checkpoint.sh checkpoints step-{step}")
Or use PyTorch Lightning callbacks:
from pytorch_lightning import Callback
class EvalHarnessCallback(Callback):
def on_validation_epoch_end(self, trainer, pl_module):
step = trainer.global_step
checkpoint_path = f"checkpoints/step-{step}"
# Save checkpoint
trainer.save_checkpoint(checkpoint_path)
# Run lm-eval
os.system(f"lm_eval --model hf --model_args pretrained={checkpoint_path} ...")
Step 4: Plot learning curves
import json
import matplotlib.pyplot as plt
# Load all results
steps = []
mmlu_scores = []
for file in sorted(glob.glob("results/step-*.json")):
with open(file) as f:
data = json.load(f)
step = int(file.split("-")[1].split(".")[0])
steps.append(step)
mmlu_scores.append(data["results"]["mmlu"]["acc"])
# Plot
plt.plot(steps, mmlu_scores)
plt.xlabel("Training Step")
plt.ylabel("MMLU Accuracy")
plt.title("Training Progress")
plt.savefig("training_curve.png")
Benchmark suite for model comparison.
Model Comparison:
- [ ] Step 1: Define model list
- [ ] Step 2: Run evaluations
- [ ] Step 3: Generate comparison table
Step 1: Define model list
# models.txt
meta-llama/Llama-2-7b-hf
meta-llama/Llama-2-13b-hf
mistralai/Mistral-7B-v0.1
microsoft/phi-2
Step 2: Run evaluations
#!/bin/bash
# eval_all_models.sh
TASKS="mmlu,gsm8k,hellaswag,truthfulqa"
while read model; do
echo "Evaluating $model"
# Extract model name for output file
model_name=$(echo $model | sed 's/\//-/g')
lm_eval --model hf \
--model_args pretrained=$model,dtype=bfloat16 \
--tasks $TASKS \
--num_fewshot 5 \
--batch_size auto \
--output_path results/$model_name.json
done < models.txt
Step 3: Generate comparison table
import json
import pandas as pd
models = [
"meta-llama-Llama-2-7b-hf",
"meta-llama-Llama-2-13b-hf",
"mistralai-Mistral-7B-v0.1",
"microsoft-phi-2"
]
tasks = ["mmlu", "gsm8k", "hellaswag", "truthfulqa"]
results = []
for model in models:
with open(f"results/{model}.json") as f:
data = json.load(f)
row = {"Model": model.replace("-", "/")}
for task in tasks:
# Get primary metric for each task
metrics = data["results"][task]
if "acc" in metrics:
row[task.upper()] = f"{metrics['acc']:.3f}"
elif "exact_match" in metrics:
row[task.upper()] = f"{metrics['exact_match']:.3f}"
results.append(row)
df = pd.DataFrame(results)
print(df.to_markdown(index=False))
Output:
| Model | MMLU | GSM8K | HELLASWAG | TRUTHFULQA |
|------------------------|-------|-------|-----------|------------|
| meta-llama/Llama-2-7b | 0.459 | 0.142 | 0.765 | 0.391 |
| meta-llama/Llama-2-13b | 0.549 | 0.287 | 0.801 | 0.430 |
| mistralai/Mistral-7B | 0.626 | 0.395 | 0.812 | 0.428 |
| microsoft/phi-2 | 0.560 | 0.613 | 0.682 | 0.447 |
Use vLLM backend for 5-10x faster evaluation.
vLLM Evaluation:
- [ ] Step 1: Install vLLM
- [ ] Step 2: Configure vLLM backend
- [ ] Step 3: Run evaluation
Step 1: Install vLLM
pip install vllm
Step 2: Configure vLLM backend
lm_eval --model vllm \
--model_args pretrained=meta-llama/Llama-2-7b-hf,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.8 \
--tasks mmlu \
--batch_size auto
Step 3: Run evaluation
vLLM is 5-10× faster than standard HuggingFace:
# Standard HF: ~2 hours for MMLU on 7B model
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf \
--tasks mmlu \
--batch_size 8
# vLLM: ~15-20 minutes for MMLU on 7B model
lm_eval --model vllm \
--model_args pretrained=meta-llama/Llama-2-7b-hf,tensor_parallel_size=2 \
--tasks mmlu \
--batch_size auto
Use lm-evaluation-harness when:
Use alternatives instead:
Issue: Evaluation too slow
Use vLLM backend:
lm_eval --model vllm \
--model_args pretrained=model-name,tensor_parallel_size=2
Or reduce fewshot examples:
--num_fewshot 0 # Instead of 5
Or evaluate subset of MMLU:
--tasks mmlu_stem # Only STEM subjects
Issue: Out of memory
Reduce batch size:
--batch_size 1 # Or --batch_size auto
Use quantization:
--model_args pretrained=model-name,load_in_8bit=True
Enable CPU offloading:
--model_args pretrained=model-name,device_map=auto,offload_folder=offload
Issue: Different results than reported
Check fewshot count:
--num_fewshot 5 # Most papers use 5-shot
Check exact task name:
--tasks mmlu # Not mmlu_direct or mmlu_fewshot
Verify model and tokenizer match:
--model_args pretrained=model-name,tokenizer=same-model-name
Issue: HumanEval not executing code
Install execution dependencies:
pip install human-eval
Enable code execution:
lm_eval --model hf \
--model_args pretrained=model-name \
--tasks humaneval \
--allow_code_execution # Required for HumanEval
Benchmark descriptions : See references/benchmark-guide.md for detailed description of all 60+ tasks, what they measure, and interpretation.
Custom tasks : See references/custom-tasks.md for creating domain-specific evaluation tasks.
API evaluation : See references/api-evaluation.md for evaluating OpenAI, Anthropic, and other API models.
Multi-GPU strategies : See references/distributed-eval.md for data parallel and tensor parallel evaluation.
Weekly Installs
178
Repository
GitHub Stars
22.6K
First Seen
Jan 21, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykWarn
Installed on
opencode142
claude-code140
gemini-cli132
codex125
cursor125
github-copilot113
AI Elements:基于shadcn/ui的AI原生应用组件库,快速构建对话界面
60,400 周安装