重要前提
安装AI Skills的关键前提是:必须科学上网,且开启TUN模式,这一点至关重要,直接决定安装能否顺利完成,在此郑重提醒三遍:科学上网,科学上网,科学上网。查看完整安装教程 →
evaluating-llms-harness by orchestra-research/ai-research-skills
npx skills add https://github.com/orchestra-research/ai-research-skills --skill evaluating-llms-harnesslm-evaluation-harness 使用标准化的提示词和指标,在 60 多个学术基准上评估 LLM。
安装:
pip install lm-eval
评估任何 HuggingFace 模型:
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf \
--tasks mmlu,gsm8k,hellaswag \
--device cuda:0 \
--batch_size 8
查看可用任务:
lm_eval --tasks list
在核心基准(MMLU、GSM8K、HumanEval)上评估模型。
复制此清单:
Benchmark Evaluation:
- [ ] Step 1: Choose benchmark suite
- [ ] Step 2: Configure model
- [ ] Step 3: Run evaluation
- [ ] Step 4: Analyze results
步骤 1:选择基准套件
核心推理基准:
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
代码基准:
标准套件(建议用于模型发布):
--tasks mmlu,gsm8k,hellaswag,truthfulqa,arc_challenge
步骤 2:配置模型
HuggingFace 模型:
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf,dtype=bfloat16 \
--tasks mmlu \
--device cuda:0 \
--batch_size auto # 自动检测最佳批次大小
量化模型(4 位/8 位):
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf,load_in_4bit=True \
--tasks mmlu \
--device cuda:0
自定义检查点:
lm_eval --model hf \
--model_args pretrained=/path/to/my-model,tokenizer=/path/to/tokenizer \
--tasks mmlu \
--device cuda:0
步骤 3:运行评估
# 完整的 MMLU 评估(57 个科目)
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf \
--tasks mmlu \
--num_fewshot 5 \ # 5-shot 评估(标准)
--batch_size 8 \
--output_path results/ \
--log_samples # 保存单个预测
# 一次运行多个基准测试
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf \
--tasks mmlu,gsm8k,hellaswag,truthfulqa,arc_challenge \
--num_fewshot 5 \
--batch_size 8 \
--output_path results/llama2-7b-eval.json
步骤 4:分析结果
结果保存到 results/llama2-7b-eval.json:
{
"results": {
"mmlu": {
"acc": 0.459,
"acc_stderr": 0.004
},
"gsm8k": {
"exact_match": 0.142,
"exact_match_stderr": 0.006
},
"hellaswag": {
"acc_norm": 0.765,
"acc_norm_stderr": 0.004
}
},
"config": {
"model": "hf",
"model_args": "pretrained=meta-llama/Llama-2-7b-hf",
"num_fewshot": 5
}
}
在训练期间评估检查点。
Training Progress Tracking:
- [ ] Step 1: Set up periodic evaluation
- [ ] Step 2: Choose quick benchmarks
- [ ] Step 3: Automate evaluation
- [ ] Step 4: Plot learning curves
步骤 1:设置定期评估
每 N 个训练步骤评估一次:
#!/bin/bash
# eval_checkpoint.sh
CHECKPOINT_DIR=$1
STEP=$2
lm_eval --model hf \
--model_args pretrained=$CHECKPOINT_DIR/checkpoint-$STEP \
--tasks gsm8k,hellaswag \
--num_fewshot 0 \ # 0-shot 以加快速度
--batch_size 16 \
--output_path results/step-$STEP.json
步骤 2:选择快速基准测试
用于频繁评估的快速基准测试:
避免用于频繁评估(太慢):
步骤 3:自动化评估
与训练脚本集成:
# 在训练循环中
if step % eval_interval == 0:
model.save_pretrained(f"checkpoints/step-{step}")
# 运行评估
os.system(f"./eval_checkpoint.sh checkpoints step-{step}")
或使用 PyTorch Lightning 回调:
from pytorch_lightning import Callback
class EvalHarnessCallback(Callback):
def on_validation_epoch_end(self, trainer, pl_module):
step = trainer.global_step
checkpoint_path = f"checkpoints/step-{step}"
# 保存检查点
trainer.save_checkpoint(checkpoint_path)
# 运行 lm-eval
os.system(f"lm_eval --model hf --model_args pretrained={checkpoint_path} ...")
步骤 4:绘制学习曲线
import json
import matplotlib.pyplot as plt
# 加载所有结果
steps = []
mmlu_scores = []
for file in sorted(glob.glob("results/step-*.json")):
with open(file) as f:
data = json.load(f)
step = int(file.split("-")[1].split(".")[0])
steps.append(step)
mmlu_scores.append(data["results"]["mmlu"]["acc"])
# 绘图
plt.plot(steps, mmlu_scores)
plt.xlabel("训练步数")
plt.ylabel("MMLU 准确率")
plt.title("训练进度")
plt.savefig("training_curve.png")
用于模型比较的基准测试套件。
Model Comparison:
- [ ] Step 1: Define model list
- [ ] Step 2: Run evaluations
- [ ] Step 3: Generate comparison table
步骤 1:定义模型列表
# models.txt
meta-llama/Llama-2-7b-hf
meta-llama/Llama-2-13b-hf
mistralai/Mistral-7B-v0.1
microsoft/phi-2
步骤 2:运行评估
#!/bin/bash
# eval_all_models.sh
TASKS="mmlu,gsm8k,hellaswag,truthfulqa"
while read model; do
echo "Evaluating $model"
# 提取模型名称用于输出文件
model_name=$(echo $model | sed 's/\//-/g')
lm_eval --model hf \
--model_args pretrained=$model,dtype=bfloat16 \
--tasks $TASKS \
--num_fewshot 5 \
--batch_size auto \
--output_path results/$model_name.json
done < models.txt
步骤 3:生成比较表格
import json
import pandas as pd
models = [
"meta-llama-Llama-2-7b-hf",
"meta-llama-Llama-2-13b-hf",
"mistralai-Mistral-7B-v0.1",
"microsoft-phi-2"
]
tasks = ["mmlu", "gsm8k", "hellaswag", "truthfulqa"]
results = []
for model in models:
with open(f"results/{model}.json") as f:
data = json.load(f)
row = {"Model": model.replace("-", "/")}
for task in tasks:
# 获取每个任务的主要指标
metrics = data["results"][task]
if "acc" in metrics:
row[task.upper()] = f"{metrics['acc']:.3f}"
elif "exact_match" in metrics:
row[task.upper()] = f"{metrics['exact_match']:.3f}"
results.append(row)
df = pd.DataFrame(results)
print(df.to_markdown(index=False))
输出:
| Model | MMLU | GSM8K | HELLASWAG | TRUTHFULQA |
|------------------------|-------|-------|-----------|------------|
| meta-llama/Llama-2-7b | 0.459 | 0.142 | 0.765 | 0.391 |
| meta-llama/Llama-2-13b | 0.549 | 0.287 | 0.801 | 0.430 |
| mistralai/Mistral-7B | 0.626 | 0.395 | 0.812 | 0.428 |
| microsoft/phi-2 | 0.560 | 0.613 | 0.682 | 0.447 |
使用 vLLM 后端进行 5-10 倍更快的评估。
vLLM Evaluation:
- [ ] Step 1: Install vLLM
- [ ] Step 2: Configure vLLM backend
- [ ] Step 3: Run evaluation
步骤 1:安装 vLLM
pip install vllm
步骤 2:配置 vLLM 后端
lm_eval --model vllm \
--model_args pretrained=meta-llama/Llama-2-7b-hf,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.8 \
--tasks mmlu \
--batch_size auto
步骤 3:运行评估
vLLM 比标准 HuggingFace 快 5-10 倍:
# 标准 HF:7B 模型上 MMLU 约 2 小时
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf \
--tasks mmlu \
--batch_size 8
# vLLM:7B 模型上 MMLU 约 15-20 分钟
lm_eval --model vllm \
--model_args pretrained=meta-llama/Llama-2-7b-hf,tensor_parallel_size=2 \
--tasks mmlu \
--batch_size auto
在以下情况使用 lm-evaluation-harness:
改用替代方案:
问题:评估太慢
使用 vLLM 后端:
lm_eval --model vllm \
--model_args pretrained=model-name,tensor_parallel_size=2
或减少 fewshot 示例:
--num_fewshot 0 # 而不是 5
或评估 MMLU 的子集:
--tasks mmlu_stem # 仅 STEM 科目
问题:内存不足
减少批次大小:
--batch_size 1 # 或 --batch_size auto
使用量化:
--model_args pretrained=model-name,load_in_8bit=True
启用 CPU 卸载:
--model_args pretrained=model-name,device_map=auto,offload_folder=offload
问题:结果与报告的不同
检查 fewshot 数量:
--num_fewshot 5 # 大多数论文使用 5-shot
检查确切的任务名称:
--tasks mmlu # 不是 mmlu_direct 或 mmlu_fewshot
验证模型和分词器是否匹配:
--model_args pretrained=model-name,tokenizer=same-model-name
问题:HumanEval 不执行代码
安装执行依赖项:
pip install human-eval
启用代码执行:
lm_eval --model hf \
--model_args pretrained=model-name \
--tasks humaneval \
--allow_code_execution # HumanEval 必需
基准测试描述:请参阅 references/benchmark-guide.md 以获取所有 60 多个任务的详细描述、它们测量的内容以及解释。
自定义任务:请参阅 references/custom-tasks.md 以创建特定领域的评估任务。
API 评估:请参阅 references/api-evaluation.md 以评估 OpenAI、Anthropic 和其他 API 模型。
多 GPU 策略:请参阅 references/distributed-eval.md 以了解数据并行和张量并行评估。
每周安装次数
61
代码仓库
GitHub 星标数
5.5K
首次出现
2026年2月7日
安全审计
安装于
codex52
cursor52
opencode52
claude-code51
gemini-cli51
github-copilot50
lm-evaluation-harness evaluates LLMs across 60+ academic benchmarks using standardized prompts and metrics.
Installation :
pip install lm-eval
Evaluate any HuggingFace model :
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf \
--tasks mmlu,gsm8k,hellaswag \
--device cuda:0 \
--batch_size 8
View available tasks :
lm_eval --tasks list
Evaluate model on core benchmarks (MMLU, GSM8K, HumanEval).
Copy this checklist:
Benchmark Evaluation:
- [ ] Step 1: Choose benchmark suite
- [ ] Step 2: Configure model
- [ ] Step 3: Run evaluation
- [ ] Step 4: Analyze results
Step 1: Choose benchmark suite
Core reasoning benchmarks :
Code benchmarks :
Standard suite (recommended for model releases):
--tasks mmlu,gsm8k,hellaswag,truthfulqa,arc_challenge
Step 2: Configure model
HuggingFace model :
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf,dtype=bfloat16 \
--tasks mmlu \
--device cuda:0 \
--batch_size auto # Auto-detect optimal batch size
Quantized model (4-bit/8-bit) :
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf,load_in_4bit=True \
--tasks mmlu \
--device cuda:0
Custom checkpoint :
lm_eval --model hf \
--model_args pretrained=/path/to/my-model,tokenizer=/path/to/tokenizer \
--tasks mmlu \
--device cuda:0
Step 3: Run evaluation
# Full MMLU evaluation (57 subjects)
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf \
--tasks mmlu \
--num_fewshot 5 \ # 5-shot evaluation (standard)
--batch_size 8 \
--output_path results/ \
--log_samples # Save individual predictions
# Multiple benchmarks at once
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf \
--tasks mmlu,gsm8k,hellaswag,truthfulqa,arc_challenge \
--num_fewshot 5 \
--batch_size 8 \
--output_path results/llama2-7b-eval.json
Step 4: Analyze results
Results saved to results/llama2-7b-eval.json:
{
"results": {
"mmlu": {
"acc": 0.459,
"acc_stderr": 0.004
},
"gsm8k": {
"exact_match": 0.142,
"exact_match_stderr": 0.006
},
"hellaswag": {
"acc_norm": 0.765,
"acc_norm_stderr": 0.004
}
},
"config": {
"model": "hf",
"model_args": "pretrained=meta-llama/Llama-2-7b-hf",
"num_fewshot": 5
}
}
Evaluate checkpoints during training.
Training Progress Tracking:
- [ ] Step 1: Set up periodic evaluation
- [ ] Step 2: Choose quick benchmarks
- [ ] Step 3: Automate evaluation
- [ ] Step 4: Plot learning curves
Step 1: Set up periodic evaluation
Evaluate every N training steps:
#!/bin/bash
# eval_checkpoint.sh
CHECKPOINT_DIR=$1
STEP=$2
lm_eval --model hf \
--model_args pretrained=$CHECKPOINT_DIR/checkpoint-$STEP \
--tasks gsm8k,hellaswag \
--num_fewshot 0 \ # 0-shot for speed
--batch_size 16 \
--output_path results/step-$STEP.json
Step 2: Choose quick benchmarks
Fast benchmarks for frequent evaluation:
Avoid for frequent eval (too slow):
Step 3: Automate evaluation
Integrate with training script:
# In training loop
if step % eval_interval == 0:
model.save_pretrained(f"checkpoints/step-{step}")
# Run evaluation
os.system(f"./eval_checkpoint.sh checkpoints step-{step}")
Or use PyTorch Lightning callbacks:
from pytorch_lightning import Callback
class EvalHarnessCallback(Callback):
def on_validation_epoch_end(self, trainer, pl_module):
step = trainer.global_step
checkpoint_path = f"checkpoints/step-{step}"
# Save checkpoint
trainer.save_checkpoint(checkpoint_path)
# Run lm-eval
os.system(f"lm_eval --model hf --model_args pretrained={checkpoint_path} ...")
Step 4: Plot learning curves
import json
import matplotlib.pyplot as plt
# Load all results
steps = []
mmlu_scores = []
for file in sorted(glob.glob("results/step-*.json")):
with open(file) as f:
data = json.load(f)
step = int(file.split("-")[1].split(".")[0])
steps.append(step)
mmlu_scores.append(data["results"]["mmlu"]["acc"])
# Plot
plt.plot(steps, mmlu_scores)
plt.xlabel("Training Step")
plt.ylabel("MMLU Accuracy")
plt.title("Training Progress")
plt.savefig("training_curve.png")
Benchmark suite for model comparison.
Model Comparison:
- [ ] Step 1: Define model list
- [ ] Step 2: Run evaluations
- [ ] Step 3: Generate comparison table
Step 1: Define model list
# models.txt
meta-llama/Llama-2-7b-hf
meta-llama/Llama-2-13b-hf
mistralai/Mistral-7B-v0.1
microsoft/phi-2
Step 2: Run evaluations
#!/bin/bash
# eval_all_models.sh
TASKS="mmlu,gsm8k,hellaswag,truthfulqa"
while read model; do
echo "Evaluating $model"
# Extract model name for output file
model_name=$(echo $model | sed 's/\//-/g')
lm_eval --model hf \
--model_args pretrained=$model,dtype=bfloat16 \
--tasks $TASKS \
--num_fewshot 5 \
--batch_size auto \
--output_path results/$model_name.json
done < models.txt
Step 3: Generate comparison table
import json
import pandas as pd
models = [
"meta-llama-Llama-2-7b-hf",
"meta-llama-Llama-2-13b-hf",
"mistralai-Mistral-7B-v0.1",
"microsoft-phi-2"
]
tasks = ["mmlu", "gsm8k", "hellaswag", "truthfulqa"]
results = []
for model in models:
with open(f"results/{model}.json") as f:
data = json.load(f)
row = {"Model": model.replace("-", "/")}
for task in tasks:
# Get primary metric for each task
metrics = data["results"][task]
if "acc" in metrics:
row[task.upper()] = f"{metrics['acc']:.3f}"
elif "exact_match" in metrics:
row[task.upper()] = f"{metrics['exact_match']:.3f}"
results.append(row)
df = pd.DataFrame(results)
print(df.to_markdown(index=False))
Output:
| Model | MMLU | GSM8K | HELLASWAG | TRUTHFULQA |
|------------------------|-------|-------|-----------|------------|
| meta-llama/Llama-2-7b | 0.459 | 0.142 | 0.765 | 0.391 |
| meta-llama/Llama-2-13b | 0.549 | 0.287 | 0.801 | 0.430 |
| mistralai/Mistral-7B | 0.626 | 0.395 | 0.812 | 0.428 |
| microsoft/phi-2 | 0.560 | 0.613 | 0.682 | 0.447 |
Use vLLM backend for 5-10x faster evaluation.
vLLM Evaluation:
- [ ] Step 1: Install vLLM
- [ ] Step 2: Configure vLLM backend
- [ ] Step 3: Run evaluation
Step 1: Install vLLM
pip install vllm
Step 2: Configure vLLM backend
lm_eval --model vllm \
--model_args pretrained=meta-llama/Llama-2-7b-hf,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.8 \
--tasks mmlu \
--batch_size auto
Step 3: Run evaluation
vLLM is 5-10× faster than standard HuggingFace:
# Standard HF: ~2 hours for MMLU on 7B model
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf \
--tasks mmlu \
--batch_size 8
# vLLM: ~15-20 minutes for MMLU on 7B model
lm_eval --model vllm \
--model_args pretrained=meta-llama/Llama-2-7b-hf,tensor_parallel_size=2 \
--tasks mmlu \
--batch_size auto
Use lm-evaluation-harness when:
Use alternatives instead:
Issue: Evaluation too slow
Use vLLM backend:
lm_eval --model vllm \
--model_args pretrained=model-name,tensor_parallel_size=2
Or reduce fewshot examples:
--num_fewshot 0 # Instead of 5
Or evaluate subset of MMLU:
--tasks mmlu_stem # Only STEM subjects
Issue: Out of memory
Reduce batch size:
--batch_size 1 # Or --batch_size auto
Use quantization:
--model_args pretrained=model-name,load_in_8bit=True
Enable CPU offloading:
--model_args pretrained=model-name,device_map=auto,offload_folder=offload
Issue: Different results than reported
Check fewshot count:
--num_fewshot 5 # Most papers use 5-shot
Check exact task name:
--tasks mmlu # Not mmlu_direct or mmlu_fewshot
Verify model and tokenizer match:
--model_args pretrained=model-name,tokenizer=same-model-name
Issue: HumanEval not executing code
Install execution dependencies:
pip install human-eval
Enable code execution:
lm_eval --model hf \
--model_args pretrained=model-name \
--tasks humaneval \
--allow_code_execution # Required for HumanEval
Benchmark descriptions : See references/benchmark-guide.md for detailed description of all 60+ tasks, what they measure, and interpretation.
Custom tasks : See references/custom-tasks.md for creating domain-specific evaluation tasks.
API evaluation : See references/api-evaluation.md for evaluating OpenAI, Anthropic, and other API models.
Multi-GPU strategies : See references/distributed-eval.md for data parallel and tensor parallel evaluation.
Weekly Installs
61
Repository
GitHub Stars
5.5K
First Seen
Feb 7, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykWarn
Installed on
codex52
cursor52
opencode52
claude-code51
gemini-cli51
github-copilot50
超能力技能使用指南:AI助手技能调用优先级与工作流程详解
53,700 周安装