evaluating-code-models by davila7/claude-code-templates
npx skills add https://github.com/davila7/claude-code-templates --skill evaluating-code-modelsBigCode Evaluation Harness 在 15 多个基准测试上评估代码生成模型,包括 HumanEval、MBPP 和 MultiPL-E(18 种语言)。
安装:
git clone https://github.com/bigcode-project/bigcode-evaluation-harness.git
cd bigcode-evaluation-harness
pip install -e .
accelerate config
在 HumanEval 上评估:
accelerate launch main.py \
--model bigcode/starcoder2-7b \
--tasks humaneval \
--max_length_generation 512 \
--temperature 0.2 \
--n_samples 20 \
--batch_size 10 \
--allow_code_execution \
--save_generations
查看可用任务:
python -c "from bigcode_eval.tasks import ALL_TASKS; print(ALL_TASKS)"
在核心代码基准测试(HumanEval、MBPP、HumanEval+)上评估模型。
检查清单:
Code Benchmark Evaluation:
- [ ] 步骤 1:选择基准测试套件
- [ ] 步骤 2:配置模型和生成参数
- [ ] 步骤 3:运行带代码执行的评估
- [ ] 步骤 4:分析 pass@k 结果
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
步骤 1:选择基准测试套件
Python 代码生成(最常见):
多语言(18 种语言):
高级:
步骤 2:配置模型和生成参数
# 标准 HuggingFace 模型
accelerate launch main.py \
--model bigcode/starcoder2-7b \
--tasks humaneval \
--max_length_generation 512 \
--temperature 0.2 \
--do_sample True \
--n_samples 200 \
--batch_size 50 \
--allow_code_execution
# 量化模型(4 位)
accelerate launch main.py \
--model codellama/CodeLlama-34b-hf \
--tasks humaneval \
--load_in_4bit \
--max_length_generation 512 \
--allow_code_execution
# 自定义/私有模型
accelerate launch main.py \
--model /path/to/my-code-model \
--tasks humaneval \
--trust_remote_code \
--use_auth_token \
--allow_code_execution
步骤 3:运行评估
# 完整评估,包含 pass@k 估计(k=1,10,100)
accelerate launch main.py \
--model bigcode/starcoder2-7b \
--tasks humaneval \
--temperature 0.8 \
--n_samples 200 \
--batch_size 50 \
--allow_code_execution \
--save_generations \
--metric_output_path results/starcoder2-humaneval.json
步骤 4:分析结果
结果位于 results/starcoder2-humaneval.json:
{
"humaneval": {
"pass@1": 0.354,
"pass@10": 0.521,
"pass@100": 0.689
},
"config": {
"model": "bigcode/starcoder2-7b",
"temperature": 0.8,
"n_samples": 200
}
}
在 18 种编程语言上评估代码生成能力。
检查清单:
Multi-Language Evaluation:
- [ ] 步骤 1:生成解决方案(宿主机)
- [ ] 步骤 2:在 Docker 中运行评估(安全执行)
- [ ] 步骤 3:跨语言比较
步骤 1:在宿主机上生成解决方案
# 生成但不执行(安全)
accelerate launch main.py \
--model bigcode/starcoder2-7b \
--tasks multiple-py,multiple-js,multiple-java,multiple-cpp \
--max_length_generation 650 \
--temperature 0.8 \
--n_samples 50 \
--batch_size 50 \
--generation_only \
--save_generations \
--save_generations_path generations_multi.json
步骤 2:在 Docker 容器中评估
# 拉取 MultiPL-E Docker 镜像
docker pull ghcr.io/bigcode-project/evaluation-harness-multiple
# 在容器内运行评估
docker run -v $(pwd)/generations_multi.json:/app/generations.json:ro \
-it evaluation-harness-multiple python3 main.py \
--model bigcode/starcoder2-7b \
--tasks multiple-py,multiple-js,multiple-java,multiple-cpp \
--load_generations_path /app/generations.json \
--allow_code_execution \
--n_samples 50
支持的语言:Python、JavaScript、Java、C++、Go、Rust、TypeScript、C#、PHP、Ruby、Swift、Kotlin、Scala、Perl、Julia、Lua、R、Racket
使用正确的格式评估聊天/指令模型。
检查清单:
Instruction Model Evaluation:
- [ ] 步骤 1:使用指令调优任务
- [ ] 步骤 2:配置指令标记
- [ ] 步骤 3:运行评估
步骤 1:选择指令任务
步骤 2:配置指令标记
# 对于具有聊天模板的模型(例如 CodeLlama-Instruct)
accelerate launch main.py \
--model codellama/CodeLlama-7b-Instruct-hf \
--tasks instruct-humaneval \
--instruction_tokens "<s>[INST],</s>,[/INST]" \
--max_length_generation 512 \
--allow_code_execution
步骤 3:针对指令模型的 HumanEvalPack
# 在 6 种语言上测试代码合成
accelerate launch main.py \
--model codellama/CodeLlama-7b-Instruct-hf \
--tasks humanevalsynthesize-python,humanevalsynthesize-js \
--prompt instruct \
--max_length_generation 512 \
--allow_code_execution
用于模型比较的基准测试套件。
步骤 1:创建评估脚本
#!/bin/bash
# eval_models.sh
MODELS=(
"bigcode/starcoder2-7b"
"codellama/CodeLlama-7b-hf"
"deepseek-ai/deepseek-coder-6.7b-base"
)
TASKS="humaneval,mbpp"
for model in "${MODELS[@]}"; do
model_name=$(echo $model | tr '/' '-')
echo "Evaluating $model"
accelerate launch main.py \
--model $model \
--tasks $TASKS \
--temperature 0.2 \
--n_samples 20 \
--batch_size 20 \
--allow_code_execution \
--metric_output_path results/${model_name}.json
done
步骤 2:生成比较表格
import json
import pandas as pd
models = ["bigcode-starcoder2-7b", "codellama-CodeLlama-7b-hf", "deepseek-ai-deepseek-coder-6.7b-base"]
results = []
for model in models:
with open(f"results/{model}.json") as f:
data = json.load(f)
results.append({
"Model": model,
"HumanEval pass@1": f"{data['humaneval']['pass@1']:.3f}",
"MBPP pass@1": f"{data['mbpp']['pass@1']:.3f}"
})
df = pd.DataFrame(results)
print(df.to_markdown(index=False))
在以下情况使用 BigCode Evaluation Harness:
在以下情况使用替代方案:
| 基准测试 | 问题数量 | 语言 | 指标 | 使用场景 |
|---|---|---|---|---|
| HumanEval | 164 | Python | pass@k | 标准代码补全 |
| HumanEval+ | 164 | Python | pass@k | 更严格的评估(80 倍测试) |
| MBPP | 500 | Python | pass@k | 入门级问题 |
| MBPP+ | 399 | Python | pass@k | 更严格的评估(35 倍测试) |
| MultiPL-E | 164×18 | 18 种语言 | pass@k | 多语言评估 |
| APPS | 10,000 | Python | pass@k | 竞赛级别 |
| DS-1000 | 1,000 | Python | pass@k | 数据科学(pandas、numpy 等) |
| HumanEvalPack | 164×3×6 | 6 种语言 | pass@k | 合成/修复/解释 |
| Mercury | 1,889 | Python | 效率 | 计算效率 |
问题:与论文中报告的结果不同
检查以下因素:
# 1. 验证 n_samples(需要 200 以获得准确的 pass@k)
--n_samples 200
# 2. 检查 temperature(0.2 用于类贪婪采样,0.8 用于采样)
--temperature 0.8
# 3. 验证任务名称完全匹配
--tasks humaneval # 不是 "human_eval" 或 "HumanEval"
# 4. 检查 max_length_generation
--max_length_generation 512 # 对于较长的问题,请增加此值
问题:CUDA 内存不足
# 使用量化
--load_in_8bit
# 或
--load_in_4bit
# 减小批处理大小
--batch_size 1
# 设置内存限制
--max_memory_per_gpu "20GiB"
问题:代码执行挂起或超时
使用 Docker 进行安全执行:
# 在宿主机上生成(不执行)
--generation_only --save_generations
# 在 Docker 中评估
docker run ... --allow_code_execution --load_generations_path ...
问题:指令模型得分低
确保正确的指令格式:
# 使用特定于指令的任务
--tasks instruct-humaneval
# 为您的模型设置指令标记
--instruction_tokens "<s>[INST],</s>,[/INST]"
问题:MultiPL-E 语言失败
使用专用的 Docker 镜像:
docker pull ghcr.io/bigcode-project/evaluation-harness-multiple
| 参数 | 默认值 | 描述 |
|---|---|---|
--model | - | HuggingFace 模型 ID 或本地路径 |
--tasks | - | 逗号分隔的任务名称 |
--n_samples | 1 | 每个问题的样本数(pass@k 需要 200) |
--temperature | 0.2 | 采样温度 |
--max_length_generation | 512 | 最大令牌数(提示 + 生成) |
--batch_size | 1 | 每个 GPU 的批处理大小 |
--allow_code_execution | False | 启用代码执行(必需) |
--generation_only | False | 仅生成,不评估 |
--load_generations_path | - | 加载预生成的解决方案 |
--save_generations | False | 保存生成的代码 |
--metric_output_path | results.json | 指标的输出文件 |
--load_in_8bit | False | 8 位量化 |
--load_in_4bit | False | 4 位量化 |
--trust_remote_code | False | 允许自定义模型代码 |
--precision | fp32 | 模型精度(fp32/fp16/bf16) |
| 模型大小 | VRAM(fp16) | VRAM(4 位) | 时间(HumanEval,n=200) |
|---|---|---|---|
| 7B | 14GB | 6GB | ~30 分钟(A100) |
| 13B | 26GB | 10GB | ~1 小时(A100) |
| 34B | 68GB | 20GB | ~2 小时(A100) |
每周安装次数
187
仓库
GitHub 星标数
23.4K
首次出现
Jan 21, 2026
安全审计
安装于
opencode154
claude-code149
gemini-cli144
cursor137
codex134
github-copilot123
BigCode Evaluation Harness evaluates code generation models across 15+ benchmarks including HumanEval, MBPP, and MultiPL-E (18 languages).
Installation :
git clone https://github.com/bigcode-project/bigcode-evaluation-harness.git
cd bigcode-evaluation-harness
pip install -e .
accelerate config
Evaluate on HumanEval :
accelerate launch main.py \
--model bigcode/starcoder2-7b \
--tasks humaneval \
--max_length_generation 512 \
--temperature 0.2 \
--n_samples 20 \
--batch_size 10 \
--allow_code_execution \
--save_generations
View available tasks :
python -c "from bigcode_eval.tasks import ALL_TASKS; print(ALL_TASKS)"
Evaluate model on core code benchmarks (HumanEval, MBPP, HumanEval+).
Checklist :
Code Benchmark Evaluation:
- [ ] Step 1: Choose benchmark suite
- [ ] Step 2: Configure model and generation
- [ ] Step 3: Run evaluation with code execution
- [ ] Step 4: Analyze pass@k results
Step 1: Choose benchmark suite
Python code generation (most common):
Multi-language (18 languages):
Advanced :
Step 2: Configure model and generation
# Standard HuggingFace model
accelerate launch main.py \
--model bigcode/starcoder2-7b \
--tasks humaneval \
--max_length_generation 512 \
--temperature 0.2 \
--do_sample True \
--n_samples 200 \
--batch_size 50 \
--allow_code_execution
# Quantized model (4-bit)
accelerate launch main.py \
--model codellama/CodeLlama-34b-hf \
--tasks humaneval \
--load_in_4bit \
--max_length_generation 512 \
--allow_code_execution
# Custom/private model
accelerate launch main.py \
--model /path/to/my-code-model \
--tasks humaneval \
--trust_remote_code \
--use_auth_token \
--allow_code_execution
Step 3: Run evaluation
# Full evaluation with pass@k estimation (k=1,10,100)
accelerate launch main.py \
--model bigcode/starcoder2-7b \
--tasks humaneval \
--temperature 0.8 \
--n_samples 200 \
--batch_size 50 \
--allow_code_execution \
--save_generations \
--metric_output_path results/starcoder2-humaneval.json
Step 4: Analyze results
Results in results/starcoder2-humaneval.json:
{
"humaneval": {
"pass@1": 0.354,
"pass@10": 0.521,
"pass@100": 0.689
},
"config": {
"model": "bigcode/starcoder2-7b",
"temperature": 0.8,
"n_samples": 200
}
}
Evaluate code generation across 18 programming languages.
Checklist :
Multi-Language Evaluation:
- [ ] Step 1: Generate solutions (host machine)
- [ ] Step 2: Run evaluation in Docker (safe execution)
- [ ] Step 3: Compare across languages
Step 1: Generate solutions on host
# Generate without execution (safe)
accelerate launch main.py \
--model bigcode/starcoder2-7b \
--tasks multiple-py,multiple-js,multiple-java,multiple-cpp \
--max_length_generation 650 \
--temperature 0.8 \
--n_samples 50 \
--batch_size 50 \
--generation_only \
--save_generations \
--save_generations_path generations_multi.json
Step 2: Evaluate in Docker container
# Pull the MultiPL-E Docker image
docker pull ghcr.io/bigcode-project/evaluation-harness-multiple
# Run evaluation inside container
docker run -v $(pwd)/generations_multi.json:/app/generations.json:ro \
-it evaluation-harness-multiple python3 main.py \
--model bigcode/starcoder2-7b \
--tasks multiple-py,multiple-js,multiple-java,multiple-cpp \
--load_generations_path /app/generations.json \
--allow_code_execution \
--n_samples 50
Supported languages : Python, JavaScript, Java, C++, Go, Rust, TypeScript, C#, PHP, Ruby, Swift, Kotlin, Scala, Perl, Julia, Lua, R, Racket
Evaluate chat/instruction models with proper formatting.
Checklist :
Instruction Model Evaluation:
- [ ] Step 1: Use instruction-tuned tasks
- [ ] Step 2: Configure instruction tokens
- [ ] Step 3: Run evaluation
Step 1: Choose instruction tasks
Step 2: Configure instruction tokens
# For models with chat templates (e.g., CodeLlama-Instruct)
accelerate launch main.py \
--model codellama/CodeLlama-7b-Instruct-hf \
--tasks instruct-humaneval \
--instruction_tokens "<s>[INST],</s>,[/INST]" \
--max_length_generation 512 \
--allow_code_execution
Step 3: HumanEvalPack for instruction models
# Test code synthesis across 6 languages
accelerate launch main.py \
--model codellama/CodeLlama-7b-Instruct-hf \
--tasks humanevalsynthesize-python,humanevalsynthesize-js \
--prompt instruct \
--max_length_generation 512 \
--allow_code_execution
Benchmark suite for model comparison.
Step 1: Create evaluation script
#!/bin/bash
# eval_models.sh
MODELS=(
"bigcode/starcoder2-7b"
"codellama/CodeLlama-7b-hf"
"deepseek-ai/deepseek-coder-6.7b-base"
)
TASKS="humaneval,mbpp"
for model in "${MODELS[@]}"; do
model_name=$(echo $model | tr '/' '-')
echo "Evaluating $model"
accelerate launch main.py \
--model $model \
--tasks $TASKS \
--temperature 0.2 \
--n_samples 20 \
--batch_size 20 \
--allow_code_execution \
--metric_output_path results/${model_name}.json
done
Step 2: Generate comparison table
import json
import pandas as pd
models = ["bigcode-starcoder2-7b", "codellama-CodeLlama-7b-hf", "deepseek-ai-deepseek-coder-6.7b-base"]
results = []
for model in models:
with open(f"results/{model}.json") as f:
data = json.load(f)
results.append({
"Model": model,
"HumanEval pass@1": f"{data['humaneval']['pass@1']:.3f}",
"MBPP pass@1": f"{data['mbpp']['pass@1']:.3f}"
})
df = pd.DataFrame(results)
print(df.to_markdown(index=False))
Use BigCode Evaluation Harness when:
Use alternatives instead:
| Benchmark | Problems | Languages | Metric | Use Case |
|---|---|---|---|---|
| HumanEval | 164 | Python | pass@k | Standard code completion |
| HumanEval+ | 164 | Python | pass@k | Stricter evaluation (80× tests) |
| MBPP | 500 | Python | pass@k | Entry-level problems |
| MBPP+ | 399 | Python | pass@k | Stricter evaluation (35× tests) |
| MultiPL-E | 164×18 | 18 languages | pass@k | Multi-language evaluation |
| APPS | 10,000 | Python |
Issue: Different results than reported in papers
Check these factors:
# 1. Verify n_samples (need 200 for accurate pass@k)
--n_samples 200
# 2. Check temperature (0.2 for greedy-ish, 0.8 for sampling)
--temperature 0.8
# 3. Verify task name matches exactly
--tasks humaneval # Not "human_eval" or "HumanEval"
# 4. Check max_length_generation
--max_length_generation 512 # Increase for longer problems
Issue: CUDA out of memory
# Use quantization
--load_in_8bit
# OR
--load_in_4bit
# Reduce batch size
--batch_size 1
# Set memory limit
--max_memory_per_gpu "20GiB"
Issue: Code execution hangs or times out
Use Docker for safe execution:
# Generate on host (no execution)
--generation_only --save_generations
# Evaluate in Docker
docker run ... --allow_code_execution --load_generations_path ...
Issue: Low scores on instruction models
Ensure proper instruction formatting:
# Use instruction-specific tasks
--tasks instruct-humaneval
# Set instruction tokens for your model
--instruction_tokens "<s>[INST],</s>,[/INST]"
Issue: MultiPL-E language failures
Use the dedicated Docker image:
docker pull ghcr.io/bigcode-project/evaluation-harness-multiple
| Argument | Default | Description |
|---|---|---|
--model | - | HuggingFace model ID or local path |
--tasks | - | Comma-separated task names |
--n_samples | 1 | Samples per problem (200 for pass@k) |
--temperature | 0.2 | Sampling temperature |
--max_length_generation | 512 | Max tokens (prompt + generation) |
| Model Size | VRAM (fp16) | VRAM (4-bit) | Time (HumanEval, n=200) |
|---|---|---|---|
| 7B | 14GB | 6GB | ~30 min (A100) |
| 13B | 26GB | 10GB | ~1 hour (A100) |
| 34B | 68GB | 20GB | ~2 hours (A100) |
Weekly Installs
187
Repository
GitHub Stars
23.4K
First Seen
Jan 21, 2026
Security Audits
Gen Agent Trust HubFailSocketPassSnykWarn
Installed on
opencode154
claude-code149
gemini-cli144
cursor137
codex134
github-copilot123
AI 代码实施计划编写技能 | 自动化开发任务分解与 TDD 流程规划工具
50,900 周安装
| pass@k |
| Competition-level |
| DS-1000 | 1,000 | Python | pass@k | Data science (pandas, numpy, etc.) |
| HumanEvalPack | 164×3×6 | 6 languages | pass@k | Synthesis/fix/explain |
| Mercury | 1,889 | Python | Efficiency | Computational efficiency |
--batch_size | 1 | Batch size per GPU |
--allow_code_execution | False | Enable code execution (required) |
--generation_only | False | Generate without evaluation |
--load_generations_path | - | Load pre-generated solutions |
--save_generations | False | Save generated code |
--metric_output_path | results.json | Output file for metrics |
--load_in_8bit | False | 8-bit quantization |
--load_in_4bit | False | 4-bit quantization |
--trust_remote_code | False | Allow custom model code |
--precision | fp32 | Model precision (fp32/fp16/bf16) |