BigCode Evaluation Harness - 代码模型基准测试工具，评估15+代码生成基准

evaluating-code-models by davila7/claude-code-templates

232 周安装量

24,200 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/davila7/claude-code-templates --skill evaluating-code-models

AI/机器学习代码生成测试

🇨🇳中文介绍

BigCode Evaluation Harness - 代码模型基准测试

快速开始

BigCode Evaluation Harness 在 15 多个基准测试上评估代码生成模型，包括 HumanEval、MBPP 和 MultiPL-E（18 种语言）。

安装：

git clone https://github.com/bigcode-project/bigcode-evaluation-harness.git
cd bigcode-evaluation-harness
pip install -e .
accelerate config

在 HumanEval 上评估：

accelerate launch main.py \
  --model bigcode/starcoder2-7b \
  --tasks humaneval \
  --max_length_generation 512 \
  --temperature 0.2 \
  --n_samples 20 \
  --batch_size 10 \
  --allow_code_execution \
  --save_generations

查看可用任务：

python -c "from bigcode_eval.tasks import ALL_TASKS; print(ALL_TASKS)"

常用工作流程

工作流程 1：标准代码基准测试评估

在核心代码基准测试（HumanEval、MBPP、HumanEval+）上评估模型。

检查清单：

Code Benchmark Evaluation:
- [ ] 步骤 1：选择基准测试套件
- [ ] 步骤 2：配置模型和生成参数
- [ ] 步骤 3：运行带代码执行的评估
- [ ] 步骤 4：分析 pass@k 结果

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

工作流程 2：多语言评估（MultiPL-E）

在 18 种编程语言上评估代码生成能力。

Multi-Language Evaluation:
- [ ] 步骤 1：生成解决方案（宿主机）
- [ ] 步骤 2：在 Docker 中运行评估（安全执行）
- [ ] 步骤 3：跨语言比较

步骤 1：在宿主机上生成解决方案

# 生成但不执行（安全）
accelerate launch main.py \
  --model bigcode/starcoder2-7b \
  --tasks multiple-py,multiple-js,multiple-java,multiple-cpp \
  --max_length_generation 650 \
  --temperature 0.8 \
  --n_samples 50 \
  --batch_size 50 \
  --generation_only \
  --save_generations \
  --save_generations_path generations_multi.json

步骤 2：在 Docker 容器中评估

# 拉取 MultiPL-E Docker 镜像
docker pull ghcr.io/bigcode-project/evaluation-harness-multiple

# 在容器内运行评估
docker run -v $(pwd)/generations_multi.json:/app/generations.json:ro \
  -it evaluation-harness-multiple python3 main.py \
  --model bigcode/starcoder2-7b \
  --tasks multiple-py,multiple-js,multiple-java,multiple-cpp \
  --load_generations_path /app/generations.json \
  --allow_code_execution \
  --n_samples 50

支持的语言：Python、JavaScript、Java、C++、Go、Rust、TypeScript、C#、PHP、Ruby、Swift、Kotlin、Scala、Perl、Julia、Lua、R、Racket

工作流程 3：指令调优模型评估

使用正确的格式评估聊天/指令模型。

Instruction Model Evaluation:
- [ ] 步骤 1：使用指令调优任务
- [ ] 步骤 2：配置指令标记
- [ ] 步骤 3：运行评估

步骤 1：选择指令任务

instruct-humaneval：带有指令提示的 HumanEval
humanevalsynthesize-{lang}：HumanEvalPack 合成任务

步骤 2：配置指令标记

# 对于具有聊天模板的模型（例如 CodeLlama-Instruct）
accelerate launch main.py \
  --model codellama/CodeLlama-7b-Instruct-hf \
  --tasks instruct-humaneval \
  --instruction_tokens "<s>[INST],</s>,[/INST]" \
  --max_length_generation 512 \
  --allow_code_execution

步骤 3：针对指令模型的 HumanEvalPack

# 在 6 种语言上测试代码合成
accelerate launch main.py \
  --model codellama/CodeLlama-7b-Instruct-hf \
  --tasks humanevalsynthesize-python,humanevalsynthesize-js \
  --prompt instruct \
  --max_length_generation 512 \
  --allow_code_execution

工作流程 4：比较多个模型

用于模型比较的基准测试套件。

步骤 1：创建评估脚本

#!/bin/bash
# eval_models.sh

MODELS=(
  "bigcode/starcoder2-7b"
  "codellama/CodeLlama-7b-hf"
  "deepseek-ai/deepseek-coder-6.7b-base"
)
TASKS="humaneval,mbpp"

for model in "${MODELS[@]}"; do
  model_name=$(echo $model | tr '/' '-')
  echo "Evaluating $model"

  accelerate launch main.py \
    --model $model \
    --tasks $TASKS \
    --temperature 0.2 \
    --n_samples 20 \
    --batch_size 20 \
    --allow_code_execution \
    --metric_output_path results/${model_name}.json
done

步骤 2：生成比较表格

import json
import pandas as pd

models = ["bigcode-starcoder2-7b", "codellama-CodeLlama-7b-hf", "deepseek-ai-deepseek-coder-6.7b-base"]
results = []

for model in models:
    with open(f"results/{model}.json") as f:
        data = json.load(f)
        results.append({
            "Model": model,
            "HumanEval pass@1": f"{data['humaneval']['pass@1']:.3f}",
            "MBPP pass@1": f"{data['mbpp']['pass@1']:.3f}"
        })

df = pd.DataFrame(results)
print(df.to_markdown(index=False))

何时使用与替代方案

在以下情况使用 BigCode Evaluation Harness：

专门评估代码生成模型
需要多语言评估（通过 MultiPL-E 支持 18 种语言）
使用单元测试（pass@k）测试功能正确性
为 BigCode/HuggingFace 排行榜进行基准测试
评估中间填充（FIM）能力

在以下情况使用替代方案：

lm-evaluation-harness：通用 LLM 基准测试（MMLU、GSM8K、HellaSwag）
EvalPlus：更严格的 HumanEval+/MBPP+，包含更多测试用例
SWE-bench：真实世界的 GitHub 问题解决
LiveCodeBench：无污染、持续更新的问题集
CodeXGLUE：代码理解任务（克隆检测、缺陷预测）

支持的基准测试

基准测试	问题数量	语言	指标	使用场景
HumanEval	164	Python	pass@k	标准代码补全
HumanEval+	164	Python	pass@k	更严格的评估（80 倍测试）
MBPP	500	Python	pass@k	入门级问题
MBPP+	399	Python	pass@k	更严格的评估（35 倍测试）
MultiPL-E	164×18	18 种语言	pass@k	多语言评估
APPS	10,000	Python	pass@k	竞赛级别
DS-1000	1,000	Python	pass@k	数据科学（pandas、numpy 等）
HumanEvalPack	164×3×6	6 种语言	pass@k	合成/修复/解释
Mercury	1,889	Python	效率	计算效率

问题：与论文中报告的结果不同

检查以下因素：

# 1. 验证 n_samples（需要 200 以获得准确的 pass@k）
--n_samples 200

# 2. 检查 temperature（0.2 用于类贪婪采样，0.8 用于采样）
--temperature 0.8

# 3. 验证任务名称完全匹配
--tasks humaneval  # 不是 "human_eval" 或 "HumanEval"

# 4. 检查 max_length_generation
--max_length_generation 512  # 对于较长的问题，请增加此值

问题：CUDA 内存不足

# 使用量化
--load_in_8bit
# 或
--load_in_4bit

# 减小批处理大小
--batch_size 1

# 设置内存限制
--max_memory_per_gpu "20GiB"

问题：代码执行挂起或超时

使用 Docker 进行安全执行：

# 在宿主机上生成（不执行）
--generation_only --save_generations

# 在 Docker 中评估
docker run ... --allow_code_execution --load_generations_path ...

问题：指令模型得分低

确保正确的指令格式：

# 使用特定于指令的任务
--tasks instruct-humaneval

# 为您的模型设置指令标记
--instruction_tokens "<s>[INST],</s>,[/INST]"

问题：MultiPL-E 语言失败

使用专用的 Docker 镜像：

docker pull ghcr.io/bigcode-project/evaluation-harness-multiple

参数	默认值	描述
`--model`	-	HuggingFace 模型 ID 或本地路径
`--tasks`	-	逗号分隔的任务名称
`--n_samples`	1	每个问题的样本数（pass@k 需要 200）
`--temperature`	0.2	采样温度
`--max_length_generation`	512	最大令牌数（提示 + 生成）
`--batch_size`	1	每个 GPU 的批处理大小
`--allow_code_execution`	False	启用代码执行（必需）
`--generation_only`	False	仅生成，不评估
`--load_generations_path`	-	加载预生成的解决方案
`--save_generations`	False	保存生成的代码
`--metric_output_path`	results.json	指标的输出文件
`--load_in_8bit`	False	8 位量化
`--load_in_4bit`	False	4 位量化
`--trust_remote_code`	False	允许自定义模型代码
`--precision`	fp32	模型精度（fp32/fp16/bf16）

模型大小	VRAM（fp16）	VRAM（4 位）	时间（HumanEval，n=200）
7B	14GB	6GB	~30 分钟（A100）
13B	26GB	10GB	~1 小时（A100）
34B	68GB	20GB	~2 小时（A100）

🇺🇸English

BigCode Evaluation Harness - Code Model Benchmarking

Quick Start

BigCode Evaluation Harness evaluates code generation models across 15+ benchmarks including HumanEval, MBPP, and MultiPL-E (18 languages).

Installation :

git clone https://github.com/bigcode-project/bigcode-evaluation-harness.git
cd bigcode-evaluation-harness
pip install -e .
accelerate config

Evaluate on HumanEval :

accelerate launch main.py \
  --model bigcode/starcoder2-7b \
  --tasks humaneval \
  --max_length_generation 512 \
  --temperature 0.2 \
  --n_samples 20 \
  --batch_size 10 \
  --allow_code_execution \
  --save_generations

View available tasks :

python -c "from bigcode_eval.tasks import ALL_TASKS; print(ALL_TASKS)"

Common Workflows

Workflow 1: Standard Code Benchmark Evaluation

Evaluate model on core code benchmarks (HumanEval, MBPP, HumanEval+).

Checklist :

Code Benchmark Evaluation:
- [ ] Step 1: Choose benchmark suite
- [ ] Step 2: Configure model and generation
- [ ] Step 3: Run evaluation with code execution
- [ ] Step 4: Analyze pass@k results

Step 1: Choose benchmark suite

Python code generation (most common):

HumanEval : 164 handwritten problems, function completion
HumanEval+ : Same 164 problems with 80× more tests (stricter)
MBPP : 500 crowd-sourced problems, entry-level difficulty
MBPP+ : 399 curated problems with 35× more tests

Multi-language (18 languages):

MultiPL-E : HumanEval/MBPP translated to C++, Java, JavaScript, Go, Rust, etc.

Advanced :

APPS : 10,000 problems (introductory/interview/competition)
DS-1000 : 1,000 data science problems across 7 libraries

Step 2: Configure model and generation

# Standard HuggingFace model
accelerate launch main.py \
  --model bigcode/starcoder2-7b \
  --tasks humaneval \
  --max_length_generation 512 \
  --temperature 0.2 \
  --do_sample True \
  --n_samples 200 \
  --batch_size 50 \
  --allow_code_execution

# Quantized model (4-bit)
accelerate launch main.py \
  --model codellama/CodeLlama-34b-hf \
  --tasks humaneval \
  --load_in_4bit \
  --max_length_generation 512 \
  --allow_code_execution

# Custom/private model
accelerate launch main.py \
  --model /path/to/my-code-model \
  --tasks humaneval \
  --trust_remote_code \
  --use_auth_token \
  --allow_code_execution

Step 3: Run evaluation

# Full evaluation with pass@k estimation (k=1,10,100)
accelerate launch main.py \
  --model bigcode/starcoder2-7b \
  --tasks humaneval \
  --temperature 0.8 \
  --n_samples 200 \
  --batch_size 50 \
  --allow_code_execution \
  --save_generations \
  --metric_output_path results/starcoder2-humaneval.json

Step 4: Analyze results

Results in results/starcoder2-humaneval.json:

{
  "humaneval": {
    "pass@1": 0.354,
    "pass@10": 0.521,
    "pass@100": 0.689
  },
  "config": {
    "model": "bigcode/starcoder2-7b",
    "temperature": 0.8,
    "n_samples": 200
  }
}

Workflow 2: Multi-Language Evaluation (MultiPL-E)

Evaluate code generation across 18 programming languages.

Checklist :

Multi-Language Evaluation:
- [ ] Step 1: Generate solutions (host machine)
- [ ] Step 2: Run evaluation in Docker (safe execution)
- [ ] Step 3: Compare across languages

Step 1: Generate solutions on host

# Generate without execution (safe)
accelerate launch main.py \
  --model bigcode/starcoder2-7b \
  --tasks multiple-py,multiple-js,multiple-java,multiple-cpp \
  --max_length_generation 650 \
  --temperature 0.8 \
  --n_samples 50 \
  --batch_size 50 \
  --generation_only \
  --save_generations \
  --save_generations_path generations_multi.json

Step 2: Evaluate in Docker container

# Pull the MultiPL-E Docker image
docker pull ghcr.io/bigcode-project/evaluation-harness-multiple

# Run evaluation inside container
docker run -v $(pwd)/generations_multi.json:/app/generations.json:ro \
  -it evaluation-harness-multiple python3 main.py \
  --model bigcode/starcoder2-7b \
  --tasks multiple-py,multiple-js,multiple-java,multiple-cpp \
  --load_generations_path /app/generations.json \
  --allow_code_execution \
  --n_samples 50

Supported languages : Python, JavaScript, Java, C++, Go, Rust, TypeScript, C#, PHP, Ruby, Swift, Kotlin, Scala, Perl, Julia, Lua, R, Racket

Workflow 3: Instruction-Tuned Model Evaluation

Evaluate chat/instruction models with proper formatting.

Checklist :

Instruction Model Evaluation:
- [ ] Step 1: Use instruction-tuned tasks
- [ ] Step 2: Configure instruction tokens
- [ ] Step 3: Run evaluation

Step 1: Choose instruction tasks

instruct-humaneval : HumanEval with instruction prompts
humanevalsynthesize-{lang} : HumanEvalPack synthesis tasks

Step 2: Configure instruction tokens

# For models with chat templates (e.g., CodeLlama-Instruct)
accelerate launch main.py \
  --model codellama/CodeLlama-7b-Instruct-hf \
  --tasks instruct-humaneval \
  --instruction_tokens "<s>[INST],</s>,[/INST]" \
  --max_length_generation 512 \
  --allow_code_execution

Step 3: HumanEvalPack for instruction models

# Test code synthesis across 6 languages
accelerate launch main.py \
  --model codellama/CodeLlama-7b-Instruct-hf \
  --tasks humanevalsynthesize-python,humanevalsynthesize-js \
  --prompt instruct \
  --max_length_generation 512 \
  --allow_code_execution

Workflow 4: Compare Multiple Models

Benchmark suite for model comparison.

Step 1: Create evaluation script

#!/bin/bash
# eval_models.sh

MODELS=(
  "bigcode/starcoder2-7b"
  "codellama/CodeLlama-7b-hf"
  "deepseek-ai/deepseek-coder-6.7b-base"
)
TASKS="humaneval,mbpp"

for model in "${MODELS[@]}"; do
  model_name=$(echo $model | tr '/' '-')
  echo "Evaluating $model"

  accelerate launch main.py \
    --model $model \
    --tasks $TASKS \
    --temperature 0.2 \
    --n_samples 20 \
    --batch_size 20 \
    --allow_code_execution \
    --metric_output_path results/${model_name}.json
done

Step 2: Generate comparison table

import json
import pandas as pd

models = ["bigcode-starcoder2-7b", "codellama-CodeLlama-7b-hf", "deepseek-ai-deepseek-coder-6.7b-base"]
results = []

for model in models:
    with open(f"results/{model}.json") as f:
        data = json.load(f)
        results.append({
            "Model": model,
            "HumanEval pass@1": f"{data['humaneval']['pass@1']:.3f}",
            "MBPP pass@1": f"{data['mbpp']['pass@1']:.3f}"
        })

df = pd.DataFrame(results)
print(df.to_markdown(index=False))

When to Use vs Alternatives

Use BigCode Evaluation Harness when:

Evaluating code generation models specifically
Need multi-language evaluation (18 languages via MultiPL-E)
Testing functional correctness with unit tests (pass@k)
Benchmarking for BigCode/HuggingFace leaderboards
Evaluating fill-in-the-middle (FIM) capabilities

Use alternatives instead:

lm-evaluation-harness : General LLM benchmarks (MMLU, GSM8K, HellaSwag)
EvalPlus : Stricter HumanEval+/MBPP+ with more test cases
SWE-bench : Real-world GitHub issue resolution
LiveCodeBench : Contamination-free, continuously updated problems
CodeXGLUE : Code understanding tasks (clone detection, defect prediction)

Supported Benchmarks

Benchmark	Problems	Languages	Metric	Use Case
HumanEval	164	Python	pass@k	Standard code completion
HumanEval+	164	Python	pass@k	Stricter evaluation (80× tests)
MBPP	500	Python	pass@k	Entry-level problems
MBPP+	399	Python	pass@k	Stricter evaluation (35× tests)
MultiPL-E	164×18	18 languages	pass@k	Multi-language evaluation
APPS	10,000	Python

Common Issues

Issue: Different results than reported in papers

Check these factors:

# 1. Verify n_samples (need 200 for accurate pass@k)
--n_samples 200

# 2. Check temperature (0.2 for greedy-ish, 0.8 for sampling)
--temperature 0.8

# 3. Verify task name matches exactly
--tasks humaneval  # Not "human_eval" or "HumanEval"

# 4. Check max_length_generation
--max_length_generation 512  # Increase for longer problems

Issue: CUDA out of memory

# Use quantization
--load_in_8bit
# OR
--load_in_4bit

# Reduce batch size
--batch_size 1

# Set memory limit
--max_memory_per_gpu "20GiB"

Issue: Code execution hangs or times out

Use Docker for safe execution:

# Generate on host (no execution)
--generation_only --save_generations

# Evaluate in Docker
docker run ... --allow_code_execution --load_generations_path ...

Issue: Low scores on instruction models

Ensure proper instruction formatting:

# Use instruction-specific tasks
--tasks instruct-humaneval

# Set instruction tokens for your model
--instruction_tokens "<s>[INST],</s>,[/INST]"

Issue: MultiPL-E language failures

Use the dedicated Docker image:

docker pull ghcr.io/bigcode-project/evaluation-harness-multiple

Command Reference

Argument	Default	Description
`--model`	-	HuggingFace model ID or local path
`--tasks`	-	Comma-separated task names
`--n_samples`	1	Samples per problem (200 for pass@k)
`--temperature`	0.2	Sampling temperature
`--max_length_generation`	512	Max tokens (prompt + generation)

Hardware Requirements

Model Size	VRAM (fp16)	VRAM (4-bit)	Time (HumanEval, n=200)
7B	14GB	6GB	~30 min (A100)
13B	26GB	10GB	~1 hour (A100)
34B	68GB	20GB	~2 hours (A100)

Resources

GitHub : https://github.com/bigcode-project/bigcode-evaluation-harness
Documentation : https://github.com/bigcode-project/bigcode-evaluation-harness/tree/main/docs
BigCode Leaderboard : https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard
HumanEval Dataset : https://huggingface.co/datasets/openai/openai_humaneval
MultiPL-E : https://github.com/nuprl/MultiPL-E

Weekly Installs

187

Repository

davila7/claude-…emplates

GitHub Stars

23.4K

First Seen

Jan 21, 2026

Security Audits

Gen Agent Trust HubFail SocketPass SnykWarn

Installed on

opencode154

claude-code149

gemini-cli144

cursor137

codex134

github-copilot123

AI 代码实施计划编写技能 | 自动化开发任务分解与 TDD 流程规划工具

50,900 周安装

BigCode Evaluation Harness - 代码模型基准测试工具，评估15+代码生成基准

🇨🇳中文介绍

BigCode Evaluation Harness - 代码模型基准测试

快速开始

常用工作流程

工作流程 1：标准代码基准测试评估

相关 Skills

工作流程 2：多语言评估（MultiPL-E）

工作流程 3：指令调优模型评估

工作流程 4：比较多个模型

何时使用与替代方案

支持的基准测试

常见问题

命令参考

硬件要求

资源

🇺🇸English

BigCode Evaluation Harness - Code Model Benchmarking

Quick Start

Common Workflows

Workflow 1: Standard Code Benchmark Evaluation

Workflow 2: Multi-Language Evaluation (MultiPL-E)

Workflow 3: Instruction-Tuned Model Evaluation

Workflow 4: Compare Multiple Models

When to Use vs Alternatives

Supported Benchmarks

Common Issues

Command Reference

Hardware Requirements

Resources

最新 Skills