Megatron-Core 大规模 LLM 训练指南：从 2B 到 462B 参数模型并行训练与优化

training-llms-megatron by davila7/claude-code-templates

229 周安装量

24,100 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/davila7/claude-code-templates --skill training-llms-megatron

AI/机器学习性能优化分布式系统

🇨🇳中文介绍

Megatron-Core - 大规模 LLM 训练

快速开始

Megatron-Core 通过先进的并行策略，在 H100 GPU 上训练参数规模从 2B 到 462B 的大语言模型，模型浮点运算利用率最高可达 47%。

安装：

# Docker (推荐)
docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:25.04-py3

# 或者使用 pip
pip install megatron-core

简单的分布式训练：

# 使用数据并行在 2 个 GPU 上进行训练
torchrun --nproc_per_node=2 examples/run_simple_mcore_train_loop.py

# 或者训练 LLaMA-3 8B
./examples/llama/train_llama3_8b_fp8.sh

常见工作流

工作流 1：使用 3D 并行训练 LLaMA 风格模型

复制此清单：

LLaMA 训练设置：
- [ ] 步骤 1：选择并行配置
- [ ] 步骤 2：配置训练超参数
- [ ] 步骤 3：启动分布式训练
- [ ] 步骤 4：监控性能指标

步骤 1：选择并行配置

模型大小决定了并行策略：

模型大小	GPU 数量	张量并行

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

工作流 2：配置专家混合模型训练

适用于像 Mixtral 这样的稀疏 MoE 模型。

MoE 训练：
- [ ] 步骤 1：配置专家并行
- [ ] 步骤 2：设置 MoE 超参数
- [ ] 步骤 3：使用 EP 启动训练

步骤 1：配置专家并行

# Mixtral 8x7B 示例
TENSOR_PARALLEL=2
PIPELINE_PARALLEL=1
EXPERT_PARALLEL=4  # 将 8 个专家分配到 4 个 GPU 上
DATA_PARALLEL=4

TOTAL_GPUS=$((TENSOR_PARALLEL * PIPELINE_PARALLEL * EXPERT_PARALLEL * DATA_PARALLEL))
# = 2 * 1 * 4 * 4 = 32 个 GPU

步骤 2：设置 MoE 超参数

torchrun \
  --nproc_per_node=8 \
  pretrain_gpt.py \
  --tensor-model-parallel-size 2 \
  --pipeline-model-parallel-size 1 \
  --expert-model-parallel-size 4 \
  --num-experts 8 \
  --moe-router-topk 2 \
  --moe-router-load-balancing-type aux_loss \
  --moe-aux-loss-coeff 0.01 \
  --hidden-size 4096 \
  --num-layers 32 \
  --num-attention-heads 32 \
  --seq-length 4096 \
  --max-position-embeddings 4096 \
  --bf16 \
  --use-mcore-models \
  --transformer-impl transformer_engine \
  --data-path /path/to/data \
  --vocab-file /path/to/vocab.json \
  --merge-file /path/to/merges.txt

步骤 3：使用 EP 启动训练

专家并行将不同的专家分配到不同的 GPU 上，在保持容量的同时减少内存占用。

不使用 EP 的内存：8 个专家 × 7B = 每个 GPU 56GB
使用 EP=4 的内存：2 个专家 × 7B = 每个 GPU 14GB
节省：内存减少 75%

工作流 3：优化以获得最大吞吐量

在 H100 上实现 47% 的 MFU。

性能优化：
- [ ] 步骤 1：启用 Flash Attention
- [ ] 步骤 2：使用 FP8 精度 (H100)
- [ ] 步骤 3：优化微批次大小
- [ ] 步骤 4：调整并行度

步骤 1：启用优化

--use-mcore-models  # 使用 Megatron Core 模型
--transformer-impl transformer_engine  # 使用 Transformer Engine
--sequence-parallel  # 减少激活内存 (与 TP 一起使用)

步骤 2：使用 FP8 精度 (仅限 H100)

--fp8-hybrid  # FP8 混合精度训练
# Transformer Engine 自动处理 FP8

结果：与 BF16 相比，在 H100 上速度提升 1.5-2 倍。

步骤 3：优化微批次大小

找到能放入内存的最大微批次：

# 从 1 开始，增加直到内存溢出
for MBS in 1 2 4 8; do
  echo "测试 micro-batch-size=$MBS"
  torchrun ... --micro-batch-size $MBS
done

7B 模型：4-8
70B 模型：1-2
405B 模型：1

步骤 4：调整并行度

张量并行：使用 ≤8 (受限于节点内的 NVLink)
流水线并行：用于 >70B 的模型
上下文并行：用于序列长度 >8K 令牌
数据并行：填充剩余的 GPU

在 128 个 H100 上运行 405B 模型的示例：

TP=8 (1 个节点)
PP=8 (跨节点)
CP=2 (长序列)
DP=1
总计 = 8 × 8 × 2 × 1 = 128 个 GPU

何时使用与替代方案对比

在以下情况使用 Megatron-Core：

训练参数 >10B 的模型
需要最高效率 (目标 >40% MFU)
使用 NVIDIA GPU (A100, H100)
大规模生产训练
需要细粒度的并行控制

使用替代方案的情况：

PyTorch FSDP ：模型 <70B，API 更简单，PyTorch 原生
DeepSpeed ：设置更简单，适用于 <100B 模型
HuggingFace Accelerate ：原型设计，工作流更简单
LitGPT ：教育用途，单文件实现

问题：GPU 利用率低 ( <30% MFU)

微批次太小
并行开销太大
未使用 Flash Attention

# 增加微批次
--micro-batch-size 4  # 原来是 1

# 启用优化
--use-flash-attn
--sequence-parallel

# 如果 TP > 8，则减少 TP
--tensor-model-parallel-size 4  # 原来是 16

问题：内存不足

减少内存使用：

--tensor-model-parallel-size 2  # 将模型拆分到多个 GPU 上
--recompute-granularity full  # 梯度检查点
--recompute-method block  # 检查点 transformer 块
--recompute-num-layers 1  # 每层都做检查点

或者使用 CPU/NVMe 卸载：

--cpu-optimizer  # 将优化器卸载到 CPU
--cpu-optimizer-type ADAM  # CPU Adam 变体

问题：训练速度比预期慢

网络瓶颈 ：确保启用 InfiniBand/NVLink
流水线气泡 ：使用交错流水线调度
```
--num-layers-per-virtual-pipeline-stage 2
```
数据加载 ：使用快速数据加载器
```
--dataloader-type cyclic
```

问题：损失发散

--lr-warmup-iters 2000  # 更长的预热
--clip-grad 1.0  # 梯度裁剪
--init-method-std 0.006  # 更小的初始化
--attention-dropout 0.0  # 注意力中不使用 dropout
--hidden-dropout 0.0  # FFN 中不使用 dropout

并行策略 ：请参阅 references/parallelism-guide.md 以获取 TP/PP/DP/CP/EP 的详细比较，包括性能分析以及何时使用每种策略。

性能基准测试 ：请参阅 references/benchmarks.md 以获取不同模型大小和 GPU 配置下的 MFU 数据。

生产配置 ：请参阅 references/production-examples.md 以获取 LLaMA 3 405B、Nemotron-4 340B 和 DeepSeek-V3 671B 的真实世界设置。

训练配方 ：请参阅 references/training-recipes.md 以获取 GPT/LLaMA/Mixtral 架构的完整超参数配置。

GPU ：NVIDIA Ampere+ (A100, H100, B200)
- Turing 可用但速度较慢
- FP8 需要 Hopper/Ada/Blackwell
网络：多节点需要 InfiniBand 或 400Gb+ 以太网
每个 GPU 的内存 ：
- 7B 模型：40GB+
- 70B 模型：80GB (使用 TP=4)
- 405B 模型：80GB (使用 TP=8, PP=8)
存储：用于检查点的快速 NVMe (70B+ 模型需要 1TB+)

文档：https://docs.nvidia.com/megatron-core/
GitHub：https://github.com/NVIDIA/Megatron-LM
论文：
- "Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism" (2019)
- "Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM" (2021)
NeMo 框架：https://docs.nvidia.com/nemo-framework/ (基于 Megatron-Core 构建)

🇺🇸English

Megatron-Core - Large-Scale LLM Training

Quick start

Megatron-Core trains LLMs from 2B to 462B parameters with up to 47% Model FLOP Utilization on H100 GPUs through advanced parallelism strategies.

Installation :

# Docker (recommended)
docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:25.04-py3

# Or pip
pip install megatron-core

Simple distributed training :

# Train with 2 GPUs using data parallelism
torchrun --nproc_per_node=2 examples/run_simple_mcore_train_loop.py

# Or LLaMA-3 8B training
./examples/llama/train_llama3_8b_fp8.sh

Common workflows

Workflow 1: Train LLaMA-style model with 3D parallelism

Copy this checklist:

LLaMA Training Setup:
- [ ] Step 1: Choose parallelism configuration
- [ ] Step 2: Configure training hyperparameters
- [ ] Step 3: Launch distributed training
- [ ] Step 4: Monitor performance metrics

Step 1: Choose parallelism configuration

Model size determines parallelism strategy:

Model Size	GPUs	Tensor Parallel	Pipeline Parallel	Data Parallel	Context Parallel
7B	8	1	1	8	1
13B	8	2	1	4	1
70B	64	4	4	4	1
405B	128	8	8	2	2

Step 2: Configure training hyperparameters

#!/bin/bash
# train_llama_70b.sh

GPUS_PER_NODE=8
NNODES=8  # 64 GPUs total
TP=4      # Tensor parallel
PP=4      # Pipeline parallel
CP=1      # Context parallel

# LLaMA 70B configuration
MODEL_SIZE=70  # Billion parameters
HIDDEN_SIZE=8192
NUM_LAYERS=80
NUM_HEADS=64
SEQ_LENGTH=4096

# Training hyperparameters
MICRO_BATCH=1
GLOBAL_BATCH=1024
LR=3e-4

torchrun \
  --nproc_per_node=$GPUS_PER_NODE \
  --nnodes=$NNODES \
  pretrain_gpt.py \
  --tensor-model-parallel-size $TP \
  --pipeline-model-parallel-size $PP \
  --context-parallel-size $CP \
  --sequence-parallel \
  --num-layers $NUM_LAYERS \
  --hidden-size $HIDDEN_SIZE \
  --num-attention-heads $NUM_HEADS \
  --seq-length $SEQ_LENGTH \
  --max-position-embeddings $SEQ_LENGTH \
  --micro-batch-size $MICRO_BATCH \
  --global-batch-size $GLOBAL_BATCH \
  --lr $LR \
  --train-iters 100000 \
  --lr-decay-style cosine \
  --lr-warmup-iters 2000 \
  --weight-decay 0.1 \
  --clip-grad 1.0 \
  --bf16 \
  --use-mcore-models \
  --transformer-impl transformer_engine \
  --data-path /path/to/data \
  --vocab-file /path/to/vocab.json \
  --merge-file /path/to/merges.txt

Step 3: Launch distributed training

# Single node (8 GPUs)
bash train_llama_70b.sh

# Multi-node with SLURM
sbatch --nodes=8 --gpus-per-node=8 train_llama_70b.sh

Step 4: Monitor performance metrics

Key metrics to track:

Model FLOP Utilization (MFU): Target >40% on H100
Throughput: Tokens/sec/GPU
Memory usage: <80GB per GPU for 70B model
Loss: Should decrease steadily

Workflow 2: Configure Mixture of Experts (MoE) training

For sparse MoE models like Mixtral.

MoE Training:
- [ ] Step 1: Configure expert parallelism
- [ ] Step 2: Set MoE hyperparameters
- [ ] Step 3: Launch training with EP

Step 1: Configure expert parallelism

# Mixtral 8x7B example
TENSOR_PARALLEL=2
PIPELINE_PARALLEL=1
EXPERT_PARALLEL=4  # Split 8 experts across 4 GPUs
DATA_PARALLEL=4

TOTAL_GPUS=$((TENSOR_PARALLEL * PIPELINE_PARALLEL * EXPERT_PARALLEL * DATA_PARALLEL))
# = 2 * 1 * 4 * 4 = 32 GPUs

Step 2: Set MoE hyperparameters

torchrun \
  --nproc_per_node=8 \
  pretrain_gpt.py \
  --tensor-model-parallel-size 2 \
  --pipeline-model-parallel-size 1 \
  --expert-model-parallel-size 4 \
  --num-experts 8 \
  --moe-router-topk 2 \
  --moe-router-load-balancing-type aux_loss \
  --moe-aux-loss-coeff 0.01 \
  --hidden-size 4096 \
  --num-layers 32 \
  --num-attention-heads 32 \
  --seq-length 4096 \
  --max-position-embeddings 4096 \
  --bf16 \
  --use-mcore-models \
  --transformer-impl transformer_engine \
  --data-path /path/to/data \
  --vocab-file /path/to/vocab.json \
  --merge-file /path/to/merges.txt

Step 3: Launch training with EP

Expert parallelism distributes different experts across GPUs, reducing memory while maintaining capacity.

Memory without EP: 8 experts × 7B = 56GB per GPU
Memory with EP=4: 2 experts × 7B = 14GB per GPU
Savings: 75% memory reduction

Workflow 3: Optimize for maximum throughput

Achieve 47% MFU on H100.

Performance Optimization:
- [ ] Step 1: Enable Flash Attention
- [ ] Step 2: Use FP8 precision (H100)
- [ ] Step 3: Optimize micro-batch size
- [ ] Step 4: Tune parallelism degrees

Step 1: Enable optimizations

--use-mcore-models  # Use Megatron Core models
--transformer-impl transformer_engine  # Use Transformer Engine
--sequence-parallel  # Reduce activation memory (use with TP)

Step 2: Use FP8 precision (H100 only)

--fp8-hybrid  # FP8 mixed precision training
# Transformer Engine handles FP8 automatically

Result: 1.5-2x speedup on H100 vs BF16.

Step 3: Optimize micro-batch size

Find largest micro-batch that fits in memory:

# Start with 1, increase until OOM
for MBS in 1 2 4 8; do
  echo "Testing micro-batch-size=$MBS"
  torchrun ... --micro-batch-size $MBS
done

Typical values:

7B model: 4-8
70B model: 1-2
405B model: 1

Step 4: Tune parallelism degrees

Rules of thumb:

Tensor Parallel: Use ≤8 (limited by NVLink within node)
Pipeline Parallel: Use for >70B models
Context Parallel: Use for sequences >8K tokens
Data Parallel: Fill remaining GPUs

Example 405B on 128 H100s:

TP=8 (1 node)
PP=8 (across nodes)
CP=2 (long sequences)
DP=1
Total = 8 × 8 × 2 × 1 = 128 GPUs

When to use vs alternatives

Use Megatron-Core when:

Training models >10B parameters
Need maximum efficiency (target >40% MFU)
Using NVIDIA GPUs (A100, H100)
Production training at scale
Want fine-grained parallelism control

Use alternatives instead:

PyTorch FSDP : Models <70B, simpler API, PyTorch native
DeepSpeed : Easier setup, good for <100B models
HuggingFace Accelerate : Prototyping, simpler workflows
LitGPT : Educational, single-file implementations

Common issues

Issue: Low GPU utilization ( <30% MFU)

Causes:

Micro-batch too small
Too much parallelism overhead
Not using Flash Attention

Fixes:

# Increase micro-batch
--micro-batch-size 4  # Was 1

# Enable optimizations
--use-flash-attn
--sequence-parallel

# Reduce TP if >8
--tensor-model-parallel-size 4  # Was 16

Issue: Out of memory

Reduce memory with:

--tensor-model-parallel-size 2  # Split model across GPUs
--recompute-granularity full  # Gradient checkpointing
--recompute-method block  # Checkpoint transformer blocks
--recompute-num-layers 1  # Checkpoint every layer

Or use CPU/NVMe offloading:

--cpu-optimizer  # Offload optimizer to CPU
--cpu-optimizer-type ADAM  # CPU Adam variant

Issue: Training slower than expected

Check:

Network bottleneck : Ensure InfiniBand/NVLink enabled
Pipeline bubbles : Use interleaved pipeline schedule
```
--num-layers-per-virtual-pipeline-stage 2
```
Data loading : Use fast data loader
```
--dataloader-type cyclic
```

Issue: Diverging loss

Stabilize training:

--lr-warmup-iters 2000  # Longer warmup
--clip-grad 1.0  # Gradient clipping
--init-method-std 0.006  # Smaller init
--attention-dropout 0.0  # No dropout in attention
--hidden-dropout 0.0  # No dropout in FFN

Advanced topics

Parallelism strategies : See references/parallelism-guide.md for detailed comparison of TP/PP/DP/CP/EP with performance analysis and when to use each.

Performance benchmarks : See references/benchmarks.md for MFU numbers across different model sizes and GPU configurations.

Production configurations : See references/production-examples.md for real-world setups from LLaMA 3 405B, Nemotron-4 340B, and DeepSeek-V3 671B.

Training recipes : See references/training-recipes.md for complete hyperparameter configurations for GPT/LLaMA/Mixtral architectures.

Hardware requirements

GPU : NVIDIA Ampere+ (A100, H100, B200)
- Turing works but slower
- FP8 requires Hopper/Ada/Blackwell
Network : InfiniBand or 400Gb+ Ethernet for multi-node
Memory per GPU :
- 7B model: 40GB+
- 70B model: 80GB (with TP=4)
- 405B model: 80GB (with TP=8, PP=8)
Storage : Fast NVMe for checkpoints (1TB+ for 70B+ models)

Resources

Docs: https://docs.nvidia.com/megatron-core/
GitHub: https://github.com/NVIDIA/Megatron-LM
Papers:
- "Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism" (2019)
- "Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM" (2021)
NeMo Framework: https://docs.nvidia.com/nemo-framework/ (built on Megatron-Core)

Weekly Installs

189

Repository

davila7/claude-…emplates

GitHub Stars

23.4K

First Seen

Jan 21, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

opencode154

claude-code154

gemini-cli146

cursor138

codex135

github-copilot125

AI 代码实施计划编写技能 | 自动化开发任务分解与 TDD 流程规划工具

49,000 周安装