distributed-llm-pretraining-torchtitan by davila7/claude-code-templates
npx skills add https://github.com/davila7/claude-code-templates --skill distributed-llm-pretraining-torchtitanTorchTitan 是 PyTorch 的官方平台,用于大规模 LLM 预训练,支持可组合的 4D 并行(FSDP2、TP、PP、CP),在 H100 GPU 上相比基线实现 65%+ 的加速。
安装:
# 从 PyPI 安装(稳定版)
pip install torchtitan
# 从源码安装(最新特性,需要 PyTorch nightly 版本)
git clone https://github.com/pytorch/torchtitan
cd torchtitan
pip install -r requirements.txt
下载分词器:
# 从 https://huggingface.co/settings/tokens 获取 HF token
python scripts/download_hf_assets.py --repo_id meta-llama/Llama-3.1-8B --assets tokenizer --hf_token=...
在 8 个 GPU 上开始训练:
CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.sh
复制此清单:
单节点预训练:
- [ ] 步骤 1:下载分词器
- [ ] 步骤 2:配置训练
- [ ] 步骤 3:启动训练
- [ ] 步骤 4:监控和保存检查点
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
步骤 1:下载分词器
python scripts/download_hf_assets.py \
--repo_id meta-llama/Llama-3.1-8B \
--assets tokenizer \
--hf_token=YOUR_HF_TOKEN
步骤 2:配置训练
编辑或创建一个 TOML 配置文件:
# llama3_8b_custom.toml
[job]
dump_folder = "./outputs"
description = "Llama 3.1 8B 训练"
[model]
name = "llama3"
flavor = "8B"
hf_assets_path = "./assets/hf/Llama-3.1-8B"
[optimizer]
name = "AdamW"
lr = 3e-4
[lr_scheduler]
warmup_steps = 200
[training]
local_batch_size = 2
seq_len = 8192
max_norm = 1.0
steps = 1000
dataset = "c4"
[parallelism]
data_parallel_shard_degree = -1 # 使用所有 GPU 进行 FSDP
[activation_checkpoint]
mode = "selective"
selective_ac_option = "op"
[checkpoint]
enable = true
folder = "checkpoint"
interval = 500
步骤 3:启动训练
# 单节点 8 个 GPU
CONFIG_FILE="./llama3_8b_custom.toml" ./run_train.sh
# 或者显式使用 torchrun
torchrun --nproc_per_node=8 \
-m torchtitan.train \
--job.config_file ./llama3_8b_custom.toml
步骤 4:监控和保存检查点
TensorBoard 日志保存到 ./outputs/tb/:
tensorboard --logdir ./outputs/tb
多节点训练:
- [ ] 步骤 1:为扩展配置并行策略
- [ ] 步骤 2:设置 SLURM 脚本
- [ ] 步骤 3:提交任务
- [ ] 步骤 4:从检查点恢复
步骤 1:为扩展配置并行策略
对于 256 个 GPU(32 个节点)上的 70B 模型:
[parallelism]
data_parallel_shard_degree = 32 # 跨 32 个 rank 的 FSDP
tensor_parallel_degree = 8 # 节点内的 TP
pipeline_parallel_degree = 1 # 70B 模型不使用 PP
context_parallel_degree = 1 # 对于长序列可增加
步骤 2:设置 SLURM 脚本
#!/bin/bash
#SBATCH --job-name=llama70b
#SBATCH --nodes=32
#SBATCH --ntasks-per-node=8
#SBATCH --gpus-per-node=8
srun torchrun \
--nnodes=32 \
--nproc_per_node=8 \
--rdzv_backend=c10d \
--rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
-m torchtitan.train \
--job.config_file ./llama3_70b.toml
步骤 3:提交任务
sbatch multinode_trainer.slurm
步骤 4:从检查点恢复
如果配置的文件夹中存在检查点,训练将自动恢复。
Float8 在 H100 GPU 上可提供 30-50% 的加速。
Float8 训练:
- [ ] 步骤 1:安装 torchao
- [ ] 步骤 2:配置 Float8
- [ ] 步骤 3:使用编译启动
步骤 1:安装 torchao
USE_CPP=0 pip install git+https://github.com/pytorch/ao.git
步骤 2:配置 Float8
添加到你的 TOML 配置中:
[model]
converters = ["quantize.linear.float8"]
[quantize.linear.float8]
enable_fsdp_float8_all_gather = true
precompute_float8_dynamic_scale_for_fsdp = true
filter_fqns = ["output"] # 排除输出层
[compile]
enable = true
components = ["model", "loss"]
步骤 3:使用编译启动
CONFIG_FILE="./llama3_8b.toml" ./run_train.sh \
--model.converters="quantize.linear.float8" \
--quantize.linear.float8.enable_fsdp_float8_all_gather \
--compile.enable
4D 并行(FSDP + TP + PP + CP):
- [ ] 步骤 1:创建种子检查点
- [ ] 步骤 2:配置 4D 并行
- [ ] 步骤 3:在 512 个 GPU 上启动
步骤 1:创建种子检查点
跨 PP 阶段进行一致初始化所需:
NGPU=1 CONFIG_FILE=./llama3_405b.toml ./run_train.sh \
--checkpoint.enable \
--checkpoint.create_seed_checkpoint \
--parallelism.data_parallel_shard_degree 1 \
--parallelism.tensor_parallel_degree 1 \
--parallelism.pipeline_parallel_degree 1
步骤 2:配置 4D 并行
[parallelism]
data_parallel_shard_degree = 8 # FSDP
tensor_parallel_degree = 8 # 节点内的 TP
pipeline_parallel_degree = 8 # 跨节点的 PP
context_parallel_degree = 1 # 用于长序列的 CP
[training]
local_batch_size = 32
seq_len = 8192
步骤 3:在 512 个 GPU 上启动
# 64 个节点 x 8 个 GPU = 512 个 GPU
srun torchrun --nnodes=64 --nproc_per_node=8 \
-m torchtitan.train \
--job.config_file ./llama3_405b.toml
在以下情况使用 TorchTitan:
在以下情况使用替代方案:
问题:大模型内存不足
启用激活检查点并减小批次大小:
[activation_checkpoint]
mode = "full" # 而不是 "selective"
[training]
local_batch_size = 1
或者使用梯度累积:
[training]
local_batch_size = 1
global_batch_size = 32 # 累积梯度
问题:TP 导致异步集合通信内存占用高
设置环境变量:
export TORCH_NCCL_AVOID_RECORD_STREAMS=1
问题:Float8 训练没有更快
Float8 仅对大型 GEMM 操作有益。过滤小层:
[quantize.linear.float8]
filter_fqns = ["attention.wk", "attention.wv", "output", "auto_filter_small_kn"]
问题:并行策略更改后检查点加载失败
使用 DCP 的重分片功能:
# 将分片检查点转换为单个文件
python -m torch.distributed.checkpoint.format_utils \
dcp_to_torch checkpoint/step-1000 checkpoint.pt
问题:流水线并行初始化
首先创建种子检查点(参见工作流 4,步骤 1)。
| 模型 | 尺寸 | 状态 |
|---|---|---|
| Llama 3.1 | 8B, 70B, 405B | 生产就绪 |
| Llama 4 | 多种 | 实验性 |
| DeepSeek V3 | 16B, 236B, 671B (MoE) | 实验性 |
| GPT-OSS | 20B, 120B (MoE) | 实验性 |
| Qwen 3 | 多种 | 实验性 |
| Flux | 扩散模型 | 实验性 |
| 模型 | GPU 数量 | 并行策略 | 每 GPU TPS | 技术 |
|---|---|---|---|---|
| Llama 8B | 8 | FSDP | 5,762 | 基线 |
| Llama 8B | 8 | FSDP+编译+FP8 | 8,532 | +48% |
| Llama 70B | 256 | FSDP+TP+AsyncTP | 876 | 2D 并行 |
| Llama 405B | 512 | FSDP+TP+PP | 128 | 3D 并行 |
FSDP2 配置 :查看 references/fsdp.md 了解详细的 FSDP2 与 FSDP1 对比以及 ZeRO 等效方案。
Float8 训练 :查看 references/float8.md 了解张量级与行级缩放方案。
检查点保存 :查看 references/checkpoint.md 了解 HuggingFace 转换和异步检查点保存。
添加自定义模型 :查看 references/custom-models.md 了解 TrainSpec 协议。
每周安装量
59
代码仓库
GitHub 星标数
22.6K
首次出现
2026年1月29日
安全审计
安装于
opencode56
github-copilot54
codex53
gemini-cli52
cursor51
amp50
TorchTitan is PyTorch's official platform for large-scale LLM pretraining with composable 4D parallelism (FSDP2, TP, PP, CP), achieving 65%+ speedups over baselines on H100 GPUs.
Installation :
# From PyPI (stable)
pip install torchtitan
# From source (latest features, requires PyTorch nightly)
git clone https://github.com/pytorch/torchtitan
cd torchtitan
pip install -r requirements.txt
Download tokenizer :
# Get HF token from https://huggingface.co/settings/tokens
python scripts/download_hf_assets.py --repo_id meta-llama/Llama-3.1-8B --assets tokenizer --hf_token=...
Start training on 8 GPUs :
CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.sh
Copy this checklist:
Single Node Pretraining:
- [ ] Step 1: Download tokenizer
- [ ] Step 2: Configure training
- [ ] Step 3: Launch training
- [ ] Step 4: Monitor and checkpoint
Step 1: Download tokenizer
python scripts/download_hf_assets.py \
--repo_id meta-llama/Llama-3.1-8B \
--assets tokenizer \
--hf_token=YOUR_HF_TOKEN
Step 2: Configure training
Edit or create a TOML config file:
# llama3_8b_custom.toml
[job]
dump_folder = "./outputs"
description = "Llama 3.1 8B training"
[model]
name = "llama3"
flavor = "8B"
hf_assets_path = "./assets/hf/Llama-3.1-8B"
[optimizer]
name = "AdamW"
lr = 3e-4
[lr_scheduler]
warmup_steps = 200
[training]
local_batch_size = 2
seq_len = 8192
max_norm = 1.0
steps = 1000
dataset = "c4"
[parallelism]
data_parallel_shard_degree = -1 # Use all GPUs for FSDP
[activation_checkpoint]
mode = "selective"
selective_ac_option = "op"
[checkpoint]
enable = true
folder = "checkpoint"
interval = 500
Step 3: Launch training
# 8 GPUs on single node
CONFIG_FILE="./llama3_8b_custom.toml" ./run_train.sh
# Or explicitly with torchrun
torchrun --nproc_per_node=8 \
-m torchtitan.train \
--job.config_file ./llama3_8b_custom.toml
Step 4: Monitor and checkpoint
TensorBoard logs are saved to ./outputs/tb/:
tensorboard --logdir ./outputs/tb
Multi-Node Training:
- [ ] Step 1: Configure parallelism for scale
- [ ] Step 2: Set up SLURM script
- [ ] Step 3: Submit job
- [ ] Step 4: Resume from checkpoint
Step 1: Configure parallelism for scale
For 70B model on 256 GPUs (32 nodes):
[parallelism]
data_parallel_shard_degree = 32 # FSDP across 32 ranks
tensor_parallel_degree = 8 # TP within node
pipeline_parallel_degree = 1 # No PP for 70B
context_parallel_degree = 1 # Increase for long sequences
Step 2: Set up SLURM script
#!/bin/bash
#SBATCH --job-name=llama70b
#SBATCH --nodes=32
#SBATCH --ntasks-per-node=8
#SBATCH --gpus-per-node=8
srun torchrun \
--nnodes=32 \
--nproc_per_node=8 \
--rdzv_backend=c10d \
--rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
-m torchtitan.train \
--job.config_file ./llama3_70b.toml
Step 3: Submit job
sbatch multinode_trainer.slurm
Step 4: Resume from checkpoint
Training auto-resumes if checkpoint exists in configured folder.
Float8 provides 30-50% speedup on H100 GPUs.
Float8 Training:
- [ ] Step 1: Install torchao
- [ ] Step 2: Configure Float8
- [ ] Step 3: Launch with compile
Step 1: Install torchao
USE_CPP=0 pip install git+https://github.com/pytorch/ao.git
Step 2: Configure Float8
Add to your TOML config:
[model]
converters = ["quantize.linear.float8"]
[quantize.linear.float8]
enable_fsdp_float8_all_gather = true
precompute_float8_dynamic_scale_for_fsdp = true
filter_fqns = ["output"] # Exclude output layer
[compile]
enable = true
components = ["model", "loss"]
Step 3: Launch with compile
CONFIG_FILE="./llama3_8b.toml" ./run_train.sh \
--model.converters="quantize.linear.float8" \
--quantize.linear.float8.enable_fsdp_float8_all_gather \
--compile.enable
4D Parallelism (FSDP + TP + PP + CP):
- [ ] Step 1: Create seed checkpoint
- [ ] Step 2: Configure 4D parallelism
- [ ] Step 3: Launch on 512 GPUs
Step 1: Create seed checkpoint
Required for consistent initialization across PP stages:
NGPU=1 CONFIG_FILE=./llama3_405b.toml ./run_train.sh \
--checkpoint.enable \
--checkpoint.create_seed_checkpoint \
--parallelism.data_parallel_shard_degree 1 \
--parallelism.tensor_parallel_degree 1 \
--parallelism.pipeline_parallel_degree 1
Step 2: Configure 4D parallelism
[parallelism]
data_parallel_shard_degree = 8 # FSDP
tensor_parallel_degree = 8 # TP within node
pipeline_parallel_degree = 8 # PP across nodes
context_parallel_degree = 1 # CP for long sequences
[training]
local_batch_size = 32
seq_len = 8192
Step 3: Launch on 512 GPUs
# 64 nodes x 8 GPUs = 512 GPUs
srun torchrun --nnodes=64 --nproc_per_node=8 \
-m torchtitan.train \
--job.config_file ./llama3_405b.toml
Use TorchTitan when:
Use alternatives instead:
Issue: Out of memory on large models
Enable activation checkpointing and reduce batch size:
[activation_checkpoint]
mode = "full" # Instead of "selective"
[training]
local_batch_size = 1
Or use gradient accumulation:
[training]
local_batch_size = 1
global_batch_size = 32 # Accumulates gradients
Issue: TP causes high memory with async collectives
Set environment variable:
export TORCH_NCCL_AVOID_RECORD_STREAMS=1
Issue: Float8 training not faster
Float8 only benefits large GEMMs. Filter small layers:
[quantize.linear.float8]
filter_fqns = ["attention.wk", "attention.wv", "output", "auto_filter_small_kn"]
Issue: Checkpoint loading fails after parallelism change
Use DCP's resharding capability:
# Convert sharded checkpoint to single file
python -m torch.distributed.checkpoint.format_utils \
dcp_to_torch checkpoint/step-1000 checkpoint.pt
Issue: Pipeline parallelism initialization
Create seed checkpoint first (see Workflow 4, Step 1).
| Model | Sizes | Status |
|---|---|---|
| Llama 3.1 | 8B, 70B, 405B | Production |
| Llama 4 | Various | Experimental |
| DeepSeek V3 | 16B, 236B, 671B (MoE) | Experimental |
| GPT-OSS | 20B, 120B (MoE) | Experimental |
| Qwen 3 | Various | Experimental |
| Flux | Diffusion | Experimental |
| Model | GPUs | Parallelism | TPS/GPU | Techniques |
|---|---|---|---|---|
| Llama 8B | 8 | FSDP | 5,762 | Baseline |
| Llama 8B | 8 | FSDP+compile+FP8 | 8,532 | +48% |
| Llama 70B | 256 | FSDP+TP+AsyncTP | 876 | 2D parallel |
| Llama 405B | 512 | FSDP+TP+PP | 128 | 3D parallel |
FSDP2 configuration : See references/fsdp.md for detailed FSDP2 vs FSDP1 comparison and ZeRO equivalents.
Float8 training : See references/float8.md for tensorwise vs rowwise scaling recipes.
Checkpointing : See references/checkpoint.md for HuggingFace conversion and async checkpointing.
Adding custom models : See references/custom-models.md for TrainSpec protocol.
Weekly Installs
59
Repository
GitHub Stars
22.6K
First Seen
Jan 29, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykFail
Installed on
opencode56
github-copilot54
codex53
gemini-cli52
cursor51
amp50
AI 代码实施计划编写技能 | 自动化开发任务分解与 TDD 流程规划工具
50,900 周安装