TorchTitan：PyTorch原生分布式大语言模型预训练平台，支持4D并行与H100 GPU加速

distributed-llm-pretraining-torchtitan by davila7/claude-code-templates

69 周安装量

23,500 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/davila7/claude-code-templates --skill distributed-llm-pretraining-torchtitan

AI/机器学习 PyTorch 分布式系统

🇨🇳中文介绍

TorchTitan - PyTorch 原生分布式大语言模型预训练

快速开始

TorchTitan 是 PyTorch 的官方平台，用于大规模 LLM 预训练，支持可组合的 4D 并行（FSDP2、TP、PP、CP），在 H100 GPU 上相比基线实现 65%+ 的加速。

安装：

# 从 PyPI 安装（稳定版）
pip install torchtitan

# 从源码安装（最新特性，需要 PyTorch nightly 版本）
git clone https://github.com/pytorch/torchtitan
cd torchtitan
pip install -r requirements.txt

下载分词器：

# 从 https://huggingface.co/settings/tokens 获取 HF token
python scripts/download_hf_assets.py --repo_id meta-llama/Llama-3.1-8B --assets tokenizer --hf_token=...

在 8 个 GPU 上开始训练：

CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.sh

常见工作流

工作流 1：在单节点上预训练 Llama 3.1 8B

复制此清单：

单节点预训练：
- [ ] 步骤 1：下载分词器
- [ ] 步骤 2：配置训练
- [ ] 步骤 3：启动训练
- [ ] 步骤 4：监控和保存检查点

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

工作流 2：使用 SLURM 进行多节点训练

多节点训练：
- [ ] 步骤 1：为扩展配置并行策略
- [ ] 步骤 2：设置 SLURM 脚本
- [ ] 步骤 3：提交任务
- [ ] 步骤 4：从检查点恢复

步骤 1：为扩展配置并行策略

对于 256 个 GPU（32 个节点）上的 70B 模型：

[parallelism]
data_parallel_shard_degree = 32  # 跨 32 个 rank 的 FSDP
tensor_parallel_degree = 8        # 节点内的 TP
pipeline_parallel_degree = 1      # 70B 模型不使用 PP
context_parallel_degree = 1       # 对于长序列可增加

步骤 2：设置 SLURM 脚本

#!/bin/bash
#SBATCH --job-name=llama70b
#SBATCH --nodes=32
#SBATCH --ntasks-per-node=8
#SBATCH --gpus-per-node=8

srun torchrun \
  --nnodes=32 \
  --nproc_per_node=8 \
  --rdzv_backend=c10d \
  --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
  -m torchtitan.train \
  --job.config_file ./llama3_70b.toml

步骤 3：提交任务

sbatch multinode_trainer.slurm

步骤 4：从检查点恢复

如果配置的文件夹中存在检查点，训练将自动恢复。

工作流 3：为 H100 启用 Float8 训练

Float8 在 H100 GPU 上可提供 30-50% 的加速。

Float8 训练：
- [ ] 步骤 1：安装 torchao
- [ ] 步骤 2：配置 Float8
- [ ] 步骤 3：使用编译启动

步骤 1：安装 torchao

USE_CPP=0 pip install git+https://github.com/pytorch/ao.git

步骤 2：配置 Float8

添加到你的 TOML 配置中：

[model]
converters = ["quantize.linear.float8"]

[quantize.linear.float8]
enable_fsdp_float8_all_gather = true
precompute_float8_dynamic_scale_for_fsdp = true
filter_fqns = ["output"]  # 排除输出层

[compile]
enable = true
components = ["model", "loss"]

步骤 3：使用编译启动

CONFIG_FILE="./llama3_8b.toml" ./run_train.sh \
  --model.converters="quantize.linear.float8" \
  --quantize.linear.float8.enable_fsdp_float8_all_gather \
  --compile.enable

工作流 4：用于 405B 模型的 4D 并行

4D 并行（FSDP + TP + PP + CP）：
- [ ] 步骤 1：创建种子检查点
- [ ] 步骤 2：配置 4D 并行
- [ ] 步骤 3：在 512 个 GPU 上启动

步骤 1：创建种子检查点

跨 PP 阶段进行一致初始化所需：

NGPU=1 CONFIG_FILE=./llama3_405b.toml ./run_train.sh \
  --checkpoint.enable \
  --checkpoint.create_seed_checkpoint \
  --parallelism.data_parallel_shard_degree 1 \
  --parallelism.tensor_parallel_degree 1 \
  --parallelism.pipeline_parallel_degree 1

步骤 2：配置 4D 并行

[parallelism]
data_parallel_shard_degree = 8   # FSDP
tensor_parallel_degree = 8       # 节点内的 TP
pipeline_parallel_degree = 8     # 跨节点的 PP
context_parallel_degree = 1      # 用于长序列的 CP

[training]
local_batch_size = 32
seq_len = 8192

步骤 3：在 512 个 GPU 上启动

# 64 个节点 x 8 个 GPU = 512 个 GPU
srun torchrun --nnodes=64 --nproc_per_node=8 \
  -m torchtitan.train \
  --job.config_file ./llama3_405b.toml

何时使用 vs 替代方案

在以下情况使用 TorchTitan：

从头开始预训练 LLM（8B 到 405B+）
需要无需第三方依赖的 PyTorch 原生解决方案
需要可组合的 4D 并行（FSDP2、TP、PP、CP）
在支持 Float8 的 H100 上进行训练
希望与 torchtune/HuggingFace 的检查点可互操作

在以下情况使用替代方案：

Megatron-LM ：适用于纯 NVIDIA 部署，追求极致性能
DeepSpeed ：更广泛的 ZeRO 优化生态系统，支持推理
Axolotl/TRL ：用于微调而非预训练
LitGPT ：用于教育目的、小规模训练

问题：大模型内存不足

启用激活检查点并减小批次大小：

[activation_checkpoint]
mode = "full"  # 而不是 "selective"

[training]
local_batch_size = 1

或者使用梯度累积：

[training]
local_batch_size = 1
global_batch_size = 32  # 累积梯度

问题：TP 导致异步集合通信内存占用高

设置环境变量：

export TORCH_NCCL_AVOID_RECORD_STREAMS=1

问题：Float8 训练没有更快

Float8 仅对大型 GEMM 操作有益。过滤小层：

[quantize.linear.float8]
filter_fqns = ["attention.wk", "attention.wv", "output", "auto_filter_small_kn"]

问题：并行策略更改后检查点加载失败

使用 DCP 的重分片功能：

# 将分片检查点转换为单个文件
python -m torch.distributed.checkpoint.format_utils \
  dcp_to_torch checkpoint/step-1000 checkpoint.pt

问题：流水线并行初始化

首先创建种子检查点（参见工作流 4，步骤 1）。

模型	尺寸	状态
Llama 3.1	8B, 70B, 405B	生产就绪
Llama 4	多种	实验性
DeepSeek V3	16B, 236B, 671B (MoE)	实验性
GPT-OSS	20B, 120B (MoE)	实验性
Qwen 3	多种	实验性
Flux	扩散模型	实验性

性能基准（H100）

模型	GPU 数量	并行策略	每 GPU TPS	技术
Llama 8B	8	FSDP	5,762	基线
Llama 8B	8	FSDP+编译+FP8	8,532	+48%
Llama 70B	256	FSDP+TP+AsyncTP	876	2D 并行
Llama 405B	512	FSDP+TP+PP	128	3D 并行

FSDP2 配置 ：查看 references/fsdp.md 了解详细的 FSDP2 与 FSDP1 对比以及 ZeRO 等效方案。

Float8 训练 ：查看 references/float8.md 了解张量级与行级缩放方案。

检查点保存 ：查看 references/checkpoint.md 了解 HuggingFace 转换和异步检查点保存。

添加自定义模型 ：查看 references/custom-models.md 了解 TrainSpec 协议。

🇺🇸English

TorchTitan - PyTorch Native Distributed LLM Pretraining

Quick start

TorchTitan is PyTorch's official platform for large-scale LLM pretraining with composable 4D parallelism (FSDP2, TP, PP, CP), achieving 65%+ speedups over baselines on H100 GPUs.

Installation :

# From PyPI (stable)
pip install torchtitan

# From source (latest features, requires PyTorch nightly)
git clone https://github.com/pytorch/torchtitan
cd torchtitan
pip install -r requirements.txt

Download tokenizer :

# Get HF token from https://huggingface.co/settings/tokens
python scripts/download_hf_assets.py --repo_id meta-llama/Llama-3.1-8B --assets tokenizer --hf_token=...

Start training on 8 GPUs :

CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.sh

Common workflows

Workflow 1: Pretrain Llama 3.1 8B on single node

Copy this checklist:

Single Node Pretraining:
- [ ] Step 1: Download tokenizer
- [ ] Step 2: Configure training
- [ ] Step 3: Launch training
- [ ] Step 4: Monitor and checkpoint

Step 1: Download tokenizer

python scripts/download_hf_assets.py \
  --repo_id meta-llama/Llama-3.1-8B \
  --assets tokenizer \
  --hf_token=YOUR_HF_TOKEN

Step 2: Configure training

Edit or create a TOML config file:

# llama3_8b_custom.toml
[job]
dump_folder = "./outputs"
description = "Llama 3.1 8B training"

[model]
name = "llama3"
flavor = "8B"
hf_assets_path = "./assets/hf/Llama-3.1-8B"

[optimizer]
name = "AdamW"
lr = 3e-4

[lr_scheduler]
warmup_steps = 200

[training]
local_batch_size = 2
seq_len = 8192
max_norm = 1.0
steps = 1000
dataset = "c4"

[parallelism]
data_parallel_shard_degree = -1  # Use all GPUs for FSDP

[activation_checkpoint]
mode = "selective"
selective_ac_option = "op"

[checkpoint]
enable = true
folder = "checkpoint"
interval = 500

Step 3: Launch training

# 8 GPUs on single node
CONFIG_FILE="./llama3_8b_custom.toml" ./run_train.sh

# Or explicitly with torchrun
torchrun --nproc_per_node=8 \
  -m torchtitan.train \
  --job.config_file ./llama3_8b_custom.toml

Step 4: Monitor and checkpoint

TensorBoard logs are saved to ./outputs/tb/:

tensorboard --logdir ./outputs/tb

Workflow 2: Multi-node training with SLURM

Multi-Node Training:
- [ ] Step 1: Configure parallelism for scale
- [ ] Step 2: Set up SLURM script
- [ ] Step 3: Submit job
- [ ] Step 4: Resume from checkpoint

Step 1: Configure parallelism for scale

For 70B model on 256 GPUs (32 nodes):

[parallelism]
data_parallel_shard_degree = 32  # FSDP across 32 ranks
tensor_parallel_degree = 8        # TP within node
pipeline_parallel_degree = 1      # No PP for 70B
context_parallel_degree = 1       # Increase for long sequences

Step 2: Set up SLURM script

#!/bin/bash
#SBATCH --job-name=llama70b
#SBATCH --nodes=32
#SBATCH --ntasks-per-node=8
#SBATCH --gpus-per-node=8

srun torchrun \
  --nnodes=32 \
  --nproc_per_node=8 \
  --rdzv_backend=c10d \
  --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
  -m torchtitan.train \
  --job.config_file ./llama3_70b.toml

Step 3: Submit job

sbatch multinode_trainer.slurm

Step 4: Resume from checkpoint

Training auto-resumes if checkpoint exists in configured folder.

Workflow 3: Enable Float8 training for H100s

Float8 provides 30-50% speedup on H100 GPUs.

Float8 Training:
- [ ] Step 1: Install torchao
- [ ] Step 2: Configure Float8
- [ ] Step 3: Launch with compile

Step 1: Install torchao

USE_CPP=0 pip install git+https://github.com/pytorch/ao.git

Step 2: Configure Float8

Add to your TOML config:

[model]
converters = ["quantize.linear.float8"]

[quantize.linear.float8]
enable_fsdp_float8_all_gather = true
precompute_float8_dynamic_scale_for_fsdp = true
filter_fqns = ["output"]  # Exclude output layer

[compile]
enable = true
components = ["model", "loss"]

Step 3: Launch with compile

CONFIG_FILE="./llama3_8b.toml" ./run_train.sh \
  --model.converters="quantize.linear.float8" \
  --quantize.linear.float8.enable_fsdp_float8_all_gather \
  --compile.enable

Workflow 4: 4D parallelism for 405B models

4D Parallelism (FSDP + TP + PP + CP):
- [ ] Step 1: Create seed checkpoint
- [ ] Step 2: Configure 4D parallelism
- [ ] Step 3: Launch on 512 GPUs

Step 1: Create seed checkpoint

Required for consistent initialization across PP stages:

NGPU=1 CONFIG_FILE=./llama3_405b.toml ./run_train.sh \
  --checkpoint.enable \
  --checkpoint.create_seed_checkpoint \
  --parallelism.data_parallel_shard_degree 1 \
  --parallelism.tensor_parallel_degree 1 \
  --parallelism.pipeline_parallel_degree 1

Step 2: Configure 4D parallelism

[parallelism]
data_parallel_shard_degree = 8   # FSDP
tensor_parallel_degree = 8       # TP within node
pipeline_parallel_degree = 8     # PP across nodes
context_parallel_degree = 1      # CP for long sequences

[training]
local_batch_size = 32
seq_len = 8192

Step 3: Launch on 512 GPUs

# 64 nodes x 8 GPUs = 512 GPUs
srun torchrun --nnodes=64 --nproc_per_node=8 \
  -m torchtitan.train \
  --job.config_file ./llama3_405b.toml

When to use vs alternatives

Use TorchTitan when:

Pretraining LLMs from scratch (8B to 405B+)
Need PyTorch-native solution without third-party dependencies
Require composable 4D parallelism (FSDP2, TP, PP, CP)
Training on H100s with Float8 support
Want interoperable checkpoints with torchtune/HuggingFace

Use alternatives instead:

Megatron-LM : Maximum performance for NVIDIA-only deployments
DeepSpeed : Broader ZeRO optimization ecosystem, inference support
Axolotl/TRL : Fine-tuning rather than pretraining
LitGPT : Educational, smaller-scale training

Common issues

Issue: Out of memory on large models

Enable activation checkpointing and reduce batch size:

[activation_checkpoint]
mode = "full"  # Instead of "selective"

[training]
local_batch_size = 1

Or use gradient accumulation:

[training]
local_batch_size = 1
global_batch_size = 32  # Accumulates gradients

Issue: TP causes high memory with async collectives

Set environment variable:

export TORCH_NCCL_AVOID_RECORD_STREAMS=1

Issue: Float8 training not faster

Float8 only benefits large GEMMs. Filter small layers:

[quantize.linear.float8]
filter_fqns = ["attention.wk", "attention.wv", "output", "auto_filter_small_kn"]

Issue: Checkpoint loading fails after parallelism change

Use DCP's resharding capability:

# Convert sharded checkpoint to single file
python -m torch.distributed.checkpoint.format_utils \
  dcp_to_torch checkpoint/step-1000 checkpoint.pt

Issue: Pipeline parallelism initialization

Create seed checkpoint first (see Workflow 4, Step 1).

Supported models

Model	Sizes	Status
Llama 3.1	8B, 70B, 405B	Production
Llama 4	Various	Experimental
DeepSeek V3	16B, 236B, 671B (MoE)	Experimental
GPT-OSS	20B, 120B (MoE)	Experimental
Qwen 3	Various	Experimental
Flux	Diffusion	Experimental

Performance benchmarks (H100)

Model	GPUs	Parallelism	TPS/GPU	Techniques
Llama 8B	8	FSDP	5,762	Baseline
Llama 8B	8	FSDP+compile+FP8	8,532	+48%
Llama 70B	256	FSDP+TP+AsyncTP	876	2D parallel
Llama 405B	512	FSDP+TP+PP	128	3D parallel

Advanced topics

FSDP2 configuration : See references/fsdp.md for detailed FSDP2 vs FSDP1 comparison and ZeRO equivalents.

Float8 training : See references/float8.md for tensorwise vs rowwise scaling recipes.

Checkpointing : See references/checkpoint.md for HuggingFace conversion and async checkpointing.

Adding custom models : See references/custom-models.md for TrainSpec protocol.

Resources

GitHub: https://github.com/pytorch/torchtitan
Paper: https://arxiv.org/abs/2410.06511
ICLR 2025: https://iclr.cc/virtual/2025/poster/29620
PyTorch Forum: https://discuss.pytorch.org/c/distributed/torchtitan/44

Weekly Installs

Repository

davila7/claude-…emplates

GitHub Stars

22.6K

First Seen

Jan 29, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykFail

Installed on

opencode56

github-copilot54

codex53

gemini-cli52

cursor51

amp50

AI 代码实施计划编写技能 | 自动化开发任务分解与 TDD 流程规划工具

50,900 周安装