⚠️

重要前提

安装AI Skills的关键前提是：必须科学上网，且开启TUN模式，这一点至关重要，直接决定安装能否顺利完成，在此郑重提醒三遍：科学上网，科学上网，科学上网。查看完整安装教程 →

PyTorch FSDP2 完全分片数据并行训练指南 - 大模型GPU内存优化

pytorch-fsdp2 by orchestra-research/ai-research-skills

64 周安装量

5,500 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/orchestra-research/ai-research-skills --skill pytorch-fsdp2

AI/机器学习 PyTorch 高性能计算

🇨🇳中文介绍

技能：在训练脚本中正确使用 PyTorch FSDP2 (`fully_shard`)

本技能教授编码代理如何将 PyTorch FSDP2 添加到训练循环中，包括正确的初始化、分片、混合精度/卸载配置以及检查点保存。

PyTorch 中的 FSDP2 主要通过 torch.distributed.fsdp.fully_shard 及其向模块原地添加的 FSDPModule 方法来使用。参见：references/pytorch_fully_shard_api.md, references/pytorch_fsdp2_tutorial.md。

何时使用此技能

在以下情况使用 FSDP2：

你的模型无法放入单个 GPU（参数 + 梯度 + 优化器状态）。
你希望使用一种基于 DTensor 的逐参数分片的即时模式分片方法（比 FSDP1 更易检查、分片状态字典更简单）。
你后续可能希望使用 DeviceMesh 将数据并行与张量并行组合使用。

在以下情况避免（或谨慎）使用：

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

替代方案（当 FSDP2 不是最佳选择时）

DistributedDataParallel (DDP)：当你需要经典的数据并行分布式训练时，使用标准的数据并行包装器。
FullyShardedDataParallel (FSDP1)：使用原始的 FSDP 包装器在数据并行工作进程间进行参数分片。

参考：references/pytorch_ddp_notes.md, references/pytorch_fsdp1_api.md。

代理必须遵循的约定

使用 torchrun 启动，并设置每个进程的 CUDA 设备（通常通过 LOCAL_RANK）。
自底向上应用 fully_shard()，即在根模块之前分片子模块（例如，Transformer 块）。
调用 model(input)，而不是 model.forward(input)，以便 FSDP2 钩子运行（除非你显式调用 unshard() 或注册前向方法）。
在分片后创建优化器，并确保它构建在 DTensor 参数上（fully_shard 之后）。
使用分布式检查点 (DCP) 或分布式状态字典辅助函数进行保存，不要使用简单的 torch.save(model.state_dict())，除非你故意将张量收集为完整张量。

（这些规则在官方 API 文档/教程中均有直接描述；参见参考资料。）

0) 版本与环境检查

建议使用近期稳定的 PyTorch 版本，其文档显示 FSDP2 和 DCP 最近有更新。
使用 torchrun --nproc_per_node <gpus_per_node> ... 并确保 RANK、WORLD_SIZE、LOCAL_RANK 可见。

参考：references/pytorch_fsdp2_tutorial.md（启动命令和设置），references/pytorch_fully_shard_api.md（用户约定）。

1) 初始化分布式环境并设置设备

最小化、正确的模式：

dist.init_process_group(backend="nccl")
torch.cuda.set_device(int(os.environ["LOCAL_RANK"]))
（可选）创建 DeviceMesh 来描述数据并行组

参考：references/pytorch_device_mesh_tutorial.md（为什么存在 DeviceMesh 以及它如何管理进程组）。

2) 在元设备上构建模型（推荐用于非常大的模型）

对于大型模型，在 meta 上初始化，应用分片，然后在 GPU 上具体化权重：

with torch.device("meta"): model = ...
在子模块上应用 fully_shard(...)，然后 fully_shard(model)
model.to_empty(device="cuda")
model.reset_parameters()（或你的初始化例程）

参考：references/pytorch_fsdp2_tutorial.md（迁移指南明确展示了此流程）。

3) 自底向上应用 `fully_shard()`（包装策略 = “在需要的地方应用”）

不要只在最顶层的模块上调用 fully_shard。

针对类 Transformer 模型的推荐分片模式：

遍历模块，if isinstance(m, TransformerBlock): fully_shard(m, ...)
然后 fully_shard(model, ...)

fully_shard 为集合通信效率形成“参数组”，并排除已被先前调用分组的参数。自底向上能提供更好的重叠和更低的内存峰值。

参考：references/pytorch_fully_shard_api.md（自底向上的要求及原因）。

4) 配置 `reshard_after_forward` 以权衡内存/性能

None 表示非根模块为 True，根模块为 False（良好的默认值）。

如果你受内存限制：保持默认值或在许多块上强制设为 True。
如果你受吞吐量限制且内存充足：考虑将未分片参数保留更长时间（根模块通常为 False）。
高级用法：如果是一个有意义的除数，可以在前向传播后使用 int 将参数重新分片到更小的网格（例如，节点内）。

参考：references/pytorch_fully_shard_api.md（完整语义）。

5) 混合精度与卸载（可选但常见）

mp_policy=MixedPrecisionPolicy(param_dtype=..., reduce_dtype=..., output_dtype=..., cast_forward_inputs=...)
offload_policy=CPUOffloadPolicy() 如果你需要 CPU 卸载

在 H100/A100 级 GPU 上，从 BF16 参数/规约开始（如果你的模型数值稳定）。
保持 reduce_dtype 与你的梯度规约期望一致。
如果你使用 CPU 卸载，请考虑 PCIe/NVLink 流量和运行时开销。

参考：references/pytorch_fully_shard_api.md（MixedPrecisionPolicy / OffloadPolicy 类）。

6) 优化器、梯度裁剪、累积

在分片后创建优化器，以便它持有 DTensor 参数。
如果你需要梯度累积 / no_sync：
- 使用 FSDP2 机制（set_requires_gradient_sync）而不是 FSDP1 的 no_sync()。

使用 FSDP2 教程中展示的方法（“Gradient Clipping and Optimizer with DTensor”），因为参数/梯度是 DTensors。

参考：references/pytorch_fsdp2_tutorial.md。

7) 检查点保存：推荐 DCP 或分布式状态字典辅助函数

两种推荐方法：

A) 分布式检查点 (DCP) — 最佳默认选择

DCP 从多个 rank 并行保存/加载，并支持加载时重新分片。
DCP 生成多个文件（通常每个 rank 至少一个）并“原地”操作。

B) 分布式状态字典辅助函数

使用 get_model_state_dict / set_model_state_dict 并配合 StateDictOptions(full_state_dict=True, cpu_offload=True, broadcast_from_rank0=True, ...)
对于优化器：get_optimizer_state_dict / set_optimizer_state_dict

使用普通的 torch.save 保存 DTensor 状态字典，除非你故意使用 DTensor.full_tensor() 进行转换并仔细管理内存。

references/pytorch_dcp_overview.md（DCP 行为和注意事项）
references/pytorch_dcp_recipe.md 和 references/pytorch_dcp_async_recipe.md（端到端用法）
references/pytorch_fsdp2_tutorial.md（DTensor 与 DCP 状态字典流程）
references/pytorch_examples_fsdp2.md（可用的检查点脚本）

工作流程检查清单（便于复制粘贴）

工作流程 A：将 FSDP2 改造到现有训练脚本中

使用 torchrun 启动并初始化进程组。
根据 LOCAL_RANK 设置 CUDA 设备；如果需要多维并行，创建 DeviceMesh。
构建模型（如果需要，使用 meta），自底向上应用 fully_shard，然后 fully_shard(model)。
在分片后创建优化器，以便它捕获 DTensor 参数。
使用 model(inputs) 以便钩子运行；使用 set_requires_gradient_sync 进行累积。
通过 torch.distributed.checkpoint 辅助函数添加 DCP 保存/加载。

参考：references/pytorch_fsdp2_tutorial.md, references/pytorch_fully_shard_api.md, references/pytorch_device_mesh_tutorial.md, references/pytorch_dcp_recipe.md。

工作流程 B：添加 DCP 保存/加载（最小化模式）

使用 Stateful 包装状态或通过 get_state_dict 组装状态。
从所有 rank 调用 dcp.save(...) 到共享路径。
调用 dcp.load(...) 并使用 set_state_dict 恢复。
当加载到不同的网格时，验证任何重新分片的假设。

参考：references/pytorch_dcp_recipe.md。

调试检查清单（代理应首先检查的内容）

所有 rank 都在不同的 GPU 上吗？
如果不是，请验证 torch.cuda.set_device(LOCAL_RANK) 和你的 torchrun 标志。
你是否意外直接调用了 forward()？
使用 model(input) 或显式调用 unshard() / 注册前向方法。
fully_shard() 是自底向上应用的吗？
如果只有根模块被分片，预计会有更差的内存/性能和可能的混淆。
优化器是在正确的时间创建的吗？
必须在分片之后构建在 DTensor 参数上。
检查点保存路径一致吗？
- 如果使用 DCP，不要与临时的 torch.save 混用，除非你理解转换过程。
- 注意 DCP 关于 PyTorch 版本兼容性的警告。

常见问题与修复

前向钩子未运行 → 调用 model(inputs)（或显式调用 unshard()）而不是 model.forward(...)。
优化器看到非 DTensor 参数 → 在所有 fully_shard 调用之后创建优化器。
只有根模块被分片 → 在根模块之前，自底向上在子模块上应用 fully_shard。
前向传播后内存激增 → 为更多模块设置 reshard_after_forward=True。
梯度累积不同步 → 使用 set_requires_gradient_sync 而不是 FSDP1 的 no_sync()。

参考：references/pytorch_fully_shard_api.md, references/pytorch_fsdp2_tutorial.md。

最小化参考实现大纲（代理友好）

编码代理应实现一个包含以下标记块的脚本：

init_distributed()：初始化进程组，设置设备
build_model_meta()：在 meta 上构建模型，应用 fully_shard，具体化权重
build_optimizer()：在分片后创建优化器
train_step()：使用 model(inputs) 和 DTensor 感知模式进行前向/反向/步进
checkpoint_save/load()：DCP 或分布式状态字典辅助函数

具体示例位于 references/pytorch_examples_fsdp2.md 和官方教程参考资料中。

references/pytorch_fsdp2_tutorial.md
references/pytorch_fully_shard_api.md
references/pytorch_ddp_notes.md
references/pytorch_fsdp1_api.md
references/pytorch_device_mesh_tutorial.md
references/pytorch_tp_tutorial.md
references/pytorch_dcp_overview.md
references/pytorch_dcp_recipe.md
references/pytorch_dcp_async_recipe.md
references/pytorch_examples_fsdp2.md
references/torchtitan_fsdp_notes.md（可选，生产笔记）
references/ray_train_fsdp2_example.md（可选，集成示例）

🇺🇸English

Skill: Use PyTorch FSDP2 (`fully_shard`) correctly in a training script

This skill teaches a coding agent how to add PyTorch FSDP2 to a training loop with correct initialization, sharding, mixed precision/offload configuration, and checkpointing.

FSDP2 in PyTorch is exposed primarily via torch.distributed.fsdp.fully_shard and the FSDPModule methods it adds in-place to modules. See: references/pytorch_fully_shard_api.md, references/pytorch_fsdp2_tutorial.md.

When to use this skill

Use FSDP2 when:

Your model doesn’t fit on one GPU (parameters + gradients + optimizer state).
You want an eager-mode sharding approach that is DTensor-based per-parameter sharding (more inspectable, simpler sharded state dicts) than FSDP1.
You may later compose DP with Tensor Parallel using DeviceMesh.

Avoid (or be careful) if:

You need strict backwards-compatible checkpoints across PyTorch versions (DCP warns against this).
You’re forced onto older PyTorch versions without the FSDP2 stack.

Alternatives (when FSDP2 is not the best fit)

DistributedDataParallel (DDP) : Use the standard data-parallel wrapper when you want classic distributed data parallel training.
FullyShardedDataParallel (FSDP1) : Use the original FSDP wrapper for parameter sharding across data-parallel workers.

Reference: references/pytorch_ddp_notes.md, references/pytorch_fsdp1_api.md.

Contract the agent must follow

Launch withtorchrun and set the CUDA device per process (usually via LOCAL_RANK).
Applyfully_shard() bottom-up, i.e., shard submodules (e.g., Transformer blocks) before the root module.
Callmodel(input), not model.forward(input), so the FSDP2 hooks run (unless you explicitly unshard() or register the forward method).
Create the optimizer after sharding and make sure it is built on the DTensor parameters (post-fully_shard).
Checkpoint using Distributed Checkpoint (DCP) or the distributed-state-dict helpers, not naïve torch.save(model.state_dict()) unless you deliberately gather to full tensors.

(Each of these rules is directly described in the official API docs/tutorial; see references.)

Step-by-step procedure

0) Version & environment sanity

Prefer a recent stable PyTorch where the docs show FSDP2 and DCP updated recently.
Use torchrun --nproc_per_node <gpus_per_node> ... and ensure RANK, WORLD_SIZE, LOCAL_RANK are visible.

Reference: references/pytorch_fsdp2_tutorial.md (launch commands and setup), references/pytorch_fully_shard_api.md (user contract).

1) Initialize distributed and set device

Minimal, correct pattern:

dist.init_process_group(backend="nccl")
torch.cuda.set_device(int(os.environ["LOCAL_RANK"]))
Optionally create a DeviceMesh to describe the data-parallel group(s)

Reference: references/pytorch_device_mesh_tutorial.md (why DeviceMesh exists & how it manages process groups).

2) Build model on meta device (recommended for very large models)

For big models, initialize on meta, apply sharding, then materialize weights on GPU:

with torch.device("meta"): model = ...
apply fully_shard(...) on submodules, then fully_shard(model)
model.to_empty(device="cuda")
model.reset_parameters() (or your init routine)

Reference: references/pytorch_fsdp2_tutorial.md (migration guide shows this flow explicitly).

3) Apply `fully_shard()` bottom-up (wrapping policy = “apply where needed”)

Do not only call fully_shard on the topmost module.

Recommended sharding pattern for transformer-like models:

iterate modules, if isinstance(m, TransformerBlock): fully_shard(m, ...)
then fully_shard(model, ...)

Why:

fully_shard forms “parameter groups” for collective efficiency and excludes params already grouped by earlier calls. Bottom-up gives better overlap and lower peak memory.

Reference: references/pytorch_fully_shard_api.md (bottom-up requirement and why).

4) Configure `reshard_after_forward` for memory/perf trade-offs

Default behavior:

None means True for non-root modules and False for root modules (good default).

Heuristics:

If you’re memory-bound: keep defaults or force True on many blocks.
If you’re throughput-bound and can afford memory: consider keeping unsharded params longer (root often False).
Advanced: use an int to reshard to a smaller mesh after forward (e.g., intra-node) if it’s a meaningful divisor.

Reference: references/pytorch_fully_shard_api.md (full semantics).

5) Mixed precision & offload (optional but common)

FSDP2 uses:

mp_policy=MixedPrecisionPolicy(param_dtype=..., reduce_dtype=..., output_dtype=..., cast_forward_inputs=...)
offload_policy=CPUOffloadPolicy() if you want CPU offload

Rules of thumb:

Start with BF16 parameters/reductions on H100/A100-class GPUs (if numerically stable for your model).
Keep reduce_dtype aligned with your gradient reduction expectations.
If you use CPU offload, budget for PCIe/NVLink traffic and runtime overhead.

Reference: references/pytorch_fully_shard_api.md (MixedPrecisionPolicy / OffloadPolicy classes).

6) Optimizer, gradient clipping, accumulation

Create the optimizer after sharding so it holds DTensor params.
If you need gradient accumulation / no_sync:
- use the FSDP2 mechanism (set_requires_gradient_sync) instead of FSDP1’s no_sync().

Gradient clipping:

Use the approach shown in the FSDP2 tutorial (“Gradient Clipping and Optimizer with DTensor”), because parameters/gradients are DTensors.

Reference: references/pytorch_fsdp2_tutorial.md.

7) Checkpointing: prefer DCP or distributed state dict helpers

Two recommended approaches:

A) Distributed Checkpoint (DCP) — best default

DCP saves/loads from multiple ranks in parallel and supports load-time resharding.
DCP produces multiple files (often at least one per rank) and operates “in place”.

B) Distributed state dict helpers

get_model_state_dict / set_model_state_dict with StateDictOptions(full_state_dict=True, cpu_offload=True, broadcast_from_rank0=True, ...)
For optimizer: get_optimizer_state_dict / set_optimizer_state_dict

Avoid:

Saving DTensor state dicts with plain torch.save unless you intentionally convert with DTensor.full_tensor() and manage memory carefully.

References:

references/pytorch_dcp_overview.md (DCP behavior and caveats)
references/pytorch_dcp_recipe.md and references/pytorch_dcp_async_recipe.md (end-to-end usage)
references/pytorch_fsdp2_tutorial.md (DTensor vs DCP state-dict flows)
references/pytorch_examples_fsdp2.md (working checkpoint scripts)

Workflow checklists (copy-paste friendly)

Workflow A: Retrofit FSDP2 into an existing training script

Launch with torchrun and initialize the process group.
Set the CUDA device from LOCAL_RANK; create a DeviceMesh if you need multi-dim parallelism.
Build the model (use meta if needed), apply fully_shard bottom-up, then fully_shard(model).
Create the optimizer after sharding so it captures DTensor parameters.
Use model(inputs) so hooks run; use set_requires_gradient_sync for accumulation.
Add DCP save/load via torch.distributed.checkpoint helpers.

Reference: references/pytorch_fsdp2_tutorial.md, references/pytorch_fully_shard_api.md, references/pytorch_device_mesh_tutorial.md, references/pytorch_dcp_recipe.md.

Workflow B: Add DCP save/load (minimal pattern)

Wrap state in Stateful or assemble state via get_state_dict.
Call dcp.save(...) from all ranks to a shared path.
Call dcp.load(...) and restore with set_state_dict.
Validate any resharding assumptions when loading into a different mesh.

Reference: references/pytorch_dcp_recipe.md.

Debug checklist (what the agent should check first)

All ranks on distinct GPUs?
If not, verify torch.cuda.set_device(LOCAL_RANK) and your torchrun flags.
Did you accidentally callforward() directly?
Use model(input) or explicitly unshard() / register forward.
Isfully_shard() applied bottom-up?
If only root is sharded, expect worse memory/perf and possible confusion.
Optimizer created at the right time?
Must be built on DTensor parameters after sharding.
Checkpointing path consistent?
- If using DCP, don’t mix with ad-hoc torch.save unless you understand conversions.
- Be mindful of PyTorch-version compatibility warnings for DCP.

Common issues and fixes

Forward hooks not running → Call model(inputs) (or unshard() explicitly) instead of model.forward(...).
Optimizer sees non-DTensor params → Create optimizer after all fully_shard calls.
Only root module sharded → Apply fully_shard bottom-up on submodules before the root.
Memory spikes after forward → Set reshard_after_forward=True for more modules.
Gradient accumulation desync → Use set_requires_gradient_sync instead of FSDP1’s no_sync().

Reference: references/pytorch_fully_shard_api.md, references/pytorch_fsdp2_tutorial.md.

Minimal reference implementation outline (agent-friendly)

The coding agent should implement a script with these labeled blocks:

init_distributed(): init process group, set device
build_model_meta(): model on meta, apply fully_shard, materialize weights
build_optimizer(): optimizer created after sharding
train_step(): forward/backward/step with model(inputs) and DTensor-aware patterns
checkpoint_save/load(): DCP or distributed state dict helpers

Concrete examples live in references/pytorch_examples_fsdp2.md and the official tutorial reference.

References

references/pytorch_fsdp2_tutorial.md
references/pytorch_fully_shard_api.md
references/pytorch_ddp_notes.md
references/pytorch_fsdp1_api.md
references/pytorch_device_mesh_tutorial.md
references/pytorch_tp_tutorial.md
references/pytorch_dcp_overview.md
references/pytorch_dcp_recipe.md
references/pytorch_dcp_async_recipe.md

Weekly Installs

Repository

orchestra-resea…h-skills

GitHub Stars

5.5K

First Seen

Feb 7, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

opencode55

codex54

cursor54

gemini-cli53

claude-code53

github-copilot52

超能力技能使用指南：AI助手技能调用优先级与工作流程详解

53,700 周安装

references/pytorch_examples_fsdp2.md

references/torchtitan_fsdp_notes.md (optional, production notes)

references/ray_train_fsdp2_example.md (optional, integration example)