nanochat LLM训练教程：单GPU端到端大语言模型训练工具，复现GPT-2仅需48美元

nanochat-llm-training by aradotso/trending-skills

283 周安装量

10 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/aradotso/trending-skills --skill nanochat-llm-training

AI/机器学习开发自然语言处理

🇨🇳中文介绍

nanochat LLM 训练

技能来自 ara.so — Daily 2026 技能集合。

nanochat 是 Karpathy 开发的极简、可定制的工具，用于在单个 GPU 节点上端到端训练 LLM。它涵盖了分词、预训练、SFT 微调、RL、评估（DCLM CORE 分数）、带 KV 缓存的推理以及类似 ChatGPT 的 Web UI。一个单一的复杂度调节旋钮（--depth）会自动配置所有其他超参数（宽度、头数、学习率、训练时长、权重衰减）以实现计算最优的训练。你可以在 8×H100 节点上（约 2 小时）以约 48 美元的成本复现 GPT-2 的能力（2019 年约 43,000 美元）。

安装

nanochat 使用 uv 进行依赖管理：

git clone https://github.com/karpathy/nanochat.git
cd nanochat
# 如果需要，安装 uv
curl -LsSf https://astral.sh/uv/install.sh | sh
# 创建虚拟环境并安装依赖
uv sync
source .venv/bin/activate

关键命令

完整的 GPT-2 快速通关（8×H100 节点，约 2–3 小时，约 48 美元）

# 运行参考流程：数据下载、预训练、SFT、评估、聊天
bash runs/speedrun.sh

预训练（分布式）

OMP_NUM_THREADS=1 torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- \
    --depth=26 \
    --run="d26_run" \
    --model-tag="d26"

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

预训练（单 GPU）

python -m scripts.base_train -- \
    --depth=26 \
    --run="d26_single"

快速研究迭代（约 5 分钟，GPT-1 规模）

OMP_NUM_THREADS=1 torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- \
    --depth=12 \
    --run="d12_exp" \
    --model-tag="d12" \
    --core-metric-every=999999 \
    --sample-every=-1 \
    --save-every=-1

CPU / Apple Silicon（微型模型，约几分钟）

bash runs/runcpu.sh

# 训练完成后
source .venv/bin/activate
python -m scripts.chat_web
# 访问 http://<你的服务器IP>:8000/

python -m scripts.chat_cli -p "hello"

缩放定律 / 迷你系列

bash runs/scaling_laws.sh   # 扫描不同深度以获取缩放定律数据
bash runs/miniseries.sh     # 训练完整的计算最优迷你系列

这是最重要的单一参数。其他所有参数都会自动推导：

`--depth`	近似模型规模	备注
6–8	微型（玩具级）	CPU/MPS 可行
12	GPT-1 大小	在 8×H100 上约 5 分钟，非常适合研究迭代
16	中等	在 8×H100 上约 15 分钟
24–26	GPT-2 大小	在 8×H100 上约 2 小时，约 48 美元

# 更小/更快的实验
python -m scripts.base_train -- --depth=12 --run="quick_test"

# 完整的 GPT-2 级别
torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- --depth=26 --run="gpt2_repro"

精度 / 数据类型配置

nanochat 通过 nanochat/common.py 中的 COMPUTE_DTYPE 进行显式的数据类型管理。不使用 torch.amp.autocast。

硬件	默认	覆盖
CUDA SM 80+ (A100, H100)	`bfloat16`	`NANOCHAT_DTYPE=float32`
CUDA SM < 80 (V100, T4)	`float32`	`NANOCHAT_DTYPE=float16`
CPU / MPS	`float32`	—

# 强制使用 fp32 进行推理
NANOCHAT_DTYPE=float32 python -m scripts.chat_cli -p "hello"

# 强制使用 bf16 进行训练
NANOCHAT_DTYPE=bfloat16 torchrun --nproc_per_node=8 -m scripts.base_train

# float16 训练（自动启用 GradScaler）
NANOCHAT_DTYPE=float16 torchrun --nproc_per_node=8 -m scripts.base_train

工作原理： 权重以 fp32 存储（优化器精度），自定义的 Linear 在前向传播时转换为 COMPUTE_DTYPE，嵌入层直接以 COMPUTE_DTYPE 存储以节省内存。

关键 Python 模块

nanochat/
├── gpt.py              # GPT nn.Module Transformer
├── engine.py           # Inference with KV Cache
├── dataloader.py       # Tokenizing Distributed Data Loader
├── dataset.py          # Download/read utils for pretraining data
├── optim.py            # AdamW + Muon optimizer (1GPU and distributed)
├── core_eval.py        # DCLM CORE score evaluation
├── loss_eval.py        # Bits-per-byte evaluation
├── checkpoint_manager.py  # Save/Load checkpoints
├── common.py           # Utilities, COMPUTE_DTYPE
├── execution.py        # Python code execution tool for LLM
└── engine.py           # Efficient KV-cache inference

scripts/
├── base_train.py       # Pretraining entry point
├── chat_web.py         # Web chat UI server
└── chat_cli.py         # CLI chat interface

runs/
├── speedrun.sh         # Reference full pipeline (GPT-2 speedrun)
├── scaling_laws.sh     # Scaling law sweeps
├── miniseries.sh       # Full compute-optimal miniseries
└── runcpu.sh           # CPU/MPS example

加载并运行已训练模型的推理

import torch
from nanochat.gpt import GPT
from nanochat.engine import InferenceEngine
from nanochat.checkpoint_manager import CheckpointManager

# 加载检查点
ckpt_manager = CheckpointManager("checkpoints/d26")
model, config = ckpt_manager.load()
model.eval()

# 使用 KV 缓存运行推理
engine = InferenceEngine(model)
output = engine.generate(
    prompt="Once upon a time",
    max_new_tokens=200,
    temperature=0.8,
    top_p=0.95,
)
print(output)

使用深度调节旋钮的自定义训练脚本

import subprocess

def train_model(depth: int, run_name: str, nproc: int = 8):
    """为给定深度启动计算最优的训练运行。"""
    cmd = [
        "torchrun",
        "--standalone",
        f"--nproc_per_node={nproc}",
        "-m", "scripts.base_train",
        "--",
        f"--depth={depth}",
        f"--run={run_name}",
        f"--model-tag={run_name}",
    ]
    subprocess.run(cmd, env={"OMP_NUM_THREADS": "1", **__import__("os").environ})

# 快速研究迭代
train_model(depth=12, run_name="my_experiment_d12")

# 完整的 GPT-2 级别
train_model(depth=26, run_name="my_gpt2_repro")

为较低 VRAM 调整设备批次大小

# 默认 device_batch_size=32 需要每个 GPU 约 80GB VRAM
# 为较小的 GPU 减少此值（梯度累积会处理其余部分）
torchrun --standalone --nproc_per_node=4 -m scripts.base_train -- \
    --depth=12 \
    --device_batch_size=16 \
    --run="low_vram_run"

# 甚至更小
python -m scripts.base_train -- \
    --depth=8 \
    --device_batch_size=4 \
    --run="single_gpu_small"

在 wandb 中监控关键指标

# nanochat 会自动记录到 wandb。需要关注的关键指标：
# - val_bpb：验证损失，以每字节比特数表示（与词汇表大小无关）
#   作为步数、总训练时间、总训练 FLOPS 的函数
# - core_metric：DCLM CORE 分数（目标 > 0.2565 以超越 GPT-2）
# - train/mfu：模型 FLOPS 利用率
# - train/tok_per_sec：训练吞吐量

# 在训练前通过环境变量设置 wandb 项目
import os
os.environ["WANDB_PROJECT"] = "my-nanochat-runs"

用于 SFT 个性的合成数据

# dev/gen_synthetic_data.py — 生成身份/个性数据
# 然后按照指南将其混合到 SFT 阶段：
# https://github.com/karpathy/nanochat/discussions/139

# 示例：生成数据并在 SFT 中指向它
python dev/gen_synthetic_data.py --output data/identity_sft.jsonl
# 然后在你的 SFT 脚本配置中引用它

# 1. 在 nanochat/ 中进行代码更改
# 2. 运行快速的 d12 进行验证
OMP_NUM_THREADS=1 torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- \
    --depth=12 --run="test_my_change" \
    --core-metric-every=999999 --sample-every=-1 --save-every=-1
# 3. 检查 wandb：val_bpb 与步数/时间/FLOPS 的关系
# 4. 如果结果有希望，在 d16 或 d26 上测试

FP8 训练（仅限 H100，用于快速通关）

# 在快速通关中使用 FP8 以获得额外的加速
# 查看 runs/speedrun.sh 获取确切的调用方式
bash runs/speedrun.sh

仅评估 CORE 分数

python -m nanochat.core_eval --checkpoint checkpoints/d26/latest

在 Lambda / 远程机器上部署

# 在远程机器上训练完成后：
source .venv/bin/activate
python -m scripts.chat_web
# 通过以下方式访问：http://<PUBLIC_IP>:8000/
# 使用 `screen` 或 `tmux` 保持运行
screen -S nanochat
python -m scripts.chat_web
# Ctrl+A, D 分离

# 减少 --device_batch_size（默认 32）
# 代码使用梯度累积来维持有效批次大小
--device_batch_size=16   # 尝试 16, 8, 4, 2, 1

单 GPU 速度慢 8 倍

这是预期的。省略 torchrun 并直接使用 python -m scripts.base_train。梯度累积会自动启动以维持等效的总批次大小。

在非 CUDA 硬件上运行

# MPS（Apple Silicon）或 CPU — 使用 runcpu.sh 作为模板
bash runs/runcpu.sh
# 结果会很弱；这仅用于开发/调试

float16 梯度下溢

# 当 NANOCHAT_DTYPE=float16 时，nanochat 自动启用 GradScaler
NANOCHAT_DTYPE=float16 torchrun --nproc_per_node=8 -m scripts.base_train -- --depth=12
# 注意：RL 脚本不支持 float16（SFT 和 base_train 支持）

V100 / T4（SM < 80）— 无 bf16

# 默认回退到 float32；可选使用 float16
NANOCHAT_DTYPE=float16 torchrun --nproc_per_node=8 -m scripts.base_train -- --depth=12

聊天 UI 无法访问

# 确保端口（默认 8000）在你的云服务商的防火墙/安全组中是开放的
# 使用公共 IP，而不是 localhost：
# http://<PUBLIC_IP>:8000/

DeepWiki 问答：https://deepwiki.com/karpathy/nanochat
讨论区：https://github.com/karpathy/nanochat/discussions
Discord：Karpathy Discord 上的 #nanochat 频道
排行榜文档：dev/LEADERBOARD.md
超越 GPT-2 指南：https://github.com/karpathy/nanochat/discussions/481
迷你系列 v1：https://github.com/karpathy/nanochat/discussions/420
添加能力指南：https://github.com/karpathy/nanochat/discussions/164

🇺🇸English

nanochat LLM Training

Skill by ara.so — Daily 2026 Skills collection.

nanochat is Karpathy's minimal, hackable harness for training LLMs end-to-end on a single GPU node. It covers tokenization, pretraining, SFT finetuning, RL, evaluation (DCLM CORE score), inference with KV cache, and a ChatGPT-like web UI. A single complexity dial (--depth) auto-configures all other hyperparameters (width, heads, LR, training horizon, weight decay) for compute-optimal training. You can reproduce GPT-2 capability (~$43,000 in 2019) for ~$48 on an 8×H100 node (~2 hours).

Installation

nanochat uses uv for dependency management:

git clone https://github.com/karpathy/nanochat.git
cd nanochat
# Install uv if needed
curl -LsSf https://astral.sh/uv/install.sh | sh
# Create venv and install deps
uv sync
source .venv/bin/activate

Key Commands

Full GPT-2 Speedrun (8×H100 node, ~2–3 hours, ~$48)

# Run the reference pipeline: data download, pretraining, SFT, eval, chat
bash runs/speedrun.sh

Pretraining (distributed)

OMP_NUM_THREADS=1 torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- \
    --depth=26 \
    --run="d26_run" \
    --model-tag="d26"

Pretraining (single GPU)

python -m scripts.base_train -- \
    --depth=26 \
    --run="d26_single"

Quick Research Iteration (~5 min, GPT-1 scale)

OMP_NUM_THREADS=1 torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- \
    --depth=12 \
    --run="d12_exp" \
    --model-tag="d12" \
    --core-metric-every=999999 \
    --sample-every=-1 \
    --save-every=-1

CPU / Apple Silicon (tiny model, ~minutes)

bash runs/runcpu.sh

Serve Chat UI

# After training completes
source .venv/bin/activate
python -m scripts.chat_web
# Visit http://<your-server-ip>:8000/

CLI Chat

python -m scripts.chat_cli -p "hello"

Scaling Laws / Miniseries

bash runs/scaling_laws.sh   # sweep depths for scaling law data
bash runs/miniseries.sh     # train full compute-optimal miniseries

The Depth Dial

The single most important parameter. Everything else is derived automatically:

`--depth`	Approximate model scale	Notes
6–8	Tiny (toy)	CPU/MPS feasible
12	GPT-1 size	~5 min on 8×H100, great for research iteration
16	Medium	~15 min on 8×H100
24–26	GPT-2 size	~2 hrs on 8×H100, ~$48

# Smaller/faster experiments
python -m scripts.base_train -- --depth=12 --run="quick_test"

# Full GPT-2 grade
torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- --depth=26 --run="gpt2_repro"

Precision / dtype Configuration

nanochat uses explicit dtype management via COMPUTE_DTYPE in nanochat/common.py. No torch.amp.autocast.

Hardware	Default	Override
CUDA SM 80+ (A100, H100)	`bfloat16`	`NANOCHAT_DTYPE=float32`
CUDA SM < 80 (V100, T4)	`float32`	`NANOCHAT_DTYPE=float16`
CPU / MPS	`float32`	—

# Force fp32 for inference
NANOCHAT_DTYPE=float32 python -m scripts.chat_cli -p "hello"

# Force bf16 for training
NANOCHAT_DTYPE=bfloat16 torchrun --nproc_per_node=8 -m scripts.base_train

# float16 training (enables GradScaler automatically)
NANOCHAT_DTYPE=float16 torchrun --nproc_per_node=8 -m scripts.base_train

How it works: Weights stored in fp32 (optimizer precision), custom Linear casts to COMPUTE_DTYPE in forward pass, embeddings stored directly in COMPUTE_DTYPE to save memory.

Key Python Modules

nanochat/
├── gpt.py              # GPT nn.Module Transformer
├── engine.py           # Inference with KV Cache
├── dataloader.py       # Tokenizing Distributed Data Loader
├── dataset.py          # Download/read utils for pretraining data
├── optim.py            # AdamW + Muon optimizer (1GPU and distributed)
├── core_eval.py        # DCLM CORE score evaluation
├── loss_eval.py        # Bits-per-byte evaluation
├── checkpoint_manager.py  # Save/Load checkpoints
├── common.py           # Utilities, COMPUTE_DTYPE
├── execution.py        # Python code execution tool for LLM
└── engine.py           # Efficient KV-cache inference

scripts/
├── base_train.py       # Pretraining entry point
├── chat_web.py         # Web chat UI server
└── chat_cli.py         # CLI chat interface

runs/
├── speedrun.sh         # Reference full pipeline (GPT-2 speedrun)
├── scaling_laws.sh     # Scaling law sweeps
├── miniseries.sh       # Full compute-optimal miniseries
└── runcpu.sh           # CPU/MPS example

Real Code Examples

Load and Run Inference on a Trained Model

import torch
from nanochat.gpt import GPT
from nanochat.engine import InferenceEngine
from nanochat.checkpoint_manager import CheckpointManager

# Load checkpoint
ckpt_manager = CheckpointManager("checkpoints/d26")
model, config = ckpt_manager.load()
model.eval()

# Run inference with KV cache
engine = InferenceEngine(model)
output = engine.generate(
    prompt="Once upon a time",
    max_new_tokens=200,
    temperature=0.8,
    top_p=0.95,
)
print(output)

Custom Training Script with Depth Dial

import subprocess

def train_model(depth: int, run_name: str, nproc: int = 8):
    """Launch a compute-optimal training run for given depth."""
    cmd = [
        "torchrun",
        "--standalone",
        f"--nproc_per_node={nproc}",
        "-m", "scripts.base_train",
        "--",
        f"--depth={depth}",
        f"--run={run_name}",
        f"--model-tag={run_name}",
    ]
    subprocess.run(cmd, env={"OMP_NUM_THREADS": "1", **__import__("os").environ})

# Quick research iteration
train_model(depth=12, run_name="my_experiment_d12")

# Full GPT-2 grade
train_model(depth=26, run_name="my_gpt2_repro")

Adjust Device Batch Size for Lower VRAM

# Default device_batch_size=32 needs ~80GB VRAM per GPU
# Reduce for smaller GPUs (gradient accumulation handles the rest)
torchrun --standalone --nproc_per_node=4 -m scripts.base_train -- \
    --depth=12 \
    --device_batch_size=16 \
    --run="low_vram_run"

# Even smaller
python -m scripts.base_train -- \
    --depth=8 \
    --device_batch_size=4 \
    --run="single_gpu_small"

Monitoring Key Metrics in wandb

# nanochat logs to wandb automatically. Key metrics to watch:
# - val_bpb: validation loss in bits-per-byte (vocab-size-invariant)
#   as a function of step, total_training_time, total_training_flops
# - core_metric: DCLM CORE score (target > 0.2565 to beat GPT-2)
# - train/mfu: Model FLOPS utilization
# - train/tok_per_sec: Training throughput

# Set wandb project via env var before training
import os
os.environ["WANDB_PROJECT"] = "my-nanochat-runs"

Synthetic Data for SFT Personality

# dev/gen_synthetic_data.py — generate identity/personality data
# Then mix into SFT stage per the guide:
# https://github.com/karpathy/nanochat/discussions/139

# Example: generate data and point SFT to it
python dev/gen_synthetic_data.py --output data/identity_sft.jsonl
# Then reference in your SFT script configuration

Common Patterns

Research Iteration Loop

# 1. Make a code change in nanochat/
# 2. Run quick d12 to validate
OMP_NUM_THREADS=1 torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- \
    --depth=12 --run="test_my_change" \
    --core-metric-every=999999 --sample-every=-1 --save-every=-1
# 3. Check wandb: val_bpb vs step/time/flops
# 4. If promising, test at d16 or d26

FP8 Training (H100 only, for speedrun)

# FP8 is used in the speedrun for additional speedup
# See runs/speedrun.sh for the exact invocation
bash runs/speedrun.sh

Evaluate CORE Score Only

python -m nanochat.core_eval --checkpoint checkpoints/d26/latest

Serve on Lambda / Remote Machine

# On remote machine after training:
source .venv/bin/activate
python -m scripts.chat_web
# Access via: http://<PUBLIC_IP>:8000/
# Use `screen` or `tmux` to keep alive
screen -S nanochat
python -m scripts.chat_web
# Ctrl+A, D to detach

Troubleshooting

OOM / Out of VRAM

# Reduce --device_batch_size (default 32)
# Code uses gradient accumulation to maintain effective batch size
--device_batch_size=16   # Try 16, 8, 4, 2, 1

Single GPU is 8× Slower

This is expected. Omit torchrun and use python -m scripts.base_train directly. Gradient accumulation kicks in automatically to maintain equivalent total batch size.

Running on Non-CUDA Hardware

# MPS (Apple Silicon) or CPU — use runcpu.sh as template
bash runs/runcpu.sh
# Results will be weak; this is for development/debugging only

float16 Gradient Underflow

# nanochat auto-enables GradScaler when NANOCHAT_DTYPE=float16
NANOCHAT_DTYPE=float16 torchrun --nproc_per_node=8 -m scripts.base_train -- --depth=12
# Note: RL scripts do NOT support float16 (SFT and base_train do)

V100 / T4 (SM < 80) — No bf16

# Default falls back to float32; optionally use float16
NANOCHAT_DTYPE=float16 torchrun --nproc_per_node=8 -m scripts.base_train -- --depth=12

Chat UI Not Accessible

# Ensure the port (default 8000) is open in your cloud provider's firewall/security group
# Use the public IP, not localhost:
# http://<PUBLIC_IP>:8000/

Resources

DeepWiki Q &A: https://deepwiki.com/karpathy/nanochat
Discussions : https://github.com/karpathy/nanochat/discussions
Discord : #nanochat channel on Karpathy's Discord
Leaderboard docs : dev/LEADERBOARD.md
Beating GPT-2 guide : https://github.com/karpathy/nanochat/discussions/481
Miniseries v1 : https://github.com/karpathy/nanochat/discussions/420
Adding abilities guide : https://github.com/karpathy/nanochat/discussions/164

Weekly Installs

283

Repository

aradotso/trending-skills

GitHub Stars

First Seen

7 days ago

Security Audits

Gen Agent Trust HubFail SocketPass SnykFail

Installed on

codex282

gemini-cli281

github-copilot281

amp281

cline281

kimi-cli281

React 组合模式指南：Vercel 组件架构最佳实践，提升代码可维护性

106,200 周安装