nanoGPT极简GPT训练教程：从莎士比亚到GPT-2复现，快速上手AI文本生成

nanogpt by davila7/claude-code-templates

211 周安装量

24,100 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/davila7/claude-code-templates --skill nanogpt

AI/机器学习开源项目自然语言处理

🇨🇳中文介绍

nanoGPT - 极简 GPT 训练

快速开始

nanoGPT 是一个简化的 GPT 实现，专为学习和实验而设计。

安装：

pip install torch numpy transformers datasets tiktoken wandb tqdm

在莎士比亚数据集上训练（对 CPU 友好）：

# 准备数据
python data/shakespeare_char/prepare.py

# 训练（在 CPU 上约 5 分钟）
python train.py config/train_shakespeare_char.py

# 生成文本
python sample.py --out_dir=out-shakespeare-char

输出示例：

ROMEO:
What say'st thou? Shall I speak, and be a man?

JULIET:
I am afeard, and yet I'll speak; for thou art
One that hath been a man, and yet I know not
What thou art.

常见工作流程

工作流程 1：字符级莎士比亚模型

完整训练流程：

# 步骤 1：准备数据（创建 train.bin, val.bin）
python data/shakespeare_char/prepare.py

# 步骤 2：训练小型模型
python train.py config/train_shakespeare_char.py

# 步骤 3：生成文本
python sample.py --out_dir=out-shakespeare-char

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

工作流程 2：复现 GPT-2 (124M)

在 OpenWebText 上进行多 GPU 训练：

# 步骤 1：准备 OpenWebText 数据（约需 1 小时）
python data/openwebtext/prepare.py

# 步骤 2：使用 DDP 训练 GPT-2 124M（8 个 GPU）
torchrun --standalone --nproc_per_node=8 \
  train.py config/train_gpt2.py

# 步骤 3：从训练好的模型采样
python sample.py --out_dir=out

配置 (config/train_gpt2.py)：

# GPT-2 (124M) 架构
n_layer = 12
n_head = 12
n_embd = 768
block_size = 1024
dropout = 0.0

# 训练
batch_size = 12
gradient_accumulation_steps = 5 * 8  # 总批次约 0.5M 词元
learning_rate = 6e-4
max_iters = 600000
lr_decay_iters = 600000

# 系统
compile = True  # PyTorch 2.0

训练时间：约 4 天（8× A100）

工作流程 3：微调预训练的 GPT-2

从 OpenAI 检查点开始：

# 在 train.py 或配置中
init_from = 'gpt2'  # 选项：gpt2, gpt2-medium, gpt2-large, gpt2-xl

# 模型自动加载 OpenAI 权重
python train.py config/finetune_shakespeare.py

示例配置 (config/finetune_shakespeare.py)：

# 从 GPT-2 开始
init_from = 'gpt2'

# 数据集
dataset = 'shakespeare_char'
batch_size = 1
block_size = 1024

# 微调
learning_rate = 3e-5  # 微调使用较低的学习率
max_iters = 2000
warmup_iters = 100

# 正则化
weight_decay = 1e-1

工作流程 4：自定义数据集

在您自己的文本上训练：

# data/custom/prepare.py
import numpy as np

# 加载您的数据
with open('my_data.txt', 'r') as f:
    text = f.read()

# 创建字符映射
chars = sorted(list(set(text)))
stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i: ch for i, ch in enumerate(chars)}

# 分词
data = np.array([stoi[ch] for ch in text], dtype=np.uint16)

# 分割训练/验证集
n = len(data)
train_data = data[:int(n*0.9)]
val_data = data[int(n*0.9):]

# 保存
train_data.tofile('data/custom/train.bin')
val_data.tofile('data/custom/val.bin')

python data/custom/prepare.py
python train.py --dataset=custom

使用场景与替代方案对比

在以下情况使用 nanoGPT：

学习 GPT 工作原理
实验 Transformer 变体
教学/教育目的
快速原型设计
计算资源有限（可在 CPU 上运行）

简洁性优势：

约 300 行：整个模型在 model.py 中
约 300 行：训练循环在 train.py 中
易于修改：方便进行定制
无抽象层：纯 PyTorch 实现

在以下情况使用替代方案：

HuggingFace Transformers：生产环境使用，支持多种模型
Megatron-LM：大规模分布式训练
LitGPT：更多架构，生产就绪
PyTorch Lightning：需要高级框架

问题：CUDA 内存不足

减少批次大小或上下文长度：

batch_size = 1  # 从 12 减少
block_size = 512  # 从 1024 减少
gradient_accumulation_steps = 40  # 增加以维持有效批次大小

问题：训练速度太慢

启用编译（PyTorch 2.0+）：

compile = True  # 2 倍加速

使用混合精度：

dtype = 'bfloat16'  # 或 'float16'

问题：生成质量差

延长训练时间：

max_iters = 10000  # 从 5000 增加

降低温度参数：

# 在 sample.py 中
temperature = 0.7  # 从 1.0 降低
top_k = 200       # 添加 top-k 采样

问题：无法加载 GPT-2 权重

安装 transformers：

pip install transformers

检查模型名称：

init_from = 'gpt2'  # 有效值：gpt2, gpt2-medium, gpt2-large, gpt2-xl

模型架构：请参阅 references/architecture.md 了解 GPT 块结构、多头注意力和 MLP 层的简单解释。

训练循环：请参阅 references/training.md 了解学习率调度、梯度累积和分布式数据并行设置。

数据准备：请参阅 references/data.md 了解分词策略（字符级 vs BPE）和二进制格式详情。

莎士比亚（字符级）：
- CPU：5 分钟
- GPU (T4)：1 分钟
- 显存：<1GB
GPT-2 (124M)：
- 1× A100：约 1 周
- 8× A100：约 4 天
- 显存：每 GPU 约 16GB
GPT-2 Medium (350M)：
- 8× A100：约 2 周
- 显存：每 GPU 约 40GB

使用 compile=True：2 倍加速
使用 dtype=bfloat16：内存减少 50%

GitHub：https://github.com/karpathy/nanoGPT ⭐ 48,000+
视频：Andrej Karpathy 的 "Let's build GPT"
论文："Attention is All You Need" (Vaswani et al.)
OpenWebText：https://huggingface.co/datasets/Skylion007/openwebtext
教育用途：最适合从零开始理解 Transformer

2026 年 1 月 21 日

🇺🇸English

nanoGPT - Minimalist GPT Training

Quick start

nanoGPT is a simplified GPT implementation designed for learning and experimentation.

Installation :

pip install torch numpy transformers datasets tiktoken wandb tqdm

Train on Shakespeare (CPU-friendly):

# Prepare data
python data/shakespeare_char/prepare.py

# Train (5 minutes on CPU)
python train.py config/train_shakespeare_char.py

# Generate text
python sample.py --out_dir=out-shakespeare-char

Output :

ROMEO:
What say'st thou? Shall I speak, and be a man?

JULIET:
I am afeard, and yet I'll speak; for thou art
One that hath been a man, and yet I know not
What thou art.

Common workflows

Workflow 1: Character-level Shakespeare

Complete training pipeline :

# Step 1: Prepare data (creates train.bin, val.bin)
python data/shakespeare_char/prepare.py

# Step 2: Train small model
python train.py config/train_shakespeare_char.py

# Step 3: Generate text
python sample.py --out_dir=out-shakespeare-char

Config (config/train_shakespeare_char.py):

# Model config
n_layer = 6          # 6 transformer layers
n_head = 6           # 6 attention heads
n_embd = 384         # 384-dim embeddings
block_size = 256     # 256 char context

# Training config
batch_size = 64
learning_rate = 1e-3
max_iters = 5000
eval_interval = 500

# Hardware
device = 'cpu'  # Or 'cuda'
compile = False # Set True for PyTorch 2.0

Training time : ~5 minutes (CPU), ~1 minute (GPU)

Workflow 2: Reproduce GPT-2 (124M)

Multi-GPU training on OpenWebText :

# Step 1: Prepare OpenWebText (takes ~1 hour)
python data/openwebtext/prepare.py

# Step 2: Train GPT-2 124M with DDP (8 GPUs)
torchrun --standalone --nproc_per_node=8 \
  train.py config/train_gpt2.py

# Step 3: Sample from trained model
python sample.py --out_dir=out

Config (config/train_gpt2.py):

# GPT-2 (124M) architecture
n_layer = 12
n_head = 12
n_embd = 768
block_size = 1024
dropout = 0.0

# Training
batch_size = 12
gradient_accumulation_steps = 5 * 8  # Total batch ~0.5M tokens
learning_rate = 6e-4
max_iters = 600000
lr_decay_iters = 600000

# System
compile = True  # PyTorch 2.0

Training time : ~4 days (8× A100)

Workflow 3: Fine-tune pretrained GPT-2

Start from OpenAI checkpoint :

# In train.py or config
init_from = 'gpt2'  # Options: gpt2, gpt2-medium, gpt2-large, gpt2-xl

# Model loads OpenAI weights automatically
python train.py config/finetune_shakespeare.py

Example config (config/finetune_shakespeare.py):

# Start from GPT-2
init_from = 'gpt2'

# Dataset
dataset = 'shakespeare_char'
batch_size = 1
block_size = 1024

# Fine-tuning
learning_rate = 3e-5  # Lower LR for fine-tuning
max_iters = 2000
warmup_iters = 100

# Regularization
weight_decay = 1e-1

Workflow 4: Custom dataset

Train on your own text :

# data/custom/prepare.py
import numpy as np

# Load your data
with open('my_data.txt', 'r') as f:
    text = f.read()

# Create character mappings
chars = sorted(list(set(text)))
stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i: ch for i, ch in enumerate(chars)}

# Tokenize
data = np.array([stoi[ch] for ch in text], dtype=np.uint16)

# Split train/val
n = len(data)
train_data = data[:int(n*0.9)]
val_data = data[int(n*0.9):]

# Save
train_data.tofile('data/custom/train.bin')
val_data.tofile('data/custom/val.bin')

Train :

python data/custom/prepare.py
python train.py --dataset=custom

When to use vs alternatives

Use nanoGPT when :

Learning how GPT works
Experimenting with transformer variants
Teaching/education purposes
Quick prototyping
Limited compute (can run on CPU)

Simplicity advantages :

~300 lines : Entire model in model.py
~300 lines : Training loop in train.py
Hackable : Easy to modify
No abstractions : Pure PyTorch

Use alternatives instead :

HuggingFace Transformers : Production use, many models
Megatron-LM : Large-scale distributed training
LitGPT : More architectures, production-ready
PyTorch Lightning : Need high-level framework

Common issues

Issue: CUDA out of memory

Reduce batch size or context length:

batch_size = 1  # Reduce from 12
block_size = 512  # Reduce from 1024
gradient_accumulation_steps = 40  # Increase to maintain effective batch

Issue: Training too slow

Enable compilation (PyTorch 2.0+):

compile = True  # 2× speedup

Use mixed precision:

dtype = 'bfloat16'  # Or 'float16'

Issue: Poor generation quality

Train longer:

max_iters = 10000  # Increase from 5000

Lower temperature:

# In sample.py
temperature = 0.7  # Lower from 1.0
top_k = 200       # Add top-k sampling

Issue: Can't load GPT-2 weights

Install transformers:

pip install transformers

Check model name:

init_from = 'gpt2'  # Valid: gpt2, gpt2-medium, gpt2-large, gpt2-xl

Advanced topics

Model architecture : See references/architecture.md for GPT block structure, multi-head attention, and MLP layers explained simply.

Training loop : See references/training.md for learning rate schedule, gradient accumulation, and distributed data parallel setup.

Data preparation : See references/data.md for tokenization strategies (character-level vs BPE) and binary format details.

Hardware requirements

Shakespeare (char-level) :
- CPU: 5 minutes
- GPU (T4): 1 minute
- VRAM: <1GB
GPT-2 (124M) :
- 1× A100: ~1 week
- 8× A100: ~4 days
- VRAM: ~16GB per GPU
GPT-2 Medium (350M) :
- 8× A100: ~2 weeks
- VRAM: ~40GB per GPU

Performance :

With compile=True: 2× speedup
With dtype=bfloat16: 50% memory reduction

Resources

GitHub: https://github.com/karpathy/nanoGPT ⭐ 48,000+
Video: "Let's build GPT" by Andrej Karpathy
Paper: "Attention is All You Need" (Vaswani et al.)
OpenWebText: https://huggingface.co/datasets/Skylion007/openwebtext
Educational: Best for understanding transformers from scratch

Weekly Installs

155

Repository

davila7/claude-…emplates

GitHub Stars

22.6K

First Seen

Jan 21, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykWarn

Installed on

opencode124

claude-code123

gemini-cli118

cursor110

codex104

antigravity102

AI 代码实施计划编写技能 | 自动化开发任务分解与 TDD 流程规划工具

47,700 周安装