⚠️

重要前提

安装AI Skills的关键前提是：必须科学上网，且开启TUN模式，这一点至关重要，直接决定安装能否顺利完成，在此郑重提醒三遍：科学上网，科学上网，科学上网。查看完整安装教程 →

nanoGPT极简GPT训练教程：从莎士比亚到GPT-2复现，快速上手PyTorch深度学习

nanogpt by orchestra-research/ai-research-skills

66 周安装量

5,600 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/orchestra-research/ai-research-skills --skill nanogpt

AI/机器学习 PyTorch 自然语言处理

🇨🇳中文介绍

nanoGPT - 极简 GPT 训练

快速开始

nanoGPT 是一个简化的 GPT 实现，专为学习和实验而设计。

安装 :

pip install torch numpy transformers datasets tiktoken wandb tqdm

在莎士比亚数据集上训练 (CPU 友好):

# 准备数据
python data/shakespeare_char/prepare.py

# 训练 (CPU 上约 5 分钟)
python train.py config/train_shakespeare_char.py

# 生成文本
python sample.py --out_dir=out-shakespeare-char

输出示例 :

ROMEO:
What say'st thou? Shall I speak, and be a man?

JULIET:
I am afeard, and yet I'll speak; for thou art
One that hath been a man, and yet I know not
What thou art.

常用工作流程

工作流程 1：字符级莎士比亚模型

完整训练流程 :

# 步骤 1: 准备数据 (创建 train.bin, val.bin)
python data/shakespeare_char/prepare.py

# 步骤 2: 训练小型模型
python train.py config/train_shakespeare_char.py

# 步骤 3: 生成文本
python sample.py --out_dir=out-shakespeare-char

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

工作流程 2：复现 GPT-2 (124M)

在 OpenWebText 上进行多 GPU 训练 :

# 步骤 1: 准备 OpenWebText 数据 (约需 1 小时)
python data/openwebtext/prepare.py

# 步骤 2: 使用 DDP 训练 GPT-2 124M (8 个 GPU)
torchrun --standalone --nproc_per_node=8 \
  train.py config/train_gpt2.py

# 步骤 3: 从训练好的模型采样
python sample.py --out_dir=out

配置 (config/train_gpt2.py):

# GPT-2 (124M) 架构
n_layer = 12
n_head = 12
n_embd = 768
block_size = 1024
dropout = 0.0

# 训练
batch_size = 12
gradient_accumulation_steps = 5 * 8  # 总批次约 0.5M 词元
learning_rate = 6e-4
max_iters = 600000
lr_decay_iters = 600000

# 系统
compile = True  # PyTorch 2.0

训练时间 : ~4 天 (8× A100)

工作流程 3：微调预训练的 GPT-2

从 OpenAI 检查点开始 :

# 在 train.py 或配置中
init_from = 'gpt2'  # 选项: gpt2, gpt2-medium, gpt2-large, gpt2-xl

# 模型自动加载 OpenAI 权重
python train.py config/finetune_shakespeare.py

示例配置 (config/finetune_shakespeare.py):

# 从 GPT-2 开始
init_from = 'gpt2'

# 数据集
dataset = 'shakespeare_char'
batch_size = 1
block_size = 1024

# 微调
learning_rate = 3e-5  # 微调使用较低学习率
max_iters = 2000
warmup_iters = 100

# 正则化
weight_decay = 1e-1

工作流程 4：自定义数据集

在您自己的文本上训练 :

# data/custom/prepare.py
import numpy as np

# 加载您的数据
with open('my_data.txt', 'r') as f:
    text = f.read()

# 创建字符映射
chars = sorted(list(set(text)))
stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i: ch for i, ch in enumerate(chars)}

# 词元化
data = np.array([stoi[ch] for ch in text], dtype=np.uint16)

# 分割训练/验证集
n = len(data)
train_data = data[:int(n*0.9)]
val_data = data[int(n*0.9):]

# 保存
train_data.tofile('data/custom/train.bin')
val_data.tofile('data/custom/val.bin')

python data/custom/prepare.py
python train.py --dataset=custom

使用场景与替代方案对比

在以下情况下使用 nanoGPT :

学习 GPT 工作原理
实验 Transformer 变体
教学/教育目的
快速原型设计
计算资源有限 (可在 CPU 上运行)

简洁性优势 :

~300 行 : 整个模型在 model.py 中
~300 行 : 训练循环在 train.py 中
易于修改 : 方便进行修改
无抽象层 : 纯 PyTorch 实现

在以下情况下使用替代方案 :

HuggingFace Transformers : 生产环境使用，模型丰富
Megatron-LM : 大规模分布式训练
LitGPT : 更多架构，生产就绪
PyTorch Lightning : 需要高级框架

问题：CUDA 内存不足

减小批次大小或上下文长度:

batch_size = 1  # 从 12 减小
block_size = 512  # 从 1024 减小
gradient_accumulation_steps = 40  # 增加以维持有效批次大小

问题：训练速度太慢

启用编译 (PyTorch 2.0+):

compile = True  # 2 倍加速

dtype = 'bfloat16'  # 或 'float16'

问题：生成质量差

max_iters = 10000  # 从 5000 增加

# 在 sample.py 中
temperature = 0.7  # 从 1.0 降低
top_k = 200       # 添加 top-k 采样

问题：无法加载 GPT-2 权重

安装 transformers:

pip install transformers

init_from = 'gpt2'  # 有效值: gpt2, gpt2-medium, gpt2-large, gpt2-xl

模型架构 : 请参阅 references/architecture.md 了解 GPT 块结构、多头注意力和 MLP 层的简单解释。

训练循环 : 请参阅 references/training.md 了解学习率调度、梯度累积和分布式数据并行设置。

数据准备 : 请参阅 references/data.md 了解词元化策略 (字符级 vs BPE) 和二进制格式详情。

莎士比亚 (字符级) :
- CPU: 5 分钟
- GPU (T4): 1 分钟
- 显存: <1GB
GPT-2 (124M) :
- 1× A100: ~1 周
- 8× A100: ~4 天
- 显存: ~16GB 每 GPU
GPT-2 Medium (350M) :
- 8× A100: ~2 周
- 显存: ~40GB 每 GPU

使用 compile=True: 2 倍加速
使用 dtype=bfloat16: 内存减少 50%

GitHub: https://github.com/karpathy/nanoGPT ⭐ 48,000+
视频: Andrej Karpathy 的 "Let's build GPT"
论文: "Attention is All You Need" (Vaswani 等人)
OpenWebText: https://huggingface.co/datasets/Skylion007/openwebtext
教育: 最适合从零开始理解 Transformer

🇺🇸English

nanoGPT - Minimalist GPT Training

Quick start

nanoGPT is a simplified GPT implementation designed for learning and experimentation.

Installation :

pip install torch numpy transformers datasets tiktoken wandb tqdm

Train on Shakespeare (CPU-friendly):

# Prepare data
python data/shakespeare_char/prepare.py

# Train (5 minutes on CPU)
python train.py config/train_shakespeare_char.py

# Generate text
python sample.py --out_dir=out-shakespeare-char

Output :

ROMEO:
What say'st thou? Shall I speak, and be a man?

JULIET:
I am afeard, and yet I'll speak; for thou art
One that hath been a man, and yet I know not
What thou art.

Common workflows

Workflow 1: Character-level Shakespeare

Complete training pipeline :

# Step 1: Prepare data (creates train.bin, val.bin)
python data/shakespeare_char/prepare.py

# Step 2: Train small model
python train.py config/train_shakespeare_char.py

# Step 3: Generate text
python sample.py --out_dir=out-shakespeare-char

Config (config/train_shakespeare_char.py):

# Model config
n_layer = 6          # 6 transformer layers
n_head = 6           # 6 attention heads
n_embd = 384         # 384-dim embeddings
block_size = 256     # 256 char context

# Training config
batch_size = 64
learning_rate = 1e-3
max_iters = 5000
eval_interval = 500

# Hardware
device = 'cpu'  # Or 'cuda'
compile = False # Set True for PyTorch 2.0

Training time : ~5 minutes (CPU), ~1 minute (GPU)

Workflow 2: Reproduce GPT-2 (124M)

Multi-GPU training on OpenWebText :

# Step 1: Prepare OpenWebText (takes ~1 hour)
python data/openwebtext/prepare.py

# Step 2: Train GPT-2 124M with DDP (8 GPUs)
torchrun --standalone --nproc_per_node=8 \
  train.py config/train_gpt2.py

# Step 3: Sample from trained model
python sample.py --out_dir=out

Config (config/train_gpt2.py):

# GPT-2 (124M) architecture
n_layer = 12
n_head = 12
n_embd = 768
block_size = 1024
dropout = 0.0

# Training
batch_size = 12
gradient_accumulation_steps = 5 * 8  # Total batch ~0.5M tokens
learning_rate = 6e-4
max_iters = 600000
lr_decay_iters = 600000

# System
compile = True  # PyTorch 2.0

Training time : ~4 days (8× A100)

Workflow 3: Fine-tune pretrained GPT-2

Start from OpenAI checkpoint :

# In train.py or config
init_from = 'gpt2'  # Options: gpt2, gpt2-medium, gpt2-large, gpt2-xl

# Model loads OpenAI weights automatically
python train.py config/finetune_shakespeare.py

Example config (config/finetune_shakespeare.py):

# Start from GPT-2
init_from = 'gpt2'

# Dataset
dataset = 'shakespeare_char'
batch_size = 1
block_size = 1024

# Fine-tuning
learning_rate = 3e-5  # Lower LR for fine-tuning
max_iters = 2000
warmup_iters = 100

# Regularization
weight_decay = 1e-1

Workflow 4: Custom dataset

Train on your own text :

# data/custom/prepare.py
import numpy as np

# Load your data
with open('my_data.txt', 'r') as f:
    text = f.read()

# Create character mappings
chars = sorted(list(set(text)))
stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i: ch for i, ch in enumerate(chars)}

# Tokenize
data = np.array([stoi[ch] for ch in text], dtype=np.uint16)

# Split train/val
n = len(data)
train_data = data[:int(n*0.9)]
val_data = data[int(n*0.9):]

# Save
train_data.tofile('data/custom/train.bin')
val_data.tofile('data/custom/val.bin')

Train :

python data/custom/prepare.py
python train.py --dataset=custom

When to use vs alternatives

Use nanoGPT when :

Learning how GPT works
Experimenting with transformer variants
Teaching/education purposes
Quick prototyping
Limited compute (can run on CPU)

Simplicity advantages :

~300 lines : Entire model in model.py
~300 lines : Training loop in train.py
Hackable : Easy to modify
No abstractions : Pure PyTorch

Use alternatives instead :

HuggingFace Transformers : Production use, many models
Megatron-LM : Large-scale distributed training
LitGPT : More architectures, production-ready
PyTorch Lightning : Need high-level framework

Common issues

Issue: CUDA out of memory

Reduce batch size or context length:

batch_size = 1  # Reduce from 12
block_size = 512  # Reduce from 1024
gradient_accumulation_steps = 40  # Increase to maintain effective batch

Issue: Training too slow

Enable compilation (PyTorch 2.0+):

compile = True  # 2× speedup

Use mixed precision:

dtype = 'bfloat16'  # Or 'float16'

Issue: Poor generation quality

Train longer:

max_iters = 10000  # Increase from 5000

Lower temperature:

# In sample.py
temperature = 0.7  # Lower from 1.0
top_k = 200       # Add top-k sampling

Issue: Can't load GPT-2 weights

Install transformers:

pip install transformers

Check model name:

init_from = 'gpt2'  # Valid: gpt2, gpt2-medium, gpt2-large, gpt2-xl

Advanced topics

Model architecture : See references/architecture.md for GPT block structure, multi-head attention, and MLP layers explained simply.

Training loop : See references/training.md for learning rate schedule, gradient accumulation, and distributed data parallel setup.

Data preparation : See references/data.md for tokenization strategies (character-level vs BPE) and binary format details.

Hardware requirements

Shakespeare (char-level) :
- CPU: 5 minutes
- GPU (T4): 1 minute
- VRAM: <1GB
GPT-2 (124M) :
- 1× A100: ~1 week
- 8× A100: ~4 days
- VRAM: ~16GB per GPU
GPT-2 Medium (350M) :
- 8× A100: ~2 weeks
- VRAM: ~40GB per GPU

Performance :

With compile=True: 2× speedup
With dtype=bfloat16: 50% memory reduction

Resources

GitHub: https://github.com/karpathy/nanoGPT ⭐ 48,000+
Video: "Let's build GPT" by Andrej Karpathy
Paper: "Attention is All You Need" (Vaswani et al.)
OpenWebText: https://huggingface.co/datasets/Skylion007/openwebtext
Educational: Best for understanding transformers from scratch

Weekly Installs

Repository

orchestra-resea…h-skills

GitHub Stars

5.6K

First Seen

Feb 7, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykWarn

Installed on

opencode57

codex56

cursor56

gemini-cli55

claude-code55

github-copilot54

超能力技能使用指南：AI助手技能调用优先级与工作流程详解

53,700 周安装