LitGPT 大语言模型实现教程：20+预训练模型加载、微调与LoRA训练指南

implementing-llms-litgpt by davila7/claude-code-templates

208 周安装量

24,300 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/davila7/claude-code-templates --skill implementing-llms-litgpt

AI/机器学习 Python Web框架自然语言处理

🇨🇳中文介绍

LitGPT - 简洁的大语言模型实现

快速开始

LitGPT 提供了 20 多个预训练大语言模型的实现，代码清晰易读，并包含可用于生产的训练工作流。

安装：

pip install 'litgpt[extra]'

加载并使用任意模型：

from litgpt import LLM

# 加载预训练模型
llm = LLM.load("microsoft/phi-2")

# 生成文本
result = llm.generate(
    "What is the capital of France?",
    max_new_tokens=50,
    temperature=0.7
)
print(result)

列出可用模型：

litgpt download list

常见工作流

工作流 1：在自定义数据集上进行微调

复制此清单：

Fine-Tuning Setup:
- [ ] Step 1: Download pretrained model
- [ ] Step 2: Prepare dataset
- [ ] Step 3: Configure training
- [ ] Step 4: Run fine-tuning

步骤 1：下载预训练模型

# 下载 Llama 3 8B
litgpt download meta-llama/Meta-Llama-3-8B

# 下载 Phi-2（更小，更快）
litgpt download microsoft/phi-2

# 下载 Gemma 2B
litgpt download google/gemma-2b

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

工作流 2：在单 GPU 上进行 LoRA 微调

最节省内存的方案。

LoRA Training:
- [ ] Step 1: Choose base model
- [ ] Step 2: Configure LoRA parameters
- [ ] Step 3: Train with LoRA
- [ ] Step 4: Merge LoRA weights (optional)

步骤 1：选择基础模型

对于有限 GPU 显存（12-16GB）：

Phi-2 (2.7B) - 最佳质量/大小权衡
Llama 3 1B - 最小，最快
Gemma 2B - 推理能力良好

步骤 2：配置 LoRA 参数

litgpt finetune_lora \
  microsoft/phi-2 \
  --data JSON \
  --data.json_path data/my_dataset.json \
  --lora_r 16 \          # LoRA 秩 (8-64，越高=容量越大)
  --lora_alpha 32 \      # LoRA 缩放系数（通常为 2×r）
  --lora_dropout 0.05 \  # 防止过拟合
  --lora_query true \    # 对查询投影应用 LoRA
  --lora_key false \     # 通常不需要
  --lora_value true \    # 对值投影应用 LoRA
  --lora_projection true \  # 对输出投影应用 LoRA
  --lora_mlp false \     # 通常不需要
  --lora_head false      # 通常不需要

r=8：轻量级，2-4MB 适配器
r=16：标准，质量良好
r=32：高容量，用于复杂任务
r=64：最高质量，适配器大小增加 4 倍

步骤 3：使用 LoRA 训练

litgpt finetune_lora \
  microsoft/phi-2 \
  --data JSON \
  --data.json_path data/my_dataset.json \
  --lora_r 16 \
  --train.epochs 3 \
  --train.learning_rate 1e-4 \
  --train.micro_batch_size 4 \
  --train.global_batch_size 32 \
  --out_dir out/phi2-lora

# 内存使用：Phi-2 带 LoRA 约 8-12GB

步骤 4：合并 LoRA 权重（可选）

将 LoRA 适配器合并到基础模型中以供部署：

litgpt merge_lora \
  out/phi2-lora/final \
  --out_dir out/phi2-merged

现在使用合并后的模型：

from litgpt import LLM
llm = LLM.load("out/phi2-merged")

工作流 3：从零开始预训练

在你的领域数据上训练新模型。

Pretraining:
- [ ] Step 1: Prepare pretraining dataset
- [ ] Step 2: Configure model architecture
- [ ] Step 3: Set up multi-GPU training
- [ ] Step 4: Launch pretraining

步骤 1：准备预训练数据集

LitGPT 期望使用分词后的数据。使用 prepare_dataset.py：

python scripts/prepare_dataset.py \
  --source_path data/my_corpus.txt \
  --checkpoint_dir checkpoints/tokenizer \
  --destination_path data/pretrain \
  --split train,val

步骤 2：配置模型架构

编辑配置文件或使用现有配置：

# config/pythia-160m.yaml
model_name: pythia-160m
block_size: 2048
vocab_size: 50304
n_layer: 12
n_head: 12
n_embd: 768
rotary_percentage: 0.25
parallel_residual: true
bias: true

步骤 3：设置多 GPU 训练

# 单 GPU
litgpt pretrain \
  --config config/pythia-160m.yaml \
  --data.data_dir data/pretrain \
  --train.max_tokens 10_000_000_000

# 使用 FSDP 的多 GPU
litgpt pretrain \
  --config config/pythia-1b.yaml \
  --data.data_dir data/pretrain \
  --devices 8 \
  --train.max_tokens 100_000_000_000

步骤 4：启动预训练

对于集群上的大规模预训练：

# 使用 SLURM
sbatch --nodes=8 --gpus-per-node=8 \
  pretrain_script.sh

# pretrain_script.sh 内容：
litgpt pretrain \
  --config config/pythia-1b.yaml \
  --data.data_dir /shared/data/pretrain \
  --devices 8 \
  --num_nodes 8 \
  --train.global_batch_size 512 \
  --train.max_tokens 300_000_000_000

工作流 4：转换和部署模型

导出 LitGPT 模型用于生产环境。

Model Deployment:
- [ ] Step 1: Test inference locally
- [ ] Step 2: Quantize model (optional)
- [ ] Step 3: Convert to GGUF (for llama.cpp)
- [ ] Step 4: Deploy with API

步骤 1：本地测试推理

from litgpt import LLM

llm = LLM.load("out/phi2-lora/final")

# 单次生成
print(llm.generate("What is machine learning?"))

# 流式生成
for token in llm.generate("Explain quantum computing", stream=True):
    print(token, end="", flush=True)

# 批量推理
prompts = ["Hello", "Goodbye", "Thank you"]
results = [llm.generate(p) for p in prompts]

步骤 2：量化模型（可选）

以最小的质量损失减少模型大小：

# 8 位量化（大小减少 50%）
litgpt convert_lit_checkpoint \
  out/phi2-lora/final \
  --dtype bfloat16 \
  --quantize bnb.nf4

# 4 位量化（大小减少 75%）
litgpt convert_lit_checkpoint \
  out/phi2-lora/final \
  --quantize bnb.nf4-dq  # 双重量化

步骤 3：转换为 GGUF（用于 llama.cpp）

python scripts/convert_lit_checkpoint.py \
  --checkpoint_path out/phi2-lora/final \
  --output_path models/phi2.gguf \
  --model_name microsoft/phi-2

步骤 4：使用 API 部署

from fastapi import FastAPI
from litgpt import LLM

app = FastAPI()
llm = LLM.load("out/phi2-lora/final")

@app.post("/generate")
def generate(prompt: str, max_tokens: int = 100):
    result = llm.generate(
        prompt,
        max_new_tokens=max_tokens,
        temperature=0.7
    )
    return {"response": result}

# 运行：uvicorn api:app --host 0.0.0.0 --port 8000

何时使用 vs 替代方案

在以下情况下使用 LitGPT：

希望理解 LLM 架构（代码清晰、易读）
需要可用于生产的训练方案
教育目的或研究
原型化新模型想法
Lightning 生态系统用户

在以下情况下使用替代方案：

Axolotl/TRL ：更多微调功能，YAML 配置
Megatron-Core ：对于 >70B 模型的最大性能
HuggingFace Transformers ：最广泛的模型支持
vLLM ：仅推理（无训练）

问题：微调时内存不足

使用 LoRA 代替全量微调：

# 代替 litgpt finetune（需要 40GB+）
litgpt finetune_lora  # 仅需 12-16GB

或启用梯度检查点：

litgpt finetune_lora \
  ... \
  --train.gradient_accumulation_iters 4  # 累积梯度

问题：训练太慢

启用 Flash Attention（内置，在兼容硬件上自动启用）：

# 在 Ampere+ GPU（A100，RTX 30/40 系列）上默认已启用
# 无需配置

使用更小的微批次并进行累积：

--train.micro_batch_size 1 \
--train.global_batch_size 32 \
--train.gradient_accumulation_iters 32  # 有效批次大小=32

问题：模型无法加载

检查模型名称：

# 列出所有可用模型
litgpt download list

# 如果不存在则下载
litgpt download meta-llama/Meta-Llama-3-8B

验证检查点目录：

ls checkpoints/
# 应该看到：meta-llama/Meta-Llama-3-8B/

问题：LoRA 适配器太大

--lora_r 8  # 代替 16 或 32

对更少的层应用 LoRA：

--lora_query true \
--lora_value true \
--lora_projection false \  # 禁用此项
--lora_mlp false  # 以及此项

支持的架构：查看 references/supported-models.md 获取包含 20 多个模型系列及其大小和能力的完整列表。

训练方案：查看 references/training-recipes.md 获取经过验证的预训练和微调超参数配置。

FSDP 配置：查看 references/distributed-training.md 获取使用完全分片数据并行的多 GPU 训练配置。

自定义架构：查看 references/custom-models.md 获取以 LitGPT 风格实现新模型架构的方法。

GPU：NVIDIA（CUDA 11.8+）、AMD（ROCm）、Apple Silicon（MPS）
内存：
- 推理（Phi-2）：6GB
- LoRA 微调（7B）：16GB
- 全量微调（7B）：40GB+
- 预训练（1B）：24GB
存储：每个模型 5-50GB（取决于大小）

GitHub：https://github.com/Lightning-AI/litgpt
文档：https://lightning.ai/docs/litgpt
教程：https://lightning.ai/docs/litgpt/tutorials
模型库：20 多个预训练架构（Llama、Gemma、Phi、Qwen、Mistral、Mixtral、Falcon 等）

2026 年 1 月 21 日

🇺🇸English

LitGPT - Clean LLM Implementations

Quick start

LitGPT provides 20+ pretrained LLM implementations with clean, readable code and production-ready training workflows.

Installation :

pip install 'litgpt[extra]'

Load and use any model :

from litgpt import LLM

# Load pretrained model
llm = LLM.load("microsoft/phi-2")

# Generate text
result = llm.generate(
    "What is the capital of France?",
    max_new_tokens=50,
    temperature=0.7
)
print(result)

List available models :

litgpt download list

Common workflows

Workflow 1: Fine-tune on custom dataset

Copy this checklist:

Fine-Tuning Setup:
- [ ] Step 1: Download pretrained model
- [ ] Step 2: Prepare dataset
- [ ] Step 3: Configure training
- [ ] Step 4: Run fine-tuning

Step 1: Download pretrained model

# Download Llama 3 8B
litgpt download meta-llama/Meta-Llama-3-8B

# Download Phi-2 (smaller, faster)
litgpt download microsoft/phi-2

# Download Gemma 2B
litgpt download google/gemma-2b

Models are saved to checkpoints/ directory.

Step 2: Prepare dataset

LitGPT supports multiple formats:

Alpaca format (instruction-response):

[
  {
    "instruction": "What is the capital of France?",
    "input": "",
    "output": "The capital of France is Paris."
  },
  {
    "instruction": "Translate to Spanish: Hello, how are you?",
    "input": "",
    "output": "Hola, ¿cómo estás?"
  }
]

Save as data/my_dataset.json.

Step 3: Configure training

# Full fine-tuning (requires 40GB+ GPU for 7B models)
litgpt finetune \
  meta-llama/Meta-Llama-3-8B \
  --data JSON \
  --data.json_path data/my_dataset.json \
  --train.max_steps 1000 \
  --train.learning_rate 2e-5 \
  --train.micro_batch_size 1 \
  --train.global_batch_size 16

# LoRA fine-tuning (efficient, 16GB GPU)
litgpt finetune_lora \
  microsoft/phi-2 \
  --data JSON \
  --data.json_path data/my_dataset.json \
  --lora_r 16 \
  --lora_alpha 32 \
  --lora_dropout 0.05 \
  --train.max_steps 1000 \
  --train.learning_rate 1e-4

Step 4: Run fine-tuning

Training saves checkpoints to out/finetune/ automatically.

Monitor training:

# View logs
tail -f out/finetune/logs.txt

# TensorBoard (if using --train.logger_name tensorboard)
tensorboard --logdir out/finetune/lightning_logs

Workflow 2: LoRA fine-tuning on single GPU

Most memory-efficient option.

LoRA Training:
- [ ] Step 1: Choose base model
- [ ] Step 2: Configure LoRA parameters
- [ ] Step 3: Train with LoRA
- [ ] Step 4: Merge LoRA weights (optional)

Step 1: Choose base model

For limited GPU memory (12-16GB):

Phi-2 (2.7B) - Best quality/size tradeoff
Llama 3 1B - Smallest, fastest
Gemma 2B - Good reasoning

Step 2: Configure LoRA parameters

litgpt finetune_lora \
  microsoft/phi-2 \
  --data JSON \
  --data.json_path data/my_dataset.json \
  --lora_r 16 \          # LoRA rank (8-64, higher=more capacity)
  --lora_alpha 32 \      # LoRA scaling (typically 2×r)
  --lora_dropout 0.05 \  # Prevent overfitting
  --lora_query true \    # Apply LoRA to query projection
  --lora_key false \     # Usually not needed
  --lora_value true \    # Apply LoRA to value projection
  --lora_projection true \  # Apply LoRA to output projection
  --lora_mlp false \     # Usually not needed
  --lora_head false      # Usually not needed

LoRA rank guide:

r=8: Lightweight, 2-4MB adapters
r=16: Standard, good quality
r=32: High capacity, use for complex tasks
r=64: Maximum quality, 4× larger adapters

Step 3: Train with LoRA

litgpt finetune_lora \
  microsoft/phi-2 \
  --data JSON \
  --data.json_path data/my_dataset.json \
  --lora_r 16 \
  --train.epochs 3 \
  --train.learning_rate 1e-4 \
  --train.micro_batch_size 4 \
  --train.global_batch_size 32 \
  --out_dir out/phi2-lora

# Memory usage: ~8-12GB for Phi-2 with LoRA

Step 4: Merge LoRA weights (optional)

Merge LoRA adapters into base model for deployment:

litgpt merge_lora \
  out/phi2-lora/final \
  --out_dir out/phi2-merged

Now use merged model:

from litgpt import LLM
llm = LLM.load("out/phi2-merged")

Workflow 3: Pretrain from scratch

Train new model on your domain data.

Pretraining:
- [ ] Step 1: Prepare pretraining dataset
- [ ] Step 2: Configure model architecture
- [ ] Step 3: Set up multi-GPU training
- [ ] Step 4: Launch pretraining

Step 1: Prepare pretraining dataset

LitGPT expects tokenized data. Use prepare_dataset.py:

python scripts/prepare_dataset.py \
  --source_path data/my_corpus.txt \
  --checkpoint_dir checkpoints/tokenizer \
  --destination_path data/pretrain \
  --split train,val

Step 2: Configure model architecture

Edit config file or use existing:

# config/pythia-160m.yaml
model_name: pythia-160m
block_size: 2048
vocab_size: 50304
n_layer: 12
n_head: 12
n_embd: 768
rotary_percentage: 0.25
parallel_residual: true
bias: true

Step 3: Set up multi-GPU training

# Single GPU
litgpt pretrain \
  --config config/pythia-160m.yaml \
  --data.data_dir data/pretrain \
  --train.max_tokens 10_000_000_000

# Multi-GPU with FSDP
litgpt pretrain \
  --config config/pythia-1b.yaml \
  --data.data_dir data/pretrain \
  --devices 8 \
  --train.max_tokens 100_000_000_000

Step 4: Launch pretraining

For large-scale pretraining on cluster:

# Using SLURM
sbatch --nodes=8 --gpus-per-node=8 \
  pretrain_script.sh

# pretrain_script.sh content:
litgpt pretrain \
  --config config/pythia-1b.yaml \
  --data.data_dir /shared/data/pretrain \
  --devices 8 \
  --num_nodes 8 \
  --train.global_batch_size 512 \
  --train.max_tokens 300_000_000_000

Workflow 4: Convert and deploy model

Export LitGPT models for production.

Model Deployment:
- [ ] Step 1: Test inference locally
- [ ] Step 2: Quantize model (optional)
- [ ] Step 3: Convert to GGUF (for llama.cpp)
- [ ] Step 4: Deploy with API

Step 1: Test inference locally

from litgpt import LLM

llm = LLM.load("out/phi2-lora/final")

# Single generation
print(llm.generate("What is machine learning?"))

# Streaming
for token in llm.generate("Explain quantum computing", stream=True):
    print(token, end="", flush=True)

# Batch inference
prompts = ["Hello", "Goodbye", "Thank you"]
results = [llm.generate(p) for p in prompts]

Step 2: Quantize model (optional)

Reduce model size with minimal quality loss:

# 8-bit quantization (50% size reduction)
litgpt convert_lit_checkpoint \
  out/phi2-lora/final \
  --dtype bfloat16 \
  --quantize bnb.nf4

# 4-bit quantization (75% size reduction)
litgpt convert_lit_checkpoint \
  out/phi2-lora/final \
  --quantize bnb.nf4-dq  # Double quantization

Step 3: Convert to GGUF (for llama.cpp)

python scripts/convert_lit_checkpoint.py \
  --checkpoint_path out/phi2-lora/final \
  --output_path models/phi2.gguf \
  --model_name microsoft/phi-2

Step 4: Deploy with API

from fastapi import FastAPI
from litgpt import LLM

app = FastAPI()
llm = LLM.load("out/phi2-lora/final")

@app.post("/generate")
def generate(prompt: str, max_tokens: int = 100):
    result = llm.generate(
        prompt,
        max_new_tokens=max_tokens,
        temperature=0.7
    )
    return {"response": result}

# Run: uvicorn api:app --host 0.0.0.0 --port 8000

When to use vs alternatives

Use LitGPT when:

Want to understand LLM architectures (clean, readable code)
Need production-ready training recipes
Educational purposes or research
Prototyping new model ideas
Lightning ecosystem user

Use alternatives instead:

Axolotl/TRL : More fine-tuning features, YAML configs
Megatron-Core : Maximum performance for >70B models
HuggingFace Transformers : Broadest model support
vLLM : Inference-only (no training)

Common issues

Issue: Out of memory during fine-tuning

Use LoRA instead of full fine-tuning:

# Instead of litgpt finetune (requires 40GB+)
litgpt finetune_lora  # Only needs 12-16GB

Or enable gradient checkpointing:

litgpt finetune_lora \
  ... \
  --train.gradient_accumulation_iters 4  # Accumulate gradients

Issue: Training too slow

Enable Flash Attention (built-in, automatic on compatible hardware):

# Already enabled by default on Ampere+ GPUs (A100, RTX 30/40 series)
# No configuration needed

Use smaller micro-batch and accumulate:

--train.micro_batch_size 1 \
--train.global_batch_size 32 \
--train.gradient_accumulation_iters 32  # Effective batch=32

Issue: Model not loading

Check model name:

# List all available models
litgpt download list

# Download if not exists
litgpt download meta-llama/Meta-Llama-3-8B

Verify checkpoints directory:

ls checkpoints/
# Should see: meta-llama/Meta-Llama-3-8B/

Issue: LoRA adapters too large

Reduce LoRA rank:

--lora_r 8  # Instead of 16 or 32

Apply LoRA to fewer layers:

--lora_query true \
--lora_value true \
--lora_projection false \  # Disable this
--lora_mlp false  # And this

Advanced topics

Supported architectures : See references/supported-models.md for complete list of 20+ model families with sizes and capabilities.

Training recipes : See references/training-recipes.md for proven hyperparameter configurations for pretraining and fine-tuning.

FSDP configuration : See references/distributed-training.md for multi-GPU training with Fully Sharded Data Parallel.

Custom architectures : See references/custom-models.md for implementing new model architectures in LitGPT style.

Hardware requirements

GPU : NVIDIA (CUDA 11.8+), AMD (ROCm), Apple Silicon (MPS)
Memory :
- Inference (Phi-2): 6GB
- LoRA fine-tuning (7B): 16GB
- Full fine-tuning (7B): 40GB+
- Pretraining (1B): 24GB
Storage : 5-50GB per model (depending on size)

Resources

GitHub: https://github.com/Lightning-AI/litgpt
Docs: https://lightning.ai/docs/litgpt
Tutorials: https://lightning.ai/docs/litgpt/tutorials
Model zoo: 20+ pretrained architectures (Llama, Gemma, Phi, Qwen, Mistral, Mixtral, Falcon, etc.)

Weekly Installs

166

Repository

davila7/claude-…emplates

GitHub Stars

23.4K

First Seen

Jan 21, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykWarn

Installed on

opencode136

claude-code134

gemini-cli132

cursor127

codex117

antigravity114

超能力技能使用指南：AI助手技能调用优先级与工作流程详解

52,100 周安装