lambda-labs-gpu-cloud by davila7/claude-code-templates
npx skills add https://github.com/davila7/claude-code-templates --skill lambda-labs-gpu-cloud在 Lambda Labs GPU 云上运行机器学习工作负载的综合指南,涵盖按需实例和一键式集群。
在以下情况下使用 Lambda Labs:
主要特性:
使用替代方案的情况:
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
# 从控制台获取实例 IP
ssh ubuntu@<INSTANCE-IP>
# 或使用特定密钥
ssh -i ~/.ssh/lambda_key ubuntu@<INSTANCE-IP>
| GPU | 显存 | 价格/GPU/小时 | 最适合 |
|---|---|---|---|
| B200 SXM6 | 180 GB | $4.99 | 最大模型,最快训练 |
| H100 SXM | 80 GB | $2.99-3.29 | 大型模型训练 |
| H100 PCIe | 80 GB | $2.49 | 性价比高的 H100 |
| GH200 | 96 GB | $1.49 | 单 GPU 大型模型 |
| A100 80GB | 80 GB | $1.79 | 生产训练 |
| A100 40GB | 40 GB | $1.29 | 标准训练 |
| A10 | 24 GB | $0.75 | 推理,微调 |
| A6000 | 48 GB | $0.80 | 良好的显存/价格比 |
| V100 | 16 GB | $0.55 | 预算训练 |
8x GPU: 最适合分布式训练 (DDP, FSDP)
4x GPU: 大型模型,多 GPU 训练
2x GPU: 中等规模工作负载
1x GPU: 微调,推理,开发
所有实例均预装 Lambda Stack:
# 包含的软件
- Ubuntu 22.04 LTS
- NVIDIA 驱动程序(最新)
- CUDA 12.x
- cuDNN 8.x
- NCCL(用于多 GPU)
- PyTorch(最新)
- TensorFlow(最新)
- JAX
- JupyterLab
# 检查 GPU
nvidia-smi
# 检查 PyTorch
python -c "import torch; print(torch.cuda.is_available())"
# 检查 CUDA 版本
nvcc --version
pip install lambda-cloud-client
import os
import lambda_cloud_client
# 使用 API 密钥配置
configuration = lambda_cloud_client.Configuration(
host="https://cloud.lambdalabs.com/api/v1",
access_token=os.environ["LAMBDA_API_KEY"]
)
with lambda_cloud_client.ApiClient(configuration) as api_client:
api = lambda_cloud_client.DefaultApi(api_client)
# 获取可用实例类型
types = api.instance_types()
for name, info in types.data.items():
print(f"{name}: {info.instance_type.description}")
from lambda_cloud_client.models import LaunchInstanceRequest
request = LaunchInstanceRequest(
region_name="us-west-1",
instance_type_name="gpu_1x_h100_sxm5",
ssh_key_names=["my-ssh-key"],
file_system_names=["my-filesystem"], # 可选
name="training-job"
)
response = api.launch_instance(request)
instance_id = response.data.instance_ids[0]
print(f"Launched: {instance_id}")
instances = api.list_instances()
for instance in instances.data:
print(f"{instance.name}: {instance.ip} ({instance.status})")
from lambda_cloud_client.models import TerminateInstanceRequest
request = TerminateInstanceRequest(
instance_ids=[instance_id]
)
api.terminate_instance(request)
from lambda_cloud_client.models import AddSshKeyRequest
# 添加 SSH 密钥
request = AddSshKeyRequest(
name="my-key",
public_key="ssh-rsa AAAA..."
)
api.add_ssh_key(request)
# 列出密钥
keys = api.list_ssh_keys()
# 删除密钥
api.delete_ssh_key(key_id)
curl -u $LAMBDA_API_KEY: \
https://cloud.lambdalabs.com/api/v1/instance-types | jq
curl -u $LAMBDA_API_KEY: \
-X POST https://cloud.lambdalabs.com/api/v1/instance-operations/launch \
-H "Content-Type: application/json" \
-d '{
"region_name": "us-west-1",
"instance_type_name": "gpu_1x_h100_sxm5",
"ssh_key_names": ["my-key"]
}' | jq
curl -u $LAMBDA_API_KEY: \
-X POST https://cloud.lambdalabs.com/api/v1/instance-operations/terminate \
-H "Content-Type: application/json" \
-d '{"instance_ids": ["<INSTANCE-ID>"]}' | jq
文件系统在实例重启后保留数据:
# 挂载位置
/lambda/nfs/<FILESYSTEM_NAME>
# 示例:保存检查点
python train.py --checkpoint-dir /lambda/nfs/my-storage/checkpoints
文件系统必须在实例启动时附加:
file_system_names# 存储在文件系统上(持久)
/lambda/nfs/storage/
├── datasets/
├── checkpoints/
├── models/
└── outputs/
# 本地 SSD(更快,临时)
/home/ubuntu/
└── working/ # 临时文件
# 本地生成密钥
ssh-keygen -t ed25519 -f ~/.ssh/lambda_key
# 将公钥添加到 Lambda 控制台
# 或通过 API
# 在实例上,添加更多密钥
echo 'ssh-rsa AAAA...' >> ~/.ssh/authorized_keys
# 在实例上
ssh-import-id gh:username
# 转发 Jupyter
ssh -L 8888:localhost:8888 ubuntu@<IP>
# 转发 TensorBoard
ssh -L 6006:localhost:6006 ubuntu@<IP>
# 多个端口
ssh -L 8888:localhost:8888 -L 6006:localhost:6006 ubuntu@<IP>
# 在实例上
jupyter lab --ip=0.0.0.0 --port=8888
# 从本地机器通过隧道
ssh -L 8888:localhost:8888 ubuntu@<IP>
# 打开 http://localhost:8888
# SSH 到实例
ssh ubuntu@<IP>
# 克隆仓库
git clone https://github.com/user/project
cd project
# 安装依赖
pip install -r requirements.txt
# 训练
python train.py --epochs 100 --checkpoint-dir /lambda/nfs/storage/checkpoints
# train_ddp.py
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
def main():
dist.init_process_group("nccl")
rank = dist.get_rank()
device = rank % torch.cuda.device_count()
model = MyModel().to(device)
model = DDP(model, device_ids=[device])
# Training loop...
if __name__ == "__main__":
main()
# 使用 torchrun 启动(8 个 GPU)
torchrun --nproc_per_node=8 train_ddp.py
import os
checkpoint_dir = "/lambda/nfs/my-storage/checkpoints"
os.makedirs(checkpoint_dir, exist_ok=True)
# 保存检查点
torch.save({
'epoch': epoch,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'loss': loss,
}, f"{checkpoint_dir}/checkpoint_{epoch}.pt")
高性能 Slurm 集群,包含:
# 在 Slurm 集群上
srun --nodes=4 --ntasks-per-node=8 --gpus-per-node=8 \
torchrun --nnodes=4 --nproc_per_node=8 \
--rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR:29500 \
train.py
# 查找私有 IP
ip addr show | grep 'inet '
# 1. 启动带文件系统的 8x H100 实例
# 2. SSH 并设置
ssh ubuntu@<IP>
pip install transformers accelerate peft
# 3. 将模型下载到文件系统
python -c "
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-2-7b-hf')
model.save_pretrained('/lambda/nfs/storage/models/llama-2-7b')
"
# 4. 在文件系统上进行带检查点的微调
accelerate launch --num_processes 8 train.py \
--model_path /lambda/nfs/storage/models/llama-2-7b \
--output_dir /lambda/nfs/storage/outputs \
--checkpoint_dir /lambda/nfs/storage/checkpoints
# 1. 启动 A10 实例(推理性价比高)
# 2. 运行推理
python inference.py \
--model /lambda/nfs/storage/models/fine-tuned \
--input /lambda/nfs/storage/data/inputs.jsonl \
--output /lambda/nfs/storage/data/outputs.jsonl
| 任务 | 推荐 GPU |
|---|---|
| LLM 微调 (7B) | A100 40GB |
| LLM 微调 (70B) | 8x H100 |
| 推理 | A10, A6000 |
| 开发 | V100, A10 |
| 最大性能 | B200 |
| 问题 | 解决方案 |
|---|---|
| 实例无法启动 | 检查区域可用性,尝试不同的 GPU |
| SSH 连接被拒绝 | 等待实例初始化(3-15 分钟) |
| 终止后数据丢失 | 使用持久文件系统 |
| 数据传输缓慢 | 使用同一区域的文件系统 |
| 未检测到 GPU | 重启实例,检查驱动程序 |
每周安装次数
147
仓库
GitHub 星标数
22.6K
首次出现
2026 年 1 月 21 日
安全审计
安装于
claude-code116
opencode114
gemini-cli111
cursor108
codex98
antigravity95
Comprehensive guide to running ML workloads on Lambda Labs GPU cloud with on-demand instances and 1-Click Clusters.
Use Lambda Labs when:
Key features:
Use alternatives instead:
# Get instance IP from console
ssh ubuntu@<INSTANCE-IP>
# Or with specific key
ssh -i ~/.ssh/lambda_key ubuntu@<INSTANCE-IP>
| GPU | VRAM | Price/GPU/hr | Best For |
|---|---|---|---|
| B200 SXM6 | 180 GB | $4.99 | Largest models, fastest training |
| H100 SXM | 80 GB | $2.99-3.29 | Large model training |
| H100 PCIe | 80 GB | $2.49 | Cost-effective H100 |
| GH200 | 96 GB | $1.49 | Single-GPU large models |
| A100 80GB | 80 GB | $1.79 | Production training |
| A100 40GB | 40 GB | $1.29 | Standard training |
| A10 | 24 GB | $0.75 | Inference, fine-tuning |
8x GPU: Best for distributed training (DDP, FSDP)
4x GPU: Large models, multi-GPU training
2x GPU: Medium workloads
1x GPU: Fine-tuning, inference, development
All instances come with Lambda Stack pre-installed:
# Included software
- Ubuntu 22.04 LTS
- NVIDIA drivers (latest)
- CUDA 12.x
- cuDNN 8.x
- NCCL (for multi-GPU)
- PyTorch (latest)
- TensorFlow (latest)
- JAX
- JupyterLab
# Check GPU
nvidia-smi
# Check PyTorch
python -c "import torch; print(torch.cuda.is_available())"
# Check CUDA version
nvcc --version
pip install lambda-cloud-client
import os
import lambda_cloud_client
# Configure with API key
configuration = lambda_cloud_client.Configuration(
host="https://cloud.lambdalabs.com/api/v1",
access_token=os.environ["LAMBDA_API_KEY"]
)
with lambda_cloud_client.ApiClient(configuration) as api_client:
api = lambda_cloud_client.DefaultApi(api_client)
# Get available instance types
types = api.instance_types()
for name, info in types.data.items():
print(f"{name}: {info.instance_type.description}")
from lambda_cloud_client.models import LaunchInstanceRequest
request = LaunchInstanceRequest(
region_name="us-west-1",
instance_type_name="gpu_1x_h100_sxm5",
ssh_key_names=["my-ssh-key"],
file_system_names=["my-filesystem"], # Optional
name="training-job"
)
response = api.launch_instance(request)
instance_id = response.data.instance_ids[0]
print(f"Launched: {instance_id}")
instances = api.list_instances()
for instance in instances.data:
print(f"{instance.name}: {instance.ip} ({instance.status})")
from lambda_cloud_client.models import TerminateInstanceRequest
request = TerminateInstanceRequest(
instance_ids=[instance_id]
)
api.terminate_instance(request)
from lambda_cloud_client.models import AddSshKeyRequest
# Add SSH key
request = AddSshKeyRequest(
name="my-key",
public_key="ssh-rsa AAAA..."
)
api.add_ssh_key(request)
# List keys
keys = api.list_ssh_keys()
# Delete key
api.delete_ssh_key(key_id)
curl -u $LAMBDA_API_KEY: \
https://cloud.lambdalabs.com/api/v1/instance-types | jq
curl -u $LAMBDA_API_KEY: \
-X POST https://cloud.lambdalabs.com/api/v1/instance-operations/launch \
-H "Content-Type: application/json" \
-d '{
"region_name": "us-west-1",
"instance_type_name": "gpu_1x_h100_sxm5",
"ssh_key_names": ["my-key"]
}' | jq
curl -u $LAMBDA_API_KEY: \
-X POST https://cloud.lambdalabs.com/api/v1/instance-operations/terminate \
-H "Content-Type: application/json" \
-d '{"instance_ids": ["<INSTANCE-ID>"]}' | jq
Filesystems persist data across instance restarts:
# Mount location
/lambda/nfs/<FILESYSTEM_NAME>
# Example: save checkpoints
python train.py --checkpoint-dir /lambda/nfs/my-storage/checkpoints
Filesystems must be attached at instance launch time:
file_system_names in launch request# Store on filesystem (persists)
/lambda/nfs/storage/
├── datasets/
├── checkpoints/
├── models/
└── outputs/
# Local SSD (faster, ephemeral)
/home/ubuntu/
└── working/ # Temporary files
# Generate key locally
ssh-keygen -t ed25519 -f ~/.ssh/lambda_key
# Add public key to Lambda console
# Or via API
# On instance, add more keys
echo 'ssh-rsa AAAA...' >> ~/.ssh/authorized_keys
# On instance
ssh-import-id gh:username
# Forward Jupyter
ssh -L 8888:localhost:8888 ubuntu@<IP>
# Forward TensorBoard
ssh -L 6006:localhost:6006 ubuntu@<IP>
# Multiple ports
ssh -L 8888:localhost:8888 -L 6006:localhost:6006 ubuntu@<IP>
# On instance
jupyter lab --ip=0.0.0.0 --port=8888
# From local machine with tunnel
ssh -L 8888:localhost:8888 ubuntu@<IP>
# Open http://localhost:8888
# SSH to instance
ssh ubuntu@<IP>
# Clone repo
git clone https://github.com/user/project
cd project
# Install dependencies
pip install -r requirements.txt
# Train
python train.py --epochs 100 --checkpoint-dir /lambda/nfs/storage/checkpoints
# train_ddp.py
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
def main():
dist.init_process_group("nccl")
rank = dist.get_rank()
device = rank % torch.cuda.device_count()
model = MyModel().to(device)
model = DDP(model, device_ids=[device])
# Training loop...
if __name__ == "__main__":
main()
# Launch with torchrun (8 GPUs)
torchrun --nproc_per_node=8 train_ddp.py
import os
checkpoint_dir = "/lambda/nfs/my-storage/checkpoints"
os.makedirs(checkpoint_dir, exist_ok=True)
# Save checkpoint
torch.save({
'epoch': epoch,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'loss': loss,
}, f"{checkpoint_dir}/checkpoint_{epoch}.pt")
High-performance Slurm clusters with:
# On Slurm cluster
srun --nodes=4 --ntasks-per-node=8 --gpus-per-node=8 \
torchrun --nnodes=4 --nproc_per_node=8 \
--rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR:29500 \
train.py
# Find private IP
ip addr show | grep 'inet '
# 1. Launch 8x H100 instance with filesystem
# 2. SSH and setup
ssh ubuntu@<IP>
pip install transformers accelerate peft
# 3. Download model to filesystem
python -c "
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-2-7b-hf')
model.save_pretrained('/lambda/nfs/storage/models/llama-2-7b')
"
# 4. Fine-tune with checkpoints on filesystem
accelerate launch --num_processes 8 train.py \
--model_path /lambda/nfs/storage/models/llama-2-7b \
--output_dir /lambda/nfs/storage/outputs \
--checkpoint_dir /lambda/nfs/storage/checkpoints
# 1. Launch A10 instance (cost-effective for inference)
# 2. Run inference
python inference.py \
--model /lambda/nfs/storage/models/fine-tuned \
--input /lambda/nfs/storage/data/inputs.jsonl \
--output /lambda/nfs/storage/data/outputs.jsonl
| Task | Recommended GPU |
|---|---|
| LLM fine-tuning (7B) | A100 40GB |
| LLM fine-tuning (70B) | 8x H100 |
| Inference | A10, A6000 |
| Development | V100, A10 |
| Maximum performance | B200 |
| Issue | Solution |
|---|---|
| Instance won't launch | Check region availability, try different GPU |
| SSH connection refused | Wait for instance to initialize (3-15 min) |
| Data lost after terminate | Use persistent filesystems |
| Slow data transfer | Use filesystem in same region |
| GPU not detected | Reboot instance, check drivers |
Weekly Installs
147
Repository
GitHub Stars
22.6K
First Seen
Jan 21, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykWarn
Installed on
claude-code116
opencode114
gemini-cli111
cursor108
codex98
antigravity95
Azure 配额管理指南:服务限制、容量验证与配额增加方法
102,500 周安装
| A6000 |
| 48 GB |
| $0.80 |
| Good VRAM/price ratio |
| V100 | 16 GB | $0.55 | Budget training |