skypilot-multi-cloud-orchestration by davila7/claude-code-templates
npx skills add https://github.com/davila7/claude-code-templates --skill skypilot-multi-cloud-orchestration使用 SkyPilot 在多云环境中运行机器学习工作负载并实现自动成本优化的综合指南。
在以下情况下使用 SkyPilot:
主要特性:
替代方案使用场景:
pip install "skypilot[aws,gcp,azure,kubernetes]"
# 验证云凭据
sky check
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
创建 hello.yaml:
resources:
accelerators: T4:1
run: |
nvidia-smi
echo "Hello from SkyPilot!"
启动:
sky launch -c hello hello.yaml
# SSH 连接到集群
ssh hello
# 终止
sky down hello
# 任务名称(可选)
name: my-task
# 资源需求
resources:
cloud: aws # 可选:如果省略则自动选择
region: us-west-2 # 可选:如果省略则自动选择
accelerators: A100:4 # GPU 类型和数量
cpus: 8+ # 最小 CPU 数量
memory: 32+ # 最小内存(GB)
use_spot: true # 使用竞价实例
disk_size: 256 # 磁盘大小(GB)
# 分布式训练的节点数量
num_nodes: 2
# 工作目录(同步到 ~/sky_workdir)
workdir: .
# 设置命令(运行一次)
setup: |
pip install -r requirements.txt
# 运行命令
run: |
python train.py
| 命令 | 用途 |
|---|---|
sky launch | 启动集群并运行任务 |
sky exec | 在现有集群上运行任务 |
sky status | 显示集群状态 |
sky stop | 停止集群(保留状态) |
sky down | 终止集群 |
sky logs | 查看任务日志 |
sky queue | 显示作业队列 |
sky jobs launch | 启动托管作业 |
sky serve up | 部署服务端点 |
# NVIDIA GPU
accelerators: T4:1
accelerators: L4:1
accelerators: A10G:1
accelerators: L40S:1
accelerators: A100:4
accelerators: A100-80GB:8
accelerators: H100:8
# 云平台特定
accelerators: V100:4 # AWS/GCP
accelerators: TPU-v4-8 # GCP TPU
resources:
accelerators:
H100: 8
A100-80GB: 8
A100: 8
any_of:
- cloud: gcp
- cloud: aws
- cloud: azure
resources:
accelerators: A100:8
use_spot: true
spot_recovery: FAILOVER # 抢占时自动恢复
# 启动新集群
sky launch -c mycluster task.yaml
# 在现有集群上运行(跳过设置)
sky exec mycluster another_task.yaml
# 交互式 SSH
ssh mycluster
# 流式日志
sky logs mycluster
resources:
accelerators: A100:4
autostop:
idle_minutes: 30
down: true # 终止而非停止
# 通过 CLI 设置自动停止
sky autostop mycluster -i 30 --down
# 所有集群
sky status
# 详细视图
sky status -a
resources:
accelerators: A100:8
num_nodes: 4 # 4 节点 × 8 GPU = 总计 32 GPU
setup: |
pip install torch torchvision
run: |
torchrun \
--nnodes=$SKYPILOT_NUM_NODES \
--nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
--node_rank=$SKYPILOT_NODE_RANK \
--master_addr=$(echo "$SKYPILOT_NODE_IPS" | head -n1) \
--master_port=12355 \
train.py
| 变量 | 描述 |
|---|---|
SKYPILOT_NODE_RANK | 节点索引(0 到 num_nodes-1) |
SKYPILOT_NODE_IPS | 换行分隔的 IP 地址 |
SKYPILOT_NUM_NODES | 节点总数 |
SKYPILOT_NUM_GPUS_PER_NODE | 每个节点的 GPU 数量 |
run: |
if [ "${SKYPILOT_NODE_RANK}" == "0" ]; then
python orchestrate.py
fi
# 启动带竞价恢复的托管作业
sky jobs launch -n my-job train.yaml
name: training-job
file_mounts:
/checkpoints:
name: my-checkpoints
store: s3
mode: MOUNT
resources:
accelerators: A100:8
use_spot: true
run: |
python train.py \
--checkpoint-dir /checkpoints \
--resume-from-latest
# 列出作业
sky jobs queue
# 查看日志
sky jobs logs my-job
# 取消作业
sky jobs cancel my-job
workdir: ./my-project # 同步到 ~/sky_workdir
file_mounts:
/data/config.yaml: ./config.yaml
~/.vimrc: ~/.vimrc
file_mounts:
# 挂载 S3 存储桶
/datasets:
source: s3://my-bucket/datasets
mode: MOUNT # 从 S3 流式传输
# 复制 GCS 存储桶
/models:
source: gs://my-bucket/models
mode: COPY # 预取到磁盘
# 缓存挂载(快速写入)
/outputs:
name: my-outputs
store: s3
mode: MOUNT_CACHED
| 模式 | 描述 | 最佳适用场景 |
|---|---|---|
MOUNT | 从云端流式传输 | 大型数据集,读取密集型 |
COPY | 预取到磁盘 | 小文件,随机访问 |
MOUNT_CACHED | 缓存并异步上传 | 检查点,输出 |
# service.yaml
service:
readiness_probe: /health
replica_policy:
min_replicas: 1
max_replicas: 10
target_qps_per_replica: 2.0
resources:
accelerators: A100:1
run: |
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-2-7b-chat-hf \
--port 8000
# 部署
sky serve up -n my-service service.yaml
# 检查状态
sky serve status
# 获取端点
sky serve status my-service
service:
replica_policy:
min_replicas: 1
max_replicas: 10
target_qps_per_replica: 2.0
upscale_delay_seconds: 60
downscale_delay_seconds: 300
load_balancing_policy: round_robin
# SkyPilot 寻找最便宜的选项
resources:
accelerators: A100:8
# 未指定云平台 - 自动选择最便宜的
# 显示优化器决策
sky launch task.yaml --dryrun
resources:
accelerators: A100:8
any_of:
- cloud: gcp
region: us-central1
- cloud: aws
region: us-east-1
- cloud: azure
envs:
HF_TOKEN: $HF_TOKEN # 从本地环境继承
WANDB_API_KEY: $WANDB_API_KEY
# 或使用密钥
secrets:
- HF_TOKEN
- WANDB_API_KEY
name: llm-finetune
file_mounts:
/checkpoints:
name: finetune-checkpoints
store: s3
mode: MOUNT_CACHED
resources:
accelerators: A100:8
use_spot: true
setup: |
pip install transformers accelerate
run: |
python train.py \
--checkpoint-dir /checkpoints \
--resume
name: hp-sweep-${RUN_ID}
envs:
RUN_ID: 0
LEARNING_RATE: 1e-4
BATCH_SIZE: 32
resources:
accelerators: A100:1
use_spot: true
run: |
python train.py \
--lr $LEARNING_RATE \
--batch-size $BATCH_SIZE \
--run-id $RUN_ID
# 启动多个作业
for i in {1..10}; do
sky jobs launch sweep.yaml \
--env RUN_ID=$i \
--env LEARNING_RATE=$(python -c "import random; print(10**random.uniform(-5,-3))")
done
# SSH 到集群
ssh mycluster
# 查看日志
sky logs mycluster
# 检查作业队列
sky queue mycluster
# 查看托管作业日志
sky jobs logs my-job
| 问题 | 解决方案 |
|---|---|
| 配额不足 | 申请配额增加,尝试不同区域 |
| 竞价抢占 | 使用 sky jobs launch 实现自动恢复 |
| 文件同步慢 | 对输出使用 MOUNT_CACHED 模式 |
| GPU 不可用 | 使用 any_of 指定后备云平台 |
每周安装量
169
代码仓库
GitHub 星标数
23.4K
首次出现
2026年1月21日
安全审计
已安装于
claude-code142
opencode140
gemini-cli131
cursor129
codex119
antigravity114
Comprehensive guide to running ML workloads across clouds with automatic cost optimization using SkyPilot.
Use SkyPilot when:
Key features:
Use alternatives instead:
pip install "skypilot[aws,gcp,azure,kubernetes]"
# Verify cloud credentials
sky check
Create hello.yaml:
resources:
accelerators: T4:1
run: |
nvidia-smi
echo "Hello from SkyPilot!"
Launch:
sky launch -c hello hello.yaml
# SSH to cluster
ssh hello
# Terminate
sky down hello
# Task name (optional)
name: my-task
# Resource requirements
resources:
cloud: aws # Optional: auto-select if omitted
region: us-west-2 # Optional: auto-select if omitted
accelerators: A100:4 # GPU type and count
cpus: 8+ # Minimum CPUs
memory: 32+ # Minimum memory (GB)
use_spot: true # Use spot instances
disk_size: 256 # Disk size (GB)
# Number of nodes for distributed training
num_nodes: 2
# Working directory (synced to ~/sky_workdir)
workdir: .
# Setup commands (run once)
setup: |
pip install -r requirements.txt
# Run commands
run: |
python train.py
| Command | Purpose |
|---|---|
sky launch | Launch cluster and run task |
sky exec | Run task on existing cluster |
sky status | Show cluster status |
sky stop | Stop cluster (preserve state) |
sky down | Terminate cluster |
sky logs | View task logs |
# NVIDIA GPUs
accelerators: T4:1
accelerators: L4:1
accelerators: A10G:1
accelerators: L40S:1
accelerators: A100:4
accelerators: A100-80GB:8
accelerators: H100:8
# Cloud-specific
accelerators: V100:4 # AWS/GCP
accelerators: TPU-v4-8 # GCP TPUs
resources:
accelerators:
H100: 8
A100-80GB: 8
A100: 8
any_of:
- cloud: gcp
- cloud: aws
- cloud: azure
resources:
accelerators: A100:8
use_spot: true
spot_recovery: FAILOVER # Auto-recover on preemption
# Launch new cluster
sky launch -c mycluster task.yaml
# Run on existing cluster (skip setup)
sky exec mycluster another_task.yaml
# Interactive SSH
ssh mycluster
# Stream logs
sky logs mycluster
resources:
accelerators: A100:4
autostop:
idle_minutes: 30
down: true # Terminate instead of stop
# Set autostop via CLI
sky autostop mycluster -i 30 --down
# All clusters
sky status
# Detailed view
sky status -a
resources:
accelerators: A100:8
num_nodes: 4 # 4 nodes × 8 GPUs = 32 GPUs total
setup: |
pip install torch torchvision
run: |
torchrun \
--nnodes=$SKYPILOT_NUM_NODES \
--nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
--node_rank=$SKYPILOT_NODE_RANK \
--master_addr=$(echo "$SKYPILOT_NODE_IPS" | head -n1) \
--master_port=12355 \
train.py
| Variable | Description |
|---|---|
SKYPILOT_NODE_RANK | Node index (0 to num_nodes-1) |
SKYPILOT_NODE_IPS | Newline-separated IP addresses |
SKYPILOT_NUM_NODES | Total number of nodes |
SKYPILOT_NUM_GPUS_PER_NODE | GPUs per node |
run: |
if [ "${SKYPILOT_NODE_RANK}" == "0" ]; then
python orchestrate.py
fi
# Launch managed job with spot recovery
sky jobs launch -n my-job train.yaml
name: training-job
file_mounts:
/checkpoints:
name: my-checkpoints
store: s3
mode: MOUNT
resources:
accelerators: A100:8
use_spot: true
run: |
python train.py \
--checkpoint-dir /checkpoints \
--resume-from-latest
# List jobs
sky jobs queue
# View logs
sky jobs logs my-job
# Cancel job
sky jobs cancel my-job
workdir: ./my-project # Synced to ~/sky_workdir
file_mounts:
/data/config.yaml: ./config.yaml
~/.vimrc: ~/.vimrc
file_mounts:
# Mount S3 bucket
/datasets:
source: s3://my-bucket/datasets
mode: MOUNT # Stream from S3
# Copy GCS bucket
/models:
source: gs://my-bucket/models
mode: COPY # Pre-fetch to disk
# Cached mount (fast writes)
/outputs:
name: my-outputs
store: s3
mode: MOUNT_CACHED
| Mode | Description | Best For |
|---|---|---|
MOUNT | Stream from cloud | Large datasets, read-heavy |
COPY | Pre-fetch to disk | Small files, random access |
MOUNT_CACHED | Cache with async upload | Checkpoints, outputs |
# service.yaml
service:
readiness_probe: /health
replica_policy:
min_replicas: 1
max_replicas: 10
target_qps_per_replica: 2.0
resources:
accelerators: A100:1
run: |
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-2-7b-chat-hf \
--port 8000
# Deploy
sky serve up -n my-service service.yaml
# Check status
sky serve status
# Get endpoint
sky serve status my-service
service:
replica_policy:
min_replicas: 1
max_replicas: 10
target_qps_per_replica: 2.0
upscale_delay_seconds: 60
downscale_delay_seconds: 300
load_balancing_policy: round_robin
# SkyPilot finds cheapest option
resources:
accelerators: A100:8
# No cloud specified - auto-select cheapest
# Show optimizer decision
sky launch task.yaml --dryrun
resources:
accelerators: A100:8
any_of:
- cloud: gcp
region: us-central1
- cloud: aws
region: us-east-1
- cloud: azure
envs:
HF_TOKEN: $HF_TOKEN # Inherited from local env
WANDB_API_KEY: $WANDB_API_KEY
# Or use secrets
secrets:
- HF_TOKEN
- WANDB_API_KEY
name: llm-finetune
file_mounts:
/checkpoints:
name: finetune-checkpoints
store: s3
mode: MOUNT_CACHED
resources:
accelerators: A100:8
use_spot: true
setup: |
pip install transformers accelerate
run: |
python train.py \
--checkpoint-dir /checkpoints \
--resume
name: hp-sweep-${RUN_ID}
envs:
RUN_ID: 0
LEARNING_RATE: 1e-4
BATCH_SIZE: 32
resources:
accelerators: A100:1
use_spot: true
run: |
python train.py \
--lr $LEARNING_RATE \
--batch-size $BATCH_SIZE \
--run-id $RUN_ID
# Launch multiple jobs
for i in {1..10}; do
sky jobs launch sweep.yaml \
--env RUN_ID=$i \
--env LEARNING_RATE=$(python -c "import random; print(10**random.uniform(-5,-3))")
done
# SSH to cluster
ssh mycluster
# View logs
sky logs mycluster
# Check job queue
sky queue mycluster
# View managed job logs
sky jobs logs my-job
| Issue | Solution |
|---|---|
| Quota exceeded | Request quota increase, try different region |
| Spot preemption | Use sky jobs launch for auto-recovery |
| Slow file sync | Use MOUNT_CACHED mode for outputs |
| GPU not available | Use any_of for fallback clouds |
Weekly Installs
169
Repository
GitHub Stars
23.4K
First Seen
Jan 21, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykWarn
Installed on
claude-code142
opencode140
gemini-cli131
cursor129
codex119
antigravity114
sky queue |
| Show job queue |
sky jobs launch | Launch managed job |
sky serve up | Deploy serving endpoint |