⚠️

重要前提

安装AI Skills的关键前提是：必须科学上网，且开启TUN模式，这一点至关重要，直接决定安装能否顺利完成，在此郑重提醒三遍：科学上网，科学上网，科学上网。查看完整安装教程 →

RunPod部署指南：Serverless GPU、vLLM端点与成本优化

runpod-deployment by scientiacapital/skills

67 周安装量

6 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/scientiacapital/skills --skill runpod-deployment

AI/机器学习云服务开发运维

🇨🇳中文介绍

Serverless Workers - 按秒计费的零扩展处理器
vLLM Endpoints - 具备 2-3 倍吞吐量的 OpenAI 兼容 LLM 服务
Pod Management - 用于开发/训练的专用 GPU 实例
Cost Optimization - GPU 选择、竞价实例、预算控制

关键交付成果：

具备流式处理功能的生产就绪 Serverless 处理器
具备 OpenAI API 兼容性的 vLLM 部署
针对任何模型大小的成本优化 GPU 选择
健康监控和自动扩缩容配置

<quick_start> 最简 Serverless 处理器 (v1.8.1)：

import runpod

def handler(job):
    """基础处理器 - 接收任务，返回结果。"""
    job_input = job["input"]
    prompt = job_input.get("prompt", "")

    # 在此处添加您的推理逻辑
    result = process(prompt)

    return {"output": result}

runpod.serverless.start({"handler": handler})

流式处理器：

import runpod

def streaming_handler(job):
    """用于流式响应的生成器。"""
    for chunk in generate_chunks(job["input"]):
        yield {"token": chunk, "finished": False}
    yield {"token": "", "finished": True}

runpod.serverless.start({
    "handler": streaming_handler,
    "return_aggregate_stream": True
})

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

M1/M2 Mac：无法在本地构建 Docker

ARM 架构与 RunPod 的 x86 GPU 不兼容。

解决方案：使用 GitHub Actions 为您构建：

# 推送代码 - Actions 构建 x86 镜像
git add . && git commit -m "Deploy" && git push

完整的 GitHub Actions 工作流请参阅 reference/cicd.md。

切勿在 Apple Silicon 上为 RunPod 本地运行 docker build。 </m1_mac_critical>

GPU 选择矩阵 (2025年1月)

GPU	VRAM	安全 $/小时	竞价 $/小时	最佳用途
RTX A4000	16GB	$0.36	$0.18	嵌入、小型模型
RTX 4090	24GB	$0.44	$0.22	7B-8B 推理
A40	48GB	$0.65	$0.39	13B-30B，微调
A100 80GB	80GB	$1.89	$0.89	70B 模型，生产环境
H100 80GB	80GB	$4.69	$1.88	70B+ 训练

def select_gpu(model_params_b: float, quantized: bool = False) -> str:
    effective = model_params_b * (0.5 if quantized else 1.0)
    if effective <= 3: return "RTX_A4000"      # $0.36/hr
    if effective <= 8: return "RTX_4090"       # $0.44/hr
    if effective <= 30: return "A40"           # $0.65/hr
    if effective <= 70: return "A100_80GB"     # $1.89/hr
    return "H100_80GB"                         # $4.69/hr

详细定价和预算控制请参阅 reference/cost-optimization.md。 </gpu_selection>

进度更新（长时间运行任务）

import runpod

def long_task_handler(job):
    total_steps = job["input"].get("steps", 10)

    for step in range(total_steps):
        process_step(step)
        runpod.serverless.progress_update(
            job_id=job["id"],
            progress=int((step + 1) / total_steps * 100)
        )

    return {"status": "complete", "steps": total_steps}

runpod.serverless.start({"handler": long_task_handler})

import runpod
import traceback

def safe_handler(job):
    try:
        # 验证输入
        if "prompt" not in job["input"]:
            return {"error": "Missing required field: prompt"}

        result = process(job["input"])
        return {"output": result}

    except torch.cuda.OutOfMemoryError:
        return {"error": "GPU OOM - reduce input size", "retry": False}
    except Exception as e:
        return {"error": str(e), "traceback": traceback.format_exc()}

runpod.serverless.start({"handler": safe_handler})

异步模式、批处理和高级处理器请参阅 reference/serverless-workers.md。 </handler_patterns>

注意： vLLM 使用 OpenAI 兼容的 API 格式，但连接到您的 RunPod 端点，而非 OpenAI 服务器。模型在您的 GPU 上运行（Llama、Qwen、Mistral 等）。

vllm_env = {
    "MODEL_NAME": "meta-llama/Llama-3.1-70B-Instruct",
    "HF_TOKEN": "${HF_TOKEN}",
    "TENSOR_PARALLEL_SIZE": "2",        # 多 GPU
    "MAX_MODEL_LEN": "16384",
    "GPU_MEMORY_UTILIZATION": "0.95",
    "QUANTIZATION": "awq",              # 可选：awq, gptq
}

OpenAI 兼容流式处理

from openai import OpenAI

client = OpenAI(
    api_key="RUNPOD_API_KEY",
    base_url="https://api.runpod.ai/v2/ENDPOINT_ID/openai/v1",
)

stream = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Write a poem"}],
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

直接 RunPod 流式处理

import requests

url = "https://api.runpod.ai/v2/ENDPOINT_ID/run"
headers = {"Authorization": "Bearer RUNPOD_API_KEY"}

response = requests.post(url, headers=headers, json={
    "input": {"prompt": "Hello", "stream": True}
})
job_id = response.json()["id"]

# 流式获取结果
stream_url = f"https://api.runpod.ai/v2/ENDPOINT_ID/stream/{job_id}"
with requests.get(stream_url, headers=headers, stream=True) as r:
    for line in r.iter_lines():
        if line: print(line.decode())

HuggingFace、TGI 和自定义模型模式请参阅 reference/model-deployment.md。 </vllm_deployment>

自动扩缩容配置

类型	最佳用途	配置
QUEUE_DELAY	可变流量	`scaler_value=2` (2秒目标)
REQUEST_COUNT	可预测负载	`scaler_value=5` (5 请求/worker)

configs = {
    "interactive_api": {
        "workers_min": 1,      # 始终保持预热
        "workers_max": 5,
        "idle_timeout": 120,
        "scaler_type": "QUEUE_DELAY",
        "scaler_value": 1,     # 1秒延迟目标
    },
    "batch_processing": {
        "workers_min": 0,
        "workers_max": 20,
        "idle_timeout": 30,
        "scaler_type": "REQUEST_COUNT",
        "scaler_value": 5,
    },
    "cost_optimized": {
        "workers_min": 0,
        "workers_max": 3,
        "idle_timeout": 15,    # 积极缩容
        "scaler_type": "QUEUE_DELAY",
        "scaler_value": 5,
    },
}

Pod 生命周期和扩缩容详细信息请参阅 reference/pod-management.md。 </auto_scaling>

import runpod

async def check_health(endpoint_id: str):
    endpoint = runpod.Endpoint(endpoint_id)
    health = await endpoint.health()

    return {
        "status": health.status,
        "workers_ready": health.workers.ready,
        "queue_depth": health.queue.in_queue,
        "avg_latency_ms": health.metrics.avg_execution_time,
    }

GraphQL 指标查询

query GetEndpoint($id: String!) {
    endpoint(id: $id) {
        status
        workers { ready running pending throttled }
        queue { inQueue inProgress completed failed }
        metrics {
            requestsPerMinute
            avgExecutionTimeMs
            p95ExecutionTimeMs
            successRate
        }
    }
}

结构化日志记录、警报和仪表板请参阅 reference/monitoring.md。 </health_monitoring>

<dockerfile_pattern>

FROM runpod/pytorch:2.1.0-py3.10-cuda12.1.1-devel-ubuntu22.04

WORKDIR /app

# 安装依赖项（缓存层）
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# 复制应用程序
COPY . .

# RunPod 入口点
CMD ["python", "-u", "handler.py"]

runpod.toml、requirements.txt 模式请参阅 reference/templates.md。 </dockerfile_pattern>

reference/serverless-workers.md - 处理器模式、流式处理、异步
reference/model-deployment.md - vLLM、TGI、HuggingFace 部署
reference/pod-management.md - GPU 类型、扩缩容、生命周期

reference/cost-optimization.md - 预算控制、规模调整
reference/monitoring.md - 健康检查、日志记录、GraphQL
reference/troubleshooting.md - 常见问题及解决方案

reference/cicd.md - 用于 M1 Mac 构建的 GitHub Actions
reference/templates.md - Dockerfile、runpod.toml 配置
templates/runpod-worker.py - 生产环境处理器模板 </file_locations>

用户需要 Serverless 部署： → 提供处理器模式、Dockerfile、部署步骤 → 参考：reference/serverless-workers.md

用户需要 vLLM 端点： → 提供 vLLM 环境配置、OpenAI 客户端设置 → 参考：reference/model-deployment.md

用户需要成本优化： → 提供 GPU 选择矩阵、竞价定价、预算控制 → 参考：reference/cost-optimization.md

用户使用 M1/M2 Mac： → 关键：必须使用 GitHub Actions 进行构建 → 参考：reference/cicd.md

用户遇到部署问题： → 检查健康端点、查看日志 → 参考：reference/troubleshooting.md

容器磁盘：$0.10/GB/月（按 5 分钟间隔计费）
网络卷：$0.07/GB/月（前 1TB），之后 $0.05/GB

def estimate_monthly_cost(gpu_type, daily_requests, avg_time_s):
    rates = {"RTX_4090": 0.44, "A40": 0.65, "A100_80GB": 1.89}
    daily_hours = (daily_requests * avg_time_s) / 3600
    return daily_hours * 30 * rates.get(gpu_type, 1.0)

# 示例：1000 请求/天，每个 5 秒，RTX 4090
# = (1000 * 5) / 3600 * 30 * 0.44 = $18.33/月

# 安装
pip install runpod

# 部署端点
runpodctl project deploy --name my-endpoint --gpu-type "NVIDIA RTX 4090"

# 健康检查
runpod endpoint health ENDPOINT_ID

# 查看日志
runpod endpoint logs ENDPOINT_ID

# 扩缩容 worker
runpod endpoint scale ENDPOINT_ID --min 1 --max 10

# 本地测试
python handler.py --rp_serve_api

用户： "将 Llama 3.1 8B 模型部署到 RunPod Serverless"

首先，选择 GPU - 8B 模型适合 RTX 4090 ($0.44/小时)：

gpu_type = "NVIDIA GeForce RTX 4090"

使用 vLLM 以获得 2-3 倍吞吐量。使用以下配置创建端点：

    "MODEL_NAME": "meta-llama/Llama-3.1-8B-Instruct",
    "MAX_MODEL_LEN": "8192",
    "GPU_MEMORY_UTILIZATION": "0.95",
}

3. 通过 OpenAI 兼容 API 访问：

from openai import OpenAI
client = OpenAI(
    api_key="YOUR_KEY",
    base_url="https://api.runpod.ai/v2/ENDPOINT_ID/openai/v1",
)

4. 成本估算：约 $0.44/小时计算成本，空闲时缩容至零。 </example_session>

发出结果 Sidecar

作为最后一步，写入 ~/.claude/skill-analytics/last-outcome-runpod-deployment.json：

{"ts":"[UTC ISO8601]","skill":"runpod-deployment","version":"1.0.0","variant":"default",
 "status":"[success|partial|error]","runtime_ms":[estimated ms from start],
 "metrics":{"pods_configured":[n],"deployments_created":[n]},
 "error":null,"session_id":"[YYYY-MM-DD]"}

如果某些阶段失败但仍产生了结果，则使用状态 "partial"。仅当未生成任何输出时使用 "error"。

🇺🇸English

Serverless Workers - Scale-to-zero handlers with pay-per-second billing
vLLM Endpoints - OpenAI-compatible LLM serving with 2-3x throughput
Pod Management - Dedicated GPU instances for development/training
Cost Optimization - GPU selection, spot instances, budget controls

Key deliverables:

Production-ready serverless handlers with streaming
vLLM deployment with OpenAI API compatibility
Cost-optimized GPU selection for any model size
Health monitoring and auto-scaling configuration

<quick_start> Minimal Serverless Handler (v1.8.1):

import runpod

def handler(job):
    """Basic handler - receives job, returns result."""
    job_input = job["input"]
    prompt = job_input.get("prompt", "")

    # Your inference logic here
    result = process(prompt)

    return {"output": result}

runpod.serverless.start({"handler": handler})

Streaming Handler:

import runpod

def streaming_handler(job):
    """Generator for streaming responses."""
    for chunk in generate_chunks(job["input"]):
        yield {"token": chunk, "finished": False}
    yield {"token": "", "finished": True}

runpod.serverless.start({
    "handler": streaming_handler,
    "return_aggregate_stream": True
})

vLLM OpenAI-Compatible Client:

from openai import OpenAI

client = OpenAI(
    api_key="RUNPOD_API_KEY",
    base_url="https://api.runpod.ai/v2/ENDPOINT_ID/openai/v1",
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Hello!"}],
    max_tokens=100,
)

</quick_start>

<success_criteria> A RunPod deployment is successful when:

Handler processes requests without errors
Endpoint scales appropriately (0 → N workers)
Cold start time is acceptable for use case
Cost stays within budget projections
Health checks pass consistently </success_criteria>

<m1_mac_critical>

M1/M2 Mac: Cannot Build Docker Locally

ARM architecture is incompatible with RunPod's x86 GPUs.

Solution: GitHub Actions builds for you:

# Push code - Actions builds x86 image
git add . && git commit -m "Deploy" && git push

See reference/cicd.md for complete GitHub Actions workflow.

Never rundocker build locally for RunPod on Apple Silicon. </m1_mac_critical>

<gpu_selection>

GPU Selection Matrix (January 2025)

GPU	VRAM	Secure $/hr	Spot $/hr	Best For
RTX A4000	16GB	$0.36	$0.18	Embeddings, small models
RTX 4090	24GB	$0.44	$0.22	7B-8B inference
A40	48GB	$0.65	$0.39	13B-30B, fine-tuning
A100 80GB	80GB	$1.89	$0.89	70B models, production
H100 80GB	80GB	$4.69	$1.88	70B+ training

Quick Selection:

def select_gpu(model_params_b: float, quantized: bool = False) -> str:
    effective = model_params_b * (0.5 if quantized else 1.0)
    if effective <= 3: return "RTX_A4000"      # $0.36/hr
    if effective <= 8: return "RTX_4090"       # $0.44/hr
    if effective <= 30: return "A40"           # $0.65/hr
    if effective <= 70: return "A100_80GB"     # $1.89/hr
    return "H100_80GB"                         # $4.69/hr

See reference/cost-optimization.md for detailed pricing and budget controls. </gpu_selection>

<handler_patterns>

Handler Patterns

Progress Updates (Long-Running Tasks)

import runpod

def long_task_handler(job):
    total_steps = job["input"].get("steps", 10)

    for step in range(total_steps):
        process_step(step)
        runpod.serverless.progress_update(
            job_id=job["id"],
            progress=int((step + 1) / total_steps * 100)
        )

    return {"status": "complete", "steps": total_steps}

runpod.serverless.start({"handler": long_task_handler})

Error Handling

import runpod
import traceback

def safe_handler(job):
    try:
        # Validate input
        if "prompt" not in job["input"]:
            return {"error": "Missing required field: prompt"}

        result = process(job["input"])
        return {"output": result}

    except torch.cuda.OutOfMemoryError:
        return {"error": "GPU OOM - reduce input size", "retry": False}
    except Exception as e:
        return {"error": str(e), "traceback": traceback.format_exc()}

runpod.serverless.start({"handler": safe_handler})

See reference/serverless-workers.md for async patterns, batching, and advanced handlers. </handler_patterns>

<vllm_deployment>

vLLM Deployment

Note: vLLM uses OpenAI-compatible API FORMAT but connects to YOUR RunPod endpoint, NOT OpenAI servers. Models run on your GPU (Llama, Qwen, Mistral, etc.)

Environment Configuration

vllm_env = {
    "MODEL_NAME": "meta-llama/Llama-3.1-70B-Instruct",
    "HF_TOKEN": "${HF_TOKEN}",
    "TENSOR_PARALLEL_SIZE": "2",        # Multi-GPU
    "MAX_MODEL_LEN": "16384",
    "GPU_MEMORY_UTILIZATION": "0.95",
    "QUANTIZATION": "awq",              # Optional: awq, gptq
}

OpenAI-Compatible Streaming

from openai import OpenAI

client = OpenAI(
    api_key="RUNPOD_API_KEY",
    base_url="https://api.runpod.ai/v2/ENDPOINT_ID/openai/v1",
)

stream = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Write a poem"}],
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Direct RunPod Streaming

import requests

url = "https://api.runpod.ai/v2/ENDPOINT_ID/run"
headers = {"Authorization": "Bearer RUNPOD_API_KEY"}

response = requests.post(url, headers=headers, json={
    "input": {"prompt": "Hello", "stream": True}
})
job_id = response.json()["id"]

# Stream results
stream_url = f"https://api.runpod.ai/v2/ENDPOINT_ID/stream/{job_id}"
with requests.get(stream_url, headers=headers, stream=True) as r:
    for line in r.iter_lines():
        if line: print(line.decode())

See reference/model-deployment.md for HuggingFace, TGI, and custom model patterns. </vllm_deployment>

<auto_scaling>

Auto-Scaling Configuration

Scaler Types

Type	Best For	Config
QUEUE_DELAY	Variable traffic	`scaler_value=2` (2s target)
REQUEST_COUNT	Predictable load	`scaler_value=5` (5 req/worker)

Configuration Patterns

configs = {
    "interactive_api": {
        "workers_min": 1,      # Always warm
        "workers_max": 5,
        "idle_timeout": 120,
        "scaler_type": "QUEUE_DELAY",
        "scaler_value": 1,     # 1s latency target
    },
    "batch_processing": {
        "workers_min": 0,
        "workers_max": 20,
        "idle_timeout": 30,
        "scaler_type": "REQUEST_COUNT",
        "scaler_value": 5,
    },
    "cost_optimized": {
        "workers_min": 0,
        "workers_max": 3,
        "idle_timeout": 15,    # Aggressive scale-down
        "scaler_type": "QUEUE_DELAY",
        "scaler_value": 5,
    },
}

See reference/pod-management.md for pod lifecycle and scaling details. </auto_scaling>

<health_monitoring>

Health & Monitoring

Quick Health Check

import runpod

async def check_health(endpoint_id: str):
    endpoint = runpod.Endpoint(endpoint_id)
    health = await endpoint.health()

    return {
        "status": health.status,
        "workers_ready": health.workers.ready,
        "queue_depth": health.queue.in_queue,
        "avg_latency_ms": health.metrics.avg_execution_time,
    }

GraphQL Metrics Query

query GetEndpoint($id: String!) {
    endpoint(id: $id) {
        status
        workers { ready running pending throttled }
        queue { inQueue inProgress completed failed }
        metrics {
            requestsPerMinute
            avgExecutionTimeMs
            p95ExecutionTimeMs
            successRate
        }
    }
}

See reference/monitoring.md for structured logging, alerts, and dashboards. </health_monitoring>

<dockerfile_pattern>

Dockerfile Template

FROM runpod/pytorch:2.1.0-py3.10-cuda12.1.1-devel-ubuntu22.04

WORKDIR /app

# Install dependencies (cached layer)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application
COPY . .

# RunPod entrypoint
CMD ["python", "-u", "handler.py"]

See reference/templates.md for runpod.toml, requirements.txt patterns. </dockerfile_pattern>

<file_locations>

Reference Files

Core Patterns:

reference/serverless-workers.md - Handler patterns, streaming, async
reference/model-deployment.md - vLLM, TGI, HuggingFace deployment
reference/pod-management.md - GPU types, scaling, lifecycle

Operations:

reference/cost-optimization.md - Budget controls, right-sizing
reference/monitoring.md - Health checks, logging, GraphQL
reference/troubleshooting.md - Common issues and solutions

DevOps:

reference/cicd.md - GitHub Actions for M1 Mac builds
reference/templates.md - Dockerfile, runpod.toml configs
templates/runpod-worker.py - Production handler template </file_locations>

User wants serverless deployment: → Provide handler pattern, Dockerfile, deployment steps → Reference: reference/serverless-workers.md

User wants vLLM endpoint: → Provide vLLM env config, OpenAI client setup → Reference: reference/model-deployment.md

User wants cost optimization: → Provide GPU selection matrix, spot pricing, budget controls → Reference: reference/cost-optimization.md

User on M1/M2 Mac: → CRITICAL: Must use GitHub Actions for builds → Reference: reference/cicd.md

User has deployment issues: → Check health endpoint, review logs → Reference: reference/troubleshooting.md

<cost_quick_ref>

Cost Quick Reference

Storage Costs:

Container disk: $0.10/GB/month (billed in 5-min intervals)
Network volumes: $0.07/GB/month (first 1TB), $0.05/GB after

Cost Estimation:

def estimate_monthly_cost(gpu_type, daily_requests, avg_time_s):
    rates = {"RTX_4090": 0.44, "A40": 0.65, "A100_80GB": 1.89}
    daily_hours = (daily_requests * avg_time_s) / 3600
    return daily_hours * 30 * rates.get(gpu_type, 1.0)

# Example: 1000 requests/day, 5s each, RTX 4090
# = (1000 * 5) / 3600 * 30 * 0.44 = $18.33/month

</cost_quick_ref>

<cli_commands>

Quick CLI Commands

# Install
pip install runpod

# Deploy endpoint
runpodctl project deploy --name my-endpoint --gpu-type "NVIDIA RTX 4090"

# Health check
runpod endpoint health ENDPOINT_ID

# View logs
runpod endpoint logs ENDPOINT_ID

# Scale workers
runpod endpoint scale ENDPOINT_ID --min 1 --max 10

# Local testing
python handler.py --rp_serve_api

</cli_commands>

<example_session>

Example Session

User: "Deploy a Llama 3.1 8B model to RunPod serverless"

Claude:

First, select GPU - 8B model fits on RTX 4090 ($0.44/hr):

gpu_type = "NVIDIA GeForce RTX 4090"

Use vLLM for 2-3x throughput. Create endpoint with:

env = {

    "MODEL_NAME": "meta-llama/Llama-3.1-8B-Instruct",
    "MAX_MODEL_LEN": "8192",
    "GPU_MEMORY_UTILIZATION": "0.95",
}

3. Access via OpenAI-compatible API:

from openai import OpenAI
client = OpenAI(
    api_key="YOUR_KEY",
    base_url="https://api.runpod.ai/v2/ENDPOINT_ID/openai/v1",
)

4. Cost estimate: ~$0.44/hr compute, scale-to-zero when idle. </example_session>

Emit Outcome Sidecar

As the final step, write to ~/.claude/skill-analytics/last-outcome-runpod-deployment.json:

{"ts":"[UTC ISO8601]","skill":"runpod-deployment","version":"1.0.0","variant":"default",
 "status":"[success|partial|error]","runtime_ms":[estimated ms from start],
 "metrics":{"pods_configured":[n],"deployments_created":[n]},
 "error":null,"session_id":"[YYYY-MM-DD]"}

Use status "partial" if some stages failed but results were produced. Use "error" only if no output was generated.

Weekly Installs

Repository

scientiacapital/skills

GitHub Stars

First Seen

Jan 22, 2026

Security Audits

Gen Agent Trust HubFail SocketPass SnykFail

Installed on

codex63

opencode63

claude-code59

gemini-cli58

github-copilot54

cursor53

Azure Data Explorer (Kusto) 查询技能：KQL数据分析、日志遥测与时间序列处理

145,500 周安装

RunPod部署指南：Serverless GPU、vLLM端点与成本优化

🇨🇳中文介绍

相关 Skills

M1/M2 Mac：无法在本地构建 Docker

GPU 选择矩阵 (2025年1月)

处理器模式

进度更新（长时间运行任务）

错误处理

vLLM 部署

环境配置