重要前提
安装AI Skills的关键前提是:必须科学上网,且开启TUN模式,这一点至关重要,直接决定安装能否顺利完成,在此郑重提醒三遍:科学上网,科学上网,科学上网。查看完整安装教程 →
runpod-deployment by scientiacapital/skills
npx skills add https://github.com/scientiacapital/skills --skill runpod-deployment关键交付成果:
<quick_start> 最简 Serverless 处理器 (v1.8.1):
import runpod
def handler(job):
"""基础处理器 - 接收任务,返回结果。"""
job_input = job["input"]
prompt = job_input.get("prompt", "")
# 在此处添加您的推理逻辑
result = process(prompt)
return {"output": result}
runpod.serverless.start({"handler": handler})
流式处理器:
import runpod
def streaming_handler(job):
"""用于流式响应的生成器。"""
for chunk in generate_chunks(job["input"]):
yield {"token": chunk, "finished": False}
yield {"token": "", "finished": True}
runpod.serverless.start({
"handler": streaming_handler,
"return_aggregate_stream": True
})
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
vLLM OpenAI 兼容客户端:
from openai import OpenAI
client = OpenAI(
api_key="RUNPOD_API_KEY",
base_url="https://api.runpod.ai/v2/ENDPOINT_ID/openai/v1",
)
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Hello!"}],
max_tokens=100,
)
</quick_start>
<success_criteria> RunPod 部署成功的标准:
<m1_mac_critical>
ARM 架构与 RunPod 的 x86 GPU 不兼容。
解决方案:使用 GitHub Actions 为您构建:
# 推送代码 - Actions 构建 x86 镜像
git add . && git commit -m "Deploy" && git push
完整的 GitHub Actions 工作流请参阅
reference/cicd.md。
切勿在 Apple Silicon 上为 RunPod 本地运行 docker build。 </m1_mac_critical>
<gpu_selection>
| GPU | VRAM | 安全 $/小时 | 竞价 $/小时 | 最佳用途 |
|---|---|---|---|---|
| RTX A4000 | 16GB | $0.36 | $0.18 | 嵌入、小型模型 |
| RTX 4090 | 24GB | $0.44 | $0.22 | 7B-8B 推理 |
| A40 | 48GB | $0.65 | $0.39 | 13B-30B,微调 |
| A100 80GB | 80GB | $1.89 | $0.89 | 70B 模型,生产环境 |
| H100 80GB | 80GB | $4.69 | $1.88 | 70B+ 训练 |
快速选择:
def select_gpu(model_params_b: float, quantized: bool = False) -> str:
effective = model_params_b * (0.5 if quantized else 1.0)
if effective <= 3: return "RTX_A4000" # $0.36/hr
if effective <= 8: return "RTX_4090" # $0.44/hr
if effective <= 30: return "A40" # $0.65/hr
if effective <= 70: return "A100_80GB" # $1.89/hr
return "H100_80GB" # $4.69/hr
详细定价和预算控制请参阅
reference/cost-optimization.md。 </gpu_selection>
<handler_patterns>
import runpod
def long_task_handler(job):
total_steps = job["input"].get("steps", 10)
for step in range(total_steps):
process_step(step)
runpod.serverless.progress_update(
job_id=job["id"],
progress=int((step + 1) / total_steps * 100)
)
return {"status": "complete", "steps": total_steps}
runpod.serverless.start({"handler": long_task_handler})
import runpod
import traceback
def safe_handler(job):
try:
# 验证输入
if "prompt" not in job["input"]:
return {"error": "Missing required field: prompt"}
result = process(job["input"])
return {"output": result}
except torch.cuda.OutOfMemoryError:
return {"error": "GPU OOM - reduce input size", "retry": False}
except Exception as e:
return {"error": str(e), "traceback": traceback.format_exc()}
runpod.serverless.start({"handler": safe_handler})
异步模式、批处理和高级处理器请参阅
reference/serverless-workers.md。 </handler_patterns>
<vllm_deployment>
注意: vLLM 使用 OpenAI 兼容的 API 格式,但连接到您的 RunPod 端点,而非 OpenAI 服务器。模型在您的 GPU 上运行(Llama、Qwen、Mistral 等)。
vllm_env = {
"MODEL_NAME": "meta-llama/Llama-3.1-70B-Instruct",
"HF_TOKEN": "${HF_TOKEN}",
"TENSOR_PARALLEL_SIZE": "2", # 多 GPU
"MAX_MODEL_LEN": "16384",
"GPU_MEMORY_UTILIZATION": "0.95",
"QUANTIZATION": "awq", # 可选:awq, gptq
}
from openai import OpenAI
client = OpenAI(
api_key="RUNPOD_API_KEY",
base_url="https://api.runpod.ai/v2/ENDPOINT_ID/openai/v1",
)
stream = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Write a poem"}],
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
import requests
url = "https://api.runpod.ai/v2/ENDPOINT_ID/run"
headers = {"Authorization": "Bearer RUNPOD_API_KEY"}
response = requests.post(url, headers=headers, json={
"input": {"prompt": "Hello", "stream": True}
})
job_id = response.json()["id"]
# 流式获取结果
stream_url = f"https://api.runpod.ai/v2/ENDPOINT_ID/stream/{job_id}"
with requests.get(stream_url, headers=headers, stream=True) as r:
for line in r.iter_lines():
if line: print(line.decode())
HuggingFace、TGI 和自定义模型模式请参阅
reference/model-deployment.md。 </vllm_deployment>
<auto_scaling>
| 类型 | 最佳用途 | 配置 |
|---|---|---|
| QUEUE_DELAY | 可变流量 | scaler_value=2 (2秒目标) |
| REQUEST_COUNT | 可预测负载 | scaler_value=5 (5 请求/worker) |
configs = {
"interactive_api": {
"workers_min": 1, # 始终保持预热
"workers_max": 5,
"idle_timeout": 120,
"scaler_type": "QUEUE_DELAY",
"scaler_value": 1, # 1秒延迟目标
},
"batch_processing": {
"workers_min": 0,
"workers_max": 20,
"idle_timeout": 30,
"scaler_type": "REQUEST_COUNT",
"scaler_value": 5,
},
"cost_optimized": {
"workers_min": 0,
"workers_max": 3,
"idle_timeout": 15, # 积极缩容
"scaler_type": "QUEUE_DELAY",
"scaler_value": 5,
},
}
Pod 生命周期和扩缩容详细信息请参阅
reference/pod-management.md。 </auto_scaling>
<health_monitoring>
import runpod
async def check_health(endpoint_id: str):
endpoint = runpod.Endpoint(endpoint_id)
health = await endpoint.health()
return {
"status": health.status,
"workers_ready": health.workers.ready,
"queue_depth": health.queue.in_queue,
"avg_latency_ms": health.metrics.avg_execution_time,
}
query GetEndpoint($id: String!) {
endpoint(id: $id) {
status
workers { ready running pending throttled }
queue { inQueue inProgress completed failed }
metrics {
requestsPerMinute
avgExecutionTimeMs
p95ExecutionTimeMs
successRate
}
}
}
结构化日志记录、警报和仪表板请参阅
reference/monitoring.md。 </health_monitoring>
<dockerfile_pattern>
FROM runpod/pytorch:2.1.0-py3.10-cuda12.1.1-devel-ubuntu22.04
WORKDIR /app
# 安装依赖项(缓存层)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# 复制应用程序
COPY . .
# RunPod 入口点
CMD ["python", "-u", "handler.py"]
runpod.toml、requirements.txt 模式请参阅
reference/templates.md。 </dockerfile_pattern>
<file_locations>
核心模式:
reference/serverless-workers.md - 处理器模式、流式处理、异步reference/model-deployment.md - vLLM、TGI、HuggingFace 部署reference/pod-management.md - GPU 类型、扩缩容、生命周期运维:
reference/cost-optimization.md - 预算控制、规模调整reference/monitoring.md - 健康检查、日志记录、GraphQLreference/troubleshooting.md - 常见问题及解决方案DevOps:
reference/cicd.md - 用于 M1 Mac 构建的 GitHub Actionsreference/templates.md - Dockerfile、runpod.toml 配置templates/runpod-worker.py - 生产环境处理器模板 </file_locations>用户需要 Serverless 部署: → 提供处理器模式、Dockerfile、部署步骤 → 参考:reference/serverless-workers.md
用户需要 vLLM 端点: → 提供 vLLM 环境配置、OpenAI 客户端设置 → 参考:reference/model-deployment.md
用户需要成本优化: → 提供 GPU 选择矩阵、竞价定价、预算控制 → 参考:reference/cost-optimization.md
用户使用 M1/M2 Mac: → 关键:必须使用 GitHub Actions 进行构建 → 参考:reference/cicd.md
用户遇到部署问题: → 检查健康端点、查看日志 → 参考:reference/troubleshooting.md
<cost_quick_ref>
存储成本:
成本估算:
def estimate_monthly_cost(gpu_type, daily_requests, avg_time_s):
rates = {"RTX_4090": 0.44, "A40": 0.65, "A100_80GB": 1.89}
daily_hours = (daily_requests * avg_time_s) / 3600
return daily_hours * 30 * rates.get(gpu_type, 1.0)
# 示例:1000 请求/天,每个 5 秒,RTX 4090
# = (1000 * 5) / 3600 * 30 * 0.44 = $18.33/月
</cost_quick_ref>
<cli_commands>
# 安装
pip install runpod
# 部署端点
runpodctl project deploy --name my-endpoint --gpu-type "NVIDIA RTX 4090"
# 健康检查
runpod endpoint health ENDPOINT_ID
# 查看日志
runpod endpoint logs ENDPOINT_ID
# 扩缩容 worker
runpod endpoint scale ENDPOINT_ID --min 1 --max 10
# 本地测试
python handler.py --rp_serve_api
</cli_commands>
<example_session>
用户: "将 Llama 3.1 8B 模型部署到 RunPod Serverless"
Claude:
gpu_type = "NVIDIA GeForce RTX 4090"
env = {
"MODEL_NAME": "meta-llama/Llama-3.1-8B-Instruct",
"MAX_MODEL_LEN": "8192",
"GPU_MEMORY_UTILIZATION": "0.95",
}
3. 通过 OpenAI 兼容 API 访问:
from openai import OpenAI
client = OpenAI(
api_key="YOUR_KEY",
base_url="https://api.runpod.ai/v2/ENDPOINT_ID/openai/v1",
)
4. 成本估算:约 $0.44/小时 计算成本,空闲时缩容至零。 </example_session>
作为最后一步,写入 ~/.claude/skill-analytics/last-outcome-runpod-deployment.json:
{"ts":"[UTC ISO8601]","skill":"runpod-deployment","version":"1.0.0","variant":"default",
"status":"[success|partial|error]","runtime_ms":[estimated ms from start],
"metrics":{"pods_configured":[n],"deployments_created":[n]},
"error":null,"session_id":"[YYYY-MM-DD]"}
如果某些阶段失败但仍产生了结果,则使用状态 "partial"。仅当未生成任何输出时使用 "error"。
每周安装次数
67
仓库
GitHub 星标数
6
首次出现
2026年1月22日
安全审计
安装于
codex63
opencode63
claude-code59
gemini-cli58
github-copilot54
cursor53
Key deliverables:
<quick_start> Minimal Serverless Handler (v1.8.1):
import runpod
def handler(job):
"""Basic handler - receives job, returns result."""
job_input = job["input"]
prompt = job_input.get("prompt", "")
# Your inference logic here
result = process(prompt)
return {"output": result}
runpod.serverless.start({"handler": handler})
Streaming Handler:
import runpod
def streaming_handler(job):
"""Generator for streaming responses."""
for chunk in generate_chunks(job["input"]):
yield {"token": chunk, "finished": False}
yield {"token": "", "finished": True}
runpod.serverless.start({
"handler": streaming_handler,
"return_aggregate_stream": True
})
vLLM OpenAI-Compatible Client:
from openai import OpenAI
client = OpenAI(
api_key="RUNPOD_API_KEY",
base_url="https://api.runpod.ai/v2/ENDPOINT_ID/openai/v1",
)
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Hello!"}],
max_tokens=100,
)
</quick_start>
<success_criteria> A RunPod deployment is successful when:
<m1_mac_critical>
ARM architecture is incompatible with RunPod's x86 GPUs.
Solution: GitHub Actions builds for you:
# Push code - Actions builds x86 image
git add . && git commit -m "Deploy" && git push
See
reference/cicd.mdfor complete GitHub Actions workflow.
Never rundocker build locally for RunPod on Apple Silicon. </m1_mac_critical>
<gpu_selection>
| GPU | VRAM | Secure $/hr | Spot $/hr | Best For |
|---|---|---|---|---|
| RTX A4000 | 16GB | $0.36 | $0.18 | Embeddings, small models |
| RTX 4090 | 24GB | $0.44 | $0.22 | 7B-8B inference |
| A40 | 48GB | $0.65 | $0.39 | 13B-30B, fine-tuning |
| A100 80GB | 80GB | $1.89 | $0.89 | 70B models, production |
| H100 80GB | 80GB | $4.69 | $1.88 | 70B+ training |
Quick Selection:
def select_gpu(model_params_b: float, quantized: bool = False) -> str:
effective = model_params_b * (0.5 if quantized else 1.0)
if effective <= 3: return "RTX_A4000" # $0.36/hr
if effective <= 8: return "RTX_4090" # $0.44/hr
if effective <= 30: return "A40" # $0.65/hr
if effective <= 70: return "A100_80GB" # $1.89/hr
return "H100_80GB" # $4.69/hr
See
reference/cost-optimization.mdfor detailed pricing and budget controls. </gpu_selection>
<handler_patterns>
import runpod
def long_task_handler(job):
total_steps = job["input"].get("steps", 10)
for step in range(total_steps):
process_step(step)
runpod.serverless.progress_update(
job_id=job["id"],
progress=int((step + 1) / total_steps * 100)
)
return {"status": "complete", "steps": total_steps}
runpod.serverless.start({"handler": long_task_handler})
import runpod
import traceback
def safe_handler(job):
try:
# Validate input
if "prompt" not in job["input"]:
return {"error": "Missing required field: prompt"}
result = process(job["input"])
return {"output": result}
except torch.cuda.OutOfMemoryError:
return {"error": "GPU OOM - reduce input size", "retry": False}
except Exception as e:
return {"error": str(e), "traceback": traceback.format_exc()}
runpod.serverless.start({"handler": safe_handler})
See
reference/serverless-workers.mdfor async patterns, batching, and advanced handlers. </handler_patterns>
<vllm_deployment>
Note: vLLM uses OpenAI-compatible API FORMAT but connects to YOUR RunPod endpoint, NOT OpenAI servers. Models run on your GPU (Llama, Qwen, Mistral, etc.)
vllm_env = {
"MODEL_NAME": "meta-llama/Llama-3.1-70B-Instruct",
"HF_TOKEN": "${HF_TOKEN}",
"TENSOR_PARALLEL_SIZE": "2", # Multi-GPU
"MAX_MODEL_LEN": "16384",
"GPU_MEMORY_UTILIZATION": "0.95",
"QUANTIZATION": "awq", # Optional: awq, gptq
}
from openai import OpenAI
client = OpenAI(
api_key="RUNPOD_API_KEY",
base_url="https://api.runpod.ai/v2/ENDPOINT_ID/openai/v1",
)
stream = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Write a poem"}],
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
import requests
url = "https://api.runpod.ai/v2/ENDPOINT_ID/run"
headers = {"Authorization": "Bearer RUNPOD_API_KEY"}
response = requests.post(url, headers=headers, json={
"input": {"prompt": "Hello", "stream": True}
})
job_id = response.json()["id"]
# Stream results
stream_url = f"https://api.runpod.ai/v2/ENDPOINT_ID/stream/{job_id}"
with requests.get(stream_url, headers=headers, stream=True) as r:
for line in r.iter_lines():
if line: print(line.decode())
See
reference/model-deployment.mdfor HuggingFace, TGI, and custom model patterns. </vllm_deployment>
<auto_scaling>
| Type | Best For | Config |
|---|---|---|
| QUEUE_DELAY | Variable traffic | scaler_value=2 (2s target) |
| REQUEST_COUNT | Predictable load | scaler_value=5 (5 req/worker) |
configs = {
"interactive_api": {
"workers_min": 1, # Always warm
"workers_max": 5,
"idle_timeout": 120,
"scaler_type": "QUEUE_DELAY",
"scaler_value": 1, # 1s latency target
},
"batch_processing": {
"workers_min": 0,
"workers_max": 20,
"idle_timeout": 30,
"scaler_type": "REQUEST_COUNT",
"scaler_value": 5,
},
"cost_optimized": {
"workers_min": 0,
"workers_max": 3,
"idle_timeout": 15, # Aggressive scale-down
"scaler_type": "QUEUE_DELAY",
"scaler_value": 5,
},
}
See
reference/pod-management.mdfor pod lifecycle and scaling details. </auto_scaling>
<health_monitoring>
import runpod
async def check_health(endpoint_id: str):
endpoint = runpod.Endpoint(endpoint_id)
health = await endpoint.health()
return {
"status": health.status,
"workers_ready": health.workers.ready,
"queue_depth": health.queue.in_queue,
"avg_latency_ms": health.metrics.avg_execution_time,
}
query GetEndpoint($id: String!) {
endpoint(id: $id) {
status
workers { ready running pending throttled }
queue { inQueue inProgress completed failed }
metrics {
requestsPerMinute
avgExecutionTimeMs
p95ExecutionTimeMs
successRate
}
}
}
See
reference/monitoring.mdfor structured logging, alerts, and dashboards. </health_monitoring>
<dockerfile_pattern>
FROM runpod/pytorch:2.1.0-py3.10-cuda12.1.1-devel-ubuntu22.04
WORKDIR /app
# Install dependencies (cached layer)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy application
COPY . .
# RunPod entrypoint
CMD ["python", "-u", "handler.py"]
See
reference/templates.mdfor runpod.toml, requirements.txt patterns. </dockerfile_pattern>
<file_locations>
Core Patterns:
reference/serverless-workers.md - Handler patterns, streaming, asyncreference/model-deployment.md - vLLM, TGI, HuggingFace deploymentreference/pod-management.md - GPU types, scaling, lifecycleOperations:
reference/cost-optimization.md - Budget controls, right-sizingreference/monitoring.md - Health checks, logging, GraphQLreference/troubleshooting.md - Common issues and solutionsDevOps:
reference/cicd.md - GitHub Actions for M1 Mac buildsreference/templates.md - Dockerfile, runpod.toml configstemplates/runpod-worker.py - Production handler template </file_locations>User wants serverless deployment: → Provide handler pattern, Dockerfile, deployment steps → Reference: reference/serverless-workers.md
User wants vLLM endpoint: → Provide vLLM env config, OpenAI client setup → Reference: reference/model-deployment.md
User wants cost optimization: → Provide GPU selection matrix, spot pricing, budget controls → Reference: reference/cost-optimization.md
User on M1/M2 Mac: → CRITICAL: Must use GitHub Actions for builds → Reference: reference/cicd.md
User has deployment issues: → Check health endpoint, review logs → Reference: reference/troubleshooting.md
<cost_quick_ref>
Storage Costs:
Cost Estimation:
def estimate_monthly_cost(gpu_type, daily_requests, avg_time_s):
rates = {"RTX_4090": 0.44, "A40": 0.65, "A100_80GB": 1.89}
daily_hours = (daily_requests * avg_time_s) / 3600
return daily_hours * 30 * rates.get(gpu_type, 1.0)
# Example: 1000 requests/day, 5s each, RTX 4090
# = (1000 * 5) / 3600 * 30 * 0.44 = $18.33/month
</cost_quick_ref>
<cli_commands>
# Install
pip install runpod
# Deploy endpoint
runpodctl project deploy --name my-endpoint --gpu-type "NVIDIA RTX 4090"
# Health check
runpod endpoint health ENDPOINT_ID
# View logs
runpod endpoint logs ENDPOINT_ID
# Scale workers
runpod endpoint scale ENDPOINT_ID --min 1 --max 10
# Local testing
python handler.py --rp_serve_api
</cli_commands>
<example_session>
User: "Deploy a Llama 3.1 8B model to RunPod serverless"
Claude:
gpu_type = "NVIDIA GeForce RTX 4090"
env = {
"MODEL_NAME": "meta-llama/Llama-3.1-8B-Instruct",
"MAX_MODEL_LEN": "8192",
"GPU_MEMORY_UTILIZATION": "0.95",
}
3. Access via OpenAI-compatible API:
from openai import OpenAI
client = OpenAI(
api_key="YOUR_KEY",
base_url="https://api.runpod.ai/v2/ENDPOINT_ID/openai/v1",
)
4. Cost estimate: ~$0.44/hr compute, scale-to-zero when idle. </example_session>
As the final step, write to ~/.claude/skill-analytics/last-outcome-runpod-deployment.json:
{"ts":"[UTC ISO8601]","skill":"runpod-deployment","version":"1.0.0","variant":"default",
"status":"[success|partial|error]","runtime_ms":[estimated ms from start],
"metrics":{"pods_configured":[n],"deployments_created":[n]},
"error":null,"session_id":"[YYYY-MM-DD]"}
Use status "partial" if some stages failed but results were produced. Use "error" only if no output was generated.
Weekly Installs
67
Repository
GitHub Stars
6
First Seen
Jan 22, 2026
Security Audits
Gen Agent Trust HubFailSocketPassSnykFail
Installed on
codex63
opencode63
claude-code59
gemini-cli58
github-copilot54
cursor53
Azure Data Explorer (Kusto) 查询技能:KQL数据分析、日志遥测与时间序列处理
145,500 周安装
React/Next.js/NestJS 测试专家技能 - 单元/集成/E2E 测试策略与最佳实践
90 周安装
MCP 应用构建指南:为 ChatGPT/Claude 创建自定义 AI 助手扩展和对话式 UI 组件
89 周安装
文本转语音技能:Kokoro TTS部署、实时流式合成与语音定制优化指南
90 周安装
离线同步专家技能:解决无离线队列问题,实现Android与Web数据同步
90 周安装
博客文章优化器 - 智能SEO、可读性与结构分析工具
91 周安装
Obsidian Vault Manager:AI 代理与知识库双向同步技能,支持中英双语操作
63 周安装