Modal 无服务器 GPU 指南：在云端运行机器学习工作负载，按秒计费，自动扩缩容

modal-serverless-gpu by davila7/claude-code-templates

241 周安装量

23,400 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/davila7/claude-code-templates --skill modal-serverless-gpu

AI/机器学习云服务开发运维

🇨🇳中文介绍

Modal 无服务器 GPU

在 Modal 的无服务器 GPU 云平台上运行机器学习工作负载的全面指南。

何时使用 Modal

在以下情况使用 Modal：

运行 GPU 密集型机器学习工作负载，无需管理基础设施
将机器学习模型部署为自动扩缩容的 API
运行批处理作业（训练、推理、数据处理）
需要按秒计费的 GPU 定价，无闲置成本
快速原型化机器学习应用
运行定时作业（类似 cron 的工作负载）

主要特性：

无服务器 GPU：按需提供 T4、L4、A10G、L40S、A100、H100、H200、B200
原生 Python：用 Python 代码定义基础设施，无需 YAML
自动扩缩容：可缩容至零，瞬间扩容至 100+ GPU
亚秒级冷启动：基于 Rust 的基础设施，实现快速容器启动
容器缓存：镜像层缓存，便于快速迭代
Web 端点：将函数部署为 REST API，支持零停机更新

请改用以下替代方案：

RunPod：适用于需要持久状态的长时运行 Pod
Lambda Labs：适用于预留 GPU 实例
SkyPilot：适用于多云编排和成本优化
Kubernetes：适用于复杂的多服务架构

快速开始

安装

pip install modal
modal setup  # 打开浏览器进行身份验证

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

使用 GPU 的 Hello World

import modal

app = modal.App("hello-gpu")

@app.function(gpu="T4")
def gpu_info():
    import subprocess
    return subprocess.run(["nvidia-smi"], capture_output=True, text=True).stdout

@app.local_entrypoint()
def main():
    print(gpu_info.remote())

运行：modal run hello_gpu.py

import modal

app = modal.App("text-generation")
image = modal.Image.debian_slim().pip_install("transformers", "torch", "accelerate")

@app.cls(gpu="A10G", image=image)
class TextGenerator:
    @modal.enter()
    def load_model(self):
        from transformers import pipeline
        self.pipe = pipeline("text-generation", model="gpt2", device=0)

    @modal.method()
    def generate(self, prompt: str) -> str:
        return self.pipe(prompt, max_length=100)[0]["generated_text"]

@app.local_entrypoint()
def main():
    print(TextGenerator().generate.remote("Hello, world"))

组件	用途
`App`	函数和资源的容器
`Function`	具有计算规格的无服务器函数
`Cls`	具有生命周期钩子的基于类的函数
`Image`	容器镜像定义
`Volume`	用于模型/数据的持久化存储
`Secret`	安全凭证存储

命令	描述
`modal run script.py`	执行并退出
`modal serve script.py`	开发模式，支持热重载
`modal deploy script.py`	持久化云部署

GPU	显存	最佳适用场景
`T4`	16GB	预算型推理，小型模型
`L4`	24GB	推理，Ada Lovelace 架构
`A10G`	24GB	训练/推理，比 T4 快 3.3 倍
`L40S`	48GB	推荐用于推理（最佳性价比）
`A100-40GB`	40GB	大模型训练
`A100-80GB`	80GB	超大模型
`H100`	80GB	最快，支持 FP8 + Transformer Engine
`H200`	141GB	可从 H100 自动升级，4.8TB/s 带宽
`B200`	最新	Blackwell 架构

# 单个 GPU
@app.function(gpu="A100")

# 特定显存版本
@app.function(gpu="A100-80GB")

# 多个 GPU（最多 8 个）
@app.function(gpu="H100:4")

# GPU 备选方案
@app.function(gpu=["H100", "A100", "L40S"])

# 任意可用 GPU
@app.function(gpu="any")

# 带 pip 的基础镜像
image = modal.Image.debian_slim(python_version="3.11").pip_install(
    "torch==2.1.0", "transformers==4.36.0", "accelerate"
)

# 从 CUDA 基础镜像构建
image = modal.Image.from_registry(
    "nvidia/cuda:12.1.0-cudnn8-devel-ubuntu22.04",
    add_python="3.11"
).pip_install("torch", "transformers")

# 包含系统包
image = modal.Image.debian_slim().apt_install("git", "ffmpeg").pip_install("whisper")

volume = modal.Volume.from_name("model-cache", create_if_missing=True)

@app.function(gpu="A10G", volumes={"/models": volume})
def load_model():
    import os
    model_path = "/models/llama-7b"
    if not os.path.exists(model_path):
        model = download_model()
        model.save_pretrained(model_path)
        volume.commit()  # 持久化更改
    return load_from_path(model_path)

FastAPI 端点装饰器

@app.function()
@modal.fastapi_endpoint(method="POST")
def predict(text: str) -> dict:
    return {"result": model.predict(text)}

from fastapi import FastAPI
web_app = FastAPI()

@web_app.post("/predict")
async def predict(text: str):
    return {"result": await model.predict.remote.aio(text)}

@app.function()
@modal.asgi_app()
def fastapi_app():
    return web_app

装饰器	使用场景
`@modal.fastapi_endpoint()`	简单函数 → API
`@modal.asgi_app()`	完整 FastAPI/Starlette 应用
`@modal.wsgi_app()`	Django/Flask 应用
`@modal.web_server(port)`	任意 HTTP 服务器

@app.function()
@modal.batched(max_batch_size=32, wait_ms=100)
async def batch_predict(inputs: list[str]) -> list[dict]:
    # 输入自动批处理
    return model.batch_predict(inputs)

# 创建密钥
modal secret create huggingface HF_TOKEN=hf_xxx



@app.function(secrets=[modal.Secret.from_name("huggingface")])
def download_model():
    import os
    token = os.environ["HF_TOKEN"]

@app.function(schedule=modal.Cron("0 0 * * *"))  # 每日午夜
def daily_job():
    pass

@app.function(schedule=modal.Period(hours=1))
def hourly_job():
    pass

@app.function(
    container_idle_timeout=300,  # 保持预热 5 分钟
    allow_concurrent_inputs=10,  # 处理并发请求
)
def inference():
    pass

模型加载最佳实践

@app.cls(gpu="A100")
class Model:
    @modal.enter()  # 容器启动时运行一次
    def load(self):
        self.model = load_model()  # 在预热期间加载

    @modal.method()
    def predict(self, x):
        return self.model(x)

@app.function()
def process_item(item):
    return expensive_computation(item)

@app.function()
def run_parallel():
    items = list(range(1000))
    # 分发到并行容器
    results = list(process_item.map(items))
    return results

@app.function(
    gpu="A100",
    memory=32768,              # 32GB 内存
    cpu=4,                     # 4 个 CPU 核心
    timeout=3600,              # 最长 1 小时
    container_idle_timeout=120,# 保持预热 2 分钟
    retries=3,                 # 失败时重试
    concurrency_limit=10,      # 最大并发容器数
)
def my_function():
    pass

# 本地测试
if __name__ == "__main__":
    result = my_function.local()

# 查看日志
# modal app logs my-app

问题	解决方案
冷启动延迟	增加 `container_idle_timeout`，使用 `@modal.enter()`
GPU 内存不足	使用更大的 GPU（`A100-80GB`），启用梯度检查点
镜像构建失败	固定依赖版本，检查 CUDA 兼容性
超时错误	增加 `timeout`，添加检查点

高级用法 - 多 GPU、分布式训练、成本优化
故障排除 - 常见问题及解决方案

🇺🇸English

Modal Serverless GPU

Comprehensive guide to running ML workloads on Modal's serverless GPU cloud platform.

When to use Modal

Use Modal when:

Running GPU-intensive ML workloads without managing infrastructure
Deploying ML models as auto-scaling APIs
Running batch processing jobs (training, inference, data processing)
Need pay-per-second GPU pricing without idle costs
Prototyping ML applications quickly
Running scheduled jobs (cron-like workloads)

Key features:

Serverless GPUs : T4, L4, A10G, L40S, A100, H100, H200, B200 on-demand
Python-native : Define infrastructure in Python code, no YAML
Auto-scaling : Scale to zero, scale to 100+ GPUs instantly
Sub-second cold starts : Rust-based infrastructure for fast container launches
Container caching : Image layers cached for rapid iteration
Web endpoints : Deploy functions as REST APIs with zero-downtime updates

Use alternatives instead:

RunPod : For longer-running pods with persistent state
Lambda Labs : For reserved GPU instances
SkyPilot : For multi-cloud orchestration and cost optimization
Kubernetes : For complex multi-service architectures

Quick start

Installation

pip install modal
modal setup  # Opens browser for authentication

Hello World with GPU

import modal

app = modal.App("hello-gpu")

@app.function(gpu="T4")
def gpu_info():
    import subprocess
    return subprocess.run(["nvidia-smi"], capture_output=True, text=True).stdout

@app.local_entrypoint()
def main():
    print(gpu_info.remote())

Run: modal run hello_gpu.py

Basic inference endpoint

import modal

app = modal.App("text-generation")
image = modal.Image.debian_slim().pip_install("transformers", "torch", "accelerate")

@app.cls(gpu="A10G", image=image)
class TextGenerator:
    @modal.enter()
    def load_model(self):
        from transformers import pipeline
        self.pipe = pipeline("text-generation", model="gpt2", device=0)

    @modal.method()
    def generate(self, prompt: str) -> str:
        return self.pipe(prompt, max_length=100)[0]["generated_text"]

@app.local_entrypoint()
def main():
    print(TextGenerator().generate.remote("Hello, world"))

Core concepts

Key components

Component	Purpose
`App`	Container for functions and resources
`Function`	Serverless function with compute specs
`Cls`	Class-based functions with lifecycle hooks
`Image`	Container image definition
`Volume`	Persistent storage for models/data
`Secret`	Secure credential storage

Execution modes

Command	Description
`modal run script.py`	Execute and exit
`modal serve script.py`	Development with live reload
`modal deploy script.py`	Persistent cloud deployment

GPU configuration

Available GPUs

GPU	VRAM	Best For
`T4`	16GB	Budget inference, small models
`L4`	24GB	Inference, Ada Lovelace arch
`A10G`	24GB	Training/inference, 3.3x faster than T4
`L40S`	48GB	Recommended for inference (best cost/perf)
`A100-40GB`	40GB	Large model training

GPU specification patterns

# Single GPU
@app.function(gpu="A100")

# Specific memory variant
@app.function(gpu="A100-80GB")

# Multiple GPUs (up to 8)
@app.function(gpu="H100:4")

# GPU with fallbacks
@app.function(gpu=["H100", "A100", "L40S"])

# Any available GPU
@app.function(gpu="any")

Container images

# Basic image with pip
image = modal.Image.debian_slim(python_version="3.11").pip_install(
    "torch==2.1.0", "transformers==4.36.0", "accelerate"
)

# From CUDA base
image = modal.Image.from_registry(
    "nvidia/cuda:12.1.0-cudnn8-devel-ubuntu22.04",
    add_python="3.11"
).pip_install("torch", "transformers")

# With system packages
image = modal.Image.debian_slim().apt_install("git", "ffmpeg").pip_install("whisper")

Persistent storage

volume = modal.Volume.from_name("model-cache", create_if_missing=True)

@app.function(gpu="A10G", volumes={"/models": volume})
def load_model():
    import os
    model_path = "/models/llama-7b"
    if not os.path.exists(model_path):
        model = download_model()
        model.save_pretrained(model_path)
        volume.commit()  # Persist changes
    return load_from_path(model_path)

Web endpoints

FastAPI endpoint decorator

@app.function()
@modal.fastapi_endpoint(method="POST")
def predict(text: str) -> dict:
    return {"result": model.predict(text)}

Full ASGI app

from fastapi import FastAPI
web_app = FastAPI()

@web_app.post("/predict")
async def predict(text: str):
    return {"result": await model.predict.remote.aio(text)}

@app.function()
@modal.asgi_app()
def fastapi_app():
    return web_app

Web endpoint types

Decorator	Use Case
`@modal.fastapi_endpoint()`	Simple function → API
`@modal.asgi_app()`	Full FastAPI/Starlette apps
`@modal.wsgi_app()`	Django/Flask apps
`@modal.web_server(port)`	Arbitrary HTTP servers

Dynamic batching

@app.function()
@modal.batched(max_batch_size=32, wait_ms=100)
async def batch_predict(inputs: list[str]) -> list[dict]:
    # Inputs automatically batched
    return model.batch_predict(inputs)

Secrets management

# Create secret
modal secret create huggingface HF_TOKEN=hf_xxx



@app.function(secrets=[modal.Secret.from_name("huggingface")])
def download_model():
    import os
    token = os.environ["HF_TOKEN"]

Scheduling

@app.function(schedule=modal.Cron("0 0 * * *"))  # Daily midnight
def daily_job():
    pass

@app.function(schedule=modal.Period(hours=1))
def hourly_job():
    pass

Performance optimization

Cold start mitigation

@app.function(
    container_idle_timeout=300,  # Keep warm 5 min
    allow_concurrent_inputs=10,  # Handle concurrent requests
)
def inference():
    pass

Model loading best practices

@app.cls(gpu="A100")
class Model:
    @modal.enter()  # Run once at container start
    def load(self):
        self.model = load_model()  # Load during warm-up

    @modal.method()
    def predict(self, x):
        return self.model(x)

Parallel processing

@app.function()
def process_item(item):
    return expensive_computation(item)

@app.function()
def run_parallel():
    items = list(range(1000))
    # Fan out to parallel containers
    results = list(process_item.map(items))
    return results

Common configuration

@app.function(
    gpu="A100",
    memory=32768,              # 32GB RAM
    cpu=4,                     # 4 CPU cores
    timeout=3600,              # 1 hour max
    container_idle_timeout=120,# Keep warm 2 min
    retries=3,                 # Retry on failure
    concurrency_limit=10,      # Max concurrent containers
)
def my_function():
    pass

Debugging

# Test locally
if __name__ == "__main__":
    result = my_function.local()

# View logs
# modal app logs my-app

Common issues

Issue	Solution
Cold start latency	Increase `container_idle_timeout`, use `@modal.enter()`
GPU OOM	Use larger GPU (`A100-80GB`), enable gradient checkpointing
Image build fails	Pin dependency versions, check CUDA compatibility
Timeout errors	Increase `timeout`, add checkpointing

References

Advanced Usage - Multi-GPU, distributed training, cost optimization
Troubleshooting - Common issues and solutions

Resources

Documentation : https://modal.com/docs
Examples : https://github.com/modal-labs/modal-examples
Pricing : https://modal.com/pricing
Discord : https://discord.gg/modal

Weekly Installs

241

Repository

davila7/claude-…emplates

GitHub Stars

23.4K

First Seen

Jan 21, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykWarn

Installed on

opencode203

gemini-cli194

codex188

claude-code187

cursor184

github-copilot172

Azure RBAC 权限管理工具：查找最小角色、创建自定义角色与自动化分配

104,600 周安装

Modal 无服务器 GPU 指南：在云端运行机器学习工作负载，按秒计费，自动扩缩容

🇨🇳中文介绍

Modal 无服务器 GPU

何时使用 Modal

快速开始

安装

相关 Skills

使用 GPU 的 Hello World

基本推理端点

核心概念

关键组件

执行模式

GPU 配置

可用 GPU

GPU 规格模式

容器镜像

持久化存储

Web 端点