vLLM 高性能大语言模型服务指南：部署、优化与批量推理

serving-llms-vllm by orchestra-research/ai-research-skills

88 周安装量

5,500 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/orchestra-research/ai-research-skills --skill serving-llms-vllm

AI/机器学习性能优化部署策略

🇨🇳中文介绍

vLLM - 高性能大语言模型服务

快速开始

vLLM 通过分页注意力机制（基于块的 KV 缓存）和连续批处理（混合预填充/解码请求），实现了比标准 transformers 高 24 倍的吞吐量。

安装：

pip install vllm

基础离线推理：

from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-3-8B-Instruct")
sampling = SamplingParams(temperature=0.7, max_tokens=256)

outputs = llm.generate(["Explain quantum computing"], sampling)
print(outputs[0].outputs[0].text)

OpenAI 兼容服务器：

vllm serve meta-llama/Llama-3-8B-Instruct

# 使用 OpenAI SDK 查询
python -c "
from openai import OpenAI
client = OpenAI(base_url='http://localhost:8000/v1', api_key='EMPTY')
print(client.chat.completions.create(
    model='meta-llama/Llama-3-8B-Instruct',
    messages=[{'role': 'user', 'content': 'Hello!'}]
).choices[0].message.content)
"

常见工作流

工作流 1：生产环境 API 部署

复制此清单并跟踪进度：

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

工作流 2：离线批量推理

用于处理大型数据集，无需服务器开销。

Batch Processing:
- [ ] Step 1: Prepare input data
- [ ] Step 2: Configure LLM engine
- [ ] Step 3: Run batch inference
- [ ] Step 4: Process results

步骤 1：准备输入数据

# 从文件加载提示词
prompts = []
with open("prompts.txt") as f:
    prompts = [line.strip() for line in f]

print(f"Loaded {len(prompts)} prompts")

步骤 2：配置 LLM 引擎

from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-3-8B-Instruct",
    tensor_parallel_size=2,  # 使用 2 个 GPU
    gpu_memory_utilization=0.9,
    max_model_len=4096
)

sampling = SamplingParams(
    temperature=0.7,
    top_p=0.95,
    max_tokens=512,
    stop=["</s>", "\n\n"]
)

步骤 3：运行批量推理

vLLM 自动批处理请求以提高效率：

# 在一次调用中处理所有提示词
outputs = llm.generate(prompts, sampling)

# vLLM 内部处理批处理
# 无需手动分块提示词

步骤 4：处理结果

# 提取生成的文本
results = []
for output in outputs:
    prompt = output.prompt
    generated = output.outputs[0].text
    results.append({
        "prompt": prompt,
        "generated": generated,
        "tokens": len(output.outputs[0].token_ids)
    })

# 保存到文件
import json
with open("results.jsonl", "w") as f:
    for result in results:
        f.write(json.dumps(result) + "\n")

print(f"Processed {len(results)} prompts")

工作流 3：量化模型服务

在有限的 GPU 内存中部署大型模型。

Quantization Setup:
- [ ] Step 1: Choose quantization method
- [ ] Step 2: Find or create quantized model
- [ ] Step 3: Launch with quantization flag
- [ ] Step 4: Verify accuracy

步骤 1：选择量化方法

AWQ ：最适合 70B 模型，精度损失最小
GPTQ ：广泛的模型支持，良好的压缩率
FP8 ：在 H100 GPU 上速度最快

步骤 2：查找或创建量化模型

使用 HuggingFace 上的预量化模型：

# 搜索 AWQ 模型
# 示例：TheBloke/Llama-2-70B-AWQ

步骤 3：使用量化标志启动

# 使用预量化模型
vllm serve TheBloke/Llama-2-70B-AWQ \
  --quantization awq \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.95

# 结果：70B 模型占用约 40GB VRAM

步骤 4：验证准确性

测试输出是否符合预期质量：

# 比较量化与非量化模型的响应
# 验证特定任务的性能未发生变化

何时使用 vs 替代方案

在以下情况使用 vLLM：

部署生产环境 LLM API（100+ 请求/秒）
提供 OpenAI 兼容的端点
GPU 内存有限但需要大型模型
多用户应用程序（聊天机器人、助手）
需要低延迟和高吞吐量

在以下情况使用替代方案：

llama.cpp ：CPU/边缘推理，单用户
HuggingFace transformers ：研究、原型设计、一次性生成
TensorRT-LLM ：仅限 NVIDIA，需要绝对最高性能
Text-Generation-Inference ：已处于 HuggingFace 生态系统

问题：模型加载期间内存不足

减少内存使用：

vllm serve MODEL \
  --gpu-memory-utilization 0.7 \
  --max-model-len 4096

vllm serve MODEL --quantization awq

问题：首个令牌生成慢（TTFT > 1 秒）

为重复提示词启用前缀缓存：

vllm serve MODEL --enable-prefix-caching

对于长提示词，启用分块预填充：

vllm serve MODEL --enable-chunked-prefill

问题：模型未找到错误

对于自定义模型，使用 --trust-remote-code：

vllm serve MODEL --trust-remote-code

问题：吞吐量低（<50 请求/秒）

增加并发序列数：

vllm serve MODEL --max-num-seqs 512

使用 nvidia-smi 检查 GPU 利用率 - 应 >80%。

问题：推理速度慢于预期

验证张量并行使用了 2 的幂次方个 GPU：

vllm serve MODEL --tensor-parallel-size 4  # 不是 3

启用推测解码以加速生成：

vllm serve MODEL --speculative-model DRAFT_MODEL

服务器部署模式 ：有关 Docker、Kubernetes 和负载均衡配置，请参阅 references/server-deployment.md。

性能优化 ：有关分页注意力机制调优、连续批处理细节和基准测试结果，请参阅 references/optimization.md。

量化指南 ：有关 AWQ/GPTQ/FP8 设置、模型准备和准确性比较，请参阅 references/quantization.md。

故障排除 ：有关详细的错误消息、调试步骤和性能诊断，请参阅 references/troubleshooting.md。

小型模型（7B-13B） ：1x A10（24GB）或 A100（40GB）
中型模型（30B-40B） ：2x A100（40GB）带张量并行
大型模型（70B+） ：4x A100（40GB）或 2x A100（80GB），使用 AWQ/GPTQ

支持的平台：NVIDIA（主要）、AMD ROCm、Intel GPU、TPU

官方文档：https://docs.vllm.ai
GitHub：https://github.com/vllm-project/vllm
论文："Efficient Memory Management for Large Language Model Serving with PagedAttention" (SOSP 2023)
社区：https://discuss.vllm.ai

🇺🇸English

vLLM - High-Performance LLM Serving

Quick start

vLLM achieves 24x higher throughput than standard transformers through PagedAttention (block-based KV cache) and continuous batching (mixing prefill/decode requests).

Installation :

pip install vllm

Basic offline inference :

from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-3-8B-Instruct")
sampling = SamplingParams(temperature=0.7, max_tokens=256)

outputs = llm.generate(["Explain quantum computing"], sampling)
print(outputs[0].outputs[0].text)

OpenAI-compatible server :

vllm serve meta-llama/Llama-3-8B-Instruct

# Query with OpenAI SDK
python -c "
from openai import OpenAI
client = OpenAI(base_url='http://localhost:8000/v1', api_key='EMPTY')
print(client.chat.completions.create(
    model='meta-llama/Llama-3-8B-Instruct',
    messages=[{'role': 'user', 'content': 'Hello!'}]
).choices[0].message.content)
"

Common workflows

Workflow 1: Production API deployment

Copy this checklist and track progress:

Deployment Progress:
- [ ] Step 1: Configure server settings
- [ ] Step 2: Test with limited traffic
- [ ] Step 3: Enable monitoring
- [ ] Step 4: Deploy to production
- [ ] Step 5: Verify performance metrics

Step 1: Configure server settings

Choose configuration based on your model size:

# For 7B-13B models on single GPU
vllm serve meta-llama/Llama-3-8B-Instruct \
  --gpu-memory-utilization 0.9 \
  --max-model-len 8192 \
  --port 8000

# For 30B-70B models with tensor parallelism
vllm serve meta-llama/Llama-2-70b-hf \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.9 \
  --quantization awq \
  --port 8000

# For production with caching and metrics
vllm serve meta-llama/Llama-3-8B-Instruct \
  --gpu-memory-utilization 0.9 \
  --enable-prefix-caching \
  --enable-metrics \
  --metrics-port 9090 \
  --port 8000 \
  --host 0.0.0.0

Step 2: Test with limited traffic

Run load test before production:

# Install load testing tool
pip install locust

# Create test_load.py with sample requests
# Run: locust -f test_load.py --host http://localhost:8000

Verify TTFT (time to first token) < 500ms and throughput > 100 req/sec.

Step 3: Enable monitoring

vLLM exposes Prometheus metrics on port 9090:

curl http://localhost:9090/metrics | grep vllm

Key metrics to monitor:

vllm:time_to_first_token_seconds - Latency
vllm:num_requests_running - Active requests
vllm:gpu_cache_usage_perc - KV cache utilization

Step 4: Deploy to production

Use Docker for consistent deployment:

# Run vLLM in Docker
docker run --gpus all -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3-8B-Instruct \
  --gpu-memory-utilization 0.9 \
  --enable-prefix-caching

Step 5: Verify performance metrics

Check that deployment meets targets:

TTFT < 500ms (for short prompts)
Throughput > target req/sec
GPU utilization > 80%
No OOM errors in logs

Workflow 2: Offline batch inference

For processing large datasets without server overhead.

Copy this checklist:

Batch Processing:
- [ ] Step 1: Prepare input data
- [ ] Step 2: Configure LLM engine
- [ ] Step 3: Run batch inference
- [ ] Step 4: Process results

Step 1: Prepare input data

# Load prompts from file
prompts = []
with open("prompts.txt") as f:
    prompts = [line.strip() for line in f]

print(f"Loaded {len(prompts)} prompts")

Step 2: Configure LLM engine

from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-3-8B-Instruct",
    tensor_parallel_size=2,  # Use 2 GPUs
    gpu_memory_utilization=0.9,
    max_model_len=4096
)

sampling = SamplingParams(
    temperature=0.7,
    top_p=0.95,
    max_tokens=512,
    stop=["</s>", "\n\n"]
)

Step 3: Run batch inference

vLLM automatically batches requests for efficiency:

# Process all prompts in one call
outputs = llm.generate(prompts, sampling)

# vLLM handles batching internally
# No need to manually chunk prompts

Step 4: Process results

# Extract generated text
results = []
for output in outputs:
    prompt = output.prompt
    generated = output.outputs[0].text
    results.append({
        "prompt": prompt,
        "generated": generated,
        "tokens": len(output.outputs[0].token_ids)
    })

# Save to file
import json
with open("results.jsonl", "w") as f:
    for result in results:
        f.write(json.dumps(result) + "\n")

print(f"Processed {len(results)} prompts")

Workflow 3: Quantized model serving

Fit large models in limited GPU memory.

Quantization Setup:
- [ ] Step 1: Choose quantization method
- [ ] Step 2: Find or create quantized model
- [ ] Step 3: Launch with quantization flag
- [ ] Step 4: Verify accuracy

Step 1: Choose quantization method

AWQ : Best for 70B models, minimal accuracy loss
GPTQ : Wide model support, good compression
FP8 : Fastest on H100 GPUs

Step 2: Find or create quantized model

Use pre-quantized models from HuggingFace:

# Search for AWQ models
# Example: TheBloke/Llama-2-70B-AWQ

Step 3: Launch with quantization flag

# Using pre-quantized model
vllm serve TheBloke/Llama-2-70B-AWQ \
  --quantization awq \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.95

# Results: 70B model in ~40GB VRAM

Step 4: Verify accuracy

Test outputs match expected quality:

# Compare quantized vs non-quantized responses
# Verify task-specific performance unchanged

When to use vs alternatives

Use vLLM when:

Deploying production LLM APIs (100+ req/sec)
Serving OpenAI-compatible endpoints
Limited GPU memory but need large models
Multi-user applications (chatbots, assistants)
Need low latency with high throughput

Use alternatives instead:

llama.cpp : CPU/edge inference, single-user
HuggingFace transformers : Research, prototyping, one-off generation
TensorRT-LLM : NVIDIA-only, need absolute maximum performance
Text-Generation-Inference : Already in HuggingFace ecosystem

Common issues

Issue: Out of memory during model loading

Reduce memory usage:

vllm serve MODEL \
  --gpu-memory-utilization 0.7 \
  --max-model-len 4096

Or use quantization:

vllm serve MODEL --quantization awq

Issue: Slow first token (TTFT > 1 second)

Enable prefix caching for repeated prompts:

vllm serve MODEL --enable-prefix-caching

For long prompts, enable chunked prefill:

vllm serve MODEL --enable-chunked-prefill

Issue: Model not found error

Use --trust-remote-code for custom models:

vllm serve MODEL --trust-remote-code

Issue: Low throughput ( <50 req/sec)

Increase concurrent sequences:

vllm serve MODEL --max-num-seqs 512

Check GPU utilization with nvidia-smi - should be >80%.

Issue: Inference slower than expected

Verify tensor parallelism uses power of 2 GPUs:

vllm serve MODEL --tensor-parallel-size 4  # Not 3

Enable speculative decoding for faster generation:

vllm serve MODEL --speculative-model DRAFT_MODEL

Advanced topics

Server deployment patterns : See references/server-deployment.md for Docker, Kubernetes, and load balancing configurations.

Performance optimization : See references/optimization.md for PagedAttention tuning, continuous batching details, and benchmark results.

Quantization guide : See references/quantization.md for AWQ/GPTQ/FP8 setup, model preparation, and accuracy comparisons.

Troubleshooting : See references/troubleshooting.md for detailed error messages, debugging steps, and performance diagnostics.

Hardware requirements

Small models (7B-13B) : 1x A10 (24GB) or A100 (40GB)
Medium models (30B-40B) : 2x A100 (40GB) with tensor parallelism
Large models (70B+) : 4x A100 (40GB) or 2x A100 (80GB), use AWQ/GPTQ

Supported platforms: NVIDIA (primary), AMD ROCm, Intel GPUs, TPUs

Resources

Official docs: https://docs.vllm.ai
GitHub: https://github.com/vllm-project/vllm
Paper: "Efficient Memory Management for Large Language Model Serving with PagedAttention" (SOSP 2023)
Community: https://discuss.vllm.ai

Weekly Installs

Repository

orchestra-resea…h-skills

GitHub Stars

5.5K

First Seen

Feb 7, 2026

Security Audits

Gen Agent Trust HubWarn SocketPass SnykWarn

Installed on

cursor70

claude-code65

opencode64

codex53

gemini-cli52

github-copilot51

AI 代码实施计划编写技能 | 自动化开发任务分解与 TDD 流程规划工具

49,800 周安装