serving-llms-vllm by davila7/claude-code-templates
npx skills add https://github.com/davila7/claude-code-templates --skill serving-llms-vllmvLLM 通过 PagedAttention(基于块的 KV 缓存)和连续批处理(混合预填充/解码请求)实现了比标准 transformers 高 24 倍的吞吐量。
安装:
pip install vllm
基础离线推理:
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-3-8B-Instruct")
sampling = SamplingParams(temperature=0.7, max_tokens=256)
outputs = llm.generate(["Explain quantum computing"], sampling)
print(outputs[0].outputs[0].text)
OpenAI 兼容服务器:
vllm serve meta-llama/Llama-3-8B-Instruct
# 使用 OpenAI SDK 查询
python -c "
from openai import OpenAI
client = OpenAI(base_url='http://localhost:8000/v1', api_key='EMPTY')
print(client.chat.completions.create(
model='meta-llama/Llama-3-8B-Instruct',
messages=[{'role': 'user', 'content': 'Hello!'}]
).choices[0].message.content)
"
复制此清单并跟踪进度:
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
Deployment Progress:
- [ ] Step 1: Configure server settings
- [ ] Step 2: Test with limited traffic
- [ ] Step 3: Enable monitoring
- [ ] Step 4: Deploy to production
- [ ] Step 5: Verify performance metrics
步骤 1:配置服务器设置
根据模型大小选择配置:
# 适用于单 GPU 上的 7B-13B 模型
vllm serve meta-llama/Llama-3-8B-Instruct \
--gpu-memory-utilization 0.9 \
--max-model-len 8192 \
--port 8000
# 适用于使用张量并行的 30B-70B 模型
vllm serve meta-llama/Llama-2-70b-hf \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.9 \
--quantization awq \
--port 8000
# 适用于带缓存和指标的生产环境
vllm serve meta-llama/Llama-3-8B-Instruct \
--gpu-memory-utilization 0.9 \
--enable-prefix-caching \
--enable-metrics \
--metrics-port 9090 \
--port 8000 \
--host 0.0.0.0
步骤 2:有限流量测试
在生产前运行负载测试:
# 安装负载测试工具
pip install locust
# 创建包含示例请求的 test_load.py
# 运行:locust -f test_load.py --host http://localhost:8000
验证 TTFT(首个令牌生成时间)< 500ms 且吞吐量 > 100 请求/秒。
步骤 3:启用监控
vLLM 在端口 9090 上暴露 Prometheus 指标:
curl http://localhost:9090/metrics | grep vllm
需要监控的关键指标:
vllm:time_to_first_token_seconds - 延迟vllm:num_requests_running - 活动请求数vllm:gpu_cache_usage_perc - KV 缓存利用率步骤 4:部署到生产环境
使用 Docker 实现一致性部署:
# 在 Docker 中运行 vLLM
docker run --gpus all -p 8000:8000 \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3-8B-Instruct \
--gpu-memory-utilization 0.9 \
--enable-prefix-caching
步骤 5:验证性能指标
检查部署是否达到目标:
用于处理大型数据集,无需服务器开销。
复制此清单:
Batch Processing:
- [ ] Step 1: Prepare input data
- [ ] Step 2: Configure LLM engine
- [ ] Step 3: Run batch inference
- [ ] Step 4: Process results
步骤 1:准备输入数据
# 从文件加载提示词
prompts = []
with open("prompts.txt") as f:
prompts = [line.strip() for line in f]
print(f"Loaded {len(prompts)} prompts")
步骤 2:配置 LLM 引擎
from vllm import LLM, SamplingParams
llm = LLM(
model="meta-llama/Llama-3-8B-Instruct",
tensor_parallel_size=2, # 使用 2 个 GPU
gpu_memory_utilization=0.9,
max_model_len=4096
)
sampling = SamplingParams(
temperature=0.7,
top_p=0.95,
max_tokens=512,
stop=["</s>", "\n\n"]
)
步骤 3:运行批量推理
vLLM 自动批处理请求以提高效率:
# 在一次调用中处理所有提示词
outputs = llm.generate(prompts, sampling)
# vLLM 在内部处理批处理
# 无需手动分块提示词
步骤 4:处理结果
# 提取生成的文本
results = []
for output in outputs:
prompt = output.prompt
generated = output.outputs[0].text
results.append({
"prompt": prompt,
"generated": generated,
"tokens": len(output.outputs[0].token_ids)
})
# 保存到文件
import json
with open("results.jsonl", "w") as f:
for result in results:
f.write(json.dumps(result) + "\n")
print(f"Processed {len(results)} prompts")
在有限的 GPU 内存中部署大型模型。
Quantization Setup:
- [ ] Step 1: Choose quantization method
- [ ] Step 2: Find or create quantized model
- [ ] Step 3: Launch with quantization flag
- [ ] Step 4: Verify accuracy
步骤 1:选择量化方法
步骤 2:查找或创建量化模型
使用 HuggingFace 上的预量化模型:
# 搜索 AWQ 模型
# 示例:TheBloke/Llama-2-70B-AWQ
步骤 3:使用量化标志启动
# 使用预量化模型
vllm serve TheBloke/Llama-2-70B-AWQ \
--quantization awq \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.95
# 结果:70B 模型占用约 40GB VRAM
步骤 4:验证准确性
测试输出是否符合预期质量:
# 比较量化与非量化模型的响应
# 验证特定任务的性能未发生变化
在以下情况下使用 vLLM:
在以下情况下使用替代方案:
问题:模型加载期间内存不足
减少内存使用:
vllm serve MODEL \
--gpu-memory-utilization 0.7 \
--max-model-len 4096
或使用量化:
vllm serve MODEL --quantization awq
问题:首个令牌生成慢(TTFT > 1 秒)
为重复提示词启用前缀缓存:
vllm serve MODEL --enable-prefix-caching
对于长提示词,启用分块预填充:
vllm serve MODEL --enable-chunked-prefill
问题:模型未找到错误
对于自定义模型使用 --trust-remote-code:
vllm serve MODEL --trust-remote-code
问题:吞吐量低(<50 请求/秒)
增加并发序列数:
vllm serve MODEL --max-num-seqs 512
使用 nvidia-smi 检查 GPU 利用率 - 应 >80%。
问题:推理速度慢于预期
验证张量并行使用 2 的幂次方个 GPU:
vllm serve MODEL --tensor-parallel-size 4 # 而不是 3
启用推测解码以加速生成:
vllm serve MODEL --speculative-model DRAFT_MODEL
服务器部署模式:有关 Docker、Kubernetes 和负载均衡配置,请参阅 references/server-deployment.md。
性能优化:有关 PagedAttention 调优、连续批处理详细信息和基准测试结果,请参阅 references/optimization.md。
量化指南:有关 AWQ/GPTQ/FP8 设置、模型准备和准确性比较,请参阅 references/quantization.md。
故障排除:有关详细的错误消息、调试步骤和性能诊断,请参阅 references/troubleshooting.md。
支持的平台:NVIDIA(主要)、AMD ROCm、Intel GPU、TPU
每周安装量
189
代码仓库
GitHub 星标数
22.6K
首次出现
Jan 21, 2026
安全审计
安装于
claude-code148
opencode147
gemini-cli135
cursor129
codex128
github-copilot114
vLLM achieves 24x higher throughput than standard transformers through PagedAttention (block-based KV cache) and continuous batching (mixing prefill/decode requests).
Installation :
pip install vllm
Basic offline inference :
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-3-8B-Instruct")
sampling = SamplingParams(temperature=0.7, max_tokens=256)
outputs = llm.generate(["Explain quantum computing"], sampling)
print(outputs[0].outputs[0].text)
OpenAI-compatible server :
vllm serve meta-llama/Llama-3-8B-Instruct
# Query with OpenAI SDK
python -c "
from openai import OpenAI
client = OpenAI(base_url='http://localhost:8000/v1', api_key='EMPTY')
print(client.chat.completions.create(
model='meta-llama/Llama-3-8B-Instruct',
messages=[{'role': 'user', 'content': 'Hello!'}]
).choices[0].message.content)
"
Copy this checklist and track progress:
Deployment Progress:
- [ ] Step 1: Configure server settings
- [ ] Step 2: Test with limited traffic
- [ ] Step 3: Enable monitoring
- [ ] Step 4: Deploy to production
- [ ] Step 5: Verify performance metrics
Step 1: Configure server settings
Choose configuration based on your model size:
# For 7B-13B models on single GPU
vllm serve meta-llama/Llama-3-8B-Instruct \
--gpu-memory-utilization 0.9 \
--max-model-len 8192 \
--port 8000
# For 30B-70B models with tensor parallelism
vllm serve meta-llama/Llama-2-70b-hf \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.9 \
--quantization awq \
--port 8000
# For production with caching and metrics
vllm serve meta-llama/Llama-3-8B-Instruct \
--gpu-memory-utilization 0.9 \
--enable-prefix-caching \
--enable-metrics \
--metrics-port 9090 \
--port 8000 \
--host 0.0.0.0
Step 2: Test with limited traffic
Run load test before production:
# Install load testing tool
pip install locust
# Create test_load.py with sample requests
# Run: locust -f test_load.py --host http://localhost:8000
Verify TTFT (time to first token) < 500ms and throughput > 100 req/sec.
Step 3: Enable monitoring
vLLM exposes Prometheus metrics on port 9090:
curl http://localhost:9090/metrics | grep vllm
Key metrics to monitor:
vllm:time_to_first_token_seconds - Latencyvllm:num_requests_running - Active requestsvllm:gpu_cache_usage_perc - KV cache utilizationStep 4: Deploy to production
Use Docker for consistent deployment:
# Run vLLM in Docker
docker run --gpus all -p 8000:8000 \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3-8B-Instruct \
--gpu-memory-utilization 0.9 \
--enable-prefix-caching
Step 5: Verify performance metrics
Check that deployment meets targets:
For processing large datasets without server overhead.
Copy this checklist:
Batch Processing:
- [ ] Step 1: Prepare input data
- [ ] Step 2: Configure LLM engine
- [ ] Step 3: Run batch inference
- [ ] Step 4: Process results
Step 1: Prepare input data
# Load prompts from file
prompts = []
with open("prompts.txt") as f:
prompts = [line.strip() for line in f]
print(f"Loaded {len(prompts)} prompts")
Step 2: Configure LLM engine
from vllm import LLM, SamplingParams
llm = LLM(
model="meta-llama/Llama-3-8B-Instruct",
tensor_parallel_size=2, # Use 2 GPUs
gpu_memory_utilization=0.9,
max_model_len=4096
)
sampling = SamplingParams(
temperature=0.7,
top_p=0.95,
max_tokens=512,
stop=["</s>", "\n\n"]
)
Step 3: Run batch inference
vLLM automatically batches requests for efficiency:
# Process all prompts in one call
outputs = llm.generate(prompts, sampling)
# vLLM handles batching internally
# No need to manually chunk prompts
Step 4: Process results
# Extract generated text
results = []
for output in outputs:
prompt = output.prompt
generated = output.outputs[0].text
results.append({
"prompt": prompt,
"generated": generated,
"tokens": len(output.outputs[0].token_ids)
})
# Save to file
import json
with open("results.jsonl", "w") as f:
for result in results:
f.write(json.dumps(result) + "\n")
print(f"Processed {len(results)} prompts")
Fit large models in limited GPU memory.
Quantization Setup:
- [ ] Step 1: Choose quantization method
- [ ] Step 2: Find or create quantized model
- [ ] Step 3: Launch with quantization flag
- [ ] Step 4: Verify accuracy
Step 1: Choose quantization method
Step 2: Find or create quantized model
Use pre-quantized models from HuggingFace:
# Search for AWQ models
# Example: TheBloke/Llama-2-70B-AWQ
Step 3: Launch with quantization flag
# Using pre-quantized model
vllm serve TheBloke/Llama-2-70B-AWQ \
--quantization awq \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.95
# Results: 70B model in ~40GB VRAM
Step 4: Verify accuracy
Test outputs match expected quality:
# Compare quantized vs non-quantized responses
# Verify task-specific performance unchanged
Use vLLM when:
Use alternatives instead:
Issue: Out of memory during model loading
Reduce memory usage:
vllm serve MODEL \
--gpu-memory-utilization 0.7 \
--max-model-len 4096
Or use quantization:
vllm serve MODEL --quantization awq
Issue: Slow first token (TTFT > 1 second)
Enable prefix caching for repeated prompts:
vllm serve MODEL --enable-prefix-caching
For long prompts, enable chunked prefill:
vllm serve MODEL --enable-chunked-prefill
Issue: Model not found error
Use --trust-remote-code for custom models:
vllm serve MODEL --trust-remote-code
Issue: Low throughput ( <50 req/sec)
Increase concurrent sequences:
vllm serve MODEL --max-num-seqs 512
Check GPU utilization with nvidia-smi - should be >80%.
Issue: Inference slower than expected
Verify tensor parallelism uses power of 2 GPUs:
vllm serve MODEL --tensor-parallel-size 4 # Not 3
Enable speculative decoding for faster generation:
vllm serve MODEL --speculative-model DRAFT_MODEL
Server deployment patterns : See references/server-deployment.md for Docker, Kubernetes, and load balancing configurations.
Performance optimization : See references/optimization.md for PagedAttention tuning, continuous batching details, and benchmark results.
Quantization guide : See references/quantization.md for AWQ/GPTQ/FP8 setup, model preparation, and accuracy comparisons.
Troubleshooting : See references/troubleshooting.md for detailed error messages, debugging steps, and performance diagnostics.
Supported platforms: NVIDIA (primary), AMD ROCm, Intel GPUs, TPUs
Weekly Installs
189
Repository
GitHub Stars
22.6K
First Seen
Jan 21, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykWarn
Installed on
claude-code148
opencode147
gemini-cli135
cursor129
codex128
github-copilot114
Azure Data Explorer (Kusto) 查询技能:KQL数据分析、日志遥测与时间序列处理
102,600 周安装