TensorRT-LLM：NVIDIA GPU大语言模型推理优化库，实现24K tokens/秒超高吞吐量

tensorrt-llm by davila7/claude-code-templates

228 周安装量

24,200 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/davila7/claude-code-templates --skill tensorrt-llm

AI/机器学习开发运维性能优化

🇨🇳中文介绍

TensorRT-LLM

NVIDIA 的开源库，用于优化大语言模型推理，在 NVIDIA GPU 上提供最先进的性能。

何时使用 TensorRT-LLM

在以下情况使用 TensorRT-LLM：

部署在 NVIDIA GPU 上（A100、H100、GB200）
需要最大吞吐量（Llama 3 上可达 24,000+ tokens/秒）
实时应用需要低延迟
处理量化模型（FP8、INT4、FP4）
跨多个 GPU 或节点进行扩展

在以下情况使用 vLLM：

需要更简单的设置和以 Python 为先的 API
想要无需 TensorRT 编译的 PagedAttention
使用 AMD GPU 或非 NVIDIA 硬件

在以下情况使用 llama.cpp：

部署在 CPU 或 Apple Silicon 上
需要在没有 NVIDIA GPU 的边缘设备上部署
想要更简单的 GGUF 量化格式

快速开始

安装

# Docker（推荐）
docker pull nvidia/tensorrt_llm:latest

# pip 安装
pip install tensorrt_llm==1.2.0rc3

# 需要 CUDA 13.0.0, TensorRT 10.13.2, Python 3.10-3.12

基础推理

from tensorrt_llm import LLM, SamplingParams

# 初始化模型
llm = LLM(model="meta-llama/Meta-Llama-3-8B")

# 配置采样参数
sampling_params = SamplingParams(
    max_tokens=100,
    temperature=0.7,
    top_p=0.9
)

# 生成
prompts = ["解释量子计算"]
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(output.text)

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

使用 trtllm-serve 进行服务

# 启动服务器（自动模型下载和编译）
trtllm-serve meta-llama/Meta-Llama-3-8B \
    --tp_size 4 \              # 张量并行（4个GPU）
    --max_batch_size 256 \
    --max_num_tokens 4096

# 客户端请求
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3-8B",
    "messages": [{"role": "user", "content": "你好！"}],
    "temperature": 0.7,
    "max_tokens": 100
  }'

动态批处理：生成过程中的动态批处理
分页 KV 缓存：高效的内存管理
Flash Attention：优化的注意力内核
量化：FP8、INT4、FP4，推理速度提升 2-4 倍
CUDA 图：减少内核启动开销

张量并行：将模型拆分到多个 GPU 上
流水线并行：按层分布
专家并行：适用于专家混合模型
多节点：扩展到单机之外

推测解码：使用草稿模型实现更快的生成
LoRA 服务：高效的多适配器部署
分离式服务：分离预填充和生成阶段

量化模型（FP8）

from tensorrt_llm import LLM

# 加载 FP8 量化模型（速度提升 2 倍，内存减少 50%）
llm = LLM(
    model="meta-llama/Meta-Llama-3-70B",
    dtype="fp8",
    max_num_tokens=8192
)

# 推理方式与之前相同
outputs = llm.generate(["总结这篇文章..."])

# 跨 8 个 GPU 的张量并行
llm = LLM(
    model="meta-llama/Meta-Llama-3-405B",
    tensor_parallel_size=8,
    dtype="fp8"
)

# 高效处理 100 个提示
prompts = [f"问题 {i}: ..." for i in range(100)]

outputs = llm.generate(
    prompts,
    sampling_params=SamplingParams(max_tokens=200)
)

# 自动动态批处理以实现最大吞吐量

Meta Llama 3-8B（H100 GPU）：

吞吐量：24,000 tokens/秒
延迟：每个 token 约 10 毫秒
与 PyTorch 对比：快 100 倍

Llama 3-70B（8× A100 80GB）：

FP8 量化：比 FP16 快 2 倍
内存：使用 FP8 减少 50%

LLaMA 系列：Llama 2、Llama 3、CodeLlama
GPT 系列：GPT-2、GPT-J、GPT-NeoX
Qwen：Qwen、Qwen2、QwQ
DeepSeek：DeepSeek-V2、DeepSeek-V3
Mixtral：Mixtral-8x7B、Mixtral-8x22B
视觉模型：LLaVA、Phi-3-vision
HuggingFace 上的 100+ 模型

优化指南 - 量化、批处理、KV 缓存调优
多 GPU 设置 - 张量/流水线并行、多节点
服务指南 - 生产部署、监控、自动扩缩容

🇺🇸English

TensorRT-LLM

NVIDIA's open-source library for optimizing LLM inference with state-of-the-art performance on NVIDIA GPUs.

When to use TensorRT-LLM

Use TensorRT-LLM when:

Deploying on NVIDIA GPUs (A100, H100, GB200)
Need maximum throughput (24,000+ tokens/sec on Llama 3)
Require low latency for real-time applications
Working with quantized models (FP8, INT4, FP4)
Scaling across multiple GPUs or nodes

Use vLLM instead when:

Need simpler setup and Python-first API
Want PagedAttention without TensorRT compilation
Working with AMD GPUs or non-NVIDIA hardware

Use llama.cpp instead when:

Deploying on CPU or Apple Silicon
Need edge deployment without NVIDIA GPUs
Want simpler GGUF quantization format

Quick start

Installation

# Docker (recommended)
docker pull nvidia/tensorrt_llm:latest

# pip install
pip install tensorrt_llm==1.2.0rc3

# Requires CUDA 13.0.0, TensorRT 10.13.2, Python 3.10-3.12

Basic inference

from tensorrt_llm import LLM, SamplingParams

# Initialize model
llm = LLM(model="meta-llama/Meta-Llama-3-8B")

# Configure sampling
sampling_params = SamplingParams(
    max_tokens=100,
    temperature=0.7,
    top_p=0.9
)

# Generate
prompts = ["Explain quantum computing"]
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(output.text)

Serving with trtllm-serve

# Start server (automatic model download and compilation)
trtllm-serve meta-llama/Meta-Llama-3-8B \
    --tp_size 4 \              # Tensor parallelism (4 GPUs)
    --max_batch_size 256 \
    --max_num_tokens 4096

# Client request
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3-8B",
    "messages": [{"role": "user", "content": "Hello!"}],
    "temperature": 0.7,
    "max_tokens": 100
  }'

Key features

Performance optimizations

In-flight batching : Dynamic batching during generation
Paged KV cache : Efficient memory management
Flash Attention : Optimized attention kernels
Quantization : FP8, INT4, FP4 for 2-4× faster inference
CUDA graphs : Reduced kernel launch overhead

Parallelism

Tensor parallelism (TP) : Split model across GPUs
Pipeline parallelism (PP) : Layer-wise distribution
Expert parallelism : For Mixture-of-Experts models
Multi-node : Scale beyond single machine

Advanced features

Speculative decoding : Faster generation with draft models
LoRA serving : Efficient multi-adapter deployment
Disaggregated serving : Separate prefill and generation

Common patterns

Quantized model (FP8)

from tensorrt_llm import LLM

# Load FP8 quantized model (2× faster, 50% memory)
llm = LLM(
    model="meta-llama/Meta-Llama-3-70B",
    dtype="fp8",
    max_num_tokens=8192
)

# Inference same as before
outputs = llm.generate(["Summarize this article..."])

Multi-GPU deployment

# Tensor parallelism across 8 GPUs
llm = LLM(
    model="meta-llama/Meta-Llama-3-405B",
    tensor_parallel_size=8,
    dtype="fp8"
)

Batch inference

# Process 100 prompts efficiently
prompts = [f"Question {i}: ..." for i in range(100)]

outputs = llm.generate(
    prompts,
    sampling_params=SamplingParams(max_tokens=200)
)

# Automatic in-flight batching for maximum throughput

Performance benchmarks

Meta Llama 3-8B (H100 GPU):

Throughput: 24,000 tokens/sec
Latency: ~10ms per token
vs PyTorch: 100× faster

Llama 3-70B (8× A100 80GB):

FP8 quantization: 2× faster than FP16
Memory: 50% reduction with FP8

Supported models

LLaMA family : Llama 2, Llama 3, CodeLlama
GPT family : GPT-2, GPT-J, GPT-NeoX
Qwen : Qwen, Qwen2, QwQ
DeepSeek : DeepSeek-V2, DeepSeek-V3
Mixtral : Mixtral-8x7B, Mixtral-8x22B
Vision : LLaVA, Phi-3-vision
100+ models on HuggingFace

References

Optimization Guide - Quantization, batching, KV cache tuning
Multi-GPU Setup - Tensor/pipeline parallelism, multi-node
Serving Guide - Production deployment, monitoring, autoscaling

Resources

Docs : https://nvidia.github.io/TensorRT-LLM/
GitHub : https://github.com/NVIDIA/TensorRT-LLM
Models : https://huggingface.co/models?library=tensorrt_llm

Weekly Installs

164

Repository

davila7/claude-…emplates

GitHub Stars

22.6K

First Seen

Jan 21, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykWarn

Installed on

claude-code128

opencode126

gemini-cli116

cursor111

codex110

antigravity99

TensorRT-LLM：NVIDIA GPU大语言模型推理优化库，实现24K tokens/秒超高吞吐量

🇨🇳中文介绍

TensorRT-LLM

何时使用 TensorRT-LLM

快速开始

安装

基础推理

相关 Skills

使用 trtllm-serve 进行服务

主要特性

性能优化

并行性

高级特性

常见模式

量化模型（FP8）

多 GPU 部署

批量推理

性能基准测试

支持的模型

参考资料

资源

🇺🇸English

TensorRT-LLM

When to use TensorRT-LLM

Quick start

Installation

Basic inference

Serving with trtllm-serve

Key features

Performance optimizations

Parallelism

Advanced features

Common patterns

Quantized model (FP8)

Multi-GPU deployment

Batch inference

Performance benchmarks

Supported models

References

Resources

最新 Skills