tensorrt-llm by davila7/claude-code-templates
npx skills add https://github.com/davila7/claude-code-templates --skill tensorrt-llmNVIDIA 的开源库,用于优化大语言模型推理,在 NVIDIA GPU 上提供最先进的性能。
在以下情况使用 TensorRT-LLM:
在以下情况使用 vLLM:
在以下情况使用 llama.cpp:
# Docker(推荐)
docker pull nvidia/tensorrt_llm:latest
# pip 安装
pip install tensorrt_llm==1.2.0rc3
# 需要 CUDA 13.0.0, TensorRT 10.13.2, Python 3.10-3.12
from tensorrt_llm import LLM, SamplingParams
# 初始化模型
llm = LLM(model="meta-llama/Meta-Llama-3-8B")
# 配置采样参数
sampling_params = SamplingParams(
max_tokens=100,
temperature=0.7,
top_p=0.9
)
# 生成
prompts = ["解释量子计算"]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(output.text)
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
# 启动服务器(自动模型下载和编译)
trtllm-serve meta-llama/Meta-Llama-3-8B \
--tp_size 4 \ # 张量并行(4个GPU)
--max_batch_size 256 \
--max_num_tokens 4096
# 客户端请求
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3-8B",
"messages": [{"role": "user", "content": "你好!"}],
"temperature": 0.7,
"max_tokens": 100
}'
from tensorrt_llm import LLM
# 加载 FP8 量化模型(速度提升 2 倍,内存减少 50%)
llm = LLM(
model="meta-llama/Meta-Llama-3-70B",
dtype="fp8",
max_num_tokens=8192
)
# 推理方式与之前相同
outputs = llm.generate(["总结这篇文章..."])
# 跨 8 个 GPU 的张量并行
llm = LLM(
model="meta-llama/Meta-Llama-3-405B",
tensor_parallel_size=8,
dtype="fp8"
)
# 高效处理 100 个提示
prompts = [f"问题 {i}: ..." for i in range(100)]
outputs = llm.generate(
prompts,
sampling_params=SamplingParams(max_tokens=200)
)
# 自动动态批处理以实现最大吞吐量
Meta Llama 3-8B(H100 GPU):
Llama 3-70B(8× A100 80GB):
每周安装量
164
代码仓库
GitHub 星标数
22.6K
首次出现
2026年1月21日
安全审计
已安装于
claude-code128
opencode126
gemini-cli116
cursor111
codex110
antigravity99
NVIDIA's open-source library for optimizing LLM inference with state-of-the-art performance on NVIDIA GPUs.
Use TensorRT-LLM when:
Use vLLM instead when:
Use llama.cpp instead when:
# Docker (recommended)
docker pull nvidia/tensorrt_llm:latest
# pip install
pip install tensorrt_llm==1.2.0rc3
# Requires CUDA 13.0.0, TensorRT 10.13.2, Python 3.10-3.12
from tensorrt_llm import LLM, SamplingParams
# Initialize model
llm = LLM(model="meta-llama/Meta-Llama-3-8B")
# Configure sampling
sampling_params = SamplingParams(
max_tokens=100,
temperature=0.7,
top_p=0.9
)
# Generate
prompts = ["Explain quantum computing"]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(output.text)
# Start server (automatic model download and compilation)
trtllm-serve meta-llama/Meta-Llama-3-8B \
--tp_size 4 \ # Tensor parallelism (4 GPUs)
--max_batch_size 256 \
--max_num_tokens 4096
# Client request
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3-8B",
"messages": [{"role": "user", "content": "Hello!"}],
"temperature": 0.7,
"max_tokens": 100
}'
from tensorrt_llm import LLM
# Load FP8 quantized model (2× faster, 50% memory)
llm = LLM(
model="meta-llama/Meta-Llama-3-70B",
dtype="fp8",
max_num_tokens=8192
)
# Inference same as before
outputs = llm.generate(["Summarize this article..."])
# Tensor parallelism across 8 GPUs
llm = LLM(
model="meta-llama/Meta-Llama-3-405B",
tensor_parallel_size=8,
dtype="fp8"
)
# Process 100 prompts efficiently
prompts = [f"Question {i}: ..." for i in range(100)]
outputs = llm.generate(
prompts,
sampling_params=SamplingParams(max_tokens=200)
)
# Automatic in-flight batching for maximum throughput
Meta Llama 3-8B (H100 GPU):
Llama 3-70B (8× A100 80GB):
Weekly Installs
164
Repository
GitHub Stars
22.6K
First Seen
Jan 21, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykWarn
Installed on
claude-code128
opencode126
gemini-cli116
cursor111
codex110
antigravity99