SGLang：高性能LLM/VLM服务框架，支持结构化输出与RadixAttention前缀缓存

sglang by davila7/claude-code-templates

216 周安装量

24,200 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/davila7/claude-code-templates --skill sglang

AI/机器学习开发性能优化

🇨🇳中文介绍

SGLang

用于 LLM 和 VLM 的高性能服务框架，具备 RadixAttention 以实现自动前缀缓存。

何时使用 SGLang

在以下情况使用 SGLang：

需要结构化输出（JSON、正则表达式、语法）
构建具有重复前缀（系统提示词、工具）的智能体
具有函数调用功能的智能体工作流
具有共享上下文的多轮对话
需要更快的 JSON 解码（比标准方法快 3 倍）

在以下情况使用 vLLM：

简单的文本生成，无需结构化
不需要前缀缓存
需要成熟、经过广泛测试的生产系统

在以下情况使用 TensorRT-LLM：

追求单请求最低延迟（无需批处理）
仅限 NVIDIA 部署
需要在 H100 上使用 FP8/INT4 量化

快速开始

安装

# pip 安装（推荐）
pip install "sglang[all]"

# 使用 FlashInfer（更快，CUDA 11.8/12.1）
pip install sglang[all] flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/

# 从源码安装
git clone https://github.com/sgl-project/sglang.git
cd sglang
pip install -e "python[all]"

启动服务器

# 基础服务器（Llama 3-8B）
python -m sglang.launch_server \
    --model-path meta-llama/Meta-Llama-3-8B-Instruct \
    --port 30000

# 启用 RadixAttention（自动前缀缓存）
python -m sglang.launch_server \
    --model-path meta-llama/Meta-Llama-3-8B-Instruct \
    --port 30000 \
    --enable-radix-cache  # 默认：启用

# 多 GPU（张量并行）
python -m sglang.launch_server \
    --model-path meta-llama/Meta-Llama-3-70B-Instruct \
    --tp 4 \
    --port 30000

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

相关 Skills

find-skills 技能搜索工具 - Vercel Labs 开源智能体技能包管理器

879,700 周安装

Vercel React 最佳实践指南 | 58条Next.js性能优化规则与代码重构

286,600 周安装

Vercel Web界面规范检查工具 - 自动检测代码是否符合Web设计指南

231,000 周安装

agent-browser 浏览器自动化工具 - Vercel Labs 命令行网页操作与测试

159,700 周安装

@sgl.function
def structured_extraction(s, article):
    s += f"Article: {article}\n\n"
    s += "Extract key information as JSON:\n"

    # JSON 模式约束
    schema = {
        "type": "object",
        "properties": {
            "title": {"type": "string"},
            "author": {"type": "string"},
            "summary": {"type": "string"},
            "sentiment": {"type": "string", "enum": ["positive", "negative", "neutral"]}
        },
        "required": ["title", "author", "summary", "sentiment"]
    }

    s += sgl.gen("info", max_tokens=300, json_schema=schema)

state = structured_extraction.run(article="...")
print(state["info"])
# 输出：符合模式的有效 JSON

import sglang as sgl

# 定义工具
tools = [
    {
        "name": "get_weather",
        "description": "Get weather for a location",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {"type": "string"}
            }
        }
    },
    {
        "name": "book_flight",
        "description": "Book a flight",
        "parameters": {
            "type": "object",
            "properties": {
                "from": {"type": "string"},
                "to": {"type": "string"},
                "date": {"type": "string"}
            }
        }
    }
]

@sgl.function
def agent_workflow(s, user_query, tools):
    # 系统提示词（由 RadixAttention 缓存）
    s += "You are a helpful assistant with access to tools.\n"
    s += f"Available tools: {tools}\n\n"

    # 用户查询
    s += f"User: {user_query}\n"
    s += "Assistant: "

    # 使用函数调用生成
    s += sgl.gen(
        "response",
        max_tokens=200,
        tools=tools,  # SGLang 处理工具调用格式
        stop=["User:", "\n\n"]
    )

# 多个查询重用系统提示词
state1 = agent_workflow.run(
    user_query="What's the weather in NYC?",
    tools=tools
)
# 第一次调用：计算完整的系统提示词

state2 = agent_workflow.run(
    user_query="Book a flight to LA",
    tools=tools
)
# 第二次调用：重用系统提示词（快 5 倍）

工作负载	vLLM	SGLang	加速比
简单生成	2500 令牌/秒	2800 令牌/秒	1.12×
少样本（10 个示例）	500 令牌/秒	5000 令牌/秒	10×
智能体（工具调用）	800 令牌/秒	4000 令牌/秒	5×
JSON 输出	600 令牌/秒	2400 令牌/秒	4×

@sgl.function
def multi_turn_chat(s, history, new_message):
    # 系统提示词（始终缓存）
    s += "You are a helpful AI assistant.\n\n"

    # 对话历史记录（随着增长而缓存）
    for msg in history:
        s += f"{msg['role']}: {msg['content']}\n"

    # 新的用户消息（仅新部分）
    s += f"User: {new_message}\n"
    s += "Assistant: "
    s += sgl.gen("response", max_tokens=200)

# 第 1 轮
history = []
state = multi_turn_chat.run(history=history, new_message="Hi there!")
history.append({"role": "User", "content": "Hi there!"})
history.append({"role": "Assistant", "content": state["response"]})

# 第 2 轮（重用第 1 轮的 KV 缓存）
state = multi_turn_chat.run(history=history, new_message="What's 2+2?")
# 仅计算新消息（快得多！）

# 第 3 轮（重用第 1 轮 + 第 2 轮的 KV 缓存）
state = multi_turn_chat.run(history=history, new_message="Tell me a joke")
# 随着历史记录增长，速度逐渐加快

# 使用 OpenAI API 启动服务器
python -m sglang.launch_server \
    --model-path meta-llama/Meta-Llama-3-8B-Instruct \
    --port 30000

# 使用 OpenAI 客户端
curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default",
    "messages": [
      {"role": "system", "content": "You are helpful"},
      {"role": "user", "content": "Hello"}
    ],
    "temperature": 0.7,
    "max_tokens": 100
  }'

# 与 OpenAI Python SDK 兼容
from openai import OpenAI
client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "Hello"}]
)

🇺🇸English

SGLang

High-performance serving framework for LLMs and VLMs with RadixAttention for automatic prefix caching.

When to use SGLang

Use SGLang when:

Need structured outputs (JSON, regex, grammar)
Building agents with repeated prefixes (system prompts, tools)
Agentic workflows with function calling
Multi-turn conversations with shared context
Need faster JSON decoding (3× vs standard)

Use vLLM instead when:

Simple text generation without structure
Don't need prefix caching
Want mature, widely-tested production system

Use TensorRT-LLM instead when:

Maximum single-request latency (no batching needed)
NVIDIA-only deployment
Need FP8/INT4 quantization on H100

Quick start

Installation

# pip install (recommended)
pip install "sglang[all]"

# With FlashInfer (faster, CUDA 11.8/12.1)
pip install sglang[all] flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/

# From source
git clone https://github.com/sgl-project/sglang.git
cd sglang
pip install -e "python[all]"

Launch server

# Basic server (Llama 3-8B)
python -m sglang.launch_server \
    --model-path meta-llama/Meta-Llama-3-8B-Instruct \
    --port 30000

# With RadixAttention (automatic prefix caching)
python -m sglang.launch_server \
    --model-path meta-llama/Meta-Llama-3-8B-Instruct \
    --port 30000 \
    --enable-radix-cache  # Default: enabled

# Multi-GPU (tensor parallelism)
python -m sglang.launch_server \
    --model-path meta-llama/Meta-Llama-3-70B-Instruct \
    --tp 4 \
    --port 30000

Basic inference

import sglang as sgl

# Set backend
sgl.set_default_backend(sgl.OpenAI("http://localhost:30000/v1"))

# Simple generation
@sgl.function
def simple_gen(s, question):
    s += "Q: " + question + "\n"
    s += "A:" + sgl.gen("answer", max_tokens=100)

# Run
state = simple_gen.run(question="What is the capital of France?")
print(state["answer"])
# Output: "The capital of France is Paris."

Structured JSON output

import sglang as sgl

@sgl.function
def extract_person(s, text):
    s += f"Extract person information from: {text}\n"
    s += "Output JSON:\n"

    # Constrained JSON generation
    s += sgl.gen(
        "json_output",
        max_tokens=200,
        regex=r'\{"name": "[^"]+", "age": \d+, "occupation": "[^"]+"\}'
    )

# Run
state = extract_person.run(
    text="John Smith is a 35-year-old software engineer."
)
print(state["json_output"])
# Output: {"name": "John Smith", "age": 35, "occupation": "software engineer"}

RadixAttention (Key Innovation)

What it does : Automatically caches and reuses common prefixes across requests.

Performance :

5× faster for agentic workloads with shared system prompts
10× faster for few-shot prompting with repeated examples
Zero configuration - works automatically

How it works :

Builds radix tree of all processed tokens
Automatically detects shared prefixes
Reuses KV cache for matching prefixes
Only computes new tokens

Example (Agent with system prompt):

Request 1: [SYSTEM_PROMPT] + "What's the weather?"
→ Computes full prompt (1000 tokens)

Request 2: [SAME_SYSTEM_PROMPT] + "Book a flight"
→ Reuses system prompt KV cache (998 tokens)
→ Only computes 2 new tokens
→ 5× faster!

Structured generation patterns

JSON with schema

@sgl.function
def structured_extraction(s, article):
    s += f"Article: {article}\n\n"
    s += "Extract key information as JSON:\n"

    # JSON schema constraint
    schema = {
        "type": "object",
        "properties": {
            "title": {"type": "string"},
            "author": {"type": "string"},
            "summary": {"type": "string"},
            "sentiment": {"type": "string", "enum": ["positive", "negative", "neutral"]}
        },
        "required": ["title", "author", "summary", "sentiment"]
    }

    s += sgl.gen("info", max_tokens=300, json_schema=schema)

state = structured_extraction.run(article="...")
print(state["info"])
# Output: Valid JSON matching schema

Regex-constrained generation

@sgl.function
def extract_email(s, text):
    s += f"Extract email from: {text}\n"
    s += "Email: "

    # Email regex pattern
    s += sgl.gen(
        "email",
        max_tokens=50,
        regex=r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
    )

state = extract_email.run(text="Contact john.doe@example.com for details")
print(state["email"])
# Output: "john.doe@example.com"

Grammar-based generation

@sgl.function
def generate_code(s, description):
    s += f"Generate Python code for: {description}\n"
    s += "```python\n"

    # EBNF grammar for Python
    python_grammar = """
    ?start: function_def
    function_def: "def" NAME "(" [parameters] "):" suite
    parameters: parameter ("," parameter)*
    parameter: NAME
    suite: simple_stmt | NEWLINE INDENT stmt+ DEDENT
    """

    s += sgl.gen("code", max_tokens=200, grammar=python_grammar)
    s += "\n```"

Agent workflows with function calling

import sglang as sgl

# Define tools
tools = [
    {
        "name": "get_weather",
        "description": "Get weather for a location",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {"type": "string"}
            }
        }
    },
    {
        "name": "book_flight",
        "description": "Book a flight",
        "parameters": {
            "type": "object",
            "properties": {
                "from": {"type": "string"},
                "to": {"type": "string"},
                "date": {"type": "string"}
            }
        }
    }
]

@sgl.function
def agent_workflow(s, user_query, tools):
    # System prompt (cached with RadixAttention)
    s += "You are a helpful assistant with access to tools.\n"
    s += f"Available tools: {tools}\n\n"

    # User query
    s += f"User: {user_query}\n"
    s += "Assistant: "

    # Generate with function calling
    s += sgl.gen(
        "response",
        max_tokens=200,
        tools=tools,  # SGLang handles tool call format
        stop=["User:", "\n\n"]
    )

# Multiple queries reuse system prompt
state1 = agent_workflow.run(
    user_query="What's the weather in NYC?",
    tools=tools
)
# First call: Computes full system prompt

state2 = agent_workflow.run(
    user_query="Book a flight to LA",
    tools=tools
)
# Second call: Reuses system prompt (5× faster)

Performance benchmarks

RadixAttention speedup

Few-shot prompting (10 examples in prompt):

vLLM: 2.5 sec/request
SGLang: 0.25 sec/request (10× faster)
Throughput: 4× higher

Agent workflows (1000-token system prompt):

vLLM: 1.8 sec/request
SGLang: 0.35 sec/request (5× faster)

JSON decoding :

Standard: 45 tok/s
SGLang: 135 tok/s (3× faster)

Throughput (Llama 3-8B, A100)

Workload	vLLM	SGLang	Speedup
Simple generation	2500 tok/s	2800 tok/s	1.12×
Few-shot (10 examples)	500 tok/s	5000 tok/s	10×
Agent (tool calls)	800 tok/s	4000 tok/s	5×
JSON output	600 tok/s	2400 tok/s	4×

Multi-turn conversations

@sgl.function
def multi_turn_chat(s, history, new_message):
    # System prompt (always cached)
    s += "You are a helpful AI assistant.\n\n"

    # Conversation history (cached as it grows)
    for msg in history:
        s += f"{msg['role']}: {msg['content']}\n"

    # New user message (only new part)
    s += f"User: {new_message}\n"
    s += "Assistant: "
    s += sgl.gen("response", max_tokens=200)

# Turn 1
history = []
state = multi_turn_chat.run(history=history, new_message="Hi there!")
history.append({"role": "User", "content": "Hi there!"})
history.append({"role": "Assistant", "content": state["response"]})

# Turn 2 (reuses Turn 1 KV cache)
state = multi_turn_chat.run(history=history, new_message="What's 2+2?")
# Only computes new message (much faster!)

# Turn 3 (reuses Turn 1 + Turn 2 KV cache)
state = multi_turn_chat.run(history=history, new_message="Tell me a joke")
# Progressively faster as history grows

Advanced features

Speculative decoding

# Launch with draft model (2-3× faster)
python -m sglang.launch_server \
    --model-path meta-llama/Meta-Llama-3-70B-Instruct \
    --speculative-model meta-llama/Meta-Llama-3-8B-Instruct \
    --speculative-num-steps 5

Multi-modal (vision models)

@sgl.function
def describe_image(s, image_path):
    s += sgl.image(image_path)
    s += "Describe this image in detail: "
    s += sgl.gen("description", max_tokens=200)

state = describe_image.run(image_path="photo.jpg")
print(state["description"])

Batching and parallel requests

# Automatic batching (continuous batching)
states = sgl.run_batch(
    [
        simple_gen.bind(question="What is AI?"),
        simple_gen.bind(question="What is ML?"),
        simple_gen.bind(question="What is DL?"),
    ]
)

# All 3 processed in single batch (efficient)

OpenAI-compatible API

# Start server with OpenAI API
python -m sglang.launch_server \
    --model-path meta-llama/Meta-Llama-3-8B-Instruct \
    --port 30000

# Use with OpenAI client
curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default",
    "messages": [
      {"role": "system", "content": "You are helpful"},
      {"role": "user", "content": "Hello"}
    ],
    "temperature": 0.7,
    "max_tokens": 100
  }'

# Works with OpenAI Python SDK
from openai import OpenAI
client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "Hello"}]
)

Supported models

Text models :

Llama 2, Llama 3, Llama 3.1, Llama 3.2
Mistral, Mixtral
Qwen, Qwen2, QwQ
DeepSeek-V2, DeepSeek-V3
Gemma, Phi-3

Vision models :

LLaVA, LLaVA-OneVision
Phi-3-Vision
Qwen2-VL

100+ models from HuggingFace

Hardware support

NVIDIA : A100, H100, L4, T4 (CUDA 11.8+) AMD : MI300, MI250 (ROCm 6.0+) Intel : Xeon with GPU (coming soon) Apple : M1/M2/M3 via MPS (experimental)

References

Structured Generation Guide - JSON schemas, regex, grammars, validation
RadixAttention Deep Dive - How it works, optimization, benchmarks
Production Deployment - Multi-GPU, monitoring, autoscaling

Resources

GitHub : https://github.com/sgl-project/sglang
Docs : https://sgl-project.github.io/
Paper : RadixAttention (arXiv:2312.07104)
Discord : https://discord.gg/sglang

Weekly Installs

151

Repository

davila7/claude-…emplates

GitHub Stars

22.6K

First Seen

Jan 21, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

claude-code123

opencode118

gemini-cli111

cursor108

codex100

antigravity96

SGLang：高性能LLM/VLM服务框架，支持结构化输出与RadixAttention前缀缓存

🇨🇳中文介绍

SGLang

何时使用 SGLang

快速开始

安装

启动服务器

相关 Skills

基础推理

结构化 JSON 输出

RadixAttention（关键创新）

结构化生成模式

带模式的 JSON

正则表达式约束生成

基于语法的生成

具有函数调用的智能体工作流

性能基准测试

RadixAttention 加速

吞吐量（Llama 3-8B，A100）

多轮对话

高级功能

推测解码

多模态（视觉模型）

批处理和并行请求

OpenAI 兼容 API

支持的模型

硬件支持

参考资料

资源

🇺🇸English

SGLang

When to use SGLang

Quick start

Installation

Launch server

Basic inference

Structured JSON output

RadixAttention (Key Innovation)

Structured generation patterns

JSON with schema

Regex-constrained generation

Grammar-based generation

Agent workflows with function calling

Performance benchmarks

RadixAttention speedup

Throughput (Llama 3-8B, A100)

Multi-turn conversations

Advanced features

Speculative decoding

Multi-modal (vision models)

Batching and parallel requests

OpenAI-compatible API

Supported models

Hardware support

References

Resources

最新 Skills