推测解码技术详解：加速大语言模型推理，实现1.5-3.6倍速度提升

speculative-decoding by davila7/claude-code-templates

213 周安装量

24,100 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/davila7/claude-code-templates --skill speculative-decoding

AI/机器学习代码生成性能优化

🇨🇳中文介绍

推测解码：加速大语言模型推理

何时使用此技能

在以下场景中使用推测解码：

加速推理：实现 1.5-3.6 倍的速度提升，且不损失质量
降低延迟：适用于实时应用（聊天机器人、代码生成）
优化吞吐量：适用于高并发服务
高效部署：在有限硬件资源上部署
加速生成：无需改变模型架构

关键技术：草稿模型推测解码、Medusa（多头解码）、前瞻解码（Jacobi 迭代）

相关论文：Medusa (arXiv 2401.10774)、前瞻解码 (ICML 2024)、推测解码综述 (ACL 2024)

安装

# 标准推测解码 (transformers)
pip install transformers accelerate

# Medusa (多头解码)
git clone https://github.com/FasterDecoding/Medusa
cd Medusa
pip install -e .

# 前瞻解码
git clone https://github.com/hao-ai-lab/LookaheadDecoding
cd LookaheadDecoding
pip install -e .

# 可选：支持推测解码的 vLLM
pip install vllm

快速开始

基础推测解码（草稿模型）

from transformers import AutoModelForCausalLM, AutoTokenizer

# 加载目标模型（大模型，速度慢）
target_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-70b-hf",
    device_map="auto",
    torch_dtype=torch.float16
)

# 加载草稿模型（小模型，速度快）
draft_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    device_map="auto",
    torch_dtype=torch.float16
)

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-70b-hf")

# 使用推测解码生成
prompt = "Explain quantum computing in simple terms:"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

# Transformers 4.36+ 支持辅助生成
outputs = target_model.generate(
    **inputs,
    assistant_model=draft_model,  # 启用推测解码
    max_new_tokens=256,
    do_sample=True,
    temperature=0.7,
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

Medusa（多头解码）

from medusa.model.medusa_model import MedusaModel

# 加载增强后的 Medusa 模型
model = MedusaModel.from_pretrained(
    "FasterDecoding/medusa-vicuna-7b-v1.3",  # 预训练了 Medusa 头
    torch_dtype=torch.float16,
    device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained("FasterDecoding/medusa-vicuna-7b-v1.3")

# 使用 Medusa 生成（2-3 倍加速）
prompt = "Write a Python function to calculate fibonacci numbers:"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

outputs = model.medusa_generate(
    **inputs,
    max_new_tokens=256,
    temperature=0.7,
    posterior_threshold=0.09,  # 接受阈值
    posterior_alpha=0.3,       # 树构建参数
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)

前瞻解码（Jacobi 迭代）

from lookahead.lookahead_decoding import LookaheadDecoding

# 加载模型
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    torch_dtype=torch.float16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")

# 初始化前瞻解码
lookahead = LookaheadDecoding(
    model=model,
    tokenizer=tokenizer,
    window_size=15,    # 前瞻窗口大小 (W)
    ngram_size=5,      # N-gram 大小 (N)
    guess_size=5       # 并行猜测数量
)

# 生成（1.5-2.3 倍加速）
prompt = "Implement quicksort in Python:"
output = lookahead.generate(prompt, max_new_tokens=256)
print(output)

1. 推测解码（草稿模型）

核心思想：使用小型草稿模型生成候选词，大型目标模型并行验证。

草稿模型推测生成 K 个词元
目标模型并行评估所有 K 个词元（单次前向传播）
接受草稿模型和目标模型一致的词元
拒绝首次出现不一致的词元，并从该处继续

def speculative_decode(target_model, draft_model, prompt, K=4):
    """推测解码算法。"""
    # 1. 生成 K 个草稿词元
    draft_tokens = draft_model.generate(prompt, max_new_tokens=K)

    # 2. 目标模型单次前向传播并行评估所有 K 个词元
    target_logits = target_model(draft_tokens)  # 并行！

    # 3. 基于概率匹配接受/拒绝
    accepted = []
    for i in range(K):
        p_draft = softmax(draft_model.logits[i])
        p_target = softmax(target_logits[i])

        # 接受概率
        if random.random() < min(1, p_target[draft_tokens[i]] / p_draft[draft_tokens[i]]):
            accepted.append(draft_tokens[i])
        else:
            break  # 拒绝，从目标模型重新采样

    return accepted

加速比：使用良好草稿模型可达 1.5-2 倍
零质量损失（数学上等价于目标模型）
当草稿模型比目标模型小 5-10 倍时效果最佳

2. Medusa（多头解码）

来源：arXiv 2401.10774 (2024)

创新点：在现有模型上添加多个预测头，无需单独的草稿模型即可预测未来词元。

Input → Base LLM (frozen) → Hidden State
                                ├→ Head 1 (predicts token t+1)
                                ├→ Head 2 (predicts token t+2)
                                ├→ Head 3 (predicts token t+3)
                                └→ Head 4 (predicts token t+4)

Medusa-1：冻结基础大语言模型，仅训练预测头
- 2.2 倍加速，无损质量
Medusa-2：联合微调基础大语言模型和预测头
- 2.3-3.6 倍加速，质量更佳

基于树的注意力机制：

# Medusa 构建候选树
# 示例：提前预测 2 步，每步取 top-2

#         Root
#        /    \
#      T1a    T1b  (Step 1: 2 candidates)
#     /  \    / \
#  T2a  T2b T2c T2d  (Step 2: 4 candidates total)

# 单次前向传播评估整棵树！

无需单独的草稿模型
训练量极小（仅需训练预测头）
兼容任何大语言模型

3. 前瞻解码（Jacobi 迭代）

核心思想：将自回归解码重新表述为求解方程组，使用 Jacobi 迭代并行求解。

Traditional:  y_t = f(x, y_1, ..., y_{t-1})  (sequential)
Jacobi:       y_t^{(k+1)} = f(x, y_1^{(k)}, ..., y_{t-1}^{(k)})  (parallel)

前瞻分支：并行生成 n-gram
- 窗口大小 W：前瞻步数
- N-gram 大小 N：使用的历史词元数量
验证分支：验证有潜力的 n-gram
- 将 n-gram 与生成的词元匹配
- 若首个词元匹配则接受

class LookaheadDecoding:
    def __init__(self, model, window_size=15, ngram_size=5):
        self.model = model
        self.W = window_size  # 前瞻窗口
        self.N = ngram_size   # N-gram 大小

    def generate_step(self, tokens):
        # 前瞻分支：生成 W × N 个候选
        candidates = {}
        for w in range(1, self.W + 1):
            for n in range(1, self.N + 1):
                # 从位置 w 开始生成长度为 n 的 n-gram
                ngram = self.generate_ngram(tokens, start=w, length=n)
                candidates[(w, n)] = ngram

        # 验证分支：查找匹配的 n-gram
        verified = []
        for ngram in candidates.values():
            if ngram[0] == tokens[-1]:  # 首个词元匹配最后一个输入
                if self.verify(tokens, ngram):
                    verified.append(ngram)

        # 接受最长的已验证 n-gram
        return max(verified, key=len) if verified else [self.model.generate_next(tokens)]

加速比：1.5-2.3 倍（代码生成可达 3.6 倍）
无需草稿模型或额外训练
开箱即用，兼容任何模型

方法	加速比	是否需要训练	草稿模型	质量损失
草稿模型推测解码	1.5-2 倍	否	是（外部）	无
Medusa	2-3.6 倍	少量（仅预测头）	否（内置预测头）	无
前瞻解码	1.5-2.3 倍	无	否	无
朴素批处理	1.2-1.5 倍	否	否	无

训练 Medusa 预测头

from medusa.model.medusa_model import MedusaModel
from medusa.model.kv_cache import initialize_past_key_values
import torch.nn as nn

# 1. 加载基础模型
base_model = AutoModelForCausalLM.from_pretrained(
    "lmsys/vicuna-7b-v1.3",
    torch_dtype=torch.float16
)

# 2. 添加 Medusa 预测头
num_heads = 4
medusa_heads = nn.ModuleList([
    nn.Linear(base_model.config.hidden_size, base_model.config.vocab_size, bias=False)
    for _ in range(num_heads)
])

# 3. 训练循环（Medusa-1 冻结基础模型）
for param in base_model.parameters():
    param.requires_grad = False  # 冻结基础模型

optimizer = torch.optim.Adam(medusa_heads.parameters(), lr=1e-3)

for batch in dataloader:
    # 前向传播
    hidden_states = base_model(**batch, output_hidden_states=True).hidden_states[-1]

    # 使用每个预测头预测未来词元
    loss = 0
    for i, head in enumerate(medusa_heads):
        logits = head(hidden_states)
        # 目标：偏移 (i+1) 个位置的词元
        target = batch['input_ids'][:, i+1:]
        loss += F.cross_entropy(logits[:, :-i-1], target)

    # 反向传播
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

混合模式：推测解码 + Medusa

# 使用 Medusa 作为推测解码的草稿模型
draft_medusa = MedusaModel.from_pretrained("medusa-vicuna-7b")
target_model = AutoModelForCausalLM.from_pretrained("vicuna-33b")

# 草稿模型使用 Medusa 生成多个候选
draft_tokens = draft_medusa.medusa_generate(prompt, max_new_tokens=5)

# 目标模型单次前向传播验证
outputs = target_model.generate(
    prompt,
    assistant_model=draft_medusa,  # 使用 Medusa 作为草稿模型
    max_new_tokens=256
)

# 结合优势：Medusa 速度 + 大模型质量

最优草稿模型选择

def select_draft_model(target_model_size, target):
    """为推测解码选择最优草稿模型。"""
    # 规则：草稿模型应比目标模型小 5-10 倍
    if target_model_size == "70B":
        return "7B"  # 10 倍小
    elif target_model_size == "33B":
        return "7B"  # 5 倍小
    elif target_model_size == "13B":
        return "1B"  # 13 倍小
    else:
        return None  # 目标模型太小，改用 Medusa/前瞻解码

# 示例
draft = select_draft_model("70B", target_model)
# 返回 "7B" → 使用 Llama-2-7b 作为 Llama-2-70b 的草稿模型

1. 选择合适的方法

# 新部署 → Medusa（整体加速比最佳，无需草稿模型）
if deploying_new_model:
    use_method = "Medusa"

# 已有部署且有小模型可用 → 草稿模型推测解码
elif have_small_version_of_model:
    use_method = "Draft Model Speculative"

# 希望零训练/零配置 → 前瞻解码
elif want_plug_and_play:
    use_method = "Lookahead Decoding"

草稿模型推测解码：

# K = 推测词元数量
K = 4  # 良好默认值
K = 2  # 保守（接受率更高）
K = 8  # 激进（接受率更低，但接受时加速更多）

# 规则：更大的 K → 如果草稿模型好，则加速比更高

# 后验阈值（接受置信度）
posterior_threshold = 0.09  # 标准值（来自论文）
posterior_threshold = 0.05  # 更保守（速度较慢，质量更高）
posterior_threshold = 0.15  # 更激进（速度更快，可能降低质量）

# 树深度（前瞻步数）
medusa_choices = [[0], [0, 0], [0, 1], [0, 0, 0]]  # 深度 3（标准）

# 窗口大小 W（前瞻距离）
# N-gram 大小 N（生成上下文）

# 7B 模型（资源较多）
W, N = 15, 5

# 13B 模型（中等）
W, N = 10, 5

# 33B+ 模型（资源有限）
W, N = 7, 5

3. 生产环境部署

# 使用推测解码的 vLLM
from vllm import LLM, SamplingParams

# 使用草稿模型初始化
llm = LLM(
    model="meta-llama/Llama-2-70b-hf",
    speculative_model="meta-llama/Llama-2-7b-hf",  # 草稿模型
    num_speculative_tokens=5,
    use_v2_block_manager=True,
)

# 生成
prompts = ["Tell me about AI:", "Explain quantum physics:"]
sampling_params = SamplingParams(temperature=0.7, max_tokens=256)

outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    print(output.outputs[0].text)

references/draft_model.md - 草稿模型选择与训练
references/medusa.md - Medusa 架构与训练
references/lookahead.md - 前瞻解码实现细节

2026 年 1 月 21 日

🇺🇸English

Speculative Decoding: Accelerating LLM Inference

When to Use This Skill

Use Speculative Decoding when you need to:

Speed up inference by 1.5-3.6× without quality loss
Reduce latency for real-time applications (chatbots, code generation)
Optimize throughput for high-volume serving
Deploy efficiently on limited hardware
Generate faster without changing model architecture

Key Techniques : Draft model speculative decoding, Medusa (multiple heads), Lookahead Decoding (Jacobi iteration)

Papers : Medusa (arXiv 2401.10774), Lookahead Decoding (ICML 2024), Speculative Decoding Survey (ACL 2024)

Installation

# Standard speculative decoding (transformers)
pip install transformers accelerate

# Medusa (multiple decoding heads)
git clone https://github.com/FasterDecoding/Medusa
cd Medusa
pip install -e .

# Lookahead Decoding
git clone https://github.com/hao-ai-lab/LookaheadDecoding
cd LookaheadDecoding
pip install -e .

# Optional: vLLM with speculative decoding
pip install vllm

Quick Start

Basic Speculative Decoding (Draft Model)

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load target model (large, slow)
target_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-70b-hf",
    device_map="auto",
    torch_dtype=torch.float16
)

# Load draft model (small, fast)
draft_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    device_map="auto",
    torch_dtype=torch.float16
)

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-70b-hf")

# Generate with speculative decoding
prompt = "Explain quantum computing in simple terms:"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

# Transformers 4.36+ supports assisted generation
outputs = target_model.generate(
    **inputs,
    assistant_model=draft_model,  # Enable speculative decoding
    max_new_tokens=256,
    do_sample=True,
    temperature=0.7,
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Medusa (Multiple Decoding Heads)

from medusa.model.medusa_model import MedusaModel

# Load Medusa-enhanced model
model = MedusaModel.from_pretrained(
    "FasterDecoding/medusa-vicuna-7b-v1.3",  # Pre-trained with Medusa heads
    torch_dtype=torch.float16,
    device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained("FasterDecoding/medusa-vicuna-7b-v1.3")

# Generate with Medusa (2-3× speedup)
prompt = "Write a Python function to calculate fibonacci numbers:"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

outputs = model.medusa_generate(
    **inputs,
    max_new_tokens=256,
    temperature=0.7,
    posterior_threshold=0.09,  # Acceptance threshold
    posterior_alpha=0.3,       # Tree construction parameter
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)

Lookahead Decoding (Jacobi Iteration)

from lookahead.lookahead_decoding import LookaheadDecoding

# Load model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    torch_dtype=torch.float16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")

# Initialize lookahead decoding
lookahead = LookaheadDecoding(
    model=model,
    tokenizer=tokenizer,
    window_size=15,    # Lookahead window (W)
    ngram_size=5,      # N-gram size (N)
    guess_size=5       # Number of parallel guesses
)

# Generate (1.5-2.3× speedup)
prompt = "Implement quicksort in Python:"
output = lookahead.generate(prompt, max_new_tokens=256)
print(output)

Core Concepts

1. Speculative Decoding (Draft Model)

Idea : Use small draft model to generate candidates, large target model to verify in parallel.

Algorithm :

Draft model generates K tokens speculatively
Target model evaluates all K tokens in parallel (single forward pass)
Accept tokens where draft and target agree
Reject first disagreement, continue from there

def speculative_decode(target_model, draft_model, prompt, K=4):

    """Speculative decoding algorithm."""
    # 1. Generate K draft tokens
    draft_tokens = draft_model.generate(prompt, max_new_tokens=K)

    # 2. Target model evaluates all K tokens in one forward pass
    target_logits = target_model(draft_tokens)  # Parallel!

    # 3. Accept/reject based on probability match
    accepted = []
    for i in range(K):
        p_draft = softmax(draft_model.logits[i])
        p_target = softmax(target_logits[i])

        # Acceptance probability
        if random.random() < min(1, p_target[draft_tokens[i]] / p_draft[draft_tokens[i]]):
            accepted.append(draft_tokens[i])
        else:
            break  # Reject, resample from target

    return accepted

Performance :

Speedup: 1.5-2× with good draft model
Zero quality loss (mathematically equivalent to target model)
Best when draft model is 5-10× smaller than target

2. Medusa (Multiple Decoding Heads)

Source : arXiv 2401.10774 (2024)

Innovation : Add multiple prediction heads to existing model, predict future tokens without separate draft model.

Architecture :

Input → Base LLM (frozen) → Hidden State
                                ├→ Head 1 (predicts token t+1)
                                ├→ Head 2 (predicts token t+2)
                                ├→ Head 3 (predicts token t+3)
                                └→ Head 4 (predicts token t+4)

Training :

Medusa-1 : Freeze base LLM, train only heads
- 2.2× speedup, lossless
Medusa-2 : Fine-tune base LLM + heads together
- 2.3-3.6× speedup, better quality

Tree-based Attention :

# Medusa constructs tree of candidates
# Example: Predict 2 steps ahead with top-2 per step

#         Root
#        /    \
#      T1a    T1b  (Step 1: 2 candidates)
#     /  \    / \
#  T2a  T2b T2c T2d  (Step 2: 4 candidates total)

# Single forward pass evaluates entire tree!

Advantages :

No separate draft model needed
Minimal training (only heads)
Compatible with any LLM

3. Lookahead Decoding (Jacobi Iteration)

Source : ICML 2024

Core idea : Reformulate autoregressive decoding as solving system of equations, solve in parallel using Jacobi iteration.

Mathematical formulation :

Traditional:  y_t = f(x, y_1, ..., y_{t-1})  (sequential)
Jacobi:       y_t^{(k+1)} = f(x, y_1^{(k)}, ..., y_{t-1}^{(k)})  (parallel)

Two branches :

Lookahead Branch : Generate n-grams in parallel
- Window size W: How many steps to look ahead
- N-gram size N: How many past tokens to use
Verification Branch : Verify promising n-grams
- Match n-grams with generated tokens
- Accept if first token matches

class LookaheadDecoding:

    def __init__(self, model, window_size=15, ngram_size=5):
        self.model = model
        self.W = window_size  # Lookahead window
        self.N = ngram_size   # N-gram size

    def generate_step(self, tokens):
        # Lookahead branch: Generate W × N candidates
        candidates = {}
        for w in range(1, self.W + 1):
            for n in range(1, self.N + 1):
                # Generate n-gram starting at position w
                ngram = self.generate_ngram(tokens, start=w, length=n)
                candidates[(w, n)] = ngram

        # Verification branch: Find matching n-grams
        verified = []
        for ngram in candidates.values():
            if ngram[0] == tokens[-1]:  # First token matches last input
                if self.verify(tokens, ngram):
                    verified.append(ngram)

        # Accept longest verified n-gram
        return max(verified, key=len) if verified else [self.model.generate_next(tokens)]

Performance :

Speedup: 1.5-2.3× (up to 3.6× for code generation)
No draft model or training needed
Works out-of-the-box with any model

Method Comparison

Method	Speedup	Training Needed	Draft Model	Quality Loss
Draft Model Speculative	1.5-2×	No	Yes (external)	None
Medusa	2-3.6×	Minimal (heads only)	No (built-in heads)	None
Lookahead	1.5-2.3×	None	No	None
Naive Batching	1.2-1.5×	No	No	None

Advanced Patterns

Training Medusa Heads

from medusa.model.medusa_model import MedusaModel
from medusa.model.kv_cache import initialize_past_key_values
import torch.nn as nn

# 1. Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "lmsys/vicuna-7b-v1.3",
    torch_dtype=torch.float16
)

# 2. Add Medusa heads
num_heads = 4
medusa_heads = nn.ModuleList([
    nn.Linear(base_model.config.hidden_size, base_model.config.vocab_size, bias=False)
    for _ in range(num_heads)
])

# 3. Training loop (freeze base model for Medusa-1)
for param in base_model.parameters():
    param.requires_grad = False  # Freeze base

optimizer = torch.optim.Adam(medusa_heads.parameters(), lr=1e-3)

for batch in dataloader:
    # Forward pass
    hidden_states = base_model(**batch, output_hidden_states=True).hidden_states[-1]

    # Predict future tokens with each head
    loss = 0
    for i, head in enumerate(medusa_heads):
        logits = head(hidden_states)
        # Target: tokens shifted by (i+1) positions
        target = batch['input_ids'][:, i+1:]
        loss += F.cross_entropy(logits[:, :-i-1], target)

    # Backward
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

Hybrid: Speculative + Medusa

# Use Medusa as draft model for speculative decoding
draft_medusa = MedusaModel.from_pretrained("medusa-vicuna-7b")
target_model = AutoModelForCausalLM.from_pretrained("vicuna-33b")

# Draft generates multiple candidates with Medusa
draft_tokens = draft_medusa.medusa_generate(prompt, max_new_tokens=5)

# Target verifies in single forward pass
outputs = target_model.generate(
    prompt,
    assistant_model=draft_medusa,  # Use Medusa as draft
    max_new_tokens=256
)

# Combines benefits: Medusa speed + large model quality

Optimal Draft Model Selection

def select_draft_model(target_model_size, target):
    """Select optimal draft model for speculative decoding."""
    # Rule: Draft should be 5-10× smaller
    if target_model_size == "70B":
        return "7B"  # 10× smaller
    elif target_model_size == "33B":
        return "7B"  # 5× smaller
    elif target_model_size == "13B":
        return "1B"  # 13× smaller
    else:
        return None  # Target too small, use Medusa/Lookahead instead

# Example
draft = select_draft_model("70B", target_model)
# Returns "7B" → Use Llama-2-7b as draft for Llama-2-70b

Best Practices

1. Choose the Right Method

# New deployment → Medusa (best overall speedup, no draft model)
if deploying_new_model:
    use_method = "Medusa"

# Existing deployment with small model available → Draft speculative
elif have_small_version_of_model:
    use_method = "Draft Model Speculative"

# Want zero training/setup → Lookahead
elif want_plug_and_play:
    use_method = "Lookahead Decoding"

2. Hyperparameter Tuning

Draft Model Speculative :

# K = number of speculative tokens
K = 4  # Good default
K = 2  # Conservative (higher acceptance)
K = 8  # Aggressive (lower acceptance, but more when accepted)

# Rule: Larger K → more speedup IF draft model is good

Medusa :

# Posterior threshold (acceptance confidence)
posterior_threshold = 0.09  # Standard (from paper)
posterior_threshold = 0.05  # More conservative (slower, higher quality)
posterior_threshold = 0.15  # More aggressive (faster, may degrade quality)

# Tree depth (how many steps ahead)
medusa_choices = [[0], [0, 0], [0, 1], [0, 0, 0]]  # Depth 3 (standard)

Lookahead :

# Window size W (lookahead distance)
# N-gram size N (context for generation)

# 7B model (more resources)
W, N = 15, 5

# 13B model (moderate)
W, N = 10, 5

# 33B+ model (limited resources)
W, N = 7, 5

3. Production Deployment

# vLLM with speculative decoding
from vllm import LLM, SamplingParams

# Initialize with draft model
llm = LLM(
    model="meta-llama/Llama-2-70b-hf",
    speculative_model="meta-llama/Llama-2-7b-hf",  # Draft model
    num_speculative_tokens=5,
    use_v2_block_manager=True,
)

# Generate
prompts = ["Tell me about AI:", "Explain quantum physics:"]
sampling_params = SamplingParams(temperature=0.7, max_tokens=256)

outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    print(output.outputs[0].text)

Resources

Medusa Paper : https://arxiv.org/abs/2401.10774
Medusa GitHub : https://github.com/FasterDecoding/Medusa
Lookahead Decoding (ICML 2024) : https://lmsys.org/blog/2023-11-21-lookahead-decoding/
Lookahead GitHub : https://github.com/hao-ai-lab/LookaheadDecoding
Speculative Decoding Survey (ACL 2024) : https://aclanthology.org/2024.findings-acl.456.pdf
Comprehensive Survey : https://arxiv.org/abs/2401.07851