RWKV架构详解：融合Transformer与RNN优势的高效AI模型安装与使用指南

rwkv-architecture by orchestra-research/ai-research-skills

70 周安装量

5,700 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/orchestra-research/ai-research-skills --skill rwkv-architecture

AI/机器学习 PyTorch 自然语言处理

🇨🇳中文介绍

RWKV - Receptance Weighted Key Value

快速开始

RWKV (RwaKuv) 结合了 Transformer 的并行化（训练）与 RNN 的效率（推理）。

安装 :

# 安装 PyTorch
pip install torch --upgrade --extra-index-url https://download.pytorch.org/whl/cu121

# 安装依赖项
pip install pytorch-lightning==1.9.5 deepspeed wandb ninja --upgrade

# 安装 RWKV
pip install rwkv

基本用法 (GPT 模式 + RNN 模式):

import os
from rwkv.model import RWKV

os.environ["RWKV_JIT_ON"] = '1'
os.environ["RWKV_CUDA_ON"] = '1'  # 使用 CUDA 内核以提升速度

# 加载模型
model = RWKV(
    model='/path/to/RWKV-4-Pile-1B5-20220903-8040',
    strategy='cuda fp16'
)

# GPT 模式（并行处理）
out, state = model.forward([187, 510, 1563, 310, 247], None)
print(out.detach().cpu().numpy())  # 输出 logits

# RNN 模式（顺序处理，结果相同）
out, state = model.forward([187, 510], None)  # 前 2 个 token
out, state = model.forward([1563], state)      # 下一个 token
out, state = model.forward([310, 247], state)  # 最后几个 token
print(out.detach().cpu().numpy())  # 与上面相同的 logits！

常见工作流程

工作流程 1：文本生成（流式）

高效的逐 token 生成 :

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

工作流程 2：长上下文处理（无限上下文）

处理百万 token 序列 :

model = RWKV(model='RWKV-4-Pile-14B', strategy='cuda fp16')

# 处理超长文档
state = None
long_document = load_document()  # 例如，1M 个 token

# 流式处理整个文档
for chunk in chunks(long_document, chunk_size=1024):
    out, state = model.forward(chunk, state)

# 此时 state 包含了整个 1M token 文档的信息
# 内存使用：O(1)（恒定，而非 O(n)!）

工作流程 3：微调 RWKV

标准微调工作流程 :

# 训练脚本
import pytorch_lightning as pl
from rwkv.model import RWKV
from rwkv.trainer import RWKVTrainer

# 配置模型
config = {
    'n_layer': 24,
    'n_embd': 1024,
    'vocab_size': 50277,
    'ctx_len': 1024
}

# 设置训练器
trainer = pl.Trainer(
    accelerator='gpu',
    devices=8,
    precision='bf16',
    strategy='deepspeed_stage_2',
    max_epochs=1
)

# 训练
model = RWKV(config)
trainer.fit(model, train_dataloader)

工作流程 4：RWKV 与 Transformer 对比

内存对比 (1M token 序列):

# Transformer (GPT)
# 内存：注意力机制为 O(n²)
# KV 缓存：1M × hidden_dim × n_layers × 2 (keys + values)
# 示例：1M × 4096 × 24 × 2 = ~400GB (不切实际！)

# RWKV
# 内存：每个 token O(1)
# 状态：hidden_dim × n_layers = 4096 × 24 = ~400KB
# 效率提升 1,000,000 倍！

速度对比 (推理):

# Transformer：每个 token O(n)（总体为二次方）
# 第一个 token：1 次计算
# 第二个 token：2 次计算
# ...
# 第 1000 个 token：1000 次计算

# RWKV：每个 token O(1)（总体为线性）
# 每个 token：1 次计算
# 第 1000 个 token：1 次计算（与第一个相同！）

何时使用 vs 替代方案

使用 RWKV 当 :

需要非常长的上下文（100K+ tokens）
希望内存使用恒定
构建流式应用程序
需要 RNN 效率与 Transformer 性能
内存受限的部署

线性时间 : O(n) 对比 Transformer 的 O(n²)
无 KV 缓存 : 每个 token 内存恒定
无限上下文 : 无固定窗口限制
可并行化训练 : 类似 GPT
顺序推理 : 类似 RNN

改用替代方案 :

Transformers : 需要绝对最佳性能，拥有充足算力
Mamba : 需要状态空间模型
RetNet : 需要保留机制
Hyena : 需要基于卷积的方法

问题：训练期间内存不足

使用梯度检查点和 DeepSpeed:

trainer = pl.Trainer(
    strategy='deepspeed_stage_3',  # 完整的 ZeRO-3
    precision='bf16'
)

问题：推理速度慢

os.environ["RWKV_CUDA_ON"] = '1'

问题：模型无法加载

检查模型路径和策略:

model = RWKV(
    model='/absolute/path/to/model.pth',
    strategy='cuda fp16'  # 或 'cpu fp32' 用于 CPU
)

问题：RNN 模式下的状态管理

始终在 forward 调用之间传递状态:

# 错误：状态丢失
out1, _ = model.forward(tokens1, None)
out2, _ = model.forward(tokens2, None)  # 没有来自 tokens1 的上下文！

# 正确：状态保留
out1, state = model.forward(tokens1, None)
out2, state = model.forward(tokens2, state)  # 拥有来自 tokens1 的上下文

时间混合与通道混合 : 关于 WKV 操作、时间衰减机制和接受门，请参阅 references/architecture-details.md。

状态管理 : 关于 att_x_prev、att_kv、ffn_x_prev 状态以及数值稳定性考虑，请参阅 references/state-management.md。

RWKV-7 改进 : 关于最新的架构改进（2025 年 3 月）和多模态能力，请参阅 references/rwkv7.md。

GPU : NVIDIA (CUDA 11.6+) 或 CPU
VRAM (FP16):
- 169M 模型: 1GB
- 430M 模型: 2GB
- 1.5B 模型: 4GB
- 3B 模型: 8GB
- 7B 模型: 16GB
- 14B 模型: 32GB
推理 : 每个 token O(1) 内存
训练 : 可并行化，类似 GPT

性能 (对比 Transformers):

速度 : 训练相似，推理更快
内存 : 对于长序列，内存减少 1000 倍
扩展性 : 线性 vs 二次方

论文 (RWKV): https://arxiv.org/abs/2305.13048 (2023 年 5 月)
论文 (RWKV-7): https://arxiv.org/abs/2503.14456 (2025 年 3 月)
GitHub: https://github.com/BlinkDL/RWKV-LM ⭐ 12,000+
文档: https://wiki.rwkv.com/
模型: https://huggingface.co/BlinkDL
Linux Foundation AI: 官方项目
生产环境: Microsoft Windows、Office 集成、NeMo 支持

2026 年 2 月 7 日

🇺🇸English

RWKV - Receptance Weighted Key Value

Quick start

RWKV (RwaKuv) combines Transformer parallelization (training) with RNN efficiency (inference).

Installation :

# Install PyTorch
pip install torch --upgrade --extra-index-url https://download.pytorch.org/whl/cu121

# Install dependencies
pip install pytorch-lightning==1.9.5 deepspeed wandb ninja --upgrade

# Install RWKV
pip install rwkv

Basic usage (GPT mode + RNN mode):

import os
from rwkv.model import RWKV

os.environ["RWKV_JIT_ON"] = '1'
os.environ["RWKV_CUDA_ON"] = '1'  # Use CUDA kernel for speed

# Load model
model = RWKV(
    model='/path/to/RWKV-4-Pile-1B5-20220903-8040',
    strategy='cuda fp16'
)

# GPT mode (parallel processing)
out, state = model.forward([187, 510, 1563, 310, 247], None)
print(out.detach().cpu().numpy())  # Logits

# RNN mode (sequential processing, same result)
out, state = model.forward([187, 510], None)  # First 2 tokens
out, state = model.forward([1563], state)      # Next token
out, state = model.forward([310, 247], state)  # Last tokens
print(out.detach().cpu().numpy())  # Same logits as above!

Common workflows

Workflow 1: Text generation (streaming)

Efficient token-by-token generation :

from rwkv.model import RWKV
from rwkv.utils import PIPELINE

model = RWKV(model='RWKV-4-Pile-14B-20230313-ctx8192-test1050', strategy='cuda fp16')
pipeline = PIPELINE(model, "20B_tokenizer.json")

# Initial prompt
prompt = "The future of AI is"
state = None

# Generate token by token
for token in prompt:
    out, state = pipeline.model.forward(pipeline.encode(token), state)

# Continue generation
for _ in range(100):
    out, state = pipeline.model.forward(None, state)
    token = pipeline.sample_logits(out)
    print(pipeline.decode(token), end='', flush=True)

Key advantage : Constant memory per token (no growing KV cache)

Workflow 2: Long context processing (infinite context)

Process million-token sequences :

model = RWKV(model='RWKV-4-Pile-14B', strategy='cuda fp16')

# Process very long document
state = None
long_document = load_document()  # e.g., 1M tokens

# Stream through entire document
for chunk in chunks(long_document, chunk_size=1024):
    out, state = model.forward(chunk, state)

# State now contains information from entire 1M token document
# Memory usage: O(1) (constant, not O(n)!)

Workflow 3: Fine-tuning RWKV

Standard fine-tuning workflow :

# Training script
import pytorch_lightning as pl
from rwkv.model import RWKV
from rwkv.trainer import RWKVTrainer

# Configure model
config = {
    'n_layer': 24,
    'n_embd': 1024,
    'vocab_size': 50277,
    'ctx_len': 1024
}

# Setup trainer
trainer = pl.Trainer(
    accelerator='gpu',
    devices=8,
    precision='bf16',
    strategy='deepspeed_stage_2',
    max_epochs=1
)

# Train
model = RWKV(config)
trainer.fit(model, train_dataloader)

Workflow 4: RWKV vs Transformer comparison

Memory comparison (1M token sequence):

# Transformer (GPT)
# Memory: O(n²) for attention
# KV cache: 1M × hidden_dim × n_layers × 2 (keys + values)
# Example: 1M × 4096 × 24 × 2 = ~400GB (impractical!)

# RWKV
# Memory: O(1) per token
# State: hidden_dim × n_layers = 4096 × 24 = ~400KB
# 1,000,000× more efficient!

Speed comparison (inference):

# Transformer: O(n) per token (quadratic overall)
# First token: 1 computation
# Second token: 2 computations
# ...
# 1000th token: 1000 computations

# RWKV: O(1) per token (linear overall)
# Every token: 1 computation
# 1000th token: 1 computation (same as first!)

When to use vs alternatives

Use RWKV when :

Need very long context (100K+ tokens)
Want constant memory usage
Building streaming applications
Need RNN efficiency with Transformer performance
Memory-constrained deployment

Key advantages :

Linear time : O(n) vs O(n²) for Transformers
No KV cache : Constant memory per token
Infinite context : No fixed window limit
Parallelizable training : Like GPT
Sequential inference : Like RNN

Use alternatives instead :

Transformers : Need absolute best performance, have compute
Mamba : Want state-space models
RetNet : Need retention mechanism
Hyena : Want convolution-based approach

Common issues

Issue: Out of memory during training

Use gradient checkpointing and DeepSpeed:

trainer = pl.Trainer(
    strategy='deepspeed_stage_3',  # Full ZeRO-3
    precision='bf16'
)

Issue: Slow inference

Enable CUDA kernel:

os.environ["RWKV_CUDA_ON"] = '1'

Issue: Model not loading

Check model path and strategy:

model = RWKV(
    model='/absolute/path/to/model.pth',
    strategy='cuda fp16'  # Or 'cpu fp32' for CPU
)

Issue: State management in RNN mode

Always pass state between forward calls:

# WRONG: State lost
out1, _ = model.forward(tokens1, None)
out2, _ = model.forward(tokens2, None)  # No context from tokens1!

# CORRECT: State preserved
out1, state = model.forward(tokens1, None)
out2, state = model.forward(tokens2, state)  # Has context from tokens1

Advanced topics

Time-mixing and channel-mixing : See references/architecture-details.md for WKV operation, time-decay mechanism, and receptance gates.

State management : See references/state-management.md for att_x_prev, att_kv, ffn_x_prev states, and numerical stability considerations.

RWKV-7 improvements : See references/rwkv7.md for latest architectural improvements (March 2025) and multimodal capabilities.

Hardware requirements

GPU : NVIDIA (CUDA 11.6+) or CPU
VRAM (FP16):
- 169M model: 1GB
- 430M model: 2GB
- 1.5B model: 4GB
- 3B model: 8GB
- 7B model: 16GB
- 14B model: 32GB
Inference : O(1) memory per token
Training : Parallelizable like GPT

Performance (vs Transformers):

Speed : Similar training, faster inference
Memory : 1000× less for long sequences
Scaling : Linear vs quadratic

Resources

Paper (RWKV): https://arxiv.org/abs/2305.13048 (May 2023)
Paper (RWKV-7): https://arxiv.org/abs/2503.14456 (March 2025)
GitHub: https://github.com/BlinkDL/RWKV-LM ⭐ 12,000+
Docs: https://wiki.rwkv.com/
Models: https://huggingface.co/BlinkDL
Linux Foundation AI: Official project
Production: Microsoft Windows, Office integration, NeMo support

Weekly Installs

Repository

orchestra-resea…h-skills

GitHub Stars

5.7K

First Seen

Feb 7, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

opencode61

codex60

cursor60

claude-code59

gemini-cli59

github-copilot58

AI 代码实施计划编写技能 | 自动化开发任务分解与 TDD 流程规划工具

50,900 周安装

RWKV架构详解：融合Transformer与RNN优势的高效AI模型安装与使用指南

🇨🇳中文介绍

RWKV - Receptance Weighted Key Value

快速开始

常见工作流程

工作流程 1：文本生成（流式）

相关 Skills

工作流程 2：长上下文处理（无限上下文）

工作流程 3：微调 RWKV

工作流程 4：RWKV 与 Transformer 对比

何时使用 vs 替代方案

常见问题

高级主题

硬件要求

资源

🇺🇸English

RWKV - Receptance Weighted Key Value

Quick start

Common workflows

Workflow 1: Text generation (streaming)

Workflow 2: Long context processing (infinite context)

Workflow 3: Fine-tuning RWKV

Workflow 4: RWKV vs Transformer comparison

When to use vs alternatives

Common issues

Advanced topics

Hardware requirements

Resources

最新 Skills