rwkv-architecture by orchestra-research/ai-research-skills
npx skills add https://github.com/orchestra-research/ai-research-skills --skill rwkv-architectureRWKV (RwaKuv) 结合了 Transformer 的并行化(训练)与 RNN 的效率(推理)。
安装 :
# 安装 PyTorch
pip install torch --upgrade --extra-index-url https://download.pytorch.org/whl/cu121
# 安装依赖项
pip install pytorch-lightning==1.9.5 deepspeed wandb ninja --upgrade
# 安装 RWKV
pip install rwkv
基本用法 (GPT 模式 + RNN 模式):
import os
from rwkv.model import RWKV
os.environ["RWKV_JIT_ON"] = '1'
os.environ["RWKV_CUDA_ON"] = '1' # 使用 CUDA 内核以提升速度
# 加载模型
model = RWKV(
model='/path/to/RWKV-4-Pile-1B5-20220903-8040',
strategy='cuda fp16'
)
# GPT 模式(并行处理)
out, state = model.forward([187, 510, 1563, 310, 247], None)
print(out.detach().cpu().numpy()) # 输出 logits
# RNN 模式(顺序处理,结果相同)
out, state = model.forward([187, 510], None) # 前 2 个 token
out, state = model.forward([1563], state) # 下一个 token
out, state = model.forward([310, 247], state) # 最后几个 token
print(out.detach().cpu().numpy()) # 与上面相同的 logits!
高效的逐 token 生成 :
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
from rwkv.model import RWKV
from rwkv.utils import PIPELINE
model = RWKV(model='RWKV-4-Pile-14B-20230313-ctx8192-test1050', strategy='cuda fp16')
pipeline = PIPELINE(model, "20B_tokenizer.json")
# 初始提示词
prompt = "The future of AI is"
state = None
# 逐 token 处理提示词
for token in prompt:
out, state = pipeline.model.forward(pipeline.encode(token), state)
# 继续生成
for _ in range(100):
out, state = pipeline.model.forward(None, state)
token = pipeline.sample_logits(out)
print(pipeline.decode(token), end='', flush=True)
关键优势 : 每个 token 的内存占用恒定(无增长的 KV 缓存)
处理百万 token 序列 :
model = RWKV(model='RWKV-4-Pile-14B', strategy='cuda fp16')
# 处理超长文档
state = None
long_document = load_document() # 例如,1M 个 token
# 流式处理整个文档
for chunk in chunks(long_document, chunk_size=1024):
out, state = model.forward(chunk, state)
# 此时 state 包含了整个 1M token 文档的信息
# 内存使用:O(1)(恒定,而非 O(n)!)
标准微调工作流程 :
# 训练脚本
import pytorch_lightning as pl
from rwkv.model import RWKV
from rwkv.trainer import RWKVTrainer
# 配置模型
config = {
'n_layer': 24,
'n_embd': 1024,
'vocab_size': 50277,
'ctx_len': 1024
}
# 设置训练器
trainer = pl.Trainer(
accelerator='gpu',
devices=8,
precision='bf16',
strategy='deepspeed_stage_2',
max_epochs=1
)
# 训练
model = RWKV(config)
trainer.fit(model, train_dataloader)
内存对比 (1M token 序列):
# Transformer (GPT)
# 内存:注意力机制为 O(n²)
# KV 缓存:1M × hidden_dim × n_layers × 2 (keys + values)
# 示例:1M × 4096 × 24 × 2 = ~400GB (不切实际!)
# RWKV
# 内存:每个 token O(1)
# 状态:hidden_dim × n_layers = 4096 × 24 = ~400KB
# 效率提升 1,000,000 倍!
速度对比 (推理):
# Transformer:每个 token O(n)(总体为二次方)
# 第一个 token:1 次计算
# 第二个 token:2 次计算
# ...
# 第 1000 个 token:1000 次计算
# RWKV:每个 token O(1)(总体为线性)
# 每个 token:1 次计算
# 第 1000 个 token:1 次计算(与第一个相同!)
使用 RWKV 当 :
关键优势 :
改用替代方案 :
问题:训练期间内存不足
使用梯度检查点和 DeepSpeed:
trainer = pl.Trainer(
strategy='deepspeed_stage_3', # 完整的 ZeRO-3
precision='bf16'
)
问题:推理速度慢
启用 CUDA 内核:
os.environ["RWKV_CUDA_ON"] = '1'
问题:模型无法加载
检查模型路径和策略:
model = RWKV(
model='/absolute/path/to/model.pth',
strategy='cuda fp16' # 或 'cpu fp32' 用于 CPU
)
问题:RNN 模式下的状态管理
始终在 forward 调用之间传递状态:
# 错误:状态丢失
out1, _ = model.forward(tokens1, None)
out2, _ = model.forward(tokens2, None) # 没有来自 tokens1 的上下文!
# 正确:状态保留
out1, state = model.forward(tokens1, None)
out2, state = model.forward(tokens2, state) # 拥有来自 tokens1 的上下文
时间混合与通道混合 : 关于 WKV 操作、时间衰减机制和接受门,请参阅 references/architecture-details.md。
状态管理 : 关于 att_x_prev、att_kv、ffn_x_prev 状态以及数值稳定性考虑,请参阅 references/state-management.md。
RWKV-7 改进 : 关于最新的架构改进(2025 年 3 月)和多模态能力,请参阅 references/rwkv7.md。
性能 (对比 Transformers):
每周安装量
70
代码仓库
GitHub Stars
5.7K
首次出现
2026 年 2 月 7 日
安全审计
安装于
opencode61
codex60
cursor60
claude-code59
gemini-cli59
github-copilot58
RWKV (RwaKuv) combines Transformer parallelization (training) with RNN efficiency (inference).
Installation :
# Install PyTorch
pip install torch --upgrade --extra-index-url https://download.pytorch.org/whl/cu121
# Install dependencies
pip install pytorch-lightning==1.9.5 deepspeed wandb ninja --upgrade
# Install RWKV
pip install rwkv
Basic usage (GPT mode + RNN mode):
import os
from rwkv.model import RWKV
os.environ["RWKV_JIT_ON"] = '1'
os.environ["RWKV_CUDA_ON"] = '1' # Use CUDA kernel for speed
# Load model
model = RWKV(
model='/path/to/RWKV-4-Pile-1B5-20220903-8040',
strategy='cuda fp16'
)
# GPT mode (parallel processing)
out, state = model.forward([187, 510, 1563, 310, 247], None)
print(out.detach().cpu().numpy()) # Logits
# RNN mode (sequential processing, same result)
out, state = model.forward([187, 510], None) # First 2 tokens
out, state = model.forward([1563], state) # Next token
out, state = model.forward([310, 247], state) # Last tokens
print(out.detach().cpu().numpy()) # Same logits as above!
Efficient token-by-token generation :
from rwkv.model import RWKV
from rwkv.utils import PIPELINE
model = RWKV(model='RWKV-4-Pile-14B-20230313-ctx8192-test1050', strategy='cuda fp16')
pipeline = PIPELINE(model, "20B_tokenizer.json")
# Initial prompt
prompt = "The future of AI is"
state = None
# Generate token by token
for token in prompt:
out, state = pipeline.model.forward(pipeline.encode(token), state)
# Continue generation
for _ in range(100):
out, state = pipeline.model.forward(None, state)
token = pipeline.sample_logits(out)
print(pipeline.decode(token), end='', flush=True)
Key advantage : Constant memory per token (no growing KV cache)
Process million-token sequences :
model = RWKV(model='RWKV-4-Pile-14B', strategy='cuda fp16')
# Process very long document
state = None
long_document = load_document() # e.g., 1M tokens
# Stream through entire document
for chunk in chunks(long_document, chunk_size=1024):
out, state = model.forward(chunk, state)
# State now contains information from entire 1M token document
# Memory usage: O(1) (constant, not O(n)!)
Standard fine-tuning workflow :
# Training script
import pytorch_lightning as pl
from rwkv.model import RWKV
from rwkv.trainer import RWKVTrainer
# Configure model
config = {
'n_layer': 24,
'n_embd': 1024,
'vocab_size': 50277,
'ctx_len': 1024
}
# Setup trainer
trainer = pl.Trainer(
accelerator='gpu',
devices=8,
precision='bf16',
strategy='deepspeed_stage_2',
max_epochs=1
)
# Train
model = RWKV(config)
trainer.fit(model, train_dataloader)
Memory comparison (1M token sequence):
# Transformer (GPT)
# Memory: O(n²) for attention
# KV cache: 1M × hidden_dim × n_layers × 2 (keys + values)
# Example: 1M × 4096 × 24 × 2 = ~400GB (impractical!)
# RWKV
# Memory: O(1) per token
# State: hidden_dim × n_layers = 4096 × 24 = ~400KB
# 1,000,000× more efficient!
Speed comparison (inference):
# Transformer: O(n) per token (quadratic overall)
# First token: 1 computation
# Second token: 2 computations
# ...
# 1000th token: 1000 computations
# RWKV: O(1) per token (linear overall)
# Every token: 1 computation
# 1000th token: 1 computation (same as first!)
Use RWKV when :
Key advantages :
Use alternatives instead :
Issue: Out of memory during training
Use gradient checkpointing and DeepSpeed:
trainer = pl.Trainer(
strategy='deepspeed_stage_3', # Full ZeRO-3
precision='bf16'
)
Issue: Slow inference
Enable CUDA kernel:
os.environ["RWKV_CUDA_ON"] = '1'
Issue: Model not loading
Check model path and strategy:
model = RWKV(
model='/absolute/path/to/model.pth',
strategy='cuda fp16' # Or 'cpu fp32' for CPU
)
Issue: State management in RNN mode
Always pass state between forward calls:
# WRONG: State lost
out1, _ = model.forward(tokens1, None)
out2, _ = model.forward(tokens2, None) # No context from tokens1!
# CORRECT: State preserved
out1, state = model.forward(tokens1, None)
out2, state = model.forward(tokens2, state) # Has context from tokens1
Time-mixing and channel-mixing : See references/architecture-details.md for WKV operation, time-decay mechanism, and receptance gates.
State management : See references/state-management.md for att_x_prev, att_kv, ffn_x_prev states, and numerical stability considerations.
RWKV-7 improvements : See references/rwkv7.md for latest architectural improvements (March 2025) and multimodal capabilities.
Performance (vs Transformers):
Weekly Installs
70
Repository
GitHub Stars
5.7K
First Seen
Feb 7, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykPass
Installed on
opencode61
codex60
cursor60
claude-code59
gemini-cli59
github-copilot58
AI 代码实施计划编写技能 | 自动化开发任务分解与 TDD 流程规划工具
50,900 周安装
Xcode MCP 工具:AI 辅助 iOS/Swift 开发,自动化构建、测试与修复
99 周安装
ReactFlow 架构师:构建高性能、可扩展的交互式图表应用 | 分层导航与状态管理
100 周安装
PUA P10 战略层:AI 项目管理与战略制定工具,定方向管 P9,提升团队效率
69 周安装
PydanticAI测试指南:使用TestModel与FunctionModel进行智能体单元测试与VCR录制
100 周安装
Apple Bento 网格生成器:一键创建像素级完美设计布局,无需Figma/Keynote
102 周安装
Gemini Web API 客户端 - 支持文本/图像生成、多轮对话与逆向工程
69 周安装