sparse-autoencoder-training by davila7/claude-code-templates
npx skills add https://github.com/davila7/claude-code-templates --skill sparse-autoencoder-trainingSAELens 是用于训练和分析稀疏自编码器(SAEs)的主要库——这是一种将多义性神经网络激活分解为稀疏、可解释特征的技术。基于 Anthropic 在单义性方面的开创性研究。
GitHub : jbloomAus/SAELens (1,100+ stars)
神经网络中的单个神经元是多义性的——它们会在多个语义不同的上下文中激活。这是因为模型使用叠加来表示比神经元数量更多的特征,使得可解释性变得困难。
SAEs 通过将密集激活分解为稀疏的单义性特征来解决这个问题——通常对于任何给定的输入,只有少量特征被激活,并且每个特征对应一个可解释的概念。
在以下情况下使用 SAELens:
在以下情况下考虑替代方案:
pip install sae-lens
要求:Python 3.10+, transformer-lens>=2.0.0
SAEs 被训练通过一个稀疏瓶颈来重建模型激活:
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
Input Activation → Encoder → Sparse Features → Decoder → Reconstructed Activation
(d_model) ↓ (d_sae >> d_model) ↓ (d_model)
sparsity reconstruction
penalty loss
损失函数 : MSE(original, reconstructed) + L1_coefficient × L1(features)
在“迈向单义性”中,人类评估者发现 70% 的 SAE 特征是真正可解释的。发现的特征包括:
from transformer_lens import HookedTransformer
from sae_lens import SAE
# 1. 加载模型和预训练的 SAE
model = HookedTransformer.from_pretrained("gpt2-small", device="cuda")
sae, cfg_dict, sparsity = SAE.from_pretrained(
release="gpt2-small-res-jb",
sae_id="blocks.8.hook_resid_pre",
device="cuda"
)
# 2. 获取模型激活
tokens = model.to_tokens("The capital of France is Paris")
_, cache = model.run_with_cache(tokens)
activations = cache["resid_pre", 8] # [batch, pos, d_model]
# 3. 编码为 SAE 特征
sae_features = sae.encode(activations) # [batch, pos, d_sae]
print(f"Active features: {(sae_features > 0).sum()}")
# 4. 查找每个位置的前几个特征
for pos in range(tokens.shape[1]):
top_features = sae_features[0, pos].topk(5)
token = model.to_str_tokens(tokens[0, pos:pos+1])[0]
print(f"Token '{token}': features {top_features.indices.tolist()}")
# 5. 重建激活
reconstructed = sae.decode(sae_features)
reconstruction_error = (activations - reconstructed).norm()
| 发布版本 | 模型 | 层 |
|---|---|---|
gpt2-small-res-jb | GPT-2 Small | 多个残差流 |
gemma-2b-res | Gemma 2B | 残差流 |
| HuggingFace 上的各种 | 搜索标签 saelens | 各种 |
from sae_lens import SAE, LanguageModelSAERunnerConfig, SAETrainingRunner
# 1. 配置训练
cfg = LanguageModelSAERunnerConfig(
# 模型
model_name="gpt2-small",
hook_name="blocks.8.hook_resid_pre",
hook_layer=8,
d_in=768, # 模型维度
# SAE 架构
architecture="standard", # 或 "gated", "topk"
d_sae=768 * 8, # 扩展因子为 8
activation_fn="relu",
# 训练
lr=4e-4,
l1_coefficient=8e-5, # 稀疏性惩罚
l1_warm_up_steps=1000,
train_batch_size_tokens=4096,
training_tokens=100_000_000,
# 数据
dataset_path="monology/pile-uncopyrighted",
context_size=128,
# 日志记录
log_to_wandb=True,
wandb_project="sae-training",
# 检查点
checkpoint_path="checkpoints",
n_checkpoints=5,
)
# 2. 训练
trainer = SAETrainingRunner(cfg)
sae = trainer.run()
# 3. 评估
print(f"L0 (平均激活特征数): {trainer.metrics['l0']}")
print(f"恢复的 CE 损失: {trainer.metrics['ce_loss_score']}")
| 参数 | 典型值 | 影响 |
|---|---|---|
d_sae | 4-16× d_model | 特征越多,容量越高 |
l1_coefficient | 5e-5 到 1e-4 | 越高 = 越稀疏,准确性越低 |
lr | 1e-4 到 1e-3 | 标准优化器学习率 |
l1_warm_up_steps | 500-2000 | 防止早期特征死亡 |
| 指标 | 目标 | 含义 |
|---|---|---|
| L0 | 50-200 | 每个令牌的平均激活特征数 |
| CE 损失分数 | 80-95% | 与原始相比恢复的交叉熵 |
| 死亡特征 | <5% | 从未激活的特征 |
| 解释方差 | >90% | 重建质量 |
from transformer_lens import HookedTransformer
from sae_lens import SAE
import torch
model = HookedTransformer.from_pretrained("gpt2-small", device="cuda")
sae, _, _ = SAE.from_pretrained(
release="gpt2-small-res-jb",
sae_id="blocks.8.hook_resid_pre",
device="cuda"
)
# 查找激活特定特征的内容
feature_idx = 1234
test_texts = [
"The scientist conducted an experiment",
"I love chocolate cake",
"The code compiles successfully",
"Paris is beautiful in spring",
]
for text in test_texts:
tokens = model.to_tokens(text)
_, cache = model.run_with_cache(tokens)
features = sae.encode(cache["resid_pre", 8])
activation = features[0, :, feature_idx].max().item()
print(f"{activation:.3f}: {text}")
def steer_with_feature(model, sae, prompt, feature_idx, strength=5.0):
"""将 SAE 特征方向添加到残差流中。"""
tokens = model.to_tokens(prompt)
# 从解码器获取特征方向
feature_direction = sae.W_dec[feature_idx] # [d_model]
def steering_hook(activation, hook):
# 在所有位置添加缩放后的特征方向
activation += strength * feature_direction
return activation
# 使用引导生成
output = model.generate(
tokens,
max_new_tokens=50,
fwd_hooks=[("blocks.8.hook_resid_pre", steering_hook)]
)
return model.to_string(output[0])
# 哪些特征对特定输出影响最大?
tokens = model.to_tokens("The capital of France is")
_, cache = model.run_with_cache(tokens)
# 获取最终位置的特征
features = sae.encode(cache["resid_pre", 8])[0, -1] # [d_sae]
# 获取每个特征的对数几率归因
# 特征贡献 = feature_activation × decoder_weight × unembedding
W_dec = sae.W_dec # [d_sae, d_model]
W_U = model.W_U # [d_model, vocab]
# 对 "Paris" 对数几率的贡献
paris_token = model.to_single_token(" Paris")
feature_contributions = features * (W_dec @ W_U[:, paris_token])
top_features = feature_contributions.topk(10)
print("预测 'Paris' 的前几个特征:")
for idx, val in zip(top_features.indices, top_features.values):
print(f" 特征 {idx.item()}: {val.item():.3f}")
# 错误:没有预热,特征过早死亡
cfg = LanguageModelSAERunnerConfig(
l1_coefficient=1e-4,
l1_warm_up_steps=0, # 错误!
)
# 正确:预热 L1 惩罚
cfg = LanguageModelSAERunnerConfig(
l1_coefficient=8e-5,
l1_warm_up_steps=1000, # 逐渐增加
use_ghost_grads=True, # 复活死亡特征
)
# 降低稀疏性惩罚
cfg = LanguageModelSAERunnerConfig(
l1_coefficient=5e-5, # 越低 = 重建效果越好
d_sae=768 * 16, # 更大的容量
)
# 增加稀疏性(更高的 L1)
cfg = LanguageModelSAERunnerConfig(
l1_coefficient=1e-4, # 越高 = 越稀疏,越可解释
)
# 或使用 TopK 架构
cfg = LanguageModelSAERunnerConfig(
architecture="topk",
activation_fn_kwargs={"k": 50}, # 恰好 50 个激活特征
)
cfg = LanguageModelSAERunnerConfig(
train_batch_size_tokens=2048, # 减小批次大小
store_batch_size_prompts=4, # 缓冲区中提示更少
n_batches_in_buffer=8, # 更小的激活缓冲区
)
在 neuronpedia.org 浏览预训练的 SAE 特征:
# 特征按 SAE ID 索引
# 示例:gpt2-small 第 8 层特征 1234
# → neuronpedia.org/gpt2-small/8-res-jb/1234
| 类 | 用途 |
|---|---|
SAE | 稀疏自编码器模型 |
LanguageModelSAERunnerConfig | 训练配置 |
SAETrainingRunner | 训练循环管理器 |
ActivationsStore | 激活收集和批处理 |
HookedSAETransformer | TransformerLens + SAE 集成 |
有关详细的 API 文档、教程和高级用法,请参阅 references/ 文件夹:
| 文件 | 内容 |
|---|---|
| references/README.md | 概述和快速入门指南 |
| references/api.md | SAE、TrainingSAE、配置的完整 API 参考 |
| references/tutorials.md | 训练、分析、引导的逐步教程 |
| 架构 | 描述 | 使用场景 |
|---|---|---|
| 标准 | ReLU + L1 惩罚 | 通用 |
| 门控 | 学习的门控机制 | 更好的稀疏性控制 |
| TopK | 恰好 K 个激活特征 | 一致的稀疏性 |
# TopK SAE (恰好 50 个特征激活)
cfg = LanguageModelSAERunnerConfig(
architecture="topk",
activation_fn="topk",
activation_fn_kwargs={"k": 50},
)
每周安装量
147
代码仓库
GitHub Stars
22.6K
首次出现
Jan 21, 2026
安全审计
安装于
opencode119
claude-code116
gemini-cli113
cursor106
codex100
antigravity93
SAELens is the primary library for training and analyzing Sparse Autoencoders (SAEs) - a technique for decomposing polysemantic neural network activations into sparse, interpretable features. Based on Anthropic's groundbreaking research on monosemanticity.
GitHub : jbloomAus/SAELens (1,100+ stars)
Individual neurons in neural networks are polysemantic - they activate in multiple, semantically distinct contexts. This happens because models use superposition to represent more features than they have neurons, making interpretability difficult.
SAEs solve this by decomposing dense activations into sparse, monosemantic features - typically only a small number of features activate for any given input, and each feature corresponds to an interpretable concept.
Use SAELens when you need to:
Consider alternatives when:
pip install sae-lens
Requirements: Python 3.10+, transformer-lens>=2.0.0
SAEs are trained to reconstruct model activations through a sparse bottleneck:
Input Activation → Encoder → Sparse Features → Decoder → Reconstructed Activation
(d_model) ↓ (d_sae >> d_model) ↓ (d_model)
sparsity reconstruction
penalty loss
Loss Function : MSE(original, reconstructed) + L1_coefficient × L1(features)
In "Towards Monosemanticity", human evaluators found 70% of SAE features genuinely interpretable. Features discovered include:
from transformer_lens import HookedTransformer
from sae_lens import SAE
# 1. Load model and pre-trained SAE
model = HookedTransformer.from_pretrained("gpt2-small", device="cuda")
sae, cfg_dict, sparsity = SAE.from_pretrained(
release="gpt2-small-res-jb",
sae_id="blocks.8.hook_resid_pre",
device="cuda"
)
# 2. Get model activations
tokens = model.to_tokens("The capital of France is Paris")
_, cache = model.run_with_cache(tokens)
activations = cache["resid_pre", 8] # [batch, pos, d_model]
# 3. Encode to SAE features
sae_features = sae.encode(activations) # [batch, pos, d_sae]
print(f"Active features: {(sae_features > 0).sum()}")
# 4. Find top features for each position
for pos in range(tokens.shape[1]):
top_features = sae_features[0, pos].topk(5)
token = model.to_str_tokens(tokens[0, pos:pos+1])[0]
print(f"Token '{token}': features {top_features.indices.tolist()}")
# 5. Reconstruct activations
reconstructed = sae.decode(sae_features)
reconstruction_error = (activations - reconstructed).norm()
| Release | Model | Layers |
|---|---|---|
gpt2-small-res-jb | GPT-2 Small | Multiple residual streams |
gemma-2b-res | Gemma 2B | Residual streams |
| Various on HuggingFace | Search tag saelens | Various |
from sae_lens import SAE, LanguageModelSAERunnerConfig, SAETrainingRunner
# 1. Configure training
cfg = LanguageModelSAERunnerConfig(
# Model
model_name="gpt2-small",
hook_name="blocks.8.hook_resid_pre",
hook_layer=8,
d_in=768, # Model dimension
# SAE architecture
architecture="standard", # or "gated", "topk"
d_sae=768 * 8, # Expansion factor of 8
activation_fn="relu",
# Training
lr=4e-4,
l1_coefficient=8e-5, # Sparsity penalty
l1_warm_up_steps=1000,
train_batch_size_tokens=4096,
training_tokens=100_000_000,
# Data
dataset_path="monology/pile-uncopyrighted",
context_size=128,
# Logging
log_to_wandb=True,
wandb_project="sae-training",
# Checkpointing
checkpoint_path="checkpoints",
n_checkpoints=5,
)
# 2. Train
trainer = SAETrainingRunner(cfg)
sae = trainer.run()
# 3. Evaluate
print(f"L0 (avg active features): {trainer.metrics['l0']}")
print(f"CE Loss Recovered: {trainer.metrics['ce_loss_score']}")
| Parameter | Typical Value | Effect |
|---|---|---|
d_sae | 4-16× d_model | More features, higher capacity |
l1_coefficient | 5e-5 to 1e-4 | Higher = sparser, less accurate |
lr | 1e-4 to 1e-3 | Standard optimizer LR |
l1_warm_up_steps | 500-2000 | Prevents early feature death |
| Metric | Target | Meaning |
|---|---|---|
| L0 | 50-200 | Average active features per token |
| CE Loss Score | 80-95% | Cross-entropy recovered vs original |
| Dead Features | <5% | Features that never activate |
| Explained Variance | >90% | Reconstruction quality |
from transformer_lens import HookedTransformer
from sae_lens import SAE
import torch
model = HookedTransformer.from_pretrained("gpt2-small", device="cuda")
sae, _, _ = SAE.from_pretrained(
release="gpt2-small-res-jb",
sae_id="blocks.8.hook_resid_pre",
device="cuda"
)
# Find what activates a specific feature
feature_idx = 1234
test_texts = [
"The scientist conducted an experiment",
"I love chocolate cake",
"The code compiles successfully",
"Paris is beautiful in spring",
]
for text in test_texts:
tokens = model.to_tokens(text)
_, cache = model.run_with_cache(tokens)
features = sae.encode(cache["resid_pre", 8])
activation = features[0, :, feature_idx].max().item()
print(f"{activation:.3f}: {text}")
def steer_with_feature(model, sae, prompt, feature_idx, strength=5.0):
"""Add SAE feature direction to residual stream."""
tokens = model.to_tokens(prompt)
# Get feature direction from decoder
feature_direction = sae.W_dec[feature_idx] # [d_model]
def steering_hook(activation, hook):
# Add scaled feature direction at all positions
activation += strength * feature_direction
return activation
# Generate with steering
output = model.generate(
tokens,
max_new_tokens=50,
fwd_hooks=[("blocks.8.hook_resid_pre", steering_hook)]
)
return model.to_string(output[0])
# Which features most affect a specific output?
tokens = model.to_tokens("The capital of France is")
_, cache = model.run_with_cache(tokens)
# Get features at final position
features = sae.encode(cache["resid_pre", 8])[0, -1] # [d_sae]
# Get logit attribution per feature
# Feature contribution = feature_activation × decoder_weight × unembedding
W_dec = sae.W_dec # [d_sae, d_model]
W_U = model.W_U # [d_model, vocab]
# Contribution to "Paris" logit
paris_token = model.to_single_token(" Paris")
feature_contributions = features * (W_dec @ W_U[:, paris_token])
top_features = feature_contributions.topk(10)
print("Top features for 'Paris' prediction:")
for idx, val in zip(top_features.indices, top_features.values):
print(f" Feature {idx.item()}: {val.item():.3f}")
# WRONG: No warm-up, features die early
cfg = LanguageModelSAERunnerConfig(
l1_coefficient=1e-4,
l1_warm_up_steps=0, # Bad!
)
# RIGHT: Warm-up L1 penalty
cfg = LanguageModelSAERunnerConfig(
l1_coefficient=8e-5,
l1_warm_up_steps=1000, # Gradually increase
use_ghost_grads=True, # Revive dead features
)
# Reduce sparsity penalty
cfg = LanguageModelSAERunnerConfig(
l1_coefficient=5e-5, # Lower = better reconstruction
d_sae=768 * 16, # More capacity
)
# Increase sparsity (higher L1)
cfg = LanguageModelSAERunnerConfig(
l1_coefficient=1e-4, # Higher = sparser, more interpretable
)
# Or use TopK architecture
cfg = LanguageModelSAERunnerConfig(
architecture="topk",
activation_fn_kwargs={"k": 50}, # Exactly 50 active features
)
cfg = LanguageModelSAERunnerConfig(
train_batch_size_tokens=2048, # Reduce batch size
store_batch_size_prompts=4, # Fewer prompts in buffer
n_batches_in_buffer=8, # Smaller activation buffer
)
Browse pre-trained SAE features at neuronpedia.org:
# Features are indexed by SAE ID
# Example: gpt2-small layer 8 feature 1234
# → neuronpedia.org/gpt2-small/8-res-jb/1234
| Class | Purpose |
|---|---|
SAE | Sparse Autoencoder model |
LanguageModelSAERunnerConfig | Training configuration |
SAETrainingRunner | Training loop manager |
ActivationsStore | Activation collection and batching |
HookedSAETransformer | TransformerLens + SAE integration |
For detailed API documentation, tutorials, and advanced usage, see the references/ folder:
| File | Contents |
|---|---|
| references/README.md | Overview and quick start guide |
| references/api.md | Complete API reference for SAE, TrainingSAE, configurations |
| references/tutorials.md | Step-by-step tutorials for training, analysis, steering |
| Architecture | Description | Use Case |
|---|---|---|
| Standard | ReLU + L1 penalty | General purpose |
| Gated | Learned gating mechanism | Better sparsity control |
| TopK | Exactly K active features | Consistent sparsity |
# TopK SAE (exactly 50 features active)
cfg = LanguageModelSAERunnerConfig(
architecture="topk",
activation_fn="topk",
activation_fn_kwargs={"k": 50},
)
Weekly Installs
147
Repository
GitHub Stars
22.6K
First Seen
Jan 21, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykWarn
Installed on
opencode119
claude-code116
gemini-cli113
cursor106
codex100
antigravity93
AI 代码实施计划编写技能 | 自动化开发任务分解与 TDD 流程规划工具
49,800 周安装