SAELens：稀疏自编码器训练库，实现神经网络机制可解释性分析

sparse-autoencoder-training by davila7/claude-code-templates

210 周安装量

24,200 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/davila7/claude-code-templates --skill sparse-autoencoder-training

AI/机器学习科研工具自然语言处理

🇨🇳中文介绍

SAELens：面向机制可解释性的稀疏自编码器

SAELens 是用于训练和分析稀疏自编码器（SAEs）的主要库——这是一种将多义性神经网络激活分解为稀疏、可解释特征的技术。基于 Anthropic 在单义性方面的开创性研究。

GitHub : jbloomAus/SAELens (1,100+ stars)

问题：多义性与叠加

神经网络中的单个神经元是多义性的——它们会在多个语义不同的上下文中激活。这是因为模型使用叠加来表示比神经元数量更多的特征，使得可解释性变得困难。

SAEs 通过将密集激活分解为稀疏的单义性特征来解决这个问题——通常对于任何给定的输入，只有少量特征被激活，并且每个特征对应一个可解释的概念。

何时使用 SAELens

在以下情况下使用 SAELens：

发现模型激活中的可解释特征
理解模型学到了什么概念
研究叠加和特征几何
执行基于特征的引导或消融
分析与安全相关的特征（欺骗、偏见、有害内容）

在以下情况下考虑替代方案：

您需要进行基本的激活分析 → 直接使用 TransformerLens
您想要进行因果干预实验 → 使用 pyvene 或 TransformerLens
您需要生产环境下的引导 → 考虑直接激活工程

安装

pip install sae-lens

要求：Python 3.10+, transformer-lens>=2.0.0

核心概念

SAEs 学习什么

SAEs 被训练通过一个稀疏瓶颈来重建模型激活：

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

相关 Skills

find-skills 技能搜索工具 - Vercel Labs 开源智能体技能包管理器

896,800 周安装

React 组合模式指南：Vercel 组件架构最佳实践，提升代码可维护性

120,000 周安装

AI Elements：基于shadcn/ui的AI原生应用组件库，快速构建对话界面

67,500 周安装

超能力技能使用指南：AI助手技能调用优先级与工作流程详解

50,500 周安装

from transformer_lens import HookedTransformer
from sae_lens import SAE

# 1. 加载模型和预训练的 SAE
model = HookedTransformer.from_pretrained("gpt2-small", device="cuda")
sae, cfg_dict, sparsity = SAE.from_pretrained(
    release="gpt2-small-res-jb",
    sae_id="blocks.8.hook_resid_pre",
    device="cuda"
)

# 2. 获取模型激活
tokens = model.to_tokens("The capital of France is Paris")
_, cache = model.run_with_cache(tokens)
activations = cache["resid_pre", 8]  # [batch, pos, d_model]

# 3. 编码为 SAE 特征
sae_features = sae.encode(activations)  # [batch, pos, d_sae]
print(f"Active features: {(sae_features > 0).sum()}")

# 4. 查找每个位置的前几个特征
for pos in range(tokens.shape[1]):
    top_features = sae_features[0, pos].topk(5)
    token = model.to_str_tokens(tokens[0, pos:pos+1])[0]
    print(f"Token '{token}': features {top_features.indices.tolist()}")

# 5. 重建激活
reconstructed = sae.decode(sae_features)
reconstruction_error = (activations - reconstructed).norm()

发布版本	模型	层
`gpt2-small-res-jb`	GPT-2 Small	多个残差流
`gemma-2b-res`	Gemma 2B	残差流
HuggingFace 上的各种	搜索标签 `saelens`	各种

from sae_lens import SAE, LanguageModelSAERunnerConfig, SAETrainingRunner

# 1. 配置训练
cfg = LanguageModelSAERunnerConfig(
    # 模型
    model_name="gpt2-small",
    hook_name="blocks.8.hook_resid_pre",
    hook_layer=8,
    d_in=768,  # 模型维度

    # SAE 架构
    architecture="standard",  # 或 "gated", "topk"
    d_sae=768 * 8,  # 扩展因子为 8
    activation_fn="relu",

    # 训练
    lr=4e-4,
    l1_coefficient=8e-5,  # 稀疏性惩罚
    l1_warm_up_steps=1000,
    train_batch_size_tokens=4096,
    training_tokens=100_000_000,

    # 数据
    dataset_path="monology/pile-uncopyrighted",
    context_size=128,

    # 日志记录
    log_to_wandb=True,
    wandb_project="sae-training",

    # 检查点
    checkpoint_path="checkpoints",
    n_checkpoints=5,
)

# 2. 训练
trainer = SAETrainingRunner(cfg)
sae = trainer.run()

# 3. 评估
print(f"L0 (平均激活特征数): {trainer.metrics['l0']}")
print(f"恢复的 CE 损失: {trainer.metrics['ce_loss_score']}")

参数	典型值	影响
`d_sae`	4-16× d_model	特征越多，容量越高
`l1_coefficient`	5e-5 到 1e-4	越高 = 越稀疏，准确性越低
`lr`	1e-4 到 1e-3	标准优化器学习率
`l1_warm_up_steps`	500-2000	防止早期特征死亡

指标	目标	含义
L0	50-200	每个令牌的平均激活特征数
CE 损失分数	80-95%	与原始相比恢复的交叉熵
死亡特征	<5%	从未激活的特征
解释方差	>90%	重建质量

from transformer_lens import HookedTransformer
from sae_lens import SAE
import torch

model = HookedTransformer.from_pretrained("gpt2-small", device="cuda")
sae, _, _ = SAE.from_pretrained(
    release="gpt2-small-res-jb",
    sae_id="blocks.8.hook_resid_pre",
    device="cuda"
)

# 查找激活特定特征的内容
feature_idx = 1234
test_texts = [
    "The scientist conducted an experiment",
    "I love chocolate cake",
    "The code compiles successfully",
    "Paris is beautiful in spring",
]

for text in test_texts:
    tokens = model.to_tokens(text)
    _, cache = model.run_with_cache(tokens)
    features = sae.encode(cache["resid_pre", 8])
    activation = features[0, :, feature_idx].max().item()
    print(f"{activation:.3f}: {text}")

def steer_with_feature(model, sae, prompt, feature_idx, strength=5.0):
    """将 SAE 特征方向添加到残差流中。"""
    tokens = model.to_tokens(prompt)

    # 从解码器获取特征方向
    feature_direction = sae.W_dec[feature_idx]  # [d_model]

    def steering_hook(activation, hook):
        # 在所有位置添加缩放后的特征方向
        activation += strength * feature_direction
        return activation

    # 使用引导生成
    output = model.generate(
        tokens,
        max_new_tokens=50,
        fwd_hooks=[("blocks.8.hook_resid_pre", steering_hook)]
    )
    return model.to_string(output[0])

# 哪些特征对特定输出影响最大？
tokens = model.to_tokens("The capital of France is")
_, cache = model.run_with_cache(tokens)

# 获取最终位置的特征
features = sae.encode(cache["resid_pre", 8])[0, -1]  # [d_sae]

# 获取每个特征的对数几率归因
# 特征贡献 = feature_activation × decoder_weight × unembedding
W_dec = sae.W_dec  # [d_sae, d_model]
W_U = model.W_U    # [d_model, vocab]

# 对 "Paris" 对数几率的贡献
paris_token = model.to_single_token(" Paris")
feature_contributions = features * (W_dec @ W_U[:, paris_token])

top_features = feature_contributions.topk(10)
print("预测 'Paris' 的前几个特征：")
for idx, val in zip(top_features.indices, top_features.values):
    print(f"  特征 {idx.item()}: {val.item():.3f}")

类	用途
`SAE`	稀疏自编码器模型
`LanguageModelSAERunnerConfig`	训练配置
`SAETrainingRunner`	训练循环管理器
`ActivationsStore`	激活收集和批处理
`HookedSAETransformer`	TransformerLens + SAE 集成

文件	内容
references/README.md	概述和快速入门指南
references/api.md	SAE、TrainingSAE、配置的完整 API 参考
references/tutorials.md	训练、分析、引导的逐步教程

架构	描述	使用场景
标准	ReLU + L1 惩罚	通用
门控	学习的门控机制	更好的稀疏性控制
TopK	恰好 K 个激活特征	一致的稀疏性

🇺🇸English

SAELens: Sparse Autoencoders for Mechanistic Interpretability

SAELens is the primary library for training and analyzing Sparse Autoencoders (SAEs) - a technique for decomposing polysemantic neural network activations into sparse, interpretable features. Based on Anthropic's groundbreaking research on monosemanticity.

GitHub : jbloomAus/SAELens (1,100+ stars)

The Problem: Polysemanticity & Superposition

Individual neurons in neural networks are polysemantic - they activate in multiple, semantically distinct contexts. This happens because models use superposition to represent more features than they have neurons, making interpretability difficult.

SAEs solve this by decomposing dense activations into sparse, monosemantic features - typically only a small number of features activate for any given input, and each feature corresponds to an interpretable concept.

When to Use SAELens

Use SAELens when you need to:

Discover interpretable features in model activations
Understand what concepts a model has learned
Study superposition and feature geometry
Perform feature-based steering or ablation
Analyze safety-relevant features (deception, bias, harmful content)

Consider alternatives when:

You need basic activation analysis → Use TransformerLens directly
You want causal intervention experiments → Use pyvene or TransformerLens
You need production steering → Consider direct activation engineering

Installation

pip install sae-lens

Requirements: Python 3.10+, transformer-lens>=2.0.0

Core Concepts

What SAEs Learn

SAEs are trained to reconstruct model activations through a sparse bottleneck:

Input Activation → Encoder → Sparse Features → Decoder → Reconstructed Activation
    (d_model)       ↓        (d_sae >> d_model)    ↓         (d_model)
                 sparsity                      reconstruction
                 penalty                          loss

Loss Function : MSE(original, reconstructed) + L1_coefficient × L1(features)

Key Validation (Anthropic Research)

In "Towards Monosemanticity", human evaluators found 70% of SAE features genuinely interpretable. Features discovered include:

DNA sequences, legal language, HTTP requests
Hebrew text, nutrition statements, code syntax
Sentiment, named entities, grammatical structures

Workflow 1: Loading and Analyzing Pre-trained SAEs

Step-by-Step

from transformer_lens import HookedTransformer
from sae_lens import SAE

# 1. Load model and pre-trained SAE
model = HookedTransformer.from_pretrained("gpt2-small", device="cuda")
sae, cfg_dict, sparsity = SAE.from_pretrained(
    release="gpt2-small-res-jb",
    sae_id="blocks.8.hook_resid_pre",
    device="cuda"
)

# 2. Get model activations
tokens = model.to_tokens("The capital of France is Paris")
_, cache = model.run_with_cache(tokens)
activations = cache["resid_pre", 8]  # [batch, pos, d_model]

# 3. Encode to SAE features
sae_features = sae.encode(activations)  # [batch, pos, d_sae]
print(f"Active features: {(sae_features > 0).sum()}")

# 4. Find top features for each position
for pos in range(tokens.shape[1]):
    top_features = sae_features[0, pos].topk(5)
    token = model.to_str_tokens(tokens[0, pos:pos+1])[0]
    print(f"Token '{token}': features {top_features.indices.tolist()}")

# 5. Reconstruct activations
reconstructed = sae.decode(sae_features)
reconstruction_error = (activations - reconstructed).norm()

Available Pre-trained SAEs

Release	Model	Layers
`gpt2-small-res-jb`	GPT-2 Small	Multiple residual streams
`gemma-2b-res`	Gemma 2B	Residual streams
Various on HuggingFace	Search tag `saelens`	Various

Checklist

Load model with TransformerLens
Load matching SAE for target layer
Encode activations to sparse features
Identify top-activating features per token
Validate reconstruction quality

Workflow 2: Training a Custom SAE

Step-by-Step

from sae_lens import SAE, LanguageModelSAERunnerConfig, SAETrainingRunner

# 1. Configure training
cfg = LanguageModelSAERunnerConfig(
    # Model
    model_name="gpt2-small",
    hook_name="blocks.8.hook_resid_pre",
    hook_layer=8,
    d_in=768,  # Model dimension

    # SAE architecture
    architecture="standard",  # or "gated", "topk"
    d_sae=768 * 8,  # Expansion factor of 8
    activation_fn="relu",

    # Training
    lr=4e-4,
    l1_coefficient=8e-5,  # Sparsity penalty
    l1_warm_up_steps=1000,
    train_batch_size_tokens=4096,
    training_tokens=100_000_000,

    # Data
    dataset_path="monology/pile-uncopyrighted",
    context_size=128,

    # Logging
    log_to_wandb=True,
    wandb_project="sae-training",

    # Checkpointing
    checkpoint_path="checkpoints",
    n_checkpoints=5,
)

# 2. Train
trainer = SAETrainingRunner(cfg)
sae = trainer.run()

# 3. Evaluate
print(f"L0 (avg active features): {trainer.metrics['l0']}")
print(f"CE Loss Recovered: {trainer.metrics['ce_loss_score']}")

Key Hyperparameters

Parameter	Typical Value	Effect
`d_sae`	4-16× d_model	More features, higher capacity
`l1_coefficient`	5e-5 to 1e-4	Higher = sparser, less accurate
`lr`	1e-4 to 1e-3	Standard optimizer LR
`l1_warm_up_steps`	500-2000	Prevents early feature death

Evaluation Metrics

Metric	Target	Meaning
L0	50-200	Average active features per token
CE Loss Score	80-95%	Cross-entropy recovered vs original
Dead Features	<5%	Features that never activate
Explained Variance	>90%	Reconstruction quality

Checklist

Choose target layer and hook point
Set expansion factor (d_sae = 4-16× d_model)
Tune L1 coefficient for desired sparsity
Enable L1 warm-up to prevent dead features
Monitor metrics during training (W&B)
Validate L0 and CE loss recovery
Check dead feature ratio

Workflow 3: Feature Analysis and Steering

Analyzing Individual Features

from transformer_lens import HookedTransformer
from sae_lens import SAE
import torch

model = HookedTransformer.from_pretrained("gpt2-small", device="cuda")
sae, _, _ = SAE.from_pretrained(
    release="gpt2-small-res-jb",
    sae_id="blocks.8.hook_resid_pre",
    device="cuda"
)

# Find what activates a specific feature
feature_idx = 1234
test_texts = [
    "The scientist conducted an experiment",
    "I love chocolate cake",
    "The code compiles successfully",
    "Paris is beautiful in spring",
]

for text in test_texts:
    tokens = model.to_tokens(text)
    _, cache = model.run_with_cache(tokens)
    features = sae.encode(cache["resid_pre", 8])
    activation = features[0, :, feature_idx].max().item()
    print(f"{activation:.3f}: {text}")

Feature Steering

def steer_with_feature(model, sae, prompt, feature_idx, strength=5.0):
    """Add SAE feature direction to residual stream."""
    tokens = model.to_tokens(prompt)

    # Get feature direction from decoder
    feature_direction = sae.W_dec[feature_idx]  # [d_model]

    def steering_hook(activation, hook):
        # Add scaled feature direction at all positions
        activation += strength * feature_direction
        return activation

    # Generate with steering
    output = model.generate(
        tokens,
        max_new_tokens=50,
        fwd_hooks=[("blocks.8.hook_resid_pre", steering_hook)]
    )
    return model.to_string(output[0])

Feature Attribution

# Which features most affect a specific output?
tokens = model.to_tokens("The capital of France is")
_, cache = model.run_with_cache(tokens)

# Get features at final position
features = sae.encode(cache["resid_pre", 8])[0, -1]  # [d_sae]

# Get logit attribution per feature
# Feature contribution = feature_activation × decoder_weight × unembedding
W_dec = sae.W_dec  # [d_sae, d_model]
W_U = model.W_U    # [d_model, vocab]

# Contribution to "Paris" logit
paris_token = model.to_single_token(" Paris")
feature_contributions = features * (W_dec @ W_U[:, paris_token])

top_features = feature_contributions.topk(10)
print("Top features for 'Paris' prediction:")
for idx, val in zip(top_features.indices, top_features.values):
    print(f"  Feature {idx.item()}: {val.item():.3f}")

Common Issues & Solutions

Issue: High dead feature ratio

# WRONG: No warm-up, features die early
cfg = LanguageModelSAERunnerConfig(
    l1_coefficient=1e-4,
    l1_warm_up_steps=0,  # Bad!
)

# RIGHT: Warm-up L1 penalty
cfg = LanguageModelSAERunnerConfig(
    l1_coefficient=8e-5,
    l1_warm_up_steps=1000,  # Gradually increase
    use_ghost_grads=True,   # Revive dead features
)

Issue: Poor reconstruction (low CE recovery)

# Reduce sparsity penalty
cfg = LanguageModelSAERunnerConfig(
    l1_coefficient=5e-5,  # Lower = better reconstruction
    d_sae=768 * 16,       # More capacity
)

Issue: Features not interpretable

# Increase sparsity (higher L1)
cfg = LanguageModelSAERunnerConfig(
    l1_coefficient=1e-4,  # Higher = sparser, more interpretable
)
# Or use TopK architecture
cfg = LanguageModelSAERunnerConfig(
    architecture="topk",
    activation_fn_kwargs={"k": 50},  # Exactly 50 active features
)

Issue: Memory errors during training

cfg = LanguageModelSAERunnerConfig(
    train_batch_size_tokens=2048,  # Reduce batch size
    store_batch_size_prompts=4,    # Fewer prompts in buffer
    n_batches_in_buffer=8,         # Smaller activation buffer
)

Integration with Neuronpedia

Browse pre-trained SAE features at neuronpedia.org:

# Features are indexed by SAE ID
# Example: gpt2-small layer 8 feature 1234
# → neuronpedia.org/gpt2-small/8-res-jb/1234

Key Classes Reference

Class	Purpose
`SAE`	Sparse Autoencoder model
`LanguageModelSAERunnerConfig`	Training configuration
`SAETrainingRunner`	Training loop manager
`ActivationsStore`	Activation collection and batching
`HookedSAETransformer`	TransformerLens + SAE integration

Reference Documentation

For detailed API documentation, tutorials, and advanced usage, see the references/ folder:

File	Contents
references/README.md	Overview and quick start guide
references/api.md	Complete API reference for SAE, TrainingSAE, configurations
references/tutorials.md	Step-by-step tutorials for training, analysis, steering

External Resources

Tutorials

Papers

Towards Monosemanticity - Anthropic (2023)
Scaling Monosemanticity - Anthropic (2024)
Sparse Autoencoders Find Highly Interpretable Features - Cunningham et al. (ICLR 2024)

Official Documentation

SAELens Docs
Neuronpedia - Feature browser

SAE Architectures

Architecture	Description	Use Case
Standard	ReLU + L1 penalty	General purpose
Gated	Learned gating mechanism	Better sparsity control
TopK	Exactly K active features	Consistent sparsity

# TopK SAE (exactly 50 features active)
cfg = LanguageModelSAERunnerConfig(
    architecture="topk",
    activation_fn="topk",
    activation_fn_kwargs={"k": 50},
)

Weekly Installs

147

Repository

davila7/claude-…emplates

GitHub Stars

22.6K

First Seen

Jan 21, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykWarn

Installed on

opencode119

claude-code116

gemini-cli113

cursor106

codex100

antigravity93

SAELens：稀疏自编码器训练库，实现神经网络机制可解释性分析

🇨🇳中文介绍

SAELens：面向机制可解释性的稀疏自编码器

问题：多义性与叠加

何时使用 SAELens

安装

核心概念

SAEs 学习什么

相关 Skills

关键验证（Anthropic 研究）

工作流程 1：加载和分析预训练的 SAEs

逐步指南

可用的预训练 SAEs

检查清单

工作流程 2：训练自定义 SAE

逐步指南

关键超参数

评估指标

检查清单

工作流程 3：特征分析与引导

分析单个特征

特征引导

特征归因

常见问题与解决方案

问题：死亡特征比例高

问题：重建效果差（CE 恢复率低）

问题：特征不可解释

问题：训练期间内存错误

与 Neuronpedia 集成

关键类参考

参考文档

外部资源

教程

论文

官方文档

SAE 架构

🇺🇸English

SAELens: Sparse Autoencoders for Mechanistic Interpretability

The Problem: Polysemanticity & Superposition

When to Use SAELens

Installation

Core Concepts

What SAEs Learn

Key Validation (Anthropic Research)

Workflow 1: Loading and Analyzing Pre-trained SAEs

Step-by-Step

Available Pre-trained SAEs

Checklist

Workflow 2: Training a Custom SAE

Step-by-Step

Key Hyperparameters

Evaluation Metrics

Checklist

Workflow 3: Feature Analysis and Steering

Analyzing Individual Features

Feature Steering

Feature Attribution

Common Issues & Solutions

Issue: High dead feature ratio

Issue: Poor reconstruction (low CE recovery)

Issue: Features not interpretable

Issue: Memory errors during training

Integration with Neuronpedia

Key Classes Reference

Reference Documentation

External Resources

Tutorials

Papers

Official Documentation

SAE Architectures

最新 Skills