transformer-lens-interpretability by davila7/claude-code-templates
npx skills add https://github.com/davila7/claude-code-templates --skill transformer-lens-interpretabilityTransformerLens 是用于 GPT 风格语言模型的机制可解释性研究的事实标准库。由 Neel Nanda 创建,Bryce Meyer 维护,它通过每个激活上的 HookPoints 提供了检查和操作模型内部的简洁接口。
GitHub : TransformerLensOrg/TransformerLens (2,900+ stars)
在以下情况下使用 TransformerLens:
在以下情况下考虑替代方案:
pip install transformer-lens
对于开发版本:
pip install git+https://github.com/TransformerLensOrg/TransformerLens
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
主要的类,用 HookPoints 包装 Transformer 模型,覆盖每个激活:
from transformer_lens import HookedTransformer
# 加载模型
model = HookedTransformer.from_pretrained("gpt2-small")
# 对于门控模型(LLaMA, Mistral)
import os
os.environ["HF_TOKEN"] = "your_token"
model = HookedTransformer.from_pretrained("meta-llama/Llama-2-7b-hf")
| 系列 | 模型 |
|---|---|
| GPT-2 | gpt2, gpt2-medium, gpt2-large, gpt2-xl |
| LLaMA | llama-7b, llama-13b, llama-2-7b, llama-2-13b |
| EleutherAI | pythia-70m 到 pythia-12b, gpt-neo, gpt-j-6b |
| Mistral | mistral-7b, mixtral-8x7b |
| 其他 | phi, qwen, opt, gemma |
运行模型并缓存所有中间激活:
# 获取所有激活
tokens = model.to_tokens("The Eiffel Tower is in")
logits, cache = model.run_with_cache(tokens)
# 访问特定激活
residual = cache["resid_post", 5] # 第 5 层残差流
attn_pattern = cache["pattern", 3] # 第 3 层注意力模式
mlp_out = cache["mlp_out", 7] # 第 7 层 MLP 输出
# 过滤要缓存的激活(节省内存)
logits, cache = model.run_with_cache(
tokens,
names_filter=lambda name: "resid_post" in name
)
| 键模式 | 形状 | 描述 |
|---|---|---|
resid_pre, layer | [batch, pos, d_model] | 注意力前的残差 |
resid_mid, layer | [batch, pos, d_model] | 注意力后的残差 |
resid_post, layer | [batch, pos, d_model] | MLP 后的残差 |
attn_out, layer | [batch, pos, d_model] | 注意力输出 |
mlp_out, layer | [batch, pos, d_model] | MLP 输出 |
pattern, layer | [batch, head, q_pos, k_pos] | 注意力模式(softmax 后) |
q, layer | [batch, pos, head, d_head] | 查询向量 |
k, layer | [batch, pos, head, d_head] | 键向量 |
v, layer | [batch, pos, head, d_head] | 值向量 |
通过将干净的激活修补到损坏的运行中,识别哪些激活对模型输出有因果影响。
from transformer_lens import HookedTransformer, patching
import torch
model = HookedTransformer.from_pretrained("gpt2-small")
# 1. 定义干净和损坏的提示
clean_prompt = "The Eiffel Tower is in the city of"
corrupted_prompt = "The Colosseum is in the city of"
clean_tokens = model.to_tokens(clean_prompt)
corrupted_tokens = model.to_tokens(corrupted_prompt)
# 2. 获取干净激活
_, clean_cache = model.run_with_cache(clean_tokens)
# 3. 定义指标(例如,对数概率差)
paris_token = model.to_single_token(" Paris")
rome_token = model.to_single_token(" Rome")
def metric(logits):
return logits[0, -1, paris_token] - logits[0, -1, rome_token]
# 4. 修补每个位置和层
results = torch.zeros(model.cfg.n_layers, clean_tokens.shape[1])
for layer in range(model.cfg.n_layers):
for pos in range(clean_tokens.shape[1]):
def patch_hook(activation, hook):
activation[0, pos] = clean_cache[hook.name][0, pos]
return activation
patched_logits = model.run_with_hooks(
corrupted_tokens,
fwd_hooks=[(f"blocks.{layer}.hook_resid_post", patch_hook)]
)
results[layer, pos] = metric(patched_logits)
# 5. 可视化结果(层 x 位置热图)
复现 "Interpretability in the Wild" 中的 IOI 电路发现。
from transformer_lens import HookedTransformer
import torch
model = HookedTransformer.from_pretrained("gpt2-small")
# IOI 任务:"When John and Mary went to the store, Mary gave a bottle to"
# 模型应预测 "John"(间接宾语)
prompt = "When John and Mary went to the store, Mary gave a bottle to"
tokens = model.to_tokens(prompt)
# 1. 获取基线对数概率
logits, cache = model.run_with_cache(tokens)
john_token = model.to_single_token(" John")
mary_token = model.to_single_token(" Mary")
# 2. 计算对数概率差(IO - S)
logit_diff = logits[0, -1, john_token] - logits[0, -1, mary_token]
print(f"Logit difference: {logit_diff.item():.3f}")
# 3. 按头进行直接对数概率归因
def get_head_contribution(layer, head):
# 将头输出投影到对数概率
head_out = cache["z", layer][0, :, head, :] # [pos, d_head]
W_O = model.W_O[layer, head] # [d_head, d_model]
W_U = model.W_U # [d_model, vocab]
# 头对最终位置对数概率的贡献
contribution = head_out[-1] @ W_O @ W_U
return contribution[john_token] - contribution[mary_token]
# 4. 映射所有头
head_contributions = torch.zeros(model.cfg.n_layers, model.cfg.n_heads)
for layer in range(model.cfg.n_layers):
for head in range(model.cfg.n_heads):
head_contributions[layer, head] = get_head_contribution(layer, head)
# 5. 识别贡献最大的头(名称移动器,备用名称移动器)
查找实现 [A][B]...[A] → [B] 模式的归纳头。
from transformer_lens import HookedTransformer
import torch
model = HookedTransformer.from_pretrained("gpt2-small")
# 创建重复序列:[A][B][A] 应预测 [B]
repeated_tokens = torch.tensor([[1000, 2000, 1000]]) # 任意标记
_, cache = model.run_with_cache(repeated_tokens)
# 归纳头从最后的 [A] 关注到第一个 [B]
# 检查从位置 2 到位置 1 的注意力
induction_scores = torch.zeros(model.cfg.n_layers, model.cfg.n_heads)
for layer in range(model.cfg.n_layers):
pattern = cache["pattern", layer][0] # [head, q_pos, k_pos]
# 从位置 2 到位置 1 的注意力
induction_scores[layer] = pattern[:, 2, 1]
# 得分高的头是归纳头
top_heads = torch.topk(induction_scores.flatten(), k=5)
# 错误:旧钩子仍然活跃
model.run_with_hooks(tokens, fwd_hooks=[...]) # 调试,添加新钩子
model.run_with_hooks(tokens, fwd_hooks=[...]) # 旧钩子还在那里!
# 正确:始终重置钩子
model.reset_hooks()
model.run_with_hooks(tokens, fwd_hooks=[...])
# 错误:假设分词一致
model.to_tokens("Tim") # 单个标记
model.to_tokens("Neel") # 变成 "Ne" + "el"(两个标记!)
# 正确:显式检查分词
tokens = model.to_tokens("Neel", prepend_bos=False)
print(model.to_str_tokens(tokens)) # ['Ne', 'el']
# 错误:忽略 LayerNorm
pre_activation = residual @ model.W_in[layer]
# 正确:包含 LayerNorm
ln_scale = model.blocks[layer].ln2.w
ln_out = model.blocks[layer].ln2(residual)
pre_activation = ln_out @ model.W_in[layer]
# 使用选择性缓存
logits, cache = model.run_with_cache(
tokens,
names_filter=lambda n: "resid_post" in n or "pattern" in n,
device="cpu" # 在 CPU 上缓存
)
| 类 | 用途 |
|---|---|
HookedTransformer | 带钩子的主模型包装器 |
ActivationCache | 类似字典的激活缓存 |
HookedTransformerConfig | 模型配置 |
FactoredMatrix | 高效的分解矩阵操作 |
TransformerLens 与 SAELens 集成,用于稀疏自编码器分析:
from transformer_lens import HookedTransformer
from sae_lens import SAE
model = HookedTransformer.from_pretrained("gpt2-small")
sae = SAE.from_pretrained("gpt2-small-res-jb", "blocks.8.hook_resid_pre")
# 使用 SAE 运行
tokens = model.to_tokens("Hello world")
_, cache = model.run_with_cache(tokens)
sae_acts = sae.encode(cache["resid_pre", 8])
详细的 API 文档、教程和高级用法,请参阅 references/ 文件夹:
| 文件 | 内容 |
|---|---|
| references/README.md | 概述和快速入门指南 |
| references/api.md | HookedTransformer、ActivationCache、HookPoints 的完整 API 参考 |
| references/tutorials.md | 激活修补、电路分析、对数概率透镜的分步教程 |
每周安装量
149
仓库
GitHub Stars
22.6K
首次出现
Jan 21, 2026
安全审计
安装于
opencode120
claude-code119
gemini-cli115
cursor107
codex102
antigravity96
TransformerLens is the de facto standard library for mechanistic interpretability research on GPT-style language models. Created by Neel Nanda and maintained by Bryce Meyer, it provides clean interfaces to inspect and manipulate model internals via HookPoints on every activation.
GitHub : TransformerLensOrg/TransformerLens (2,900+ stars)
Use TransformerLens when you need to:
Consider alternatives when:
pip install transformer-lens
For development version:
pip install git+https://github.com/TransformerLensOrg/TransformerLens
The main class that wraps transformer models with HookPoints on every activation:
from transformer_lens import HookedTransformer
# Load a model
model = HookedTransformer.from_pretrained("gpt2-small")
# For gated models (LLaMA, Mistral)
import os
os.environ["HF_TOKEN"] = "your_token"
model = HookedTransformer.from_pretrained("meta-llama/Llama-2-7b-hf")
| Family | Models |
|---|---|
| GPT-2 | gpt2, gpt2-medium, gpt2-large, gpt2-xl |
| LLaMA | llama-7b, llama-13b, llama-2-7b, llama-2-13b |
| EleutherAI | pythia-70m to pythia-12b, gpt-neo, gpt-j-6b |
| Mistral | mistral-7b, mixtral-8x7b |
| Others | phi, qwen, opt, gemma |
Run the model and cache all intermediate activations:
# Get all activations
tokens = model.to_tokens("The Eiffel Tower is in")
logits, cache = model.run_with_cache(tokens)
# Access specific activations
residual = cache["resid_post", 5] # Layer 5 residual stream
attn_pattern = cache["pattern", 3] # Layer 3 attention pattern
mlp_out = cache["mlp_out", 7] # Layer 7 MLP output
# Filter which activations to cache (saves memory)
logits, cache = model.run_with_cache(
tokens,
names_filter=lambda name: "resid_post" in name
)
| Key Pattern | Shape | Description |
|---|---|---|
resid_pre, layer | [batch, pos, d_model] | Residual before attention |
resid_mid, layer | [batch, pos, d_model] | Residual after attention |
resid_post, layer | [batch, pos, d_model] | Residual after MLP |
attn_out, layer | [batch, pos, d_model] | Attention output |
mlp_out, layer | [batch, pos, d_model] |
Identify which activations causally affect model output by patching clean activations into corrupted runs.
from transformer_lens import HookedTransformer, patching
import torch
model = HookedTransformer.from_pretrained("gpt2-small")
# 1. Define clean and corrupted prompts
clean_prompt = "The Eiffel Tower is in the city of"
corrupted_prompt = "The Colosseum is in the city of"
clean_tokens = model.to_tokens(clean_prompt)
corrupted_tokens = model.to_tokens(corrupted_prompt)
# 2. Get clean activations
_, clean_cache = model.run_with_cache(clean_tokens)
# 3. Define metric (e.g., logit difference)
paris_token = model.to_single_token(" Paris")
rome_token = model.to_single_token(" Rome")
def metric(logits):
return logits[0, -1, paris_token] - logits[0, -1, rome_token]
# 4. Patch each position and layer
results = torch.zeros(model.cfg.n_layers, clean_tokens.shape[1])
for layer in range(model.cfg.n_layers):
for pos in range(clean_tokens.shape[1]):
def patch_hook(activation, hook):
activation[0, pos] = clean_cache[hook.name][0, pos]
return activation
patched_logits = model.run_with_hooks(
corrupted_tokens,
fwd_hooks=[(f"blocks.{layer}.hook_resid_post", patch_hook)]
)
results[layer, pos] = metric(patched_logits)
# 5. Visualize results (layer x position heatmap)
Replicate the IOI circuit discovery from "Interpretability in the Wild".
from transformer_lens import HookedTransformer
import torch
model = HookedTransformer.from_pretrained("gpt2-small")
# IOI task: "When John and Mary went to the store, Mary gave a bottle to"
# Model should predict "John" (indirect object)
prompt = "When John and Mary went to the store, Mary gave a bottle to"
tokens = model.to_tokens(prompt)
# 1. Get baseline logits
logits, cache = model.run_with_cache(tokens)
john_token = model.to_single_token(" John")
mary_token = model.to_single_token(" Mary")
# 2. Compute logit difference (IO - S)
logit_diff = logits[0, -1, john_token] - logits[0, -1, mary_token]
print(f"Logit difference: {logit_diff.item():.3f}")
# 3. Direct logit attribution by head
def get_head_contribution(layer, head):
# Project head output to logits
head_out = cache["z", layer][0, :, head, :] # [pos, d_head]
W_O = model.W_O[layer, head] # [d_head, d_model]
W_U = model.W_U # [d_model, vocab]
# Head contribution to logits at final position
contribution = head_out[-1] @ W_O @ W_U
return contribution[john_token] - contribution[mary_token]
# 4. Map all heads
head_contributions = torch.zeros(model.cfg.n_layers, model.cfg.n_heads)
for layer in range(model.cfg.n_layers):
for head in range(model.cfg.n_heads):
head_contributions[layer, head] = get_head_contribution(layer, head)
# 5. Identify top contributing heads (name movers, backup name movers)
Find induction heads that implement [A][B]...[A] → [B] pattern.
from transformer_lens import HookedTransformer
import torch
model = HookedTransformer.from_pretrained("gpt2-small")
# Create repeated sequence: [A][B][A] should predict [B]
repeated_tokens = torch.tensor([[1000, 2000, 1000]]) # Arbitrary tokens
_, cache = model.run_with_cache(repeated_tokens)
# Induction heads attend from final [A] back to first [B]
# Check attention from position 2 to position 1
induction_scores = torch.zeros(model.cfg.n_layers, model.cfg.n_heads)
for layer in range(model.cfg.n_layers):
pattern = cache["pattern", layer][0] # [head, q_pos, k_pos]
# Attention from pos 2 to pos 1
induction_scores[layer] = pattern[:, 2, 1]
# Heads with high scores are induction heads
top_heads = torch.topk(induction_scores.flatten(), k=5)
# WRONG: Old hooks remain active
model.run_with_hooks(tokens, fwd_hooks=[...]) # Debug, add new hooks
model.run_with_hooks(tokens, fwd_hooks=[...]) # Old hooks still there!
# RIGHT: Always reset hooks
model.reset_hooks()
model.run_with_hooks(tokens, fwd_hooks=[...])
# WRONG: Assuming consistent tokenization
model.to_tokens("Tim") # Single token
model.to_tokens("Neel") # Becomes "Ne" + "el" (two tokens!)
# RIGHT: Check tokenization explicitly
tokens = model.to_tokens("Neel", prepend_bos=False)
print(model.to_str_tokens(tokens)) # ['Ne', 'el']
# WRONG: Ignoring LayerNorm
pre_activation = residual @ model.W_in[layer]
# RIGHT: Include LayerNorm
ln_scale = model.blocks[layer].ln2.w
ln_out = model.blocks[layer].ln2(residual)
pre_activation = ln_out @ model.W_in[layer]
# Use selective caching
logits, cache = model.run_with_cache(
tokens,
names_filter=lambda n: "resid_post" in n or "pattern" in n,
device="cpu" # Cache on CPU
)
| Class | Purpose |
|---|---|
HookedTransformer | Main model wrapper with hooks |
ActivationCache | Dictionary-like cache of activations |
HookedTransformerConfig | Model configuration |
FactoredMatrix | Efficient factored matrix operations |
TransformerLens integrates with SAELens for Sparse Autoencoder analysis:
from transformer_lens import HookedTransformer
from sae_lens import SAE
model = HookedTransformer.from_pretrained("gpt2-small")
sae = SAE.from_pretrained("gpt2-small-res-jb", "blocks.8.hook_resid_pre")
# Run with SAE
tokens = model.to_tokens("Hello world")
_, cache = model.run_with_cache(tokens)
sae_acts = sae.encode(cache["resid_pre", 8])
For detailed API documentation, tutorials, and advanced usage, see the references/ folder:
| File | Contents |
|---|---|
| references/README.md | Overview and quick start guide |
| references/api.md | Complete API reference for HookedTransformer, ActivationCache, HookPoints |
| references/tutorials.md | Step-by-step tutorials for activation patching, circuit analysis, logit lens |
Weekly Installs
149
Repository
GitHub Stars
22.6K
First Seen
Jan 21, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykPass
Installed on
opencode120
claude-code119
gemini-cli115
cursor107
codex102
antigravity96
AI 代码实施计划编写技能 | 自动化开发任务分解与 TDD 流程规划工具
47,700 周安装
| MLP output |
pattern, layer | [batch, head, q_pos, k_pos] | Attention pattern (post-softmax) |
q, layer | [batch, pos, head, d_head] | Query vectors |
k, layer | [batch, pos, head, d_head] | Key vectors |
v, layer | [batch, pos, head, d_head] | Value vectors |