TransformerLens 可解释性工具：GPT 模型机制分析与激活修补指南

transformer-lens-interpretability by davila7/claude-code-templates

206 周安装量

24,100 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/davila7/claude-code-templates --skill transformer-lens-interpretability

AI/机器学习科研工具自然语言处理

🇨🇳中文介绍

TransformerLens：Transformer 的机制可解释性工具

TransformerLens 是用于 GPT 风格语言模型的机制可解释性研究的事实标准库。由 Neel Nanda 创建，Bryce Meyer 维护，它通过每个激活上的 HookPoints 提供了检查和操作模型内部的简洁接口。

GitHub : TransformerLensOrg/TransformerLens (2,900+ stars)

何时使用 TransformerLens

在以下情况下使用 TransformerLens：

逆向工程训练期间学到的算法
执行激活修补 / 因果追踪实验
研究注意力模式和信息流
分析电路（例如，归纳头、IOI 电路）
缓存和检查中间激活
应用直接对数概率归因

在以下情况下考虑替代方案：

您需要使用非 Transformer 架构 → 使用 nnsight 或 pyvene
您想要训练/分析稀疏自编码器 → 使用 SAELens
您需要在海量模型上进行远程执行 → 使用带 NDIF 的 nnsight
您想要更高级的因果干预抽象 → 使用 pyvene

安装

pip install transformer-lens

对于开发版本：

pip install git+https://github.com/TransformerLensOrg/TransformerLens

核心概念

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

支持的模型（50+）

系列	模型
GPT-2	gpt2, gpt2-medium, gpt2-large, gpt2-xl
LLaMA	llama-7b, llama-13b, llama-2-7b, llama-2-13b
EleutherAI	pythia-70m 到 pythia-12b, gpt-neo, gpt-j-6b
Mistral	mistral-7b, mixtral-8x7b
其他	phi, qwen, opt, gemma

运行模型并缓存所有中间激活：

# 获取所有激活
tokens = model.to_tokens("The Eiffel Tower is in")
logits, cache = model.run_with_cache(tokens)

# 访问特定激活
residual = cache["resid_post", 5]  # 第 5 层残差流
attn_pattern = cache["pattern", 3]  # 第 3 层注意力模式
mlp_out = cache["mlp_out", 7]  # 第 7 层 MLP 输出

# 过滤要缓存的激活（节省内存）
logits, cache = model.run_with_cache(
    tokens,
    names_filter=lambda name: "resid_post" in name
)

键模式	形状	描述
`resid_pre, layer`	[batch, pos, d_model]	注意力前的残差
`resid_mid, layer`	[batch, pos, d_model]	注意力后的残差
`resid_post, layer`	[batch, pos, d_model]	MLP 后的残差
`attn_out, layer`	[batch, pos, d_model]	注意力输出
`mlp_out, layer`	[batch, pos, d_model]	MLP 输出
`pattern, layer`	[batch, head, q_pos, k_pos]	注意力模式（softmax 后）
`q, layer`	[batch, pos, head, d_head]	查询向量
`k, layer`	[batch, pos, head, d_head]	键向量
`v, layer`	[batch, pos, head, d_head]	值向量

工作流程 1：激活修补（因果追踪）

通过将干净的激活修补到损坏的运行中，识别哪些激活对模型输出有因果影响。

from transformer_lens import HookedTransformer, patching
import torch

model = HookedTransformer.from_pretrained("gpt2-small")

# 1. 定义干净和损坏的提示
clean_prompt = "The Eiffel Tower is in the city of"
corrupted_prompt = "The Colosseum is in the city of"

clean_tokens = model.to_tokens(clean_prompt)
corrupted_tokens = model.to_tokens(corrupted_prompt)

# 2. 获取干净激活
_, clean_cache = model.run_with_cache(clean_tokens)

# 3. 定义指标（例如，对数概率差）
paris_token = model.to_single_token(" Paris")
rome_token = model.to_single_token(" Rome")

def metric(logits):
    return logits[0, -1, paris_token] - logits[0, -1, rome_token]

# 4. 修补每个位置和层
results = torch.zeros(model.cfg.n_layers, clean_tokens.shape[1])

for layer in range(model.cfg.n_layers):
    for pos in range(clean_tokens.shape[1]):
        def patch_hook(activation, hook):
            activation[0, pos] = clean_cache[hook.name][0, pos]
            return activation

        patched_logits = model.run_with_hooks(
            corrupted_tokens,
            fwd_hooks=[(f"blocks.{layer}.hook_resid_post", patch_hook)]
        )
        results[layer, pos] = metric(patched_logits)

# 5. 可视化结果（层 x 位置热图）

定义差异最小的干净和损坏输入
选择能捕捉行为差异的指标
缓存干净激活
系统地修补每个（层，位置）组合
将结果可视化为热图
识别因果热点

工作流程 2：电路分析（间接宾语识别）

复现 "Interpretability in the Wild" 中的 IOI 电路发现。

from transformer_lens import HookedTransformer
import torch

model = HookedTransformer.from_pretrained("gpt2-small")

# IOI 任务："When John and Mary went to the store, Mary gave a bottle to"
# 模型应预测 "John"（间接宾语）

prompt = "When John and Mary went to the store, Mary gave a bottle to"
tokens = model.to_tokens(prompt)

# 1. 获取基线对数概率
logits, cache = model.run_with_cache(tokens)

john_token = model.to_single_token(" John")
mary_token = model.to_single_token(" Mary")

# 2. 计算对数概率差（IO - S）
logit_diff = logits[0, -1, john_token] - logits[0, -1, mary_token]
print(f"Logit difference: {logit_diff.item():.3f}")

# 3. 按头进行直接对数概率归因
def get_head_contribution(layer, head):
    # 将头输出投影到对数概率
    head_out = cache["z", layer][0, :, head, :]  # [pos, d_head]
    W_O = model.W_O[layer, head]  # [d_head, d_model]
    W_U = model.W_U  # [d_model, vocab]

    # 头对最终位置对数概率的贡献
    contribution = head_out[-1] @ W_O @ W_U
    return contribution[john_token] - contribution[mary_token]

# 4. 映射所有头
head_contributions = torch.zeros(model.cfg.n_layers, model.cfg.n_heads)
for layer in range(model.cfg.n_layers):
    for head in range(model.cfg.n_heads):
        head_contributions[layer, head] = get_head_contribution(layer, head)

# 5. 识别贡献最大的头（名称移动器，备用名称移动器）

设置具有明确 IO/S 标记的任务
计算基线对数概率差
按注意力头贡献分解
识别关键电路组件（名称移动器、S 抑制、归纳）
通过消融实验验证

工作流程 3：归纳头检测

查找实现 [A][B]...[A] → [B] 模式的归纳头。

from transformer_lens import HookedTransformer
import torch

model = HookedTransformer.from_pretrained("gpt2-small")

# 创建重复序列：[A][B][A] 应预测 [B]
repeated_tokens = torch.tensor([[1000, 2000, 1000]])  # 任意标记

_, cache = model.run_with_cache(repeated_tokens)

# 归纳头从最后的 [A] 关注到第一个 [B]
# 检查从位置 2 到位置 1 的注意力
induction_scores = torch.zeros(model.cfg.n_layers, model.cfg.n_heads)

for layer in range(model.cfg.n_layers):
    pattern = cache["pattern", layer][0]  # [head, q_pos, k_pos]
    # 从位置 2 到位置 1 的注意力
    induction_scores[layer] = pattern[:, 2, 1]

# 得分高的头是归纳头
top_heads = torch.topk(induction_scores.flatten(), k=5)

常见问题与解决方案

问题：调试后钩子仍然存在

# 错误：旧钩子仍然活跃
model.run_with_hooks(tokens, fwd_hooks=[...])  # 调试，添加新钩子
model.run_with_hooks(tokens, fwd_hooks=[...])  # 旧钩子还在那里！

# 正确：始终重置钩子
model.reset_hooks()
model.run_with_hooks(tokens, fwd_hooks=[...])

问题：分词陷阱

# 错误：假设分词一致
model.to_tokens("Tim")  # 单个标记
model.to_tokens("Neel")  # 变成 "Ne" + "el"（两个标记！）

# 正确：显式检查分词
tokens = model.to_tokens("Neel", prepend_bos=False)
print(model.to_str_tokens(tokens))  # ['Ne', 'el']

问题：分析中忽略了 LayerNorm

# 错误：忽略 LayerNorm
pre_activation = residual @ model.W_in[layer]

# 正确：包含 LayerNorm
ln_scale = model.blocks[layer].ln2.w
ln_out = model.blocks[layer].ln2(residual)
pre_activation = ln_out @ model.W_in[layer]

问题：大模型内存爆炸

# 使用选择性缓存
logits, cache = model.run_with_cache(
    tokens,
    names_filter=lambda n: "resid_post" in n or "pattern" in n,
    device="cpu"  # 在 CPU 上缓存
)

类	用途
`HookedTransformer`	带钩子的主模型包装器
`ActivationCache`	类似字典的激活缓存
`HookedTransformerConfig`	模型配置
`FactoredMatrix`	高效的分解矩阵操作

TransformerLens 与 SAELens 集成，用于稀疏自编码器分析：

from transformer_lens import HookedTransformer
from sae_lens import SAE

model = HookedTransformer.from_pretrained("gpt2-small")
sae = SAE.from_pretrained("gpt2-small-res-jb", "blocks.8.hook_resid_pre")

# 使用 SAE 运行
tokens = model.to_tokens("Hello world")
_, cache = model.run_with_cache(tokens)
sae_acts = sae.encode(cache["resid_pre", 8])

详细的 API 文档、教程和高级用法，请参阅 references/ 文件夹：

文件	内容
references/README.md	概述和快速入门指南
references/api.md	HookedTransformer、ActivationCache、HookPoints 的完整 API 参考
references/tutorials.md	激活修补、电路分析、对数概率透镜的分步教程

v2.0 : 移除了 HookedSAE（移至 SAELens）
v3.0 (alpha) : TransformerBridge 用于加载任何 nn.Module

🇺🇸English

TransformerLens: Mechanistic Interpretability for Transformers

TransformerLens is the de facto standard library for mechanistic interpretability research on GPT-style language models. Created by Neel Nanda and maintained by Bryce Meyer, it provides clean interfaces to inspect and manipulate model internals via HookPoints on every activation.

GitHub : TransformerLensOrg/TransformerLens (2,900+ stars)

When to Use TransformerLens

Use TransformerLens when you need to:

Reverse-engineer algorithms learned during training
Perform activation patching / causal tracing experiments
Study attention patterns and information flow
Analyze circuits (e.g., induction heads, IOI circuit)
Cache and inspect intermediate activations
Apply direct logit attribution

Consider alternatives when:

You need to work with non-transformer architectures → Use nnsight or pyvene
You want to train/analyze Sparse Autoencoders → Use SAELens
You need remote execution on massive models → Use nnsight with NDIF
You want higher-level causal intervention abstractions → Use pyvene

Installation

pip install transformer-lens

For development version:

pip install git+https://github.com/TransformerLensOrg/TransformerLens

Core Concepts

HookedTransformer

The main class that wraps transformer models with HookPoints on every activation:

from transformer_lens import HookedTransformer

# Load a model
model = HookedTransformer.from_pretrained("gpt2-small")

# For gated models (LLaMA, Mistral)
import os
os.environ["HF_TOKEN"] = "your_token"
model = HookedTransformer.from_pretrained("meta-llama/Llama-2-7b-hf")

Supported Models (50+)

Family	Models
GPT-2	gpt2, gpt2-medium, gpt2-large, gpt2-xl
LLaMA	llama-7b, llama-13b, llama-2-7b, llama-2-13b
EleutherAI	pythia-70m to pythia-12b, gpt-neo, gpt-j-6b
Mistral	mistral-7b, mixtral-8x7b
Others	phi, qwen, opt, gemma

Activation Caching

Run the model and cache all intermediate activations:

# Get all activations
tokens = model.to_tokens("The Eiffel Tower is in")
logits, cache = model.run_with_cache(tokens)

# Access specific activations
residual = cache["resid_post", 5]  # Layer 5 residual stream
attn_pattern = cache["pattern", 3]  # Layer 3 attention pattern
mlp_out = cache["mlp_out", 7]  # Layer 7 MLP output

# Filter which activations to cache (saves memory)
logits, cache = model.run_with_cache(
    tokens,
    names_filter=lambda name: "resid_post" in name
)

ActivationCache Keys

Key Pattern	Shape	Description
`resid_pre, layer`	[batch, pos, d_model]	Residual before attention
`resid_mid, layer`	[batch, pos, d_model]	Residual after attention
`resid_post, layer`	[batch, pos, d_model]	Residual after MLP
`attn_out, layer`	[batch, pos, d_model]	Attention output
`mlp_out, layer`	[batch, pos, d_model]

Workflow 1: Activation Patching (Causal Tracing)

Identify which activations causally affect model output by patching clean activations into corrupted runs.

Step-by-Step

from transformer_lens import HookedTransformer, patching
import torch

model = HookedTransformer.from_pretrained("gpt2-small")

# 1. Define clean and corrupted prompts
clean_prompt = "The Eiffel Tower is in the city of"
corrupted_prompt = "The Colosseum is in the city of"

clean_tokens = model.to_tokens(clean_prompt)
corrupted_tokens = model.to_tokens(corrupted_prompt)

# 2. Get clean activations
_, clean_cache = model.run_with_cache(clean_tokens)

# 3. Define metric (e.g., logit difference)
paris_token = model.to_single_token(" Paris")
rome_token = model.to_single_token(" Rome")

def metric(logits):
    return logits[0, -1, paris_token] - logits[0, -1, rome_token]

# 4. Patch each position and layer
results = torch.zeros(model.cfg.n_layers, clean_tokens.shape[1])

for layer in range(model.cfg.n_layers):
    for pos in range(clean_tokens.shape[1]):
        def patch_hook(activation, hook):
            activation[0, pos] = clean_cache[hook.name][0, pos]
            return activation

        patched_logits = model.run_with_hooks(
            corrupted_tokens,
            fwd_hooks=[(f"blocks.{layer}.hook_resid_post", patch_hook)]
        )
        results[layer, pos] = metric(patched_logits)

# 5. Visualize results (layer x position heatmap)

Checklist

Define clean and corrupted inputs that differ minimally
Choose metric that captures behavior difference
Cache clean activations
Systematically patch each (layer, position) combination
Visualize results as heatmap
Identify causal hotspots

Workflow 2: Circuit Analysis (Indirect Object Identification)

Replicate the IOI circuit discovery from "Interpretability in the Wild".

Step-by-Step

from transformer_lens import HookedTransformer
import torch

model = HookedTransformer.from_pretrained("gpt2-small")

# IOI task: "When John and Mary went to the store, Mary gave a bottle to"
# Model should predict "John" (indirect object)

prompt = "When John and Mary went to the store, Mary gave a bottle to"
tokens = model.to_tokens(prompt)

# 1. Get baseline logits
logits, cache = model.run_with_cache(tokens)

john_token = model.to_single_token(" John")
mary_token = model.to_single_token(" Mary")

# 2. Compute logit difference (IO - S)
logit_diff = logits[0, -1, john_token] - logits[0, -1, mary_token]
print(f"Logit difference: {logit_diff.item():.3f}")

# 3. Direct logit attribution by head
def get_head_contribution(layer, head):
    # Project head output to logits
    head_out = cache["z", layer][0, :, head, :]  # [pos, d_head]
    W_O = model.W_O[layer, head]  # [d_head, d_model]
    W_U = model.W_U  # [d_model, vocab]

    # Head contribution to logits at final position
    contribution = head_out[-1] @ W_O @ W_U
    return contribution[john_token] - contribution[mary_token]

# 4. Map all heads
head_contributions = torch.zeros(model.cfg.n_layers, model.cfg.n_heads)
for layer in range(model.cfg.n_layers):
    for head in range(model.cfg.n_heads):
        head_contributions[layer, head] = get_head_contribution(layer, head)

# 5. Identify top contributing heads (name movers, backup name movers)

Checklist

Set up task with clear IO/S tokens
Compute baseline logit difference
Decompose by attention head contributions
Identify key circuit components (name movers, S-inhibition, induction)
Validate with ablation experiments

Workflow 3: Induction Head Detection

Find induction heads that implement [A][B]...[A] → [B] pattern.

from transformer_lens import HookedTransformer
import torch

model = HookedTransformer.from_pretrained("gpt2-small")

# Create repeated sequence: [A][B][A] should predict [B]
repeated_tokens = torch.tensor([[1000, 2000, 1000]])  # Arbitrary tokens

_, cache = model.run_with_cache(repeated_tokens)

# Induction heads attend from final [A] back to first [B]
# Check attention from position 2 to position 1
induction_scores = torch.zeros(model.cfg.n_layers, model.cfg.n_heads)

for layer in range(model.cfg.n_layers):
    pattern = cache["pattern", layer][0]  # [head, q_pos, k_pos]
    # Attention from pos 2 to pos 1
    induction_scores[layer] = pattern[:, 2, 1]

# Heads with high scores are induction heads
top_heads = torch.topk(induction_scores.flatten(), k=5)

Common Issues & Solutions

Issue: Hooks persist after debugging

# WRONG: Old hooks remain active
model.run_with_hooks(tokens, fwd_hooks=[...])  # Debug, add new hooks
model.run_with_hooks(tokens, fwd_hooks=[...])  # Old hooks still there!

# RIGHT: Always reset hooks
model.reset_hooks()
model.run_with_hooks(tokens, fwd_hooks=[...])

Issue: Tokenization gotchas

# WRONG: Assuming consistent tokenization
model.to_tokens("Tim")  # Single token
model.to_tokens("Neel")  # Becomes "Ne" + "el" (two tokens!)

# RIGHT: Check tokenization explicitly
tokens = model.to_tokens("Neel", prepend_bos=False)
print(model.to_str_tokens(tokens))  # ['Ne', 'el']

Issue: LayerNorm ignored in analysis

# WRONG: Ignoring LayerNorm
pre_activation = residual @ model.W_in[layer]

# RIGHT: Include LayerNorm
ln_scale = model.blocks[layer].ln2.w
ln_out = model.blocks[layer].ln2(residual)
pre_activation = ln_out @ model.W_in[layer]

Issue: Memory explosion with large models

# Use selective caching
logits, cache = model.run_with_cache(
    tokens,
    names_filter=lambda n: "resid_post" in n or "pattern" in n,
    device="cpu"  # Cache on CPU
)

Key Classes Reference

Class	Purpose
`HookedTransformer`	Main model wrapper with hooks
`ActivationCache`	Dictionary-like cache of activations
`HookedTransformerConfig`	Model configuration
`FactoredMatrix`	Efficient factored matrix operations

Integration with SAELens

TransformerLens integrates with SAELens for Sparse Autoencoder analysis:

from transformer_lens import HookedTransformer
from sae_lens import SAE

model = HookedTransformer.from_pretrained("gpt2-small")
sae = SAE.from_pretrained("gpt2-small-res-jb", "blocks.8.hook_resid_pre")

# Run with SAE
tokens = model.to_tokens("Hello world")
_, cache = model.run_with_cache(tokens)
sae_acts = sae.encode(cache["resid_pre", 8])

Reference Documentation

For detailed API documentation, tutorials, and advanced usage, see the references/ folder:

File	Contents
references/README.md	Overview and quick start guide
references/api.md	Complete API reference for HookedTransformer, ActivationCache, HookPoints
references/tutorials.md	Step-by-step tutorials for activation patching, circuit analysis, logit lens

External Resources

Tutorials

Main Demo Notebook
Activation Patching Demo
ARENA Mech Interp Course - 200+ hours of tutorials

Papers

Official Documentation

Version Notes

v2.0 : Removed HookedSAE (moved to SAELens)
v3.0 (alpha) : TransformerBridge for loading any nn.Module

Weekly Installs

149

Repository

davila7/claude-…emplates

GitHub Stars

22.6K

First Seen

Jan 21, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

opencode120

claude-code119

gemini-cli115

cursor107

codex102

antigravity96

AI 代码实施计划编写技能 | 自动化开发任务分解与 TDD 流程规划工具

47,700 周安装

TransformerLens 可解释性工具：GPT 模型机制分析与激活修补指南

🇨🇳中文介绍

TransformerLens：Transformer 的机制可解释性工具

何时使用 TransformerLens

安装

核心概念

相关 Skills

HookedTransformer

支持的模型（50+）

激活缓存

ActivationCache 键

工作流程 1：激活修补（因果追踪）

分步指南

检查清单

工作流程 2：电路分析（间接宾语识别）

分步指南

检查清单

工作流程 3：归纳头检测

常见问题与解决方案

问题：调试后钩子仍然存在

问题：分词陷阱

问题：分析中忽略了 LayerNorm

问题：大模型内存爆炸

关键类参考

与 SAELens 集成

参考文档

外部资源

教程

论文

官方文档

版本说明

🇺🇸English

TransformerLens: Mechanistic Interpretability for Transformers

When to Use TransformerLens

Installation

Core Concepts

HookedTransformer

Supported Models (50+)

Activation Caching

ActivationCache Keys

Workflow 1: Activation Patching (Causal Tracing)

Step-by-Step

Checklist

Workflow 2: Circuit Analysis (Indirect Object Identification)

Step-by-Step

Checklist

Workflow 3: Induction Head Detection

Common Issues & Solutions

Issue: Hooks persist after debugging

Issue: Tokenization gotchas

Issue: LayerNorm ignored in analysis

Issue: Memory explosion with large models

Key Classes Reference

Integration with SAELens

Reference Documentation

External Resources

Tutorials

Papers

Official Documentation

Version Notes

最新 Skills