Pyvene：PyTorch模型因果干预库，实现激活修补与因果追踪

pyvene-interventions by davila7/claude-code-templates

162 周安装量

23,400 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/davila7/claude-code-templates --skill pyvene-interventions

AI/机器学习 PyTorch 自然语言处理

🇨🇳中文介绍

pyvene：神经网络的因果干预

pyvene 是斯坦福 NLP 用于对 PyTorch 模型执行因果干预的库。它提供了一个基于字典的声明式框架，用于激活修补、因果追踪和交换干预训练，使得干预实验可重现且可共享。

GitHub : stanfordnlp/pyvene (840+ stars) 论文 : pyvene: A Library for Understanding and Improving PyTorch Models via Interventions (NAACL 2024)

何时使用 pyvene

在以下情况下使用 pyvene：

执行因果追踪（ROME 式定位）
运行激活修补实验
进行交换干预训练
测试关于模型组件的因果假设
通过 HuggingFace 共享/重现干预实验
处理任何 PyTorch 架构（不仅仅是 transformers）

在以下情况下考虑替代方案：

需要进行探索性激活分析 → 使用 TransformerLens
想要训练/分析 SAE → 使用 SAELens
需要在大型模型上进行远程执行 → 使用 nnsight
需要更低层次的控制 → 使用 nnsight

安装

pip install pyvene

标准导入：

import pyvene as pv

核心概念

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

相关 Skills

find-skills 技能搜索工具 - Vercel Labs 开源智能体技能包管理器

843,800 周安装

React 组合模式指南：Vercel 组件架构最佳实践，提升代码可维护性

113,700 周安装

AI Elements：基于shadcn/ui的AI原生应用组件库，快速构建对话界面

63,800 周安装

AI 代码实施计划编写技能 | 自动化开发任务分解与 TDD 流程规划工具

46,800 周安装

类型	描述	用例
`VanillaIntervention`	在不同运行之间交换激活	激活修补
`AdditionIntervention`	向基础运行添加激活	引导、消融
`SubtractionIntervention`	减去激活	消融
`ZeroIntervention`	将激活置零	组件敲除
`RotatedSpaceIntervention`	DAS 可训练干预	因果发现
`CollectIntervention`	收集激活	探测、分析

# 可干预的可用组件
components = [
    "block_input",      # Transformer 块的输入
    "block_output",     # Transformer 块的输出
    "mlp_input",        # MLP 的输入
    "mlp_output",       # MLP 的输出
    "mlp_activation",   # MLP 隐藏层激活
    "attention_input",  # 注意力机制的输入
    "attention_output", # 注意力机制的输出
    "attention_value_output",  # 注意力值向量
    "query_output",     # 查询向量
    "key_output",       # 键向量
    "value_output",     # 值向量
    "head_attention_value_output",  # 每个头的值向量
]

import pyvene as pv
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained("gpt2-xl")
tokenizer = AutoTokenizer.from_pretrained("gpt2-xl")

# 1. 定义干净和损坏的输入
clean_prompt = "The Space Needle is in downtown"
corrupted_prompt = "The ##### ###### ## ## ########"  # 噪声

clean_tokens = tokenizer(clean_prompt, return_tensors="pt")
corrupted_tokens = tokenizer(corrupted_prompt, return_tensors="pt")

# 2. 获取干净激活（源）
with torch.no_grad():
    clean_outputs = model(**clean_tokens, output_hidden_states=True)
    clean_states = clean_outputs.hidden_states

# 3. 定义恢复干预
def run_causal_trace(layer, position):
    """在特定层和位置恢复干净激活。"""
    config = pv.IntervenableConfig(
        representations=[
            pv.RepresentationConfig(
                layer=layer,
                component="block_output",
                intervention_type=pv.VanillaIntervention,
                unit="pos",
                max_number_of_units=1,
            )
        ]
    )

    intervenable = pv.IntervenableModel(config, model)

    # 运行干预
    _, patched_outputs = intervenable(
        base=corrupted_tokens,
        sources=[clean_tokens],
        unit_locations={"sources->base": ([[[position]]], [[[position]]])},
        output_original_output=True,
    )

    # 返回正确标记的概率
    probs = torch.softmax(patched_outputs.logits[0, -1], dim=-1)
    seattle_token = tokenizer.encode(" Seattle")[0]
    return probs[seattle_token].item()

# 4. 遍历层和位置
n_layers = model.config.n_layer
seq_len = clean_tokens["input_ids"].shape[1]

results = torch.zeros(n_layers, seq_len)
for layer in range(n_layers):
    for pos in range(seq_len):
        results[layer, pos] = run_causal_trace(layer, pos)

# 5. 可视化（层 x 位置热图）
# 高值表示因果重要性

import pyvene as pv
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# IOI 任务设置
clean_prompt = "When John and Mary went to the store, Mary gave a bottle to"
corrupted_prompt = "When John and Mary went to the store, John gave a bottle to"

clean_tokens = tokenizer(clean_prompt, return_tensors="pt")
corrupted_tokens = tokenizer(corrupted_prompt, return_tensors="pt")

john_token = tokenizer.encode(" John")[0]
mary_token = tokenizer.encode(" Mary")[0]

def logit_diff(logits):
    """IO - S 对数差。"""
    return logits[0, -1, john_token] - logits[0, -1, mary_token]

# 在每一层修补注意力输出
def patch_attention(layer):
    config = pv.IntervenableConfig(
        representations=[
            pv.RepresentationConfig(
                layer=layer,
                component="attention_output",
                intervention_type=pv.VanillaIntervention,
            )
        ]
    )

    intervenable = pv.IntervenableModel(config, model)

    _, patched_outputs = intervenable(
        base=corrupted_tokens,
        sources=[clean_tokens],
    )

    return logit_diff(patched_outputs.logits).item()

# 找出哪些层重要
results = []
for layer in range(model.config.n_layer):
    diff = patch_attention(layer)
    results.append(diff)
    print(f"Layer {layer}: logit diff = {diff:.3f}")

import pyvene as pv
from transformers import AutoModelForCausalLM
import torch

model = AutoModelForCausalLM.from_pretrained("gpt2")

# 1. 定义可训练干预
config = pv.IntervenableConfig(
    representations=[
        pv.RepresentationConfig(
            layer=6,
            component="block_output",
            intervention_type=pv.RotatedSpaceIntervention,  # 可训练
            low_rank_dimension=64,  # 学习 64 维子空间
        )
    ]
)

intervenable = pv.IntervenableModel(config, model)

# 2. 设置训练
optimizer = torch.optim.Adam(
    intervenable.get_trainable_parameters(),
    lr=1e-4
)

# 3. 训练循环（简化版）
for base_input, source_input, target_output in dataloader:
    optimizer.zero_grad()

    _, outputs = intervenable(
        base=base_input,
        sources=[source_input],
    )

    loss = criterion(outputs.logits, target_output)
    loss.backward()
    optimizer.step()

# 4. 分析学习到的干预
# 旋转矩阵揭示了因果子空间
rotation = intervenable.interventions["layer.6.block_output"][0].rotate_layer

import pyvene as pv
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")

# 加载预训练的引导干预
intervenable = pv.IntervenableModel.load(
    "zhengxuanzenwu/intervenable_honest_llama2_chat_7B",
    model=model,
)

# 使用引导进行生成
prompt = "Is the earth flat?"
inputs = tokenizer(prompt, return_tensors="pt")

# 在生成过程中应用干预
outputs = intervenable.generate(
    inputs,
    max_new_tokens=100,
    do_sample=False,
)

print(tokenizer.decode(outputs[0]))

类	用途
`IntervenableModel`	干预的主要包装器
`IntervenableConfig`	配置容器
`RepresentationConfig`	单个干预规范
`VanillaIntervention`	激活交换
`RotatedSpaceIntervention`	可训练的 DAS 干预
`CollectIntervention`	激活收集

文件	内容
references/README.md	概述和快速入门指南
references/api.md	IntervenableModel、干预类型、配置的完整 API 参考
references/tutorials.md	因果追踪、激活修补、DAS 的分步教程

特性	pyvene	TransformerLens	nnsight
声明式配置	是	否	否
HuggingFace 共享	是	否	否
可训练干预	是	有限	是
任何 PyTorch 模型	是	仅限 Transformers	是
远程执行	否	否	是 (NDIF)

🇺🇸English

pyvene: Causal Interventions for Neural Networks

pyvene is Stanford NLP's library for performing causal interventions on PyTorch models. It provides a declarative, dict-based framework for activation patching, causal tracing, and interchange intervention training - making intervention experiments reproducible and shareable.

GitHub : stanfordnlp/pyvene (840+ stars) Paper : pyvene: A Library for Understanding and Improving PyTorch Models via Interventions (NAACL 2024)

When to Use pyvene

Use pyvene when you need to:

Perform causal tracing (ROME-style localization)
Run activation patching experiments
Conduct interchange intervention training (IIT)
Test causal hypotheses about model components
Share/reproduce intervention experiments via HuggingFace
Work with any PyTorch architecture (not just transformers)

Consider alternatives when:

You need exploratory activation analysis → Use TransformerLens
You want to train/analyze SAEs → Use SAELens
You need remote execution on massive models → Use nnsight
You want lower-level control → Use nnsight

Installation

pip install pyvene

Standard import:

import pyvene as pv

Core Concepts

IntervenableModel

The main class that wraps any PyTorch model with intervention capabilities:

import pyvene as pv
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load base model
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Define intervention configuration
config = pv.IntervenableConfig(
    representations=[
        pv.RepresentationConfig(
            layer=8,
            component="block_output",
            intervention_type=pv.VanillaIntervention,
        )
    ]
)

# Create intervenable model
intervenable = pv.IntervenableModel(config, model)

Intervention Types

Type	Description	Use Case
`VanillaIntervention`	Swap activations between runs	Activation patching
`AdditionIntervention`	Add activations to base run	Steering, ablation
`SubtractionIntervention`	Subtract activations	Ablation
`ZeroIntervention`	Zero out activations	Component knockout
`RotatedSpaceIntervention`

Component Targets

# Available components to intervene on
components = [
    "block_input",      # Input to transformer block
    "block_output",     # Output of transformer block
    "mlp_input",        # Input to MLP
    "mlp_output",       # Output of MLP
    "mlp_activation",   # MLP hidden activations
    "attention_input",  # Input to attention
    "attention_output", # Output of attention
    "attention_value_output",  # Attention value vectors
    "query_output",     # Query vectors
    "key_output",       # Key vectors
    "value_output",     # Value vectors
    "head_attention_value_output",  # Per-head values
]

Workflow 1: Causal Tracing (ROME-style)

Locate where factual associations are stored by corrupting inputs and restoring activations.

Step-by-Step

import pyvene as pv
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained("gpt2-xl")
tokenizer = AutoTokenizer.from_pretrained("gpt2-xl")

# 1. Define clean and corrupted inputs
clean_prompt = "The Space Needle is in downtown"
corrupted_prompt = "The ##### ###### ## ## ########"  # Noise

clean_tokens = tokenizer(clean_prompt, return_tensors="pt")
corrupted_tokens = tokenizer(corrupted_prompt, return_tensors="pt")

# 2. Get clean activations (source)
with torch.no_grad():
    clean_outputs = model(**clean_tokens, output_hidden_states=True)
    clean_states = clean_outputs.hidden_states

# 3. Define restoration intervention
def run_causal_trace(layer, position):
    """Restore clean activation at specific layer and position."""
    config = pv.IntervenableConfig(
        representations=[
            pv.RepresentationConfig(
                layer=layer,
                component="block_output",
                intervention_type=pv.VanillaIntervention,
                unit="pos",
                max_number_of_units=1,
            )
        ]
    )

    intervenable = pv.IntervenableModel(config, model)

    # Run with intervention
    _, patched_outputs = intervenable(
        base=corrupted_tokens,
        sources=[clean_tokens],
        unit_locations={"sources->base": ([[[position]]], [[[position]]])},
        output_original_output=True,
    )

    # Return probability of correct token
    probs = torch.softmax(patched_outputs.logits[0, -1], dim=-1)
    seattle_token = tokenizer.encode(" Seattle")[0]
    return probs[seattle_token].item()

# 4. Sweep over layers and positions
n_layers = model.config.n_layer
seq_len = clean_tokens["input_ids"].shape[1]

results = torch.zeros(n_layers, seq_len)
for layer in range(n_layers):
    for pos in range(seq_len):
        results[layer, pos] = run_causal_trace(layer, pos)

# 5. Visualize (layer x position heatmap)
# High values indicate causal importance

Checklist

Prepare clean prompt with target factual association
Create corrupted version (noise or counterfactual)
Define intervention config for each (layer, position)
Run patching sweep
Identify causal hotspots in heatmap

Workflow 2: Activation Patching for Circuit Analysis

Test which components are necessary for a specific behavior.

Step-by-Step

import pyvene as pv
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# IOI task setup
clean_prompt = "When John and Mary went to the store, Mary gave a bottle to"
corrupted_prompt = "When John and Mary went to the store, John gave a bottle to"

clean_tokens = tokenizer(clean_prompt, return_tensors="pt")
corrupted_tokens = tokenizer(corrupted_prompt, return_tensors="pt")

john_token = tokenizer.encode(" John")[0]
mary_token = tokenizer.encode(" Mary")[0]

def logit_diff(logits):
    """IO - S logit difference."""
    return logits[0, -1, john_token] - logits[0, -1, mary_token]

# Patch attention output at each layer
def patch_attention(layer):
    config = pv.IntervenableConfig(
        representations=[
            pv.RepresentationConfig(
                layer=layer,
                component="attention_output",
                intervention_type=pv.VanillaIntervention,
            )
        ]
    )

    intervenable = pv.IntervenableModel(config, model)

    _, patched_outputs = intervenable(
        base=corrupted_tokens,
        sources=[clean_tokens],
    )

    return logit_diff(patched_outputs.logits).item()

# Find which layers matter
results = []
for layer in range(model.config.n_layer):
    diff = patch_attention(layer)
    results.append(diff)
    print(f"Layer {layer}: logit diff = {diff:.3f}")

Workflow 3: Interchange Intervention Training (IIT)

Train interventions to discover causal structure.

Step-by-Step

import pyvene as pv
from transformers import AutoModelForCausalLM
import torch

model = AutoModelForCausalLM.from_pretrained("gpt2")

# 1. Define trainable intervention
config = pv.IntervenableConfig(
    representations=[
        pv.RepresentationConfig(
            layer=6,
            component="block_output",
            intervention_type=pv.RotatedSpaceIntervention,  # Trainable
            low_rank_dimension=64,  # Learn 64-dim subspace
        )
    ]
)

intervenable = pv.IntervenableModel(config, model)

# 2. Set up training
optimizer = torch.optim.Adam(
    intervenable.get_trainable_parameters(),
    lr=1e-4
)

# 3. Training loop (simplified)
for base_input, source_input, target_output in dataloader:
    optimizer.zero_grad()

    _, outputs = intervenable(
        base=base_input,
        sources=[source_input],
    )

    loss = criterion(outputs.logits, target_output)
    loss.backward()
    optimizer.step()

# 4. Analyze learned intervention
# The rotation matrix reveals causal subspace
rotation = intervenable.interventions["layer.6.block_output"][0].rotate_layer

DAS (Distributed Alignment Search)

# Low-rank rotation finds interpretable subspaces
config = pv.IntervenableConfig(
    representations=[
        pv.RepresentationConfig(
            layer=8,
            component="block_output",
            intervention_type=pv.LowRankRotatedSpaceIntervention,
            low_rank_dimension=1,  # Find 1D causal direction
        )
    ]
)

Workflow 4: Model Steering (Honest LLaMA)

Steer model behavior during generation.

import pyvene as pv
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")

# Load pre-trained steering intervention
intervenable = pv.IntervenableModel.load(
    "zhengxuanzenwu/intervenable_honest_llama2_chat_7B",
    model=model,
)

# Generate with steering
prompt = "Is the earth flat?"
inputs = tokenizer(prompt, return_tensors="pt")

# Intervention applied during generation
outputs = intervenable.generate(
    inputs,
    max_new_tokens=100,
    do_sample=False,
)

print(tokenizer.decode(outputs[0]))

Saving and Sharing Interventions

# Save locally
intervenable.save("./my_intervention")

# Load from local
intervenable = pv.IntervenableModel.load(
    "./my_intervention",
    model=model,
)

# Share on HuggingFace
intervenable.save_intervention("username/my-intervention")

# Load from HuggingFace
intervenable = pv.IntervenableModel.load(
    "username/my-intervention",
    model=model,
)

Common Issues & Solutions

Issue: Wrong intervention location

# WRONG: Incorrect component name
config = pv.RepresentationConfig(
    component="mlp",  # Not valid!
)

# RIGHT: Use exact component name
config = pv.RepresentationConfig(
    component="mlp_output",  # Valid
)

Issue: Dimension mismatch

# Ensure source and base have compatible shapes
# For position-specific interventions:
config = pv.RepresentationConfig(
    unit="pos",
    max_number_of_units=1,  # Intervene on single position
)

# Specify locations explicitly
intervenable(
    base=base_tokens,
    sources=[source_tokens],
    unit_locations={"sources->base": ([[[5]]], [[[5]]])},  # Position 5
)

Issue: Memory with large models

# Use gradient checkpointing
model.gradient_checkpointing_enable()

# Or intervene on fewer components
config = pv.IntervenableConfig(
    representations=[
        pv.RepresentationConfig(
            layer=8,  # Single layer instead of all
            component="block_output",
        )
    ]
)

Issue: LoRA integration

# pyvene v0.1.8+ supports LoRAs as interventions
config = pv.RepresentationConfig(
    intervention_type=pv.LoRAIntervention,
    low_rank_dimension=16,
)

Key Classes Reference

Class	Purpose
`IntervenableModel`	Main wrapper for interventions
`IntervenableConfig`	Configuration container
`RepresentationConfig`	Single intervention specification
`VanillaIntervention`	Activation swapping
`RotatedSpaceIntervention`	Trainable DAS intervention
`CollectIntervention`

Supported Models

pyvene works with any PyTorch model. Tested on:

GPT-2 (all sizes)
LLaMA / LLaMA-2
Pythia
Mistral / Mixtral
OPT
BLIP (vision-language)
ESM (protein models)
Mamba (state space)

Reference Documentation

For detailed API documentation, tutorials, and advanced usage, see the references/ folder:

File	Contents
references/README.md	Overview and quick start guide
references/api.md	Complete API reference for IntervenableModel, intervention types, configurations
references/tutorials.md	Step-by-step tutorials for causal tracing, activation patching, DAS

External Resources

Tutorials

Papers

Locating and Editing Factual Associations in GPT - Meng et al. (2022)
Inference-Time Intervention - Li et al. (2023)
Interpretability in the Wild - Wang et al. (2022)

Official Documentation

Comparison with Other Tools

Feature	pyvene	TransformerLens	nnsight
Declarative config	Yes	No	No
HuggingFace sharing	Yes	No	No
Trainable interventions	Yes	Limited	Yes
Any PyTorch model	Yes	Transformers only	Yes
Remote execution	No	No	Yes (NDIF)

Weekly Installs

144

Repository

davila7/claude-…emplates

GitHub Stars

22.6K

First Seen

Jan 21, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykWarn

Installed on

claude-code119

opencode115

gemini-cli107

cursor106

codex95

antigravity94

Pyvene：PyTorch模型因果干预库，实现激活修补与因果追踪

🇨🇳中文介绍

pyvene：神经网络的因果干预

何时使用 pyvene

安装

核心概念

相关 Skills

IntervenableModel

干预类型

组件目标

工作流程 1：因果追踪（ROME 风格）

分步指南

检查清单

工作流程 2：用于电路分析的激活修补

分步指南

工作流程 3：交换干预训练

分步指南

DAS（分布式对齐搜索）

工作流程 4：模型引导（诚实的 LLaMA）

保存和共享干预

常见问题与解决方案

问题：干预位置错误

问题：维度不匹配

问题：大型模型的内存问题

问题：LoRA 集成

关键类参考

支持的模型

参考文档

外部资源

教程

论文

官方文档

与其他工具的比较

🇺🇸English

pyvene: Causal Interventions for Neural Networks

When to Use pyvene

Installation

Core Concepts

IntervenableModel

Intervention Types

Component Targets

Workflow 1: Causal Tracing (ROME-style)

Step-by-Step

Checklist

Workflow 2: Activation Patching for Circuit Analysis

Step-by-Step

Workflow 3: Interchange Intervention Training (IIT)

Step-by-Step

DAS (Distributed Alignment Search)

Workflow 4: Model Steering (Honest LLaMA)

Saving and Sharing Interventions

Common Issues & Solutions

Issue: Wrong intervention location

Issue: Dimension mismatch

Issue: Memory with large models

Issue: LoRA integration

Key Classes Reference

Supported Models

Reference Documentation

External Resources

Tutorials

Papers

Official Documentation

Comparison with Other Tools

最新 Skills