nnsight远程可解释性工具：PyTorch模型内部状态透明访问与干预

nnsight-remote-interpretability by davila7/claude-code-templates

156 周安装量

23,400 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/davila7/claude-code-templates --skill nnsight-remote-interpretability

AI/机器学习科研工具 PyTorch

🇨🇳中文介绍

nnsight：透明访问神经网络内部

nnsight (/ɛn.saɪt/) 使研究人员能够解释和操作任何 PyTorch 模型的内部状态，其独特之处在于能够通过 NDIF 在本地的小模型上或远程的大规模模型（70B+）上运行相同的代码。

GitHub : ndif-team/nnsight (730+ stars) 论文 : NNsight and NDIF: Democratizing Access to Foundation Model Internals (ICLR 2025)

核心价值主张

一次编写，随处运行：相同的可解释性代码可在本地 GPT-2 或远程 Llama-3.1-405B 上运行。只需切换 remote=True 即可。

# 本地执行（小模型）
with model.trace("Hello world"):
    hidden = model.transformer.h[5].output[0].save()

# 远程执行（大模型）- 相同的代码！
with model.trace("Hello world", remote=True):
    hidden = model.model.layers[40].output[0].save()

何时使用 nnsight

在以下情况下使用 nnsight：

对本地 GPU 无法容纳的大模型（70B、405B）运行可解释性实验
处理任何 PyTorch 架构（transformers、Mamba、自定义模型）
执行多令牌生成干预
在不同提示之间共享激活值
无需重新实现即可访问完整的模型内部状态

在以下情况下考虑替代方案：

您希望跨模型有一致的 API → 使用 TransformerLens

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

相关 Skills

find-skills 技能搜索工具 - Vercel Labs 开源智能体技能包管理器

843,800 周安装

React 组合模式指南：Vercel 组件架构最佳实践，提升代码可维护性

113,700 周安装

AI Elements：基于shadcn/ui的AI原生应用组件库，快速构建对话界面

63,800 周安装

超能力技能使用指南：AI助手技能调用优先级与工作流程详解

47,800 周安装

from nnsight import LanguageModel

model = LanguageModel("gpt2", device_map="auto")

with model.trace("The Eiffel Tower is in") as tracer:
    # 访问任何模块的输出
    hidden_states = model.transformer.h[5].output[0].save()

    # 访问注意力模式
    attn = model.transformer.h[5].attn.attn_dropout.input[0][0].save()

    # 修改激活值
    model.transformer.h[8].output[0][:] = 0  # 将第 8 层输出置零

    # 获取最终输出
    logits = model.output.save()

# 上下文退出后，访问保存的值
print(hidden_states.shape)  # [batch, seq, hidden]

from nnsight import LanguageModel
import torch

model = LanguageModel("gpt2", device_map="auto")

prompt = "The capital of France is"

with model.trace(prompt) as tracer:
    # 1. 从多个层收集激活值
    layer_outputs = []
    for i in range(12):  # GPT-2 有 12 层
        layer_out = model.transformer.h[i].output[0].save()
        layer_outputs.append(layer_out)

    # 2. 获取注意力模式
    attn_patterns = []
    for i in range(12):
        # 访问注意力权重（softmax 之后）
        attn = model.transformer.h[i].attn.attn_dropout.input[0][0].save()
        attn_patterns.append(attn)

    # 3. 获取最终 logits
    logits = model.output.save()

# 4. 在上下文外部分析
for i, layer_out in enumerate(layer_outputs):
    print(f"Layer {i} output shape: {layer_out.shape}")
    print(f"Layer {i} norm: {layer_out.norm().item():.3f}")

# 5. 查找 top 预测
probs = torch.softmax(logits[0, -1], dim=-1)
top_tokens = probs.topk(5)
for token, prob in zip(top_tokens.indices, top_tokens.values):
    print(f"{model.tokenizer.decode(token)}: {prob.item():.3f}")

from nnsight import LanguageModel
import torch

model = LanguageModel("gpt2", device_map="auto")

clean_prompt = "The Eiffel Tower is in"
corrupted_prompt = "The Colosseum is in"

# 1. 获取干净激活值
with model.trace(clean_prompt) as tracer:
    clean_hidden = model.transformer.h[8].output[0].save()

# 2. 将干净激活值修补到损坏的运行中
with model.trace(corrupted_prompt) as tracer:
    # 用干净激活值替换第 8 层输出
    model.transformer.h[8].output[0][:] = clean_hidden

    patched_logits = model.output.save()

# 3. 比较预测结果
paris_token = model.tokenizer.encode(" Paris")[0]
rome_token = model.tokenizer.encode(" Rome")[0]

patched_probs = torch.softmax(patched_logits[0, -1], dim=-1)
print(f"Paris prob: {patched_probs[paris_token].item():.3f}")
print(f"Rome prob: {patched_probs[rome_token].item():.3f}")

def patch_layer_position(layer, position, clean_cache, corrupted_prompt):
    """将单个层/位置的干净激活值修补到损坏提示中。"""
    with model.trace(corrupted_prompt) as tracer:
        # 获取当前激活值
        current = model.transformer.h[layer].output[0]

        # 仅修补特定位置
        current[:, position, :] = clean_cache[layer][:, position, :]

        logits = model.output.save()

    return logits

# 扫描所有层和位置
results = torch.zeros(12, seq_len)
for layer in range(12):
    for pos in range(seq_len):
        logits = patch_layer_position(layer, pos, clean_hidden, corrupted)
        results[layer, pos] = compute_metric(logits)

from nnsight import LanguageModel

# 1. 加载大模型（将远程运行）
model = LanguageModel("meta-llama/Llama-3.1-70B")

# 2. 相同的代码，只需添加 remote=True
with model.trace("The meaning of life is", remote=True) as tracer:
    # 访问 70B 模型的内部状态！
    layer_40_out = model.model.layers[40].output[0].save()
    logits = model.output.save()

# 3. 从 NDIF 返回的结果
print(f"Layer 40 shape: {layer_40_out.shape}")

# 4. 带干预的生成
with model.trace(remote=True) as tracer:
    with tracer.invoke("What is 2+2?"):
        # 在生成过程中进行干预
        model.model.layers[20].output[0][:, -1, :] *= 1.5

    output = model.generate(max_new_tokens=50)

from nnsight import LanguageModel
import torch

model = LanguageModel("gpt2", device_map="auto")

with model.trace("The quick brown fox") as tracer:
    # 保存激活值并启用梯度
    hidden = model.transformer.h[5].output[0].save()
    hidden.retain_grad()

    logits = model.output

    # 计算特定令牌的损失
    target_token = model.tokenizer.encode(" jumps")[0]
    loss = -logits[0, -1, target_token]

    # 反向传播
    loss.backward()

# 访问梯度
grad = hidden.grad
print(f"Gradient shape: {grad.shape}")
print(f"Gradient norm: {grad.norm().item():.3f}")

方法/属性	用途
`model.trace(prompt, remote=False)`	开始追踪上下文
`proxy.save()`	保存值以供追踪后访问
`proxy[:]`	切片/索引代理（赋值进行修补）
`tracer.invoke(prompt)`	在追踪中添加提示
`model.generate(...)`	带干预的生成
`model.output`	最终模型输出 logits
`model._model`	底层的 HuggingFace 模型

功能	nnsight	TransformerLens	pyvene
任意架构	是	仅限 Transformers	是
远程执行	是 (NDIF)	否	否
一致的 API	否	是	是
延迟执行	是	否	否
HuggingFace 原生	是	重新实现	是
可共享配置	否	否	是

文件	内容
references/README.md	概述和快速入门指南
references/api.md	LanguageModel、追踪、代理对象的完整 API 参考
references/tutorials.md	本地和远程可解释性的分步教程

🇺🇸English

nnsight: Transparent Access to Neural Network Internals

nnsight (/ɛn.saɪt/) enables researchers to interpret and manipulate the internals of any PyTorch model, with the unique capability of running the same code locally on small models or remotely on massive models (70B+) via NDIF.

GitHub : ndif-team/nnsight (730+ stars) Paper : NNsight and NDIF: Democratizing Access to Foundation Model Internals (ICLR 2025)

Key Value Proposition

Write once, run anywhere : The same interpretability code works on GPT-2 locally or Llama-3.1-405B remotely. Just toggle remote=True.

# Local execution (small model)
with model.trace("Hello world"):
    hidden = model.transformer.h[5].output[0].save()

# Remote execution (massive model) - same code!
with model.trace("Hello world", remote=True):
    hidden = model.model.layers[40].output[0].save()

When to Use nnsight

Use nnsight when you need to:

Run interpretability experiments on models too large for local GPUs (70B, 405B)
Work with any PyTorch architecture (transformers, Mamba, custom models)
Perform multi-token generation interventions
Share activations between different prompts
Access full model internals without reimplementation

Consider alternatives when:

You want consistent API across models → Use TransformerLens
You need declarative, shareable interventions → Use pyvene
You're training SAEs → Use SAELens
You only work with small models locally → TransformerLens may be simpler

Installation

# Basic installation
pip install nnsight

# For vLLM support
pip install "nnsight[vllm]"

For remote NDIF execution, sign up at login.ndif.us for an API key.

Core Concepts

LanguageModel Wrapper

from nnsight import LanguageModel

# Load model (uses HuggingFace under the hood)
model = LanguageModel("openai-community/gpt2", device_map="auto")

# For larger models
model = LanguageModel("meta-llama/Llama-3.1-8B", device_map="auto")

Tracing Context

The trace context manager enables deferred execution - operations are collected into a computation graph:

from nnsight import LanguageModel

model = LanguageModel("gpt2", device_map="auto")

with model.trace("The Eiffel Tower is in") as tracer:
    # Access any module's output
    hidden_states = model.transformer.h[5].output[0].save()

    # Access attention patterns
    attn = model.transformer.h[5].attn.attn_dropout.input[0][0].save()

    # Modify activations
    model.transformer.h[8].output[0][:] = 0  # Zero out layer 8

    # Get final output
    logits = model.output.save()

# After context exits, access saved values
print(hidden_states.shape)  # [batch, seq, hidden]

Proxy Objects

Inside trace, module accesses return Proxy objects that record operations:

with model.trace("Hello"):
    # These are all Proxy objects - operations are deferred
    h5_out = model.transformer.h[5].output[0]  # Proxy
    h5_mean = h5_out.mean(dim=-1)              # Proxy
    h5_saved = h5_mean.save()                   # Save for later access

Workflow 1: Activation Analysis

Step-by-Step

from nnsight import LanguageModel
import torch

model = LanguageModel("gpt2", device_map="auto")

prompt = "The capital of France is"

with model.trace(prompt) as tracer:
    # 1. Collect activations from multiple layers
    layer_outputs = []
    for i in range(12):  # GPT-2 has 12 layers
        layer_out = model.transformer.h[i].output[0].save()
        layer_outputs.append(layer_out)

    # 2. Get attention patterns
    attn_patterns = []
    for i in range(12):
        # Access attention weights (after softmax)
        attn = model.transformer.h[i].attn.attn_dropout.input[0][0].save()
        attn_patterns.append(attn)

    # 3. Get final logits
    logits = model.output.save()

# 4. Analyze outside context
for i, layer_out in enumerate(layer_outputs):
    print(f"Layer {i} output shape: {layer_out.shape}")
    print(f"Layer {i} norm: {layer_out.norm().item():.3f}")

# 5. Find top predictions
probs = torch.softmax(logits[0, -1], dim=-1)
top_tokens = probs.topk(5)
for token, prob in zip(top_tokens.indices, top_tokens.values):
    print(f"{model.tokenizer.decode(token)}: {prob.item():.3f}")

Checklist

Load model with LanguageModel wrapper
Use trace context for operations
Call .save() on values you need after context
Access saved values outside context
Use .shape, .norm(), etc. for analysis

Workflow 2: Activation Patching

Step-by-Step

from nnsight import LanguageModel
import torch

model = LanguageModel("gpt2", device_map="auto")

clean_prompt = "The Eiffel Tower is in"
corrupted_prompt = "The Colosseum is in"

# 1. Get clean activations
with model.trace(clean_prompt) as tracer:
    clean_hidden = model.transformer.h[8].output[0].save()

# 2. Patch clean into corrupted run
with model.trace(corrupted_prompt) as tracer:
    # Replace layer 8 output with clean activations
    model.transformer.h[8].output[0][:] = clean_hidden

    patched_logits = model.output.save()

# 3. Compare predictions
paris_token = model.tokenizer.encode(" Paris")[0]
rome_token = model.tokenizer.encode(" Rome")[0]

patched_probs = torch.softmax(patched_logits[0, -1], dim=-1)
print(f"Paris prob: {patched_probs[paris_token].item():.3f}")
print(f"Rome prob: {patched_probs[rome_token].item():.3f}")

Systematic Patching Sweep

def patch_layer_position(layer, position, clean_cache, corrupted_prompt):
    """Patch single layer/position from clean to corrupted."""
    with model.trace(corrupted_prompt) as tracer:
        # Get current activation
        current = model.transformer.h[layer].output[0]

        # Patch only specific position
        current[:, position, :] = clean_cache[layer][:, position, :]

        logits = model.output.save()

    return logits

# Sweep over all layers and positions
results = torch.zeros(12, seq_len)
for layer in range(12):
    for pos in range(seq_len):
        logits = patch_layer_position(layer, pos, clean_hidden, corrupted)
        results[layer, pos] = compute_metric(logits)

Workflow 3: Remote Execution with NDIF

Run the same experiments on massive models without local GPUs.

Step-by-Step

from nnsight import LanguageModel

# 1. Load large model (will run remotely)
model = LanguageModel("meta-llama/Llama-3.1-70B")

# 2. Same code, just add remote=True
with model.trace("The meaning of life is", remote=True) as tracer:
    # Access internals of 70B model!
    layer_40_out = model.model.layers[40].output[0].save()
    logits = model.output.save()

# 3. Results returned from NDIF
print(f"Layer 40 shape: {layer_40_out.shape}")

# 4. Generation with interventions
with model.trace(remote=True) as tracer:
    with tracer.invoke("What is 2+2?"):
        # Intervene during generation
        model.model.layers[20].output[0][:, -1, :] *= 1.5

    output = model.generate(max_new_tokens=50)

NDIF Setup

Sign up at login.ndif.us
Get API key
Set environment variable or pass to nnsight:

import os

os.environ["NDIF_API_KEY"] = "your_key"

# Or configure directly
from nnsight import CONFIG
CONFIG.API_KEY = "your_key"

Available Models on NDIF

Llama-3.1-8B, 70B, 405B
DeepSeek-R1 models
Various open-weight models (check ndif.us for current list)

Workflow 4: Cross-Prompt Activation Sharing

Share activations between different inputs in a single trace.

from nnsight import LanguageModel

model = LanguageModel("gpt2", device_map="auto")

with model.trace() as tracer:
    # First prompt
    with tracer.invoke("The cat sat on the"):
        cat_hidden = model.transformer.h[6].output[0].save()

    # Second prompt - inject cat's activations
    with tracer.invoke("The dog ran through the"):
        # Replace with cat's activations at layer 6
        model.transformer.h[6].output[0][:] = cat_hidden
        dog_with_cat = model.output.save()

# The dog prompt now has cat's internal representations

Workflow 5: Gradient-Based Analysis

Access gradients during backward pass.

from nnsight import LanguageModel
import torch

model = LanguageModel("gpt2", device_map="auto")

with model.trace("The quick brown fox") as tracer:
    # Save activations and enable gradient
    hidden = model.transformer.h[5].output[0].save()
    hidden.retain_grad()

    logits = model.output

    # Compute loss on specific token
    target_token = model.tokenizer.encode(" jumps")[0]
    loss = -logits[0, -1, target_token]

    # Backward pass
    loss.backward()

# Access gradients
grad = hidden.grad
print(f"Gradient shape: {grad.shape}")
print(f"Gradient norm: {grad.norm().item():.3f}")

Note : Gradient access not supported for vLLM or remote execution.

Common Issues & Solutions

Issue: Module path differs between models

# GPT-2 structure
model.transformer.h[5].output[0]

# LLaMA structure
model.model.layers[5].output[0]

# Solution: Check model structure
print(model._model)  # See actual module names

Issue: Forgetting to save

# WRONG: Value not accessible outside trace
with model.trace("Hello"):
    hidden = model.transformer.h[5].output[0]  # Not saved!

print(hidden)  # Error or wrong value

# RIGHT: Call .save()
with model.trace("Hello"):
    hidden = model.transformer.h[5].output[0].save()

print(hidden)  # Works!

Issue: Remote timeout

# For long operations, increase timeout
with model.trace("prompt", remote=True, timeout=300) as tracer:
    # Long operation...

Issue: Memory with many saved activations

# Only save what you need
with model.trace("prompt"):
    # Don't save everything
    for i in range(100):
        model.transformer.h[i].output[0].save()  # Memory heavy!

    # Better: save specific layers
    key_layers = [0, 5, 11]
    for i in key_layers:
        model.transformer.h[i].output[0].save()

Issue: vLLM gradient limitation

# vLLM doesn't support gradients
# Use standard execution for gradient analysis
model = LanguageModel("gpt2", device_map="auto")  # Not vLLM

Key API Reference

Method/Property	Purpose
`model.trace(prompt, remote=False)`	Start tracing context
`proxy.save()`	Save value for access after trace
`proxy[:]`	Slice/index proxy (assignment patches)
`tracer.invoke(prompt)`	Add prompt within trace
`model.generate(...)`	Generate with interventions
`model.output`

Comparison with Other Tools

Feature	nnsight	TransformerLens	pyvene
Any architecture	Yes	Transformers only	Yes
Remote execution	Yes (NDIF)	No	No
Consistent API	No	Yes	Yes
Deferred execution	Yes	No	No
HuggingFace native	Yes	Reimplemented	Yes
Shareable configs	No	No	Yes

Reference Documentation

For detailed API documentation, tutorials, and advanced usage, see the references/ folder:

File	Contents
references/README.md	Overview and quick start guide
references/api.md	Complete API reference for LanguageModel, tracing, proxy objects
references/tutorials.md	Step-by-step tutorials for local and remote interpretability

External Resources

Tutorials

Official Documentation

Papers

NNsight and NDIF Paper - Fiotto-Kaufman et al. (ICLR 2025)

Architecture Support

nnsight works with any PyTorch model:

Transformers : GPT-2, LLaMA, Mistral, etc.
State Space Models : Mamba
Vision Models : ViT, CLIP
Custom architectures : Any nn.Module

The key is knowing the module structure to access the right components.

Weekly Installs

156

Repository

davila7/claude-…emplates

GitHub Stars

23.4K

First Seen

Jan 21, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

claude-code132

opencode128

gemini-cli122

cursor119

codex109

antigravity106

nnsight远程可解释性工具：PyTorch模型内部状态透明访问与干预

🇨🇳中文介绍

nnsight：透明访问神经网络内部

核心价值主张

何时使用 nnsight

相关 Skills

安装

核心概念

LanguageModel 包装器

追踪上下文

代理对象

工作流程 1：激活分析

分步指南

检查清单

工作流程 2：激活修补

分步指南

系统性修补扫描

工作流程 3：使用 NDIF 进行远程执行

分步指南

NDIF 设置

NDIF 上可用的模型

工作流程 4：跨提示激活共享

工作流程 5：基于梯度的分析

常见问题及解决方案

问题：不同模型的模块路径不同

问题：忘记保存

问题：远程超时

问题：保存过多激活值导致内存问题

问题：vLLM 梯度限制

关键 API 参考

与其他工具的比较

参考文档

外部资源

教程

官方文档

论文

架构支持

🇺🇸English

nnsight: Transparent Access to Neural Network Internals

Key Value Proposition

When to Use nnsight

Installation

Core Concepts

LanguageModel Wrapper

Tracing Context

Proxy Objects

Workflow 1: Activation Analysis

Step-by-Step

Checklist

Workflow 2: Activation Patching

Step-by-Step

Systematic Patching Sweep

Workflow 3: Remote Execution with NDIF

Step-by-Step

NDIF Setup

Available Models on NDIF

Workflow 4: Cross-Prompt Activation Sharing

Workflow 5: Gradient-Based Analysis

Common Issues & Solutions

Issue: Module path differs between models

Issue: Forgetting to save

Issue: Remote timeout

Issue: Memory with many saved activations

Issue: vLLM gradient limitation

Key API Reference

Comparison with Other Tools

Reference Documentation

External Resources

Tutorials

Official Documentation

Papers

Architecture Support

最新 Skills