alphafold-database by davila7/claude-code-templates
npx skills add https://github.com/davila7/claude-code-templates --skill alphafold-databaseAlphaFold DB 是一个包含超过 2 亿个蛋白质的 AI 预测 3D 结构的公共存储库,由 DeepMind 和 EMBL-EBI 维护。您可以访问带有置信度指标的结构预测、下载坐标文件、检索批量数据集,并将预测集成到计算工作流中。
在以下涉及 AI 预测蛋白质结构的场景中应使用此技能:
使用 Biopython(推荐):
Biopython 库提供了检索 AlphaFold 结构的最简单接口:
from Bio.PDB import alphafold_db
# 获取 UniProt 登录号的所有预测
predictions = list(alphafold_db.get_predictions("P00520"))
# 下载结构文件(mmCIF 格式)
for prediction in predictions:
cif_file = alphafold_db.download_cif_for(prediction, directory="./structures")
print(f"Downloaded: {cif_file}")
# 直接获取 Structure 对象
from Bio.PDB import MMCIFParser
structures = list(alphafold_db.get_structural_models_for("P00520"))
直接 API 访问:
使用 REST 端点查询预测:
import requests
# 获取 UniProt 登录号的预测元数据
uniprot_id = "P00520"
api_url = f"https://alphafold.ebi.ac.uk/api/prediction/{uniprot_id}"
response = requests.get(api_url)
prediction_data = response.json()
# 提取 AlphaFold ID
alphafold_id = prediction_data[0]['entryId']
print(f"AlphaFold ID: {alphafold_id}")
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
使用 UniProt 查找登录号:
首先搜索 UniProt 以查找蛋白质登录号:
import urllib.parse, urllib.request
def get_uniprot_ids(query, query_type='PDB_ID'):
"""查询 UniProt 以获取登录号 ID"""
url = 'https://www.uniprot.org/uploadlists/'
params = {
'from': query_type,
'to': 'ACC',
'format': 'txt',
'query': query
}
data = urllib.parse.urlencode(params).encode('ascii')
with urllib.request.urlopen(urllib.request.Request(url, data)) as response:
return response.read().decode('utf-8').splitlines()
# 示例:查找蛋白质名称的 UniProt ID
protein_ids = get_uniprot_ids("hemoglobin", query_type="GENE_NAME")
AlphaFold 为每个预测提供多种文件格式:
可用文件类型:
model_v4.cif):mmCIF/PDBx 格式的原子坐标confidence_v4.json):每个残基的 pLDDT 分数(0-100)predicted_aligned_error_v4.json):用于残基对置信度的 PAE 矩阵下载 URL:
import requests
alphafold_id = "AF-P00520-F1"
version = "v4"
# 模型坐标 (mmCIF)
model_url = f"https://alphafold.ebi.ac.uk/files/{alphafold_id}-model_{version}.cif"
response = requests.get(model_url)
with open(f"{alphafold_id}.cif", "w") as f:
f.write(response.text)
# 置信度分数 (JSON)
confidence_url = f"https://alphafold.ebi.ac.uk/files/{alphafold_id}-confidence_{version}.json"
response = requests.get(confidence_url)
confidence_data = response.json()
# 预测对齐误差 (JSON)
pae_url = f"https://alphafold.ebi.ac.uk/files/{alphafold_id}-predicted_aligned_error_{version}.json"
response = requests.get(pae_url)
pae_data = response.json()
PDB 格式(替代方案):
# 下载为 PDB 格式而非 mmCIF
pdb_url = f"https://alphafold.ebi.ac.uk/files/{alphafold_id}-model_{version}.pdb"
response = requests.get(pdb_url)
with open(f"{alphafold_id}.pdb", "wb") as f:
f.write(response.content)
AlphaFold 预测包含对解释至关重要的置信度估计:
pLDDT(每个残基的置信度):
import json
import requests
# 加载置信度分数
alphafold_id = "AF-P00520-F1"
confidence_url = f"https://alphafold.ebi.ac.uk/files/{alphafold_id}-confidence_v4.json"
confidence = requests.get(confidence_url).json()
# 提取 pLDDT 分数
plddt_scores = confidence['confidenceScore']
# 解释置信度水平
# pLDDT > 90: 置信度非常高
# pLDDT 70-90: 置信度高
# pLDDT 50-70: 置信度低
# pLDDT < 50: 置信度非常低
high_confidence_residues = [i for i, score in enumerate(plddt_scores) if score > 90]
print(f"High confidence residues: {len(high_confidence_residues)}/{len(plddt_scores)}")
PAE(预测对齐误差):
PAE 表示相对结构域位置的置信度:
import numpy as np
import matplotlib.pyplot as plt
# 加载 PAE 矩阵
pae_url = f"https://alphafold.ebi.ac.uk/files/{alphafold_id}-predicted_aligned_error_v4.json"
pae = requests.get(pae_url).json()
# 可视化 PAE 矩阵
pae_matrix = np.array(pae['distance'])
plt.figure(figsize=(10, 8))
plt.imshow(pae_matrix, cmap='viridis_r', vmin=0, vmax=30)
plt.colorbar(label='PAE (Å)')
plt.title(f'Predicted Aligned Error: {alphafold_id}')
plt.xlabel('Residue')
plt.ylabel('Residue')
plt.savefig(f'{alphafold_id}_pae.png', dpi=300, bbox_inches='tight')
# 低 PAE 值 (<5 Å) 表示相对定位置信度高
# 高 PAE 值 (>15 Å) 表示结构域排列不确定
对于大规模分析,请使用 Google Cloud 数据集:
Google Cloud Storage:
# 安装 gsutil
uv pip install gsutil
# 列出可用数据
gsutil ls gs://public-datasets-deepmind-alphafold-v4/
# 下载整个蛋白质组(按分类学 ID)
gsutil -m cp gs://public-datasets-deepmind-alphafold-v4/proteomes/proteome-tax_id-9606-*.tar .
# 下载特定文件
gsutil cp gs://public-datasets-deepmind-alphafold-v4/accession_ids.csv .
BigQuery 元数据访问:
from google.cloud import bigquery
# 初始化客户端
client = bigquery.Client()
# 查询元数据
query = """
SELECT
entryId,
uniprotAccession,
organismScientificName,
globalMetricValue,
fractionPlddtVeryHigh
FROM `bigquery-public-data.deepmind_alphafold.metadata`
WHERE organismScientificName = 'Homo sapiens'
AND fractionPlddtVeryHigh > 0.8
LIMIT 100
"""
results = client.query(query).to_dataframe()
print(f"Found {len(results)} high-confidence human proteins")
按物种下载:
import subprocess
def download_proteome(taxonomy_id, output_dir="./proteomes"):
"""下载某个物种的所有 AlphaFold 预测"""
pattern = f"gs://public-datasets-deepmind-alphafold-v4/proteomes/proteome-tax_id-{taxonomy_id}-*_v4.tar"
cmd = f"gsutil -m cp {pattern} {output_dir}/"
subprocess.run(cmd, shell=True, check=True)
# 下载大肠杆菌蛋白质组(分类 ID: 83333)
download_proteome(83333)
# 下载人类蛋白质组(分类 ID: 9606)
download_proteome(9606)
使用 BioPython 处理下载的 AlphaFold 结构:
from Bio.PDB import MMCIFParser, PDBIO
import numpy as np
# 解析 mmCIF 文件
parser = MMCIFParser(QUIET=True)
structure = parser.get_structure("protein", "AF-P00520-F1-model_v4.cif")
# 提取坐标
coords = []
for model in structure:
for chain in model:
for residue in chain:
if 'CA' in residue: # 仅 Alpha 碳原子
coords.append(residue['CA'].get_coord())
coords = np.array(coords)
print(f"Structure has {len(coords)} residues")
# 计算距离
from scipy.spatial.distance import pdist, squareform
distance_matrix = squareform(pdist(coords))
# 识别接触(< 8 Å)
contacts = np.where((distance_matrix > 0) & (distance_matrix < 8))
print(f"Number of contacts: {len(contacts[0]) // 2}")
提取 B 因子(pLDDT 值):
AlphaFold 将 pLDDT 分数存储在 B 因子列中:
from Bio.PDB import MMCIFParser
parser = MMCIFParser(QUIET=True)
structure = parser.get_structure("protein", "AF-P00520-F1-model_v4.cif")
# 从 B 因子中提取 pLDDT
plddt_scores = []
for model in structure:
for chain in model:
for residue in chain:
if 'CA' in residue:
plddt_scores.append(residue['CA'].get_bfactor())
# 识别高置信度区域
high_conf_regions = [(i, score) for i, score in enumerate(plddt_scores, 1) if score > 90]
print(f"High confidence residues: {len(high_conf_regions)}")
高效处理多个预测:
from Bio.PDB import alphafold_db
import pandas as pd
uniprot_ids = ["P00520", "P12931", "P04637"] # 多个蛋白质
results = []
for uniprot_id in uniprot_ids:
try:
# 获取预测
predictions = list(alphafold_db.get_predictions(uniprot_id))
if predictions:
pred = predictions[0]
# 下载结构
cif_file = alphafold_db.download_cif_for(pred, directory="./batch_structures")
# 获取置信度数据
alphafold_id = pred['entryId']
conf_url = f"https://alphafold.ebi.ac.uk/files/{alphafold_id}-confidence_v4.json"
conf_data = requests.get(conf_url).json()
# 计算统计信息
plddt_scores = conf_data['confidenceScore']
avg_plddt = np.mean(plddt_scores)
high_conf_fraction = sum(1 for s in plddt_scores if s > 90) / len(plddt_scores)
results.append({
'uniprot_id': uniprot_id,
'alphafold_id': alphafold_id,
'avg_plddt': avg_plddt,
'high_conf_fraction': high_conf_fraction,
'length': len(plddt_scores)
})
except Exception as e:
print(f"Error processing {uniprot_id}: {e}")
# 创建摘要 DataFrame
df = pd.DataFrame(results)
print(df)
# 安装 Biopython 用于结构访问
uv pip install biopython
# 安装 requests 用于 API 访问
uv pip install requests
# 用于可视化和分析
uv pip install numpy matplotlib pandas scipy
# 用于 Google Cloud 访问(可选)
uv pip install google-cloud-bigquery gsutil
AlphaFold 也可以通过 3D-Beacons 联合 API 访问:
import requests
# 通过 3D-Beacons 查询
uniprot_id = "P00520"
url = f"https://www.ebi.ac.uk/pdbe/pdbe-kb/3dbeacons/api/uniprot/summary/{uniprot_id}.json"
response = requests.get(url)
data = response.json()
# 筛选 AlphaFold 结构
af_structures = [s for s in data['structures'] if s['provider'] == 'AlphaFold DB']
UniProt 登录号: 蛋白质的主要标识符(例如 "P00520")。查询 AlphaFold DB 时需要。
AlphaFold ID: 内部标识符格式:AF-[UniProt 登录号]-F[片段编号](例如 "AF-P00520-F1")。
pLDDT(预测局部距离差异测试): 每个残基的置信度指标(0-100)。值越高表示预测置信度越高。
PAE(预测对齐误差): 指示残基对之间相对位置置信度的矩阵。低值(<5 Å)表示相对定位置信度高。
数据库版本: 当前版本是 v4。文件 URL 包含版本后缀(例如 model_v4.cif)。
片段编号: 大蛋白质可能被分割成片段。片段编号出现在 AlphaFold ID 中(例如 F1、F2)。
pLDDT 阈值:
PAE 指南:
全面的 API 文档,涵盖:
请查阅此参考以获取详细的 API 信息、批量下载策略,或在处理大规模数据集时使用。
_v4.cif)每周安装次数
165
仓库
GitHub 星标数
23.5K
首次出现
2026 年 1 月 21 日
安全审计
安装于
opencode137
claude-code137
gemini-cli133
cursor132
codex123
antigravity120
AlphaFold DB is a public repository of AI-predicted 3D protein structures for over 200 million proteins, maintained by DeepMind and EMBL-EBI. Access structure predictions with confidence metrics, download coordinate files, retrieve bulk datasets, and integrate predictions into computational workflows.
This skill should be used when working with AI-predicted protein structures in scenarios such as:
Using Biopython (Recommended):
The Biopython library provides the simplest interface for retrieving AlphaFold structures:
from Bio.PDB import alphafold_db
# Get all predictions for a UniProt accession
predictions = list(alphafold_db.get_predictions("P00520"))
# Download structure file (mmCIF format)
for prediction in predictions:
cif_file = alphafold_db.download_cif_for(prediction, directory="./structures")
print(f"Downloaded: {cif_file}")
# Get Structure objects directly
from Bio.PDB import MMCIFParser
structures = list(alphafold_db.get_structural_models_for("P00520"))
Direct API Access:
Query predictions using REST endpoints:
import requests
# Get prediction metadata for a UniProt accession
uniprot_id = "P00520"
api_url = f"https://alphafold.ebi.ac.uk/api/prediction/{uniprot_id}"
response = requests.get(api_url)
prediction_data = response.json()
# Extract AlphaFold ID
alphafold_id = prediction_data[0]['entryId']
print(f"AlphaFold ID: {alphafold_id}")
Using UniProt to Find Accessions:
Search UniProt to find protein accessions first:
import urllib.parse, urllib.request
def get_uniprot_ids(query, query_type='PDB_ID'):
"""Query UniProt to get accession IDs"""
url = 'https://www.uniprot.org/uploadlists/'
params = {
'from': query_type,
'to': 'ACC',
'format': 'txt',
'query': query
}
data = urllib.parse.urlencode(params).encode('ascii')
with urllib.request.urlopen(urllib.request.Request(url, data)) as response:
return response.read().decode('utf-8').splitlines()
# Example: Find UniProt IDs for a protein name
protein_ids = get_uniprot_ids("hemoglobin", query_type="GENE_NAME")
AlphaFold provides multiple file formats for each prediction:
File Types Available:
model_v4.cif): Atomic coordinates in mmCIF/PDBx formatconfidence_v4.json): Per-residue pLDDT scores (0-100)predicted_aligned_error_v4.json): PAE matrix for residue pair confidenceDownload URLs:
import requests
alphafold_id = "AF-P00520-F1"
version = "v4"
# Model coordinates (mmCIF)
model_url = f"https://alphafold.ebi.ac.uk/files/{alphafold_id}-model_{version}.cif"
response = requests.get(model_url)
with open(f"{alphafold_id}.cif", "w") as f:
f.write(response.text)
# Confidence scores (JSON)
confidence_url = f"https://alphafold.ebi.ac.uk/files/{alphafold_id}-confidence_{version}.json"
response = requests.get(confidence_url)
confidence_data = response.json()
# Predicted Aligned Error (JSON)
pae_url = f"https://alphafold.ebi.ac.uk/files/{alphafold_id}-predicted_aligned_error_{version}.json"
response = requests.get(pae_url)
pae_data = response.json()
PDB Format (Alternative):
# Download as PDB format instead of mmCIF
pdb_url = f"https://alphafold.ebi.ac.uk/files/{alphafold_id}-model_{version}.pdb"
response = requests.get(pdb_url)
with open(f"{alphafold_id}.pdb", "wb") as f:
f.write(response.content)
AlphaFold predictions include confidence estimates critical for interpretation:
pLDDT (per-residue confidence):
import json
import requests
# Load confidence scores
alphafold_id = "AF-P00520-F1"
confidence_url = f"https://alphafold.ebi.ac.uk/files/{alphafold_id}-confidence_v4.json"
confidence = requests.get(confidence_url).json()
# Extract pLDDT scores
plddt_scores = confidence['confidenceScore']
# Interpret confidence levels
# pLDDT > 90: Very high confidence
# pLDDT 70-90: High confidence
# pLDDT 50-70: Low confidence
# pLDDT < 50: Very low confidence
high_confidence_residues = [i for i, score in enumerate(plddt_scores) if score > 90]
print(f"High confidence residues: {len(high_confidence_residues)}/{len(plddt_scores)}")
PAE (Predicted Aligned Error):
PAE indicates confidence in relative domain positions:
import numpy as np
import matplotlib.pyplot as plt
# Load PAE matrix
pae_url = f"https://alphafold.ebi.ac.uk/files/{alphafold_id}-predicted_aligned_error_v4.json"
pae = requests.get(pae_url).json()
# Visualize PAE matrix
pae_matrix = np.array(pae['distance'])
plt.figure(figsize=(10, 8))
plt.imshow(pae_matrix, cmap='viridis_r', vmin=0, vmax=30)
plt.colorbar(label='PAE (Å)')
plt.title(f'Predicted Aligned Error: {alphafold_id}')
plt.xlabel('Residue')
plt.ylabel('Residue')
plt.savefig(f'{alphafold_id}_pae.png', dpi=300, bbox_inches='tight')
# Low PAE values (<5 Å) indicate confident relative positioning
# High PAE values (>15 Å) suggest uncertain domain arrangements
For large-scale analyses, use Google Cloud datasets:
Google Cloud Storage:
# Install gsutil
uv pip install gsutil
# List available data
gsutil ls gs://public-datasets-deepmind-alphafold-v4/
# Download entire proteomes (by taxonomy ID)
gsutil -m cp gs://public-datasets-deepmind-alphafold-v4/proteomes/proteome-tax_id-9606-*.tar .
# Download specific files
gsutil cp gs://public-datasets-deepmind-alphafold-v4/accession_ids.csv .
BigQuery Metadata Access:
from google.cloud import bigquery
# Initialize client
client = bigquery.Client()
# Query metadata
query = """
SELECT
entryId,
uniprotAccession,
organismScientificName,
globalMetricValue,
fractionPlddtVeryHigh
FROM `bigquery-public-data.deepmind_alphafold.metadata`
WHERE organismScientificName = 'Homo sapiens'
AND fractionPlddtVeryHigh > 0.8
LIMIT 100
"""
results = client.query(query).to_dataframe()
print(f"Found {len(results)} high-confidence human proteins")
Download by Species:
import subprocess
def download_proteome(taxonomy_id, output_dir="./proteomes"):
"""Download all AlphaFold predictions for a species"""
pattern = f"gs://public-datasets-deepmind-alphafold-v4/proteomes/proteome-tax_id-{taxonomy_id}-*_v4.tar"
cmd = f"gsutil -m cp {pattern} {output_dir}/"
subprocess.run(cmd, shell=True, check=True)
# Download E. coli proteome (tax ID: 83333)
download_proteome(83333)
# Download human proteome (tax ID: 9606)
download_proteome(9606)
Work with downloaded AlphaFold structures using BioPython:
from Bio.PDB import MMCIFParser, PDBIO
import numpy as np
# Parse mmCIF file
parser = MMCIFParser(QUIET=True)
structure = parser.get_structure("protein", "AF-P00520-F1-model_v4.cif")
# Extract coordinates
coords = []
for model in structure:
for chain in model:
for residue in chain:
if 'CA' in residue: # Alpha carbons only
coords.append(residue['CA'].get_coord())
coords = np.array(coords)
print(f"Structure has {len(coords)} residues")
# Calculate distances
from scipy.spatial.distance import pdist, squareform
distance_matrix = squareform(pdist(coords))
# Identify contacts (< 8 Å)
contacts = np.where((distance_matrix > 0) & (distance_matrix < 8))
print(f"Number of contacts: {len(contacts[0]) // 2}")
Extract B-factors (pLDDT values):
AlphaFold stores pLDDT scores in the B-factor column:
from Bio.PDB import MMCIFParser
parser = MMCIFParser(QUIET=True)
structure = parser.get_structure("protein", "AF-P00520-F1-model_v4.cif")
# Extract pLDDT from B-factors
plddt_scores = []
for model in structure:
for chain in model:
for residue in chain:
if 'CA' in residue:
plddt_scores.append(residue['CA'].get_bfactor())
# Identify high-confidence regions
high_conf_regions = [(i, score) for i, score in enumerate(plddt_scores, 1) if score > 90]
print(f"High confidence residues: {len(high_conf_regions)}")
Process multiple predictions efficiently:
from Bio.PDB import alphafold_db
import pandas as pd
uniprot_ids = ["P00520", "P12931", "P04637"] # Multiple proteins
results = []
for uniprot_id in uniprot_ids:
try:
# Get prediction
predictions = list(alphafold_db.get_predictions(uniprot_id))
if predictions:
pred = predictions[0]
# Download structure
cif_file = alphafold_db.download_cif_for(pred, directory="./batch_structures")
# Get confidence data
alphafold_id = pred['entryId']
conf_url = f"https://alphafold.ebi.ac.uk/files/{alphafold_id}-confidence_v4.json"
conf_data = requests.get(conf_url).json()
# Calculate statistics
plddt_scores = conf_data['confidenceScore']
avg_plddt = np.mean(plddt_scores)
high_conf_fraction = sum(1 for s in plddt_scores if s > 90) / len(plddt_scores)
results.append({
'uniprot_id': uniprot_id,
'alphafold_id': alphafold_id,
'avg_plddt': avg_plddt,
'high_conf_fraction': high_conf_fraction,
'length': len(plddt_scores)
})
except Exception as e:
print(f"Error processing {uniprot_id}: {e}")
# Create summary DataFrame
df = pd.DataFrame(results)
print(df)
# Install Biopython for structure access
uv pip install biopython
# Install requests for API access
uv pip install requests
# For visualization and analysis
uv pip install numpy matplotlib pandas scipy
# For Google Cloud access (optional)
uv pip install google-cloud-bigquery gsutil
AlphaFold can also be accessed via the 3D-Beacons federated API:
import requests
# Query via 3D-Beacons
uniprot_id = "P00520"
url = f"https://www.ebi.ac.uk/pdbe/pdbe-kb/3dbeacons/api/uniprot/summary/{uniprot_id}.json"
response = requests.get(url)
data = response.json()
# Filter for AlphaFold structures
af_structures = [s for s in data['structures'] if s['provider'] == 'AlphaFold DB']
UniProt Accession: Primary identifier for proteins (e.g., "P00520"). Required for querying AlphaFold DB.
AlphaFold ID: Internal identifier format: AF-[UniProt accession]-F[fragment number] (e.g., "AF-P00520-F1").
pLDDT (predicted Local Distance Difference Test): Per-residue confidence metric (0-100). Higher values indicate more confident predictions.
PAE (Predicted Aligned Error): Matrix indicating confidence in relative positions between residue pairs. Low values (<5 Å) suggest confident relative positioning.
Database Version: Current version is v4. File URLs include version suffix (e.g., model_v4.cif).
Fragment Number: Large proteins may be split into fragments. Fragment number appears in AlphaFold ID (e.g., F1, F2).
pLDDT Thresholds:
PAE Guidelines:
Comprehensive API documentation covering:
Consult this reference for detailed API information, bulk download strategies, or when working with large-scale datasets.
_v4.cif)Weekly Installs
165
Repository
GitHub Stars
23.5K
First Seen
Jan 21, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykPass
Installed on
opencode137
claude-code137
gemini-cli133
cursor132
codex123
antigravity120
超能力技能使用指南:AI助手技能调用优先级与工作流程详解
46,500 周安装
Plankton代码质量工具:Claude Code自动格式化与Linter强制执行系统
1,200 周安装
AI优先工程指南:优化团队流程、架构与代码评审,提升AI辅助开发效率
1,300 周安装
竞品广告提取器 - 分析竞争对手广告策略,优化营销效果 | 营销数据分析工具
1,300 周安装
Canvas Design 技能:AI驱动视觉哲学创作与画布设计工具 | 艺术生成与美学设计
1,200 周安装
构建内容集群指南:使用支柱-辐条模型提升网站主题权威与SEO排名
1,200 周安装
read-github:GitHub仓库文档与代码读取工具,支持语义搜索和URL转换
1,200 周安装