pubchem-database by davila7/claude-code-templates
npx skills add https://github.com/davila7/claude-code-templates --skill pubchem-databasePubChem 是世界上最大的免费化学数据库,包含超过 1.1 亿种化合物和超过 2.7 亿条生物活性数据。可通过名称、CID 或 SMILES 查询化学结构,检索分子性质,执行相似性和子结构搜索,使用 PUG-REST API 和 PubChemPy 访问生物活性数据。
此技能应在以下情况下使用:
使用多种标识符类型搜索化合物:
按化学名称搜索:
import pubchempy as pcp
compounds = pcp.get_compounds('aspirin', 'name')
compound = compounds[0]
按 CID(化合物 ID)搜索:
compound = pcp.Compound.from_cid(2244) # 阿司匹林
按 SMILES 搜索:
compound = pcp.get_compounds('CC(=O)OC1=CC=CC=C1C(=O)O', 'smiles')[0]
按 InChI 搜索:
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
compound = pcp.get_compounds('InChI=1S/C9H8O4/...', 'inchi')[0]
按分子式搜索:
compounds = pcp.get_compounds('C9H8O4', 'formula')
# 返回所有匹配此分子式的化合物
使用高级或低级方法检索化合物的分子性质:
使用 PubChemPy(推荐):
import pubchempy as pcp
# 获取包含所有性质的化合物对象
compound = pcp.get_compounds('caffeine', 'name')[0]
# 访问单个性质
molecular_formula = compound.molecular_formula
molecular_weight = compound.molecular_weight
iupac_name = compound.iupac_name
smiles = compound.canonical_smiles
inchi = compound.inchi
xlogp = compound.xlogp # 分配系数
tpsa = compound.tpsa # 拓扑极性表面积
获取特定性质:
# 仅请求特定性质
properties = pcp.get_properties(
['MolecularFormula', 'MolecularWeight', 'CanonicalSMILES', 'XLogP'],
'aspirin',
'name'
)
# 返回字典列表
批量性质检索:
import pandas as pd
compound_names = ['aspirin', 'ibuprofen', 'paracetamol']
all_properties = []
for name in compound_names:
props = pcp.get_properties(
['MolecularFormula', 'MolecularWeight', 'XLogP'],
name,
'name'
)
all_properties.extend(props)
df = pd.DataFrame(all_properties)
可用性质:MolecularFormula、MolecularWeight、CanonicalSMILES、IsomericSMILES、InChI、InChIKey、IUPACName、XLogP、TPSA、HBondDonorCount、HBondAcceptorCount、RotatableBondCount、Complexity、Charge 等(完整列表请参见 references/api_reference.md)。
使用 Tanimoto 相似性查找结构相似的化合物:
import pubchempy as pcp
# 从查询化合物开始
query_compound = pcp.get_compounds('gefitinib', 'name')[0]
query_smiles = query_compound.canonical_smiles
# 执行相似性搜索
similar_compounds = pcp.get_compounds(
query_smiles,
'smiles',
searchtype='similarity',
Threshold=85, # 相似性阈值 (0-100)
MaxRecords=50
)
# 处理结果
for compound in similar_compounds[:10]:
print(f"CID {compound.cid}: {compound.iupac_name}")
print(f" MW: {compound.molecular_weight}")
注意:大型查询的相似性搜索是异步的,可能需要 15-30 秒才能完成。PubChemPy 会自动处理异步模式。
查找包含特定结构基团的化合物:
import pubchempy as pcp
# 搜索包含吡啶环的化合物
pyridine_smiles = 'c1ccncc1'
matches = pcp.get_compounds(
pyridine_smiles,
'smiles',
searchtype='substructure',
MaxRecords=100
)
print(f"Found {len(matches)} compounds containing pyridine")
常见子结构:
c1ccccc1c1ccncc1c1ccc(O)cc1C(=O)O在不同化学结构格式之间转换:
import pubchempy as pcp
compound = pcp.get_compounds('aspirin', 'name')[0]
# 转换为不同格式
smiles = compound.canonical_smiles
inchi = compound.inchi
inchikey = compound.inchikey
cid = compound.cid
# 下载结构文件
pcp.download('SDF', 'aspirin', 'name', 'aspirin.sdf', overwrite=True)
pcp.download('JSON', '2244', 'cid', 'aspirin.json', overwrite=True)
生成 2D 结构图像:
import pubchempy as pcp
# 将化合物结构下载为 PNG
pcp.download('PNG', 'caffeine', 'name', 'caffeine.png', overwrite=True)
# 使用直接 URL(通过 requests)
import requests
cid = 2244 # 阿司匹林
url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/{cid}/PNG?image_size=large"
response = requests.get(url)
with open('structure.png', 'wb') as f:
f.write(response.content)
获取化合物的所有已知名称和同义词:
import pubchempy as pcp
synonyms_data = pcp.get_synonyms('aspirin', 'name')
if synonyms_data:
cid = synonyms_data[0]['CID']
synonyms = synonyms_data[0]['Synonym']
print(f"CID {cid} has {len(synonyms)} synonyms:")
for syn in synonyms[:10]: # 前 10 个
print(f" - {syn}")
从实验中检索生物活性数据:
import requests
import json
# 获取化合物的生物测定摘要
cid = 2244 # 阿司匹林
url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/{cid}/assaysummary/JSON"
response = requests.get(url)
if response.status_code == 200:
data = response.json()
# 处理生物测定信息
table = data.get('Table', {})
rows = table.get('Row', [])
print(f"Found {len(rows)} bioassay records")
对于更复杂的生物活性查询,使用 scripts/bioactivity_query.py 辅助脚本,它提供:
通过 PUG-View 访问详细的化合物信息:
import requests
cid = 2244
url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/{cid}/JSON"
response = requests.get(url)
if response.status_code == 200:
annotations = response.json()
# 包含广泛的数据,包括:
# - 化学和物理性质
# - 药物和医药信息
# - 药理学和生物化学
# - 安全性和危害
# - 毒性
# - 文献引用
# - 专利
获取特定部分:
# 仅获取药物信息
url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/{cid}/JSON?heading=Drug and Medication Information"
安装 PubChemPy 以进行基于 Python 的访问:
uv pip install pubchempy
对于直接 API 访问和生物活性查询:
uv pip install requests
数据分析可选:
uv pip install pandas
此技能包含用于常见 PubChem 任务的 Python 脚本:
提供用于搜索和检索化合物信息的实用函数:
关键函数:
search_by_name(name, max_results=10):按名称搜索化合物search_by_smiles(smiles):按 SMILES 字符串搜索get_compound_by_cid(cid):按 CID 检索化合物get_compound_properties(identifier, namespace, properties):获取特定性质similarity_search(smiles, threshold, max_records):执行相似性搜索substructure_search(smiles, max_records):执行子结构搜索get_synonyms(identifier, namespace):获取所有同义词batch_search(identifiers, namespace, properties):批量搜索多个化合物download_structure(identifier, namespace, format, filename):下载结构print_compound_info(compound):打印格式化的化合物信息用法:
from scripts.compound_search import search_by_name, get_compound_properties
# 搜索化合物
compounds = search_by_name('ibuprofen')
# 获取特定性质
props = get_compound_properties('aspirin', 'name', ['MolecularWeight', 'XLogP'])
提供用于检索生物活性数据的函数:
关键函数:
get_bioassay_summary(cid):获取化合物的生物测定摘要get_compound_bioactivities(cid, activity_outcome):获取筛选后的生物活性get_assay_description(aid):获取详细的测定信息get_assay_targets(aid):获取测定的生物目标search_assays_by_target(target_name, max_results):按目标查找测定get_active_compounds_in_assay(aid, max_results):获取活性化合物get_compound_annotations(cid, section):获取 PUG-View 注释summarize_bioactivities(cid):生成生物活性摘要统计find_compounds_by_bioactivity(target, threshold, max_compounds):按目标查找化合物用法:
from scripts.bioactivity_query import get_bioassay_summary, summarize_bioactivities
# 获取生物活性摘要
summary = summarize_bioactivities(2244) # 阿司匹林
print(f"Total assays: {summary['total_assays']}")
print(f"Active: {summary['active']}, Inactive: {summary['inactive']}")
速率限制:
最佳实践:
错误处理:
from pubchempy import BadRequestError, NotFoundError, TimeoutError
try:
compound = pcp.get_compounds('query', 'name')[0]
except NotFoundError:
print("Compound not found")
except BadRequestError:
print("Invalid request format")
except TimeoutError:
print("Request timed out - try reducing scope")
except IndexError:
print("No results returned")
在不同化学标识符之间转换:
import pubchempy as pcp
# 从任何标识符类型开始
compound = pcp.get_compounds('caffeine', 'name')[0]
# 提取所有标识符格式
identifiers = {
'CID': compound.cid,
'Name': compound.iupac_name,
'SMILES': compound.canonical_smiles,
'InChI': compound.inchi,
'InChIKey': compound.inchikey,
'Formula': compound.molecular_formula
}
使用 Lipinski 五规则筛选化合物:
import pubchempy as pcp
def check_drug_likeness(compound_name):
compound = pcp.get_compounds(compound_name, 'name')[0]
# Lipinski 五规则
rules = {
'MW <= 500': compound.molecular_weight <= 500,
'LogP <= 5': compound.xlogp <= 5 if compound.xlogp else None,
'HBD <= 5': compound.h_bond_donor_count <= 5,
'HBA <= 10': compound.h_bond_acceptor_count <= 10
}
violations = sum(1 for v in rules.values() if v is False)
return rules, violations
rules, violations = check_drug_likeness('aspirin')
print(f"Lipinski violations: {violations}")
识别与已知药物结构相似的化合物:
import pubchempy as pcp
# 从已知药物开始
reference_drug = pcp.get_compounds('imatinib', 'name')[0]
reference_smiles = reference_drug.canonical_smiles
# 查找相似化合物
similar = pcp.get_compounds(
reference_smiles,
'smiles',
searchtype='similarity',
Threshold=85,
MaxRecords=20
)
# 按类药性质筛选
candidates = []
for comp in similar:
if comp.molecular_weight and 200 <= comp.molecular_weight <= 600:
if comp.xlogp and -1 <= comp.xlogp <= 5:
candidates.append(comp)
print(f"Found {len(candidates)} drug-like candidates")
比较多个化合物的性质:
import pubchempy as pcp
import pandas as pd
compound_list = ['aspirin', 'ibuprofen', 'naproxen', 'celecoxib']
properties_list = []
for name in compound_list:
try:
compound = pcp.get_compounds(name, 'name')[0]
properties_list.append({
'Name': name,
'CID': compound.cid,
'Formula': compound.molecular_formula,
'MW': compound.molecular_weight,
'LogP': compound.xlogp,
'TPSA': compound.tpsa,
'HBD': compound.h_bond_donor_count,
'HBA': compound.h_bond_acceptor_count
})
except Exception as e:
print(f"Error processing {name}: {e}")
df = pd.DataFrame(properties_list)
print(df.to_string(index=False))
筛选包含特定药效团的化合物:
import pubchempy as pcp
# 定义药效团(例如,磺酰胺基团)
pharmacophore_smiles = 'S(=O)(=O)N'
# 搜索包含此子结构的化合物
hits = pcp.get_compounds(
pharmacophore_smiles,
'smiles',
searchtype='substructure',
MaxRecords=100
)
# 进一步按性质筛选
filtered_hits = [
comp for comp in hits
if comp.molecular_weight and comp.molecular_weight < 500
]
print(f"Found {len(filtered_hits)} compounds with desired substructure")
有关详细的 API 文档,包括完整的性质列表、URL 模式、高级查询选项和更多示例,请查阅 references/api_reference.md。此综合参考包括:
未找到化合物:
超时错误:
空性质值:
if compound.xlogp:超出速率限制:
相似性/子结构搜索挂起:
每周安装
131
仓库
GitHub 星标
22.6K
首次出现
2026年1月21日
安全审计
安装于
claude-code110
opencode104
gemini-cli99
cursor95
codex89
antigravity87
PubChem is the world's largest freely available chemical database with 110M+ compounds and 270M+ bioactivities. Query chemical structures by name, CID, or SMILES, retrieve molecular properties, perform similarity and substructure searches, access bioactivity data using PUG-REST API and PubChemPy.
This skill should be used when:
Search for compounds using multiple identifier types:
By Chemical Name :
import pubchempy as pcp
compounds = pcp.get_compounds('aspirin', 'name')
compound = compounds[0]
By CID (Compound ID) :
compound = pcp.Compound.from_cid(2244) # Aspirin
By SMILES :
compound = pcp.get_compounds('CC(=O)OC1=CC=CC=C1C(=O)O', 'smiles')[0]
By InChI :
compound = pcp.get_compounds('InChI=1S/C9H8O4/...', 'inchi')[0]
By Molecular Formula :
compounds = pcp.get_compounds('C9H8O4', 'formula')
# Returns all compounds matching this formula
Retrieve molecular properties for compounds using either high-level or low-level approaches:
Using PubChemPy (Recommended) :
import pubchempy as pcp
# Get compound object with all properties
compound = pcp.get_compounds('caffeine', 'name')[0]
# Access individual properties
molecular_formula = compound.molecular_formula
molecular_weight = compound.molecular_weight
iupac_name = compound.iupac_name
smiles = compound.canonical_smiles
inchi = compound.inchi
xlogp = compound.xlogp # Partition coefficient
tpsa = compound.tpsa # Topological polar surface area
Get Specific Properties :
# Request only specific properties
properties = pcp.get_properties(
['MolecularFormula', 'MolecularWeight', 'CanonicalSMILES', 'XLogP'],
'aspirin',
'name'
)
# Returns list of dictionaries
Batch Property Retrieval :
import pandas as pd
compound_names = ['aspirin', 'ibuprofen', 'paracetamol']
all_properties = []
for name in compound_names:
props = pcp.get_properties(
['MolecularFormula', 'MolecularWeight', 'XLogP'],
name,
'name'
)
all_properties.extend(props)
df = pd.DataFrame(all_properties)
Available Properties : MolecularFormula, MolecularWeight, CanonicalSMILES, IsomericSMILES, InChI, InChIKey, IUPACName, XLogP, TPSA, HBondDonorCount, HBondAcceptorCount, RotatableBondCount, Complexity, Charge, and many more (see references/api_reference.md for complete list).
Find structurally similar compounds using Tanimoto similarity:
import pubchempy as pcp
# Start with a query compound
query_compound = pcp.get_compounds('gefitinib', 'name')[0]
query_smiles = query_compound.canonical_smiles
# Perform similarity search
similar_compounds = pcp.get_compounds(
query_smiles,
'smiles',
searchtype='similarity',
Threshold=85, # Similarity threshold (0-100)
MaxRecords=50
)
# Process results
for compound in similar_compounds[:10]:
print(f"CID {compound.cid}: {compound.iupac_name}")
print(f" MW: {compound.molecular_weight}")
Note : Similarity searches are asynchronous for large queries and may take 15-30 seconds to complete. PubChemPy handles the asynchronous pattern automatically.
Find compounds containing a specific structural motif:
import pubchempy as pcp
# Search for compounds containing pyridine ring
pyridine_smiles = 'c1ccncc1'
matches = pcp.get_compounds(
pyridine_smiles,
'smiles',
searchtype='substructure',
MaxRecords=100
)
print(f"Found {len(matches)} compounds containing pyridine")
Common Substructures :
c1ccccc1c1ccncc1c1ccc(O)cc1C(=O)OConvert between different chemical structure formats:
import pubchempy as pcp
compound = pcp.get_compounds('aspirin', 'name')[0]
# Convert to different formats
smiles = compound.canonical_smiles
inchi = compound.inchi
inchikey = compound.inchikey
cid = compound.cid
# Download structure files
pcp.download('SDF', 'aspirin', 'name', 'aspirin.sdf', overwrite=True)
pcp.download('JSON', '2244', 'cid', 'aspirin.json', overwrite=True)
Generate 2D structure images:
import pubchempy as pcp
# Download compound structure as PNG
pcp.download('PNG', 'caffeine', 'name', 'caffeine.png', overwrite=True)
# Using direct URL (via requests)
import requests
cid = 2244 # Aspirin
url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/{cid}/PNG?image_size=large"
response = requests.get(url)
with open('structure.png', 'wb') as f:
f.write(response.content)
Get all known names and synonyms for a compound:
import pubchempy as pcp
synonyms_data = pcp.get_synonyms('aspirin', 'name')
if synonyms_data:
cid = synonyms_data[0]['CID']
synonyms = synonyms_data[0]['Synonym']
print(f"CID {cid} has {len(synonyms)} synonyms:")
for syn in synonyms[:10]: # First 10
print(f" - {syn}")
Retrieve biological activity data from assays:
import requests
import json
# Get bioassay summary for a compound
cid = 2244 # Aspirin
url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/{cid}/assaysummary/JSON"
response = requests.get(url)
if response.status_code == 200:
data = response.json()
# Process bioassay information
table = data.get('Table', {})
rows = table.get('Row', [])
print(f"Found {len(rows)} bioassay records")
For more complex bioactivity queries , use the scripts/bioactivity_query.py helper script which provides:
Access detailed compound information through PUG-View:
import requests
cid = 2244
url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/{cid}/JSON"
response = requests.get(url)
if response.status_code == 200:
annotations = response.json()
# Contains extensive data including:
# - Chemical and Physical Properties
# - Drug and Medication Information
# - Pharmacology and Biochemistry
# - Safety and Hazards
# - Toxicity
# - Literature references
# - Patents
Get Specific Section :
# Get only drug information
url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/{cid}/JSON?heading=Drug and Medication Information"
Install PubChemPy for Python-based access:
uv pip install pubchempy
For direct API access and bioactivity queries:
uv pip install requests
Optional for data analysis:
uv pip install pandas
This skill includes Python scripts for common PubChem tasks:
Provides utility functions for searching and retrieving compound information:
Key Functions :
search_by_name(name, max_results=10): Search compounds by namesearch_by_smiles(smiles): Search by SMILES stringget_compound_by_cid(cid): Retrieve compound by CIDget_compound_properties(identifier, namespace, properties): Get specific propertiessimilarity_search(smiles, threshold, max_records): Perform similarity searchsubstructure_search(smiles, max_records): Perform substructure searchget_synonyms(identifier, namespace): Get all synonymsbatch_search(identifiers, namespace, properties): Batch search multiple compoundsUsage :
from scripts.compound_search import search_by_name, get_compound_properties
# Search for a compound
compounds = search_by_name('ibuprofen')
# Get specific properties
props = get_compound_properties('aspirin', 'name', ['MolecularWeight', 'XLogP'])
Provides functions for retrieving biological activity data:
Key Functions :
get_bioassay_summary(cid): Get bioassay summary for compoundget_compound_bioactivities(cid, activity_outcome): Get filtered bioactivitiesget_assay_description(aid): Get detailed assay informationget_assay_targets(aid): Get biological targets for assaysearch_assays_by_target(target_name, max_results): Find assays by targetget_active_compounds_in_assay(aid, max_results): Get active compoundsget_compound_annotations(cid, section): Get PUG-View annotationssummarize_bioactivities(cid): Generate bioactivity summary statisticsUsage :
from scripts.bioactivity_query import get_bioassay_summary, summarize_bioactivities
# Get bioactivity summary
summary = summarize_bioactivities(2244) # Aspirin
print(f"Total assays: {summary['total_assays']}")
print(f"Active: {summary['active']}, Inactive: {summary['inactive']}")
Rate Limits :
Best Practices :
Error Handling :
from pubchempy import BadRequestError, NotFoundError, TimeoutError
try:
compound = pcp.get_compounds('query', 'name')[0]
except NotFoundError:
print("Compound not found")
except BadRequestError:
print("Invalid request format")
except TimeoutError:
print("Request timed out - try reducing scope")
except IndexError:
print("No results returned")
Convert between different chemical identifiers:
import pubchempy as pcp
# Start with any identifier type
compound = pcp.get_compounds('caffeine', 'name')[0]
# Extract all identifier formats
identifiers = {
'CID': compound.cid,
'Name': compound.iupac_name,
'SMILES': compound.canonical_smiles,
'InChI': compound.inchi,
'InChIKey': compound.inchikey,
'Formula': compound.molecular_formula
}
Screen compounds using Lipinski's Rule of Five:
import pubchempy as pcp
def check_drug_likeness(compound_name):
compound = pcp.get_compounds(compound_name, 'name')[0]
# Lipinski's Rule of Five
rules = {
'MW <= 500': compound.molecular_weight <= 500,
'LogP <= 5': compound.xlogp <= 5 if compound.xlogp else None,
'HBD <= 5': compound.h_bond_donor_count <= 5,
'HBA <= 10': compound.h_bond_acceptor_count <= 10
}
violations = sum(1 for v in rules.values() if v is False)
return rules, violations
rules, violations = check_drug_likeness('aspirin')
print(f"Lipinski violations: {violations}")
Identify structurally similar compounds to a known drug:
import pubchempy as pcp
# Start with known drug
reference_drug = pcp.get_compounds('imatinib', 'name')[0]
reference_smiles = reference_drug.canonical_smiles
# Find similar compounds
similar = pcp.get_compounds(
reference_smiles,
'smiles',
searchtype='similarity',
Threshold=85,
MaxRecords=20
)
# Filter by drug-like properties
candidates = []
for comp in similar:
if comp.molecular_weight and 200 <= comp.molecular_weight <= 600:
if comp.xlogp and -1 <= comp.xlogp <= 5:
candidates.append(comp)
print(f"Found {len(candidates)} drug-like candidates")
Compare properties across multiple compounds:
import pubchempy as pcp
import pandas as pd
compound_list = ['aspirin', 'ibuprofen', 'naproxen', 'celecoxib']
properties_list = []
for name in compound_list:
try:
compound = pcp.get_compounds(name, 'name')[0]
properties_list.append({
'Name': name,
'CID': compound.cid,
'Formula': compound.molecular_formula,
'MW': compound.molecular_weight,
'LogP': compound.xlogp,
'TPSA': compound.tpsa,
'HBD': compound.h_bond_donor_count,
'HBA': compound.h_bond_acceptor_count
})
except Exception as e:
print(f"Error processing {name}: {e}")
df = pd.DataFrame(properties_list)
print(df.to_string(index=False))
Screen for compounds containing specific pharmacophores:
import pubchempy as pcp
# Define pharmacophore (e.g., sulfonamide group)
pharmacophore_smiles = 'S(=O)(=O)N'
# Search for compounds containing this substructure
hits = pcp.get_compounds(
pharmacophore_smiles,
'smiles',
searchtype='substructure',
MaxRecords=100
)
# Further filter by properties
filtered_hits = [
comp for comp in hits
if comp.molecular_weight and comp.molecular_weight < 500
]
print(f"Found {len(filtered_hits)} compounds with desired substructure")
For detailed API documentation, including complete property lists, URL patterns, advanced query options, and more examples, consult references/api_reference.md. This comprehensive reference includes:
Compound Not Found :
Timeout Errors :
Empty Property Values :
if compound.xlogp:Rate Limit Exceeded :
Similarity/Substructure Search Hangs :
Weekly Installs
131
Repository
GitHub Stars
22.6K
First Seen
Jan 21, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykWarn
Installed on
claude-code110
opencode104
gemini-cli99
cursor95
codex89
antigravity87
智能OCR文字识别工具 - 支持100+语言,高精度提取图片/PDF/手写文本
1,000 周安装
download_structure(identifier, namespace, format, filename): Download structuresprint_compound_info(compound): Print formatted compound informationfind_compounds_by_bioactivity(target, threshold, max_compounds): Find compounds by target