pytdc by davila7/claude-code-templates
npx skills add https://github.com/davila7/claude-code-templates --skill pytdcPyTDC 是一个开放科学平台,为药物发现和开发提供 AI 就绪的数据集和基准。该平台提供涵盖整个治疗流程的精选数据集,包含标准化的评估指标和有意义的数据划分,并组织成三个类别:单实例预测(分子/蛋白质属性)、多实例预测(药物-靶点相互作用,DDI)和生成(分子生成、逆合成)。
在以下情况下应使用此技能:
使用 pip 安装 PyTDC:
uv pip install PyTDC
升级到最新版本:
uv pip install PyTDC --upgrade
核心依赖项(自动安装):
特定功能所需的额外包会根据需要自动安装。
访问任何 TDC 数据集的基本模式遵循以下结构:
from tdc.<problem> import <Task>
data = <Task>(name='<Dataset>')
split = data.get_split(method='scaffold', seed=1, frac=[0.7, 0.1, 0.2])
df = data.get_data(format='df')
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
其中:
<problem>:single_pred、multi_pred 或 generation 之一<Task>:特定的任务类别(例如,ADME、DTI、MolGen)<Dataset>:该任务内的数据集名称示例 - 加载 ADME 数据:
from tdc.single_pred import ADME
data = ADME(name='Caco2_Wang')
split = data.get_split(method='scaffold')
# 返回包含 'train', 'valid', 'test' DataFrames 的字典
单实例预测涉及预测单个生物医学实体(分子、蛋白质等)的属性。
预测药物分子的药代动力学属性。
from tdc.single_pred import ADME
data = ADME(name='Caco2_Wang') # 肠道渗透性
# 其他数据集:HIA_Hou, Bioavailability_Ma, Lipophilicity_AstraZeneca 等。
常见 ADME 数据集:
预测化合物的毒性和不良反应。
from tdc.single_pred import Tox
data = Tox(name='hERG') # 心脏毒性
# 其他数据集:AMES, DILI, Carcinogens_Lagunin 等。
常见毒性数据集:
基于筛选数据的生物活性预测。
from tdc.single_pred import HTS
data = HTS(name='SARSCoV2_Vitro_Touret')
分子的量子力学属性。
from tdc.single_pred import QM
data = QM(name='QM7')
单预测数据集通常返回包含以下列的 DataFrame:
Drug_ID 或 Compound_ID:唯一标识符Drug 或 X:SMILES 字符串或分子表示Y:目标标签(连续值或二值)多实例预测涉及预测多个生物医学实体之间相互作用的属性。
预测药物与蛋白质靶点之间的结合亲和力。
from tdc.multi_pred import DTI
data = DTI(name='BindingDB_Kd')
split = data.get_split()
可用数据集:
数据格式: Drug_ID, Target_ID, Drug (SMILES), Target (序列), Y (结合亲和力)
预测药物对之间的相互作用。
from tdc.multi_pred import DDI
data = DDI(name='DrugBank')
split = data.get_split()
多分类任务,预测相互作用类型。数据集包含 191,808 个 DDI 对,涉及 1,706 种药物。
预测蛋白质-蛋白质相互作用。
from tdc.multi_pred import PPI
data = PPI(name='HuRI')
生成任务涉及创建具有所需属性的新型生物医学实体。
生成具有理想化学特性的多样化新分子。
from tdc.generation import MolGen
data = MolGen(name='ChEMBL_V29')
split = data.get_split()
与预言机一起使用以优化特定属性:
from tdc import Oracle
oracle = Oracle(name='GSK3B')
score = oracle('CC(C)Cc1ccc(cc1)C(C)C(O)=O') # 评估 SMILES
有关所有可用预言机函数,请参阅 references/oracles.md。
预测合成目标分子所需的反应物。
from tdc.generation import RetroSyn
data = RetroSyn(name='USPTO')
split = data.get_split()
数据集包含来自 USPTO 数据库的 1,939,253 个反应。
生成分子对(例如,前药-药物对)。
from tdc.generation import PairMolGen
data = PairMolGen(name='Prodrug')
有关详细的预言机文档和分子生成工作流程,请参阅 references/oracles.md 和 scripts/molecular_generation.py。
基准测试组提供相关数据集的精选集合,用于系统性的模型评估。
from tdc.benchmark_group import admet_group
group = admet_group(path='data/')
# 获取基准测试数据集
benchmark = group.get('Caco2_Wang')
predictions = {}
for seed in [1, 2, 3, 4, 5]:
train, valid = benchmark['train'], benchmark['valid']
# 在此处训练模型
predictions[seed] = model.predict(benchmark['test'])
# 使用要求的 5 个随机种子进行评估
results = group.evaluate(predictions)
ADMET 组包含 22 个数据集,涵盖吸收、分布、代谢、排泄和毒性。
可用的基准测试组包括以下集合:
有关基准评估工作流程,请参阅 scripts/benchmark_evaluation.py。
TDC 提供全面的数据处理工具,分为四类。
使用各种策略获取训练/验证/测试分区:
# 骨架划分(大多数任务的默认方法)
split = data.get_split(method='scaffold', seed=1, frac=[0.7, 0.1, 0.2])
# 随机划分
split = data.get_split(method='random', seed=42, frac=[0.8, 0.1, 0.1])
# 冷划分(用于 DTI/DDI 任务)
split = data.get_split(method='cold_drug', seed=1) # 测试集中出现未见过的药物
split = data.get_split(method='cold_target', seed=1) # 测试集中出现未见过的靶点
可用的划分策略:
random:随机打乱scaffold:基于骨架的划分(用于化学多样性)cold_drug, cold_target, cold_drug_target:用于 DTI 任务temporal:用于时序数据集的时间划分使用标准化指标进行评估:
from tdc import Evaluator
# 用于二分类
evaluator = Evaluator(name='ROC-AUC')
score = evaluator(y_true, y_pred)
# 用于回归
evaluator = Evaluator(name='RMSE')
score = evaluator(y_true, y_pred)
可用指标: ROC-AUC, PR-AUC, F1, Accuracy, RMSE, MAE, R2, Spearman, Pearson 等。
TDC 提供 11 个关键的处理工具:
from tdc.chem_utils import MolConvert
# 分子格式转换
converter = MolConvert(src='SMILES', dst='PyG')
pyg_graph = converter('CC(C)Cc1ccc(cc1)C(C)C(O)=O')
处理工具包括:
有关全面的工具文档,请参阅 references/utilities.md。
TDC 提供 17 个以上的预言机函数用于分子优化:
from tdc import Oracle
# 单个预言机
oracle = Oracle(name='DRD2')
score = oracle('CC(C)Cc1ccc(cc1)C(C)C(O)=O')
# 多个预言机
oracle = Oracle(name='JNK3')
scores = oracle(['SMILES1', 'SMILES2', 'SMILES3'])
完整的预言机文档,请参阅 references/oracles.md。
from tdc.utils import retrieve_dataset_names
# 获取所有 ADME 数据集
adme_datasets = retrieve_dataset_names('ADME')
# 获取所有 DTI 数据集
dti_datasets = retrieve_dataset_names('DTI')
# 获取标签映射
label_map = data.get_label_map(name='DrugBank')
# 转换标签
from tdc.chem_utils import label_transform
transformed = label_transform(y, from_unit='nM', to_unit='p')
from tdc.utils import cid2smiles, uniprot2seq
# 将 PubChem CID 转换为 SMILES
smiles = cid2smiles(2244)
# 将 UniProt ID 转换为氨基酸序列
sequence = uniprot2seq('P12345')
完整示例请参阅 scripts/load_and_split_data.py:
from tdc.single_pred import ADME
from tdc import Evaluator
# 加载数据
data = ADME(name='Caco2_Wang')
split = data.get_split(method='scaffold', seed=42)
train, valid, test = split['train'], split['valid'], split['test']
# 训练模型(用户实现)
# model.fit(train['Drug'], train['Y'])
# 评估
evaluator = Evaluator(name='MAE')
# score = evaluator(test['Y'], predictions)
完整示例请参阅 scripts/benchmark_evaluation.py,其中包含多个随机种子和适当的评估协议。
完整示例请参阅 scripts/molecular_generation.py,展示了使用预言机函数进行目标导向生成的例子。
此技能包含用于常见 TDC 工作流程的捆绑资源:
load_and_split_data.py:使用各种策略加载和划分 TDC 数据集的模板benchmark_evaluation.py:使用适当的 5 种子协议运行基准组评估的模板molecular_generation.py:使用预言机函数进行分子生成的模板datasets.md:按任务类型组织的所有可用数据集的综合目录oracles.md:所有 17 个以上分子生成预言机的完整文档utilities.md:数据处理、划分和评估工具的详细指南每周安装数
124
代码库
GitHub 星标数
22.6K
首次出现
2026年1月21日
安全审计
安装于
claude-code105
opencode98
cursor94
gemini-cli93
antigravity87
codex81
PyTDC is an open-science platform providing AI-ready datasets and benchmarks for drug discovery and development. Access curated datasets spanning the entire therapeutics pipeline with standardized evaluation metrics and meaningful data splits, organized into three categories: single-instance prediction (molecular/protein properties), multi-instance prediction (drug-target interactions, DDI), and generation (molecule generation, retrosynthesis).
This skill should be used when:
Install PyTDC using pip:
uv pip install PyTDC
To upgrade to the latest version:
uv pip install PyTDC --upgrade
Core dependencies (automatically installed):
Additional packages are installed automatically as needed for specific features.
The basic pattern for accessing any TDC dataset follows this structure:
from tdc.<problem> import <Task>
data = <Task>(name='<Dataset>')
split = data.get_split(method='scaffold', seed=1, frac=[0.7, 0.1, 0.2])
df = data.get_data(format='df')
Where:
<problem>: One of single_pred, multi_pred, or generation<Task>: Specific task category (e.g., ADME, DTI, MolGen)<Dataset>: Dataset name within that taskExample - Loading ADME data:
from tdc.single_pred import ADME
data = ADME(name='Caco2_Wang')
split = data.get_split(method='scaffold')
# Returns dict with 'train', 'valid', 'test' DataFrames
Single-instance prediction involves forecasting properties of individual biomedical entities (molecules, proteins, etc.).
Predict pharmacokinetic properties of drug molecules.
from tdc.single_pred import ADME
data = ADME(name='Caco2_Wang') # Intestinal permeability
# Other datasets: HIA_Hou, Bioavailability_Ma, Lipophilicity_AstraZeneca, etc.
Common ADME datasets:
Predict toxicity and adverse effects of compounds.
from tdc.single_pred import Tox
data = Tox(name='hERG') # Cardiotoxicity
# Other datasets: AMES, DILI, Carcinogens_Lagunin, etc.
Common toxicity datasets:
Bioactivity predictions from screening data.
from tdc.single_pred import HTS
data = HTS(name='SARSCoV2_Vitro_Touret')
Quantum mechanical properties of molecules.
from tdc.single_pred import QM
data = QM(name='QM7')
Single prediction datasets typically return DataFrames with columns:
Drug_ID or Compound_ID: Unique identifierDrug or X: SMILES string or molecular representationY: Target label (continuous or binary)Multi-instance prediction involves forecasting properties of interactions between multiple biomedical entities.
Predict binding affinity between drugs and protein targets.
from tdc.multi_pred import DTI
data = DTI(name='BindingDB_Kd')
split = data.get_split()
Available datasets:
Data format: Drug_ID, Target_ID, Drug (SMILES), Target (sequence), Y (binding affinity)
Predict interactions between drug pairs.
from tdc.multi_pred import DDI
data = DDI(name='DrugBank')
split = data.get_split()
Multi-class classification task predicting interaction types. Dataset contains 191,808 DDI pairs with 1,706 drugs.
Predict protein-protein interactions.
from tdc.multi_pred import PPI
data = PPI(name='HuRI')
Generation tasks involve creating novel biomedical entities with desired properties.
Generate diverse, novel molecules with desirable chemical properties.
from tdc.generation import MolGen
data = MolGen(name='ChEMBL_V29')
split = data.get_split()
Use with oracles to optimize for specific properties:
from tdc import Oracle
oracle = Oracle(name='GSK3B')
score = oracle('CC(C)Cc1ccc(cc1)C(C)C(O)=O') # Evaluate SMILES
See references/oracles.md for all available oracle functions.
Predict reactants needed to synthesize a target molecule.
from tdc.generation import RetroSyn
data = RetroSyn(name='USPTO')
split = data.get_split()
Dataset contains 1,939,253 reactions from USPTO database.
Generate molecule pairs (e.g., prodrug-drug pairs).
from tdc.generation import PairMolGen
data = PairMolGen(name='Prodrug')
For detailed oracle documentation and molecular generation workflows, refer to references/oracles.md and scripts/molecular_generation.py.
Benchmark groups provide curated collections of related datasets for systematic model evaluation.
from tdc.benchmark_group import admet_group
group = admet_group(path='data/')
# Get benchmark datasets
benchmark = group.get('Caco2_Wang')
predictions = {}
for seed in [1, 2, 3, 4, 5]:
train, valid = benchmark['train'], benchmark['valid']
# Train model here
predictions[seed] = model.predict(benchmark['test'])
# Evaluate with required 5 seeds
results = group.evaluate(predictions)
ADMET Group includes 22 datasets covering absorption, distribution, metabolism, excretion, and toxicity.
Available benchmark groups include collections for:
For benchmark evaluation workflows, see scripts/benchmark_evaluation.py.
TDC provides comprehensive data processing utilities organized into four categories.
Retrieve train/validation/test partitions with various strategies:
# Scaffold split (default for most tasks)
split = data.get_split(method='scaffold', seed=1, frac=[0.7, 0.1, 0.2])
# Random split
split = data.get_split(method='random', seed=42, frac=[0.8, 0.1, 0.1])
# Cold split (for DTI/DDI tasks)
split = data.get_split(method='cold_drug', seed=1) # Unseen drugs in test
split = data.get_split(method='cold_target', seed=1) # Unseen targets in test
Available split strategies:
random: Random shufflingscaffold: Scaffold-based (for chemical diversity)cold_drug, cold_target, cold_drug_target: For DTI taskstemporal: Time-based splits for temporal datasetsUse standardized metrics for evaluation:
from tdc import Evaluator
# For binary classification
evaluator = Evaluator(name='ROC-AUC')
score = evaluator(y_true, y_pred)
# For regression
evaluator = Evaluator(name='RMSE')
score = evaluator(y_true, y_pred)
Available metrics: ROC-AUC, PR-AUC, F1, Accuracy, RMSE, MAE, R2, Spearman, Pearson, and more.
TDC provides 11 key processing utilities:
from tdc.chem_utils import MolConvert
# Molecule format conversion
converter = MolConvert(src='SMILES', dst='PyG')
pyg_graph = converter('CC(C)Cc1ccc(cc1)C(C)C(O)=O')
Processing utilities include:
For comprehensive utilities documentation, see references/utilities.md.
TDC provides 17+ oracle functions for molecular optimization:
from tdc import Oracle
# Single oracle
oracle = Oracle(name='DRD2')
score = oracle('CC(C)Cc1ccc(cc1)C(C)C(O)=O')
# Multiple oracles
oracle = Oracle(name='JNK3')
scores = oracle(['SMILES1', 'SMILES2', 'SMILES3'])
For complete oracle documentation, see references/oracles.md.
from tdc.utils import retrieve_dataset_names
# Get all ADME datasets
adme_datasets = retrieve_dataset_names('ADME')
# Get all DTI datasets
dti_datasets = retrieve_dataset_names('DTI')
# Get label mapping
label_map = data.get_label_map(name='DrugBank')
# Convert labels
from tdc.chem_utils import label_transform
transformed = label_transform(y, from_unit='nM', to_unit='p')
from tdc.utils import cid2smiles, uniprot2seq
# Convert PubChem CID to SMILES
smiles = cid2smiles(2244)
# Convert UniProt ID to amino acid sequence
sequence = uniprot2seq('P12345')
See scripts/load_and_split_data.py for a complete example:
from tdc.single_pred import ADME
from tdc import Evaluator
# Load data
data = ADME(name='Caco2_Wang')
split = data.get_split(method='scaffold', seed=42)
train, valid, test = split['train'], split['valid'], split['test']
# Train model (user implements)
# model.fit(train['Drug'], train['Y'])
# Evaluate
evaluator = Evaluator(name='MAE')
# score = evaluator(test['Y'], predictions)
See scripts/benchmark_evaluation.py for a complete example with multiple seeds and proper evaluation protocol.
See scripts/molecular_generation.py for an example of goal-directed generation using oracle functions.
This skill includes bundled resources for common TDC workflows:
load_and_split_data.py: Template for loading and splitting TDC datasets with various strategiesbenchmark_evaluation.py: Template for running benchmark group evaluations with proper 5-seed protocolmolecular_generation.py: Template for molecular generation using oracle functionsdatasets.md: Comprehensive catalog of all available datasets organized by task typeoracles.md: Complete documentation of all 17+ molecule generation oraclesutilities.md: Detailed guide to data processing, splitting, and evaluation utilitiesWeekly Installs
124
Repository
GitHub Stars
22.6K
First Seen
Jan 21, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykWarn
Installed on
claude-code105
opencode98
cursor94
gemini-cli93
antigravity87
codex81
AI 代码实施计划编写技能 | 自动化开发任务分解与 TDD 流程规划工具
50,900 周安装