deepchem by davila7/claude-code-templates
npx skills add https://github.com/davila7/claude-code-templates --skill deepchemDeepChem 是一个全面的 Python 库,用于将机器学习应用于化学、材料科学和生物学。通过专门的神经网络、分子特征化方法和预训练模型,实现分子性质预测、药物发现、材料设计和生物分子分析。
此技能应在以下情况下使用:
DeepChem 为各种化学数据格式提供专门的加载器:
import deepchem as dc
# 加载包含 SMILES 的 CSV
featurizer = dc.feat.CircularFingerprint(radius=2, size=2048)
loader = dc.data.CSVLoader(
tasks=['solubility', 'toxicity'],
feature_field='smiles',
featurizer=featurizer
)
dataset = loader.create_dataset('molecules.csv')
# 加载 SDF 文件
loader = dc.data.SDFLoader(tasks=['activity'], featurizer=featurizer)
dataset = loader.create_dataset('compounds.sdf')
# 加载蛋白质序列
loader = dc.data.FASTALoader()
dataset = loader.create_dataset('proteins.fasta')
关键加载器:
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
CSVLoaderSDFLoader:分子结构文件FASTALoader:蛋白质/DNA 序列ImageLoader:分子图像JsonLoader:JSON 格式的数据集将分子转换为机器学习模型的数值表示。
Is the model a graph neural network?
├─ YES → Use graph featurizers
│ ├─ Standard GNN → MolGraphConvFeaturizer
│ ├─ Message passing → DMPNNFeaturizer
│ └─ Pretrained → GroverFeaturizer
│
└─ NO → What type of model?
├─ Traditional ML (RF, XGBoost, SVM)
│ ├─ Fast baseline → CircularFingerprint (ECFP)
│ ├─ Interpretable → RDKitDescriptors
│ └─ Maximum coverage → MordredDescriptors
│
├─ Deep learning (non-graph)
│ ├─ Dense networks → CircularFingerprint
│ └─ CNN → SmilesToImage
│
├─ Sequence models (LSTM, Transformer)
│ └─ SmilesToSeq
│
└─ 3D structure analysis
└─ CoulombMatrix
# 指纹(用于传统机器学习)
fp = dc.feat.CircularFingerprint(radius=2, size=2048)
# 描述符(用于可解释模型)
desc = dc.feat.RDKitDescriptors()
# 图特征(用于图神经网络)
graph_feat = dc.feat.MolGraphConvFeaturizer()
# 应用特征化
features = fp.featurize(['CCO', 'c1ccccc1'])
选择指南:
有关完整的特征化器文档,请参阅 references/api_reference.md。
关键:对于药物发现任务,使用 ScaffoldSplitter 以防止相似分子结构同时出现在训练集和测试集中导致的数据泄露。
# 骨架拆分(推荐用于分子)
splitter = dc.splits.ScaffoldSplitter()
train, valid, test = splitter.train_valid_test_split(
dataset,
frac_train=0.8,
frac_valid=0.1,
frac_test=0.1
)
# 随机拆分(用于非分子数据)
splitter = dc.splits.RandomSplitter()
train, test = splitter.train_test_split(dataset)
# 分层拆分(用于不平衡分类)
splitter = dc.splits.RandomStratifiedSplitter()
train, test = splitter.train_test_split(dataset)
可用的拆分器:
ScaffoldSplitter:按分子骨架拆分(防止泄露)ButinaSplitter:基于聚类的分子拆分MaxMinSplitter:最大化集合间的多样性RandomSplitter:随机拆分RandomStratifiedSplitter:保持类别分布| 数据集大小 | 任务 | 推荐模型 | 特征化器 |
|---|---|---|---|
| < 1K 样本 | 任意 | SklearnModel(RandomForest) | CircularFingerprint |
| 1K-100K | 分类/回归 | GBDTModel 或 MultitaskRegressor | CircularFingerprint |
100K | 分子性质 | GCNModel、AttentiveFPModel、DMPNNModel | MolGraphConvFeaturizer
任意(推荐小型) | 迁移学习 | ChemBERTa、GROVER、MolFormer | 模型特定
晶体结构 | 材料性质 | CGCNNModel、MEGNetModel | 基于结构
蛋白质序列 | 蛋白质性质 | ProtBERT | 基于序列
from sklearn.ensemble import RandomForestRegressor
# 包装 scikit-learn 模型
sklearn_model = RandomForestRegressor(n_estimators=100)
model = dc.models.SklearnModel(model=sklearn_model)
model.fit(train)
# 多任务回归器(用于指纹)
model = dc.models.MultitaskRegressor(
n_tasks=2,
n_features=2048,
layer_sizes=[1000, 500],
dropouts=0.25,
learning_rate=0.001
)
model.fit(train, nb_epoch=50)
# 图卷积网络
model = dc.models.GCNModel(
n_tasks=1,
mode='regression',
batch_size=128,
learning_rate=0.001
)
model.fit(train, nb_epoch=50)
# 图注意力网络
model = dc.models.GATModel(n_tasks=1, mode='classification')
model.fit(train, nb_epoch=50)
# 注意力指纹
model = dc.models.AttentiveFPModel(n_tasks=1, mode='regression')
model.fit(train, nb_epoch=50)
快速访问 30 多个经过整理的基准数据集,并具有标准化的训练/验证/测试拆分:
# 加载基准数据集
tasks, datasets, transformers = dc.molnet.load_tox21(
featurizer='GraphConv', # 或 'ECFP', 'Weave', 'Raw'
splitter='scaffold', # 或 'random', 'stratified'
reload=False
)
train, valid, test = datasets
# 训练和评估
model = dc.models.GCNModel(n_tasks=len(tasks), mode='classification')
model.fit(train, nb_epoch=50)
metric = dc.metrics.Metric(dc.metrics.roc_auc_score)
test_score = model.evaluate(test, [metric])
常见数据集:
load_tox21()、load_bbbp()、load_hiv()、load_clintox()load_delaney()、load_freesolv()、load_lipo()load_qm7()、load_qm8()、load_qm9()load_perovskite()、load_bandgap()、load_mp_formation_energy()有关完整的数据集列表,请参阅 references/api_reference.md。
利用预训练模型提高性能,特别是在小型数据集上:
# ChemBERTa(在 7700 万个分子上预训练的 BERT)
model = dc.models.HuggingFaceModel(
model='seyonec/ChemBERTa-zinc-base-v1',
task='classification',
n_tasks=1,
learning_rate=2e-5 # 微调时使用较低的学习率
)
model.fit(train, nb_epoch=10)
# GROVER(在 1000 万个分子上预训练的图变换器)
model = dc.models.GroverModel(
task='regression',
n_tasks=1
)
model.fit(train, nb_epoch=20)
何时使用迁移学习:
使用 scripts/transfer_learning.py 脚本进行引导式迁移学习工作流。
# 定义指标
classification_metrics = [
dc.metrics.Metric(dc.metrics.roc_auc_score, name='ROC-AUC'),
dc.metrics.Metric(dc.metrics.accuracy_score, name='Accuracy'),
dc.metrics.Metric(dc.metrics.f1_score, name='F1')
]
regression_metrics = [
dc.metrics.Metric(dc.metrics.r2_score, name='R²'),
dc.metrics.Metric(dc.metrics.mean_absolute_error, name='MAE'),
dc.metrics.Metric(dc.metrics.root_mean_squared_error, name='RMSE')
]
# 评估
train_scores = model.evaluate(train, classification_metrics)
test_scores = model.evaluate(test, classification_metrics)
# 在测试集上进行预测
predictions = model.predict(test)
# 对新分子进行预测
new_smiles = ['CCO', 'c1ccccc1', 'CC(C)O']
new_features = featurizer.featurize(new_smiles)
new_dataset = dc.data.NumpyDataset(X=new_features)
# 应用与训练时相同的变换
for transformer in transformers:
new_dataset = transformer.transform(new_dataset)
predictions = model.predict(new_dataset)
用于在标准基准上评估模型:
import deepchem as dc
# 1. 加载基准
tasks, datasets, _ = dc.molnet.load_bbbp(
featurizer='GraphConv',
splitter='scaffold'
)
train, valid, test = datasets
# 2. 训练模型
model = dc.models.GCNModel(n_tasks=len(tasks), mode='classification')
model.fit(train, nb_epoch=50)
# 3. 评估
metric = dc.metrics.Metric(dc.metrics.roc_auc_score)
test_score = model.evaluate(test, [metric])
print(f"Test ROC-AUC: {test_score}")
用于在自定义分子数据集上训练:
import deepchem as dc
# 1. 加载和特征化数据
featurizer = dc.feat.CircularFingerprint(radius=2, size=2048)
loader = dc.data.CSVLoader(
tasks=['activity'],
feature_field='smiles',
featurizer=featurizer
)
dataset = loader.create_dataset('my_molecules.csv')
# 2. 拆分数据(对分子使用 ScaffoldSplitter!)
splitter = dc.splits.ScaffoldSplitter()
train, valid, test = splitter.train_valid_test_split(dataset)
# 3. 归一化(可选但推荐)
transformers = [dc.trans.NormalizationTransformer(
transform_y=True, dataset=train
)]
for transformer in transformers:
train = transformer.transform(train)
valid = transformer.transform(valid)
test = transformer.transform(test)
# 4. 训练模型
model = dc.models.MultitaskRegressor(
n_tasks=1,
n_features=2048,
layer_sizes=[1000, 500],
dropouts=0.25
)
model.fit(train, nb_epoch=50)
# 5. 评估
metric = dc.metrics.Metric(dc.metrics.r2_score)
test_score = model.evaluate(test, [metric])
用于利用预训练模型:
import deepchem as dc
# 1. 加载数据(预训练模型通常需要原始 SMILES)
loader = dc.data.CSVLoader(
tasks=['activity'],
feature_field='smiles',
featurizer=dc.feat.DummyFeaturizer() # 模型处理特征化
)
dataset = loader.create_dataset('small_dataset.csv')
# 2. 拆分数据
splitter = dc.splits.ScaffoldSplitter()
train, test = splitter.train_test_split(dataset)
# 3. 加载预训练模型
model = dc.models.HuggingFaceModel(
model='seyonec/ChemBERTa-zinc-base-v1',
task='classification',
n_tasks=1,
learning_rate=2e-5
)
# 4. 微调
model.fit(train, nb_epoch=10)
# 5. 评估
predictions = model.predict(test)
有关涵盖分子生成、材料科学、蛋白质分析等的 8 个详细工作流示例,请参阅 references/workflows.md。
此技能在 scripts/ 目录中包含三个生产就绪的脚本:
predict_solubility.py训练和评估溶解度预测模型。适用于 Delaney 基准或自定义 CSV 数据。
# 使用 Delaney 基准
python scripts/predict_solubility.py
# 使用自定义数据
python scripts/predict_solubility.py \
--data my_data.csv \
--smiles-col smiles \
--target-col solubility \
--predict "CCO" "c1ccccc1"
graph_neural_network.py在分子数据上训练各种图神经网络架构。
# 在 Tox21 上训练 GCN
python scripts/graph_neural_network.py --model gcn --dataset tox21
# 在自定义数据上训练 AttentiveFP
python scripts/graph_neural_network.py \
--model attentivefp \
--data molecules.csv \
--task-type regression \
--targets activity \
--epochs 100
transfer_learning.py在分子性质预测任务上微调预训练模型(ChemBERTa、GROVER)。
# 在 BBBP 上微调 ChemBERTa
python scripts/transfer_learning.py --model chemberta --dataset bbbp
# 在自定义数据上微调 GROVER
python scripts/transfer_learning.py \
--model grover \
--data small_dataset.csv \
--target activity \
--task-type classification \
--epochs 20
# 良好:防止数据泄露
splitter = dc.splits.ScaffoldSplitter()
train, test = splitter.train_test_split(dataset)
# 不良:相似分子同时出现在训练集和测试集中
splitter = dc.splits.RandomSplitter()
train, test = splitter.train_test_split(dataset)
transformers = [
dc.trans.NormalizationTransformer(
transform_y=True, # 同时归一化目标值
dataset=train
)
]
for transformer in transformers:
train = transformer.transform(train)
test = transformer.transform(test)
# 选项 1:平衡变换器
transformer = dc.trans.BalancingTransformer(dataset=train)
train = transformer.transform(train)
# 选项 2:使用平衡指标
metric = dc.metrics.Metric(dc.metrics.balanced_accuracy_score)
# 对于大型数据集使用 DiskDataset
dataset = dc.data.DiskDataset.from_numpy(X, y, w, ids)
# 使用较小的批次大小
model = dc.models.GCNModel(batch_size=32) # 而不是 128
问题:使用随机拆分允许相似分子同时出现在训练集和测试集中。解决方案:始终对分子数据集使用 ScaffoldSplitter。
问题:图神经网络的性能比简单的指纹差。解决方案:
问题:模型记住了训练数据。解决方案:
问题:模块未找到错误。解决方案:确保 DeepChem 安装了所需的依赖项:
uv pip install deepchem
# 对于 PyTorch 模型
uv pip install deepchem[torch]
# 对于所有功能
uv pip install deepchem[all]
此技能包含全面的参考文档:
references/api_reference.md完整的 API 文档,包括:
何时参考:当您需要特定的 API 详细信息、参数名称或想要探索可用选项时,请搜索此文件。
references/workflows.md八个详细的端到端工作流:
何时参考:将这些工作流用作实现完整解决方案的模板。
基本安装:
uv pip install deepchem
对于 PyTorch 模型(GCN、GAT 等):
uv pip install deepchem[torch]
对于所有功能:
uv pip install deepchem[all]
如果出现导入错误,用户可能需要特定的依赖项。请查看 DeepChem 文档以获取详细的安装说明。
每周安装次数
153
仓库
GitHub 星标数
23.4K
首次出现
2026 年 1 月 21 日
安全审计
已安装于
claude-code130
opencode127
gemini-cli119
cursor116
antigravity109
codex107
DeepChem is a comprehensive Python library for applying machine learning to chemistry, materials science, and biology. Enable molecular property prediction, drug discovery, materials design, and biomolecule analysis through specialized neural networks, molecular featurization methods, and pretrained models.
This skill should be used when:
DeepChem provides specialized loaders for various chemical data formats:
import deepchem as dc
# Load CSV with SMILES
featurizer = dc.feat.CircularFingerprint(radius=2, size=2048)
loader = dc.data.CSVLoader(
tasks=['solubility', 'toxicity'],
feature_field='smiles',
featurizer=featurizer
)
dataset = loader.create_dataset('molecules.csv')
# Load SDF files
loader = dc.data.SDFLoader(tasks=['activity'], featurizer=featurizer)
dataset = loader.create_dataset('compounds.sdf')
# Load protein sequences
loader = dc.data.FASTALoader()
dataset = loader.create_dataset('proteins.fasta')
Key Loaders :
CSVLoader: Tabular data with molecular identifiersSDFLoader: Molecular structure filesFASTALoader: Protein/DNA sequencesImageLoader: Molecular imagesJsonLoader: JSON-formatted datasetsConvert molecules into numerical representations for ML models.
Is the model a graph neural network?
├─ YES → Use graph featurizers
│ ├─ Standard GNN → MolGraphConvFeaturizer
│ ├─ Message passing → DMPNNFeaturizer
│ └─ Pretrained → GroverFeaturizer
│
└─ NO → What type of model?
├─ Traditional ML (RF, XGBoost, SVM)
│ ├─ Fast baseline → CircularFingerprint (ECFP)
│ ├─ Interpretable → RDKitDescriptors
│ └─ Maximum coverage → MordredDescriptors
│
├─ Deep learning (non-graph)
│ ├─ Dense networks → CircularFingerprint
│ └─ CNN → SmilesToImage
│
├─ Sequence models (LSTM, Transformer)
│ └─ SmilesToSeq
│
└─ 3D structure analysis
└─ CoulombMatrix
# Fingerprints (for traditional ML)
fp = dc.feat.CircularFingerprint(radius=2, size=2048)
# Descriptors (for interpretable models)
desc = dc.feat.RDKitDescriptors()
# Graph features (for GNNs)
graph_feat = dc.feat.MolGraphConvFeaturizer()
# Apply featurization
features = fp.featurize(['CCO', 'c1ccccc1'])
Selection Guide :
See references/api_reference.md for complete featurizer documentation.
Critical : For drug discovery tasks, use ScaffoldSplitter to prevent data leakage from similar molecular structures appearing in both training and test sets.
# Scaffold splitting (recommended for molecules)
splitter = dc.splits.ScaffoldSplitter()
train, valid, test = splitter.train_valid_test_split(
dataset,
frac_train=0.8,
frac_valid=0.1,
frac_test=0.1
)
# Random splitting (for non-molecular data)
splitter = dc.splits.RandomSplitter()
train, test = splitter.train_test_split(dataset)
# Stratified splitting (for imbalanced classification)
splitter = dc.splits.RandomStratifiedSplitter()
train, test = splitter.train_test_split(dataset)
Available Splitters :
ScaffoldSplitter: Split by molecular scaffolds (prevents leakage)ButinaSplitter: Clustering-based molecular splittingMaxMinSplitter: Maximize diversity between setsRandomSplitter: Random splittingRandomStratifiedSplitter: Preserves class distributions| Dataset Size | Task | Recommended Model | Featurizer |
|---|---|---|---|
| < 1K samples | Any | SklearnModel (RandomForest) | CircularFingerprint |
| 1K-100K | Classification/Regression | GBDTModel or MultitaskRegressor | CircularFingerprint |
100K | Molecular properties | GCNModel, AttentiveFPModel, DMPNNModel | MolGraphConvFeaturizer
Any (small preferred) | Transfer learning | ChemBERTa, GROVER, MolFormer | Model-specific
Crystal structures | Materials properties | CGCNNModel, MEGNetModel | Structure-based
Protein sequences | Protein properties | ProtBERT | Sequence-based
from sklearn.ensemble import RandomForestRegressor
# Wrap scikit-learn model
sklearn_model = RandomForestRegressor(n_estimators=100)
model = dc.models.SklearnModel(model=sklearn_model)
model.fit(train)
# Multitask regressor (for fingerprints)
model = dc.models.MultitaskRegressor(
n_tasks=2,
n_features=2048,
layer_sizes=[1000, 500],
dropouts=0.25,
learning_rate=0.001
)
model.fit(train, nb_epoch=50)
# Graph Convolutional Network
model = dc.models.GCNModel(
n_tasks=1,
mode='regression',
batch_size=128,
learning_rate=0.001
)
model.fit(train, nb_epoch=50)
# Graph Attention Network
model = dc.models.GATModel(n_tasks=1, mode='classification')
model.fit(train, nb_epoch=50)
# Attentive Fingerprint
model = dc.models.AttentiveFPModel(n_tasks=1, mode='regression')
model.fit(train, nb_epoch=50)
Quick access to 30+ curated benchmark datasets with standardized train/valid/test splits:
# Load benchmark dataset
tasks, datasets, transformers = dc.molnet.load_tox21(
featurizer='GraphConv', # or 'ECFP', 'Weave', 'Raw'
splitter='scaffold', # or 'random', 'stratified'
reload=False
)
train, valid, test = datasets
# Train and evaluate
model = dc.models.GCNModel(n_tasks=len(tasks), mode='classification')
model.fit(train, nb_epoch=50)
metric = dc.metrics.Metric(dc.metrics.roc_auc_score)
test_score = model.evaluate(test, [metric])
Common Datasets :
load_tox21(), load_bbbp(), load_hiv(), load_clintox()load_delaney(), load_freesolv(), load_lipo()load_qm7(), load_qm8(), load_qm9()See references/api_reference.md for complete dataset list.
Leverage pretrained models for improved performance, especially on small datasets:
# ChemBERTa (BERT pretrained on 77M molecules)
model = dc.models.HuggingFaceModel(
model='seyonec/ChemBERTa-zinc-base-v1',
task='classification',
n_tasks=1,
learning_rate=2e-5 # Lower LR for fine-tuning
)
model.fit(train, nb_epoch=10)
# GROVER (graph transformer pretrained on 10M molecules)
model = dc.models.GroverModel(
task='regression',
n_tasks=1
)
model.fit(train, nb_epoch=20)
When to use transfer learning :
Use the scripts/transfer_learning.py script for guided transfer learning workflows.
# Define metrics
classification_metrics = [
dc.metrics.Metric(dc.metrics.roc_auc_score, name='ROC-AUC'),
dc.metrics.Metric(dc.metrics.accuracy_score, name='Accuracy'),
dc.metrics.Metric(dc.metrics.f1_score, name='F1')
]
regression_metrics = [
dc.metrics.Metric(dc.metrics.r2_score, name='R²'),
dc.metrics.Metric(dc.metrics.mean_absolute_error, name='MAE'),
dc.metrics.Metric(dc.metrics.root_mean_squared_error, name='RMSE')
]
# Evaluate
train_scores = model.evaluate(train, classification_metrics)
test_scores = model.evaluate(test, classification_metrics)
# Predict on test set
predictions = model.predict(test)
# Predict on new molecules
new_smiles = ['CCO', 'c1ccccc1', 'CC(C)O']
new_features = featurizer.featurize(new_smiles)
new_dataset = dc.data.NumpyDataset(X=new_features)
# Apply same transformations as training
for transformer in transformers:
new_dataset = transformer.transform(new_dataset)
predictions = model.predict(new_dataset)
For evaluating a model on standard benchmarks:
import deepchem as dc
# 1. Load benchmark
tasks, datasets, _ = dc.molnet.load_bbbp(
featurizer='GraphConv',
splitter='scaffold'
)
train, valid, test = datasets
# 2. Train model
model = dc.models.GCNModel(n_tasks=len(tasks), mode='classification')
model.fit(train, nb_epoch=50)
# 3. Evaluate
metric = dc.metrics.Metric(dc.metrics.roc_auc_score)
test_score = model.evaluate(test, [metric])
print(f"Test ROC-AUC: {test_score}")
For training on custom molecular datasets:
import deepchem as dc
# 1. Load and featurize data
featurizer = dc.feat.CircularFingerprint(radius=2, size=2048)
loader = dc.data.CSVLoader(
tasks=['activity'],
feature_field='smiles',
featurizer=featurizer
)
dataset = loader.create_dataset('my_molecules.csv')
# 2. Split data (use ScaffoldSplitter for molecules!)
splitter = dc.splits.ScaffoldSplitter()
train, valid, test = splitter.train_valid_test_split(dataset)
# 3. Normalize (optional but recommended)
transformers = [dc.trans.NormalizationTransformer(
transform_y=True, dataset=train
)]
for transformer in transformers:
train = transformer.transform(train)
valid = transformer.transform(valid)
test = transformer.transform(test)
# 4. Train model
model = dc.models.MultitaskRegressor(
n_tasks=1,
n_features=2048,
layer_sizes=[1000, 500],
dropouts=0.25
)
model.fit(train, nb_epoch=50)
# 5. Evaluate
metric = dc.metrics.Metric(dc.metrics.r2_score)
test_score = model.evaluate(test, [metric])
For leveraging pretrained models:
import deepchem as dc
# 1. Load data (pretrained models often need raw SMILES)
loader = dc.data.CSVLoader(
tasks=['activity'],
feature_field='smiles',
featurizer=dc.feat.DummyFeaturizer() # Model handles featurization
)
dataset = loader.create_dataset('small_dataset.csv')
# 2. Split data
splitter = dc.splits.ScaffoldSplitter()
train, test = splitter.train_test_split(dataset)
# 3. Load pretrained model
model = dc.models.HuggingFaceModel(
model='seyonec/ChemBERTa-zinc-base-v1',
task='classification',
n_tasks=1,
learning_rate=2e-5
)
# 4. Fine-tune
model.fit(train, nb_epoch=10)
# 5. Evaluate
predictions = model.predict(test)
See references/workflows.md for 8 detailed workflow examples covering molecular generation, materials science, protein analysis, and more.
This skill includes three production-ready scripts in the scripts/ directory:
predict_solubility.pyTrain and evaluate solubility prediction models. Works with Delaney benchmark or custom CSV data.
# Use Delaney benchmark
python scripts/predict_solubility.py
# Use custom data
python scripts/predict_solubility.py \
--data my_data.csv \
--smiles-col smiles \
--target-col solubility \
--predict "CCO" "c1ccccc1"
graph_neural_network.pyTrain various graph neural network architectures on molecular data.
# Train GCN on Tox21
python scripts/graph_neural_network.py --model gcn --dataset tox21
# Train AttentiveFP on custom data
python scripts/graph_neural_network.py \
--model attentivefp \
--data molecules.csv \
--task-type regression \
--targets activity \
--epochs 100
transfer_learning.pyFine-tune pretrained models (ChemBERTa, GROVER) on molecular property prediction tasks.
# Fine-tune ChemBERTa on BBBP
python scripts/transfer_learning.py --model chemberta --dataset bbbp
# Fine-tune GROVER on custom data
python scripts/transfer_learning.py \
--model grover \
--data small_dataset.csv \
--target activity \
--task-type classification \
--epochs 20
# GOOD: Prevents data leakage
splitter = dc.splits.ScaffoldSplitter()
train, test = splitter.train_test_split(dataset)
# BAD: Similar molecules in train and test
splitter = dc.splits.RandomSplitter()
train, test = splitter.train_test_split(dataset)
transformers = [
dc.trans.NormalizationTransformer(
transform_y=True, # Also normalize target values
dataset=train
)
]
for transformer in transformers:
train = transformer.transform(train)
test = transformer.transform(test)
# Option 1: Balancing transformer
transformer = dc.trans.BalancingTransformer(dataset=train)
train = transformer.transform(train)
# Option 2: Use balanced metrics
metric = dc.metrics.Metric(dc.metrics.balanced_accuracy_score)
# Use DiskDataset for large datasets
dataset = dc.data.DiskDataset.from_numpy(X, y, w, ids)
# Use smaller batch sizes
model = dc.models.GCNModel(batch_size=32) # Instead of 128
Problem : Using random splitting allows similar molecules in train/test sets. Solution : Always use ScaffoldSplitter for molecular datasets.
Problem : Graph neural networks perform worse than simple fingerprints. Solutions :
Problem : Model memorizes training data. Solutions :
Problem : Module not found errors. Solution : Ensure DeepChem is installed with required dependencies:
uv pip install deepchem
# For PyTorch models
uv pip install deepchem[torch]
# For all features
uv pip install deepchem[all]
This skill includes comprehensive reference documentation:
references/api_reference.mdComplete API documentation including:
When to reference : Search this file when you need specific API details, parameter names, or want to explore available options.
references/workflows.mdEight detailed end-to-end workflows:
When to reference : Use these workflows as templates for implementing complete solutions.
Basic installation:
uv pip install deepchem
For PyTorch models (GCN, GAT, etc.):
uv pip install deepchem[torch]
For all features:
uv pip install deepchem[all]
If import errors occur, the user may need specific dependencies. Check the DeepChem documentation for detailed installation instructions.
Weekly Installs
153
Repository
GitHub Stars
23.4K
First Seen
Jan 21, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykWarn
Installed on
claude-code130
opencode127
gemini-cli119
cursor116
antigravity109
codex107
AI 代码实施计划编写技能 | 自动化开发任务分解与 TDD 流程规划工具
50,900 周安装
GitHub Actions 工作流创建专家 | 自动化 CI/CD 与部署配置指南
151 周安装
Ghidra深度逆向工程分析助手 - 系统化二进制代码深度调查与数据库优化
149 周安装
数据与漏斗分析实战指南:GA4事件跟踪、UTM参数与转化路径优化
151 周安装
GitLab CLI 集成技能:使用 glab 命令行工具自动化 GitLab 任务
149 周安装
Gmail API 技能:Python 自动化邮件管理、搜索和发送指南
150 周安装
Svelte 组件开发指南:Bits UI、Ark UI、Melt UI 组件库与 Web 组件实战
151 周安装
load_perovskite()load_bandgap()load_mp_formation_energy()