重要前提
安装AI Skills的关键前提是:必须科学上网,且开启TUN模式,这一点至关重要,直接决定安装能否顺利完成,在此郑重提醒三遍:科学上网,科学上网,科学上网。查看完整安装教程 →
deepchem by k-dense-ai/claude-scientific-skills
npx skills add https://github.com/k-dense-ai/claude-scientific-skills --skill deepchemDeepChem 是一个全面的 Python 库,用于将机器学习应用于化学、材料科学和生物学。通过专门的神经网络、分子特征化方法和预训练模型,实现分子性质预测、药物发现、材料设计和生物分子分析。
此技能应在以下情况下使用:
DeepChem 为各种化学数据格式提供专门的加载器:
import deepchem as dc
# 加载包含 SMILES 的 CSV
featurizer = dc.feat.CircularFingerprint(radius=2, size=2048)
loader = dc.data.CSVLoader(
tasks=['solubility', 'toxicity'],
feature_field='smiles',
featurizer=featurizer
)
dataset = loader.create_dataset('molecules.csv')
# 加载 SDF 文件
loader = dc.data.SDFLoader(tasks=['activity'], featurizer=featurizer)
dataset = loader.create_dataset('compounds.sdf')
# 加载蛋白质序列
loader = dc.data.FASTALoader()
dataset = loader.create_dataset('proteins.fasta')
关键加载器:
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
CSVLoaderSDFLoader:分子结构文件FASTALoader:蛋白质/DNA 序列ImageLoader:分子图像JsonLoader:JSON 格式的数据集将分子转换为适合机器学习模型的数值表示。
graph TD
A[模型是图神经网络吗?] -->|是| B[使用图特征化器]
A -->|否| C[什么类型的模型?]
B --> D[标准 GNN → MolGraphConvFeaturizer]
B --> E[消息传递 → DMPNNFeaturizer]
B --> F[预训练 → GroverFeaturizer]
C --> G[传统机器学习<br/>RF, XGBoost, SVM]
C --> H[深度学习<br/>非图模型]
C --> I[序列模型<br/>LSTM, Transformer]
C --> J[3D 结构分析]
G --> K[快速基线 → CircularFingerprint (ECFP)]
G --> L[可解释性 → RDKitDescriptors]
G --> M[最大覆盖 → MordredDescriptors]
H --> N[密集网络 → CircularFingerprint]
H --> O[CNN → SmilesToImage]
I --> P[SmilesToSeq]
J --> Q[CoulombMatrix]
# 指纹(用于传统机器学习)
fp = dc.feat.CircularFingerprint(radius=2, size=2048)
# 描述符(用于可解释模型)
desc = dc.feat.RDKitDescriptors()
# 图特征(用于图神经网络)
graph_feat = dc.feat.MolGraphConvFeaturizer()
# 应用特征化
features = fp.featurize(['CCO', 'c1ccccc1'])
选择指南:
完整的特征化器文档请参见 references/api_reference.md。
关键:对于药物发现任务,使用 ScaffoldSplitter 以防止相似的分子结构同时出现在训练集和测试集中而导致数据泄露。
# 骨架拆分(推荐用于分子)
splitter = dc.splits.ScaffoldSplitter()
train, valid, test = splitter.train_valid_test_split(
dataset,
frac_train=0.8,
frac_valid=0.1,
frac_test=0.1
)
# 随机拆分(用于非分子数据)
splitter = dc.splits.RandomSplitter()
train, test = splitter.train_test_split(dataset)
# 分层拆分(用于类别不平衡的分类任务)
splitter = dc.splits.RandomStratifiedSplitter()
train, test = splitter.train_test_split(dataset)
可用的拆分器:
ScaffoldSplitter:按分子骨架拆分(防止泄露)ButinaSplitter:基于聚类的分子拆分MaxMinSplitter:最大化集合间的多样性RandomSplitter:随机拆分RandomStratifiedSplitter:保持类别分布| 数据集大小 | 任务 | 推荐模型 | 特征化器 |
|---|---|---|---|
| < 1K 样本 | 任意 | SklearnModel (RandomForest) | CircularFingerprint |
| 1K-100K | 分类/回归 | GBDTModel 或 MultitaskRegressor | CircularFingerprint |
| > 100K | 分子性质 | GCNModel, AttentiveFPModel, DMPNNModel | MolGraphConvFeaturizer |
| 任意(推荐小型) | 迁移学习 | ChemBERTa, GROVER, MolFormer | 模型特定 |
| 晶体结构 | 材料性质 | CGCNNModel, MEGNetModel | 基于结构 |
| 蛋白质序列 | 蛋白质性质 | ProtBERT | 基于序列 |
from sklearn.ensemble import RandomForestRegressor
# 包装 scikit-learn 模型
sklearn_model = RandomForestRegressor(n_estimators=100)
model = dc.models.SklearnModel(model=sklearn_model)
model.fit(train)
# 多任务回归器(用于指纹)
model = dc.models.MultitaskRegressor(
n_tasks=2,
n_features=2048,
layer_sizes=[1000, 500],
dropouts=0.25,
learning_rate=0.001
)
model.fit(train, nb_epoch=50)
# 图卷积网络
model = dc.models.GCNModel(
n_tasks=1,
mode='regression',
batch_size=128,
learning_rate=0.001
)
model.fit(train, nb_epoch=50)
# 图注意力网络
model = dc.models.GATModel(n_tasks=1, mode='classification')
model.fit(train, nb_epoch=50)
# 注意力指纹模型
model = dc.models.AttentiveFPModel(n_tasks=1, mode='regression')
model.fit(train, nb_epoch=50)
快速访问 30 多个经过整理的基准数据集,并附带标准化的训练/验证/测试拆分:
# 加载基准数据集
tasks, datasets, transformers = dc.molnet.load_tox21(
featurizer='GraphConv', # 或 'ECFP', 'Weave', 'Raw'
splitter='scaffold', # 或 'random', 'stratified'
reload=False
)
train, valid, test = datasets
# 训练和评估
model = dc.models.GCNModel(n_tasks=len(tasks), mode='classification')
model.fit(train, nb_epoch=50)
metric = dc.metrics.Metric(dc.metrics.roc_auc_score)
test_score = model.evaluate(test, [metric])
常用数据集:
load_tox21()、load_bbbp()、load_hiv()、load_clintox()load_delaney()、load_freesolv()、load_lipo()load_qm7()、load_qm8()、load_qm9()load_perovskite()、load_bandgap()、load_mp_formation_energy()完整的数据集列表请参见 references/api_reference.md。
利用预训练模型提高性能,尤其是在小型数据集上:
# ChemBERTa(在 7700 万分子上预训练的 BERT)
model = dc.models.HuggingFaceModel(
model='seyonec/ChemBERTa-zinc-base-v1',
task='classification',
n_tasks=1,
learning_rate=2e-5 # 微调时使用较低的学习率
)
model.fit(train, nb_epoch=10)
# GROVER(在 1000 万分子上预训练的图变换器)
model = dc.models.GroverModel(
task='regression',
n_tasks=1
)
model.fit(train, nb_epoch=20)
何时使用迁移学习:
使用 scripts/transfer_learning.py 脚本进行引导式的迁移学习工作流。
# 定义评估指标
classification_metrics = [
dc.metrics.Metric(dc.metrics.roc_auc_score, name='ROC-AUC'),
dc.metrics.Metric(dc.metrics.accuracy_score, name='Accuracy'),
dc.metrics.Metric(dc.metrics.f1_score, name='F1')
]
regression_metrics = [
dc.metrics.Metric(dc.metrics.r2_score, name='R²'),
dc.metrics.Metric(dc.metrics.mean_absolute_error, name='MAE'),
dc.metrics.Metric(dc.metrics.root_mean_squared_error, name='RMSE')
]
# 评估
train_scores = model.evaluate(train, classification_metrics)
test_scores = model.evaluate(test, classification_metrics)
# 在测试集上进行预测
predictions = model.predict(test)
# 对新分子进行预测
new_smiles = ['CCO', 'c1ccccc1', 'CC(C)O']
new_features = featurizer.featurize(new_smiles)
new_dataset = dc.data.NumpyDataset(X=new_features)
# 应用与训练时相同的转换
for transformer in transformers:
new_dataset = transformer.transform(new_dataset)
predictions = model.predict(new_dataset)
用于在标准基准上评估模型:
import deepchem as dc
# 1. 加载基准数据集
tasks, datasets, _ = dc.molnet.load_bbbp(
featurizer='GraphConv',
splitter='scaffold'
)
train, valid, test = datasets
# 2. 训练模型
model = dc.models.GCNModel(n_tasks=len(tasks), mode='classification')
model.fit(train, nb_epoch=50)
# 3. 评估
metric = dc.metrics.Metric(dc.metrics.roc_auc_score)
test_score = model.evaluate(test, [metric])
print(f"Test ROC-AUC: {test_score}")
用于在自定义分子数据集上训练:
import deepchem as dc
# 1. 加载和特征化数据
featurizer = dc.feat.CircularFingerprint(radius=2, size=2048)
loader = dc.data.CSVLoader(
tasks=['activity'],
feature_field='smiles',
featurizer=featurizer
)
dataset = loader.create_dataset('my_molecules.csv')
# 2. 拆分数据(对于分子,请使用 ScaffoldSplitter!)
splitter = dc.splits.ScaffoldSplitter()
train, valid, test = splitter.train_valid_test_split(dataset)
# 3. 标准化(可选但推荐)
transformers = [dc.trans.NormalizationTransformer(
transform_y=True, dataset=train
)]
for transformer in transformers:
train = transformer.transform(train)
valid = transformer.transform(valid)
test = transformer.transform(test)
# 4. 训练模型
model = dc.models.MultitaskRegressor(
n_tasks=1,
n_features=2048,
layer_sizes=[1000, 500],
dropouts=0.25
)
model.fit(train, nb_epoch=50)
# 5. 评估
metric = dc.metrics.Metric(dc.metrics.r2_score)
test_score = model.evaluate(test, [metric])
用于利用预训练模型:
import deepchem as dc
# 1. 加载数据(预训练模型通常需要原始 SMILES)
loader = dc.data.CSVLoader(
tasks=['activity'],
feature_field='smiles',
featurizer=dc.feat.DummyFeaturizer() # 模型处理特征化
)
dataset = loader.create_dataset('small_dataset.csv')
# 2. 拆分数据
splitter = dc.splits.ScaffoldSplitter()
train, test = splitter.train_test_split(dataset)
# 3. 加载预训练模型
model = dc.models.HuggingFaceModel(
model='seyonec/ChemBERTa-zinc-base-v1',
task='classification',
n_tasks=1,
learning_rate=2e-5
)
# 4. 微调
model.fit(train, nb_epoch=10)
# 5. 评估
predictions = model.predict(test)
references/workflows.md 中包含了 8 个详细的工作流示例,涵盖分子生成、材料科学、蛋白质分析等。
此技能在 scripts/ 目录中包含三个生产就绪的脚本:
predict_solubility.py训练和评估溶解度预测模型。适用于 Delaney 基准测试或自定义 CSV 数据。
# 使用 Delaney 基准测试
python scripts/predict_solubility.py
# 使用自定义数据
python scripts/predict_solubility.py \
--data my_data.csv \
--smiles-col smiles \
--target-col solubility \
--predict "CCO" "c1ccccc1"
graph_neural_network.py在分子数据上训练各种图神经网络架构。
# 在 Tox21 上训练 GCN
python scripts/graph_neural_network.py --model gcn --dataset tox21
# 在自定义数据上训练 AttentiveFP
python scripts/graph_neural_network.py \
--model attentivefp \
--data molecules.csv \
--task-type regression \
--targets activity \
--epochs 100
transfer_learning.py在分子性质预测任务上微调预训练模型(ChemBERTa、GROVER)。
# 在 BBBP 上微调 ChemBERTa
python scripts/transfer_learning.py --model chemberta --dataset bbbp
# 在自定义数据上微调 GROVER
python scripts/transfer_learning.py \
--model grover \
--data small_dataset.csv \
--target activity \
--task-type classification \
--epochs 20
# 良好:防止数据泄露
splitter = dc.splits.ScaffoldSplitter()
train, test = splitter.train_test_split(dataset)
# 不良:相似分子同时出现在训练集和测试集
splitter = dc.splits.RandomSplitter()
train, test = splitter.train_test_split(dataset)
transformers = [
dc.trans.NormalizationTransformer(
transform_y=True, # 同时标准化目标值
dataset=train
)
]
for transformer in transformers:
train = transformer.transform(train)
test = transformer.transform(test)
# 选项 1:平衡转换器
transformer = dc.trans.BalancingTransformer(dataset=train)
train = transformer.transform(train)
# 选项 2:使用平衡指标
metric = dc.metrics.Metric(dc.metrics.balanced_accuracy_score)
# 对于大型数据集,使用 DiskDataset
dataset = dc.data.DiskDataset.from_numpy(X, y, w, ids)
# 使用较小的批次大小
model = dc.models.GCNModel(batch_size=32) # 而不是 128
问题:使用随机拆分允许相似分子同时出现在训练集和测试集中。
解决方案:对于分子数据集,始终使用 ScaffoldSplitter。
问题:图神经网络的性能比简单的指纹方法更差。 解决方案:
问题:模型记住了训练数据。 解决方案:
问题:找不到模块错误。 解决方案:确保 DeepChem 已安装所需的依赖项:
uv pip install deepchem
# 对于 PyTorch 模型
uv pip install deepchem[torch]
# 对于所有功能
uv pip install deepchem[all]
此技能包含全面的参考文档:
references/api_reference.md完整的 API 文档,包括:
何时参考:当您需要特定的 API 详细信息、参数名称或想要探索可用选项时,请搜索此文件。
references/workflows.md八个详细的端到端工作流:
何时参考:将这些工作流用作实现完整解决方案的模板。
基本安装:
uv pip install deepchem
对于 PyTorch 模型(GCN、GAT 等):
uv pip install deepchem[torch]
对于所有功能:
uv pip install deepchem[all]
如果出现导入错误,用户可能需要特定的依赖项。请查看 DeepChem 文档以获取详细的安装说明。
每周安装次数
55
仓库
GitHub 星标数
17.3K
首次出现
2026 年 1 月 20 日
安全审计
已安装于
opencode48
codex47
gemini-cli47
cursor45
claude-code44
github-copilot44
DeepChem is a comprehensive Python library for applying machine learning to chemistry, materials science, and biology. Enable molecular property prediction, drug discovery, materials design, and biomolecule analysis through specialized neural networks, molecular featurization methods, and pretrained models.
This skill should be used when:
DeepChem provides specialized loaders for various chemical data formats:
import deepchem as dc
# Load CSV with SMILES
featurizer = dc.feat.CircularFingerprint(radius=2, size=2048)
loader = dc.data.CSVLoader(
tasks=['solubility', 'toxicity'],
feature_field='smiles',
featurizer=featurizer
)
dataset = loader.create_dataset('molecules.csv')
# Load SDF files
loader = dc.data.SDFLoader(tasks=['activity'], featurizer=featurizer)
dataset = loader.create_dataset('compounds.sdf')
# Load protein sequences
loader = dc.data.FASTALoader()
dataset = loader.create_dataset('proteins.fasta')
Key Loaders :
CSVLoader: Tabular data with molecular identifiersSDFLoader: Molecular structure filesFASTALoader: Protein/DNA sequencesImageLoader: Molecular imagesJsonLoader: JSON-formatted datasetsConvert molecules into numerical representations for ML models.
Is the model a graph neural network?
├─ YES → Use graph featurizers
│ ├─ Standard GNN → MolGraphConvFeaturizer
│ ├─ Message passing → DMPNNFeaturizer
│ └─ Pretrained → GroverFeaturizer
│
└─ NO → What type of model?
├─ Traditional ML (RF, XGBoost, SVM)
│ ├─ Fast baseline → CircularFingerprint (ECFP)
│ ├─ Interpretable → RDKitDescriptors
│ └─ Maximum coverage → MordredDescriptors
│
├─ Deep learning (non-graph)
│ ├─ Dense networks → CircularFingerprint
│ └─ CNN → SmilesToImage
│
├─ Sequence models (LSTM, Transformer)
│ └─ SmilesToSeq
│
└─ 3D structure analysis
└─ CoulombMatrix
# Fingerprints (for traditional ML)
fp = dc.feat.CircularFingerprint(radius=2, size=2048)
# Descriptors (for interpretable models)
desc = dc.feat.RDKitDescriptors()
# Graph features (for GNNs)
graph_feat = dc.feat.MolGraphConvFeaturizer()
# Apply featurization
features = fp.featurize(['CCO', 'c1ccccc1'])
Selection Guide :
See references/api_reference.md for complete featurizer documentation.
Critical : For drug discovery tasks, use ScaffoldSplitter to prevent data leakage from similar molecular structures appearing in both training and test sets.
# Scaffold splitting (recommended for molecules)
splitter = dc.splits.ScaffoldSplitter()
train, valid, test = splitter.train_valid_test_split(
dataset,
frac_train=0.8,
frac_valid=0.1,
frac_test=0.1
)
# Random splitting (for non-molecular data)
splitter = dc.splits.RandomSplitter()
train, test = splitter.train_test_split(dataset)
# Stratified splitting (for imbalanced classification)
splitter = dc.splits.RandomStratifiedSplitter()
train, test = splitter.train_test_split(dataset)
Available Splitters :
ScaffoldSplitter: Split by molecular scaffolds (prevents leakage)ButinaSplitter: Clustering-based molecular splittingMaxMinSplitter: Maximize diversity between setsRandomSplitter: Random splittingRandomStratifiedSplitter: Preserves class distributions| Dataset Size | Task | Recommended Model | Featurizer |
|---|---|---|---|
| < 1K samples | Any | SklearnModel (RandomForest) | CircularFingerprint |
| 1K-100K | Classification/Regression | GBDTModel or MultitaskRegressor | CircularFingerprint |
100K | Molecular properties | GCNModel, AttentiveFPModel, DMPNNModel | MolGraphConvFeaturizer
Any (small preferred) | Transfer learning | ChemBERTa, GROVER, MolFormer | Model-specific
Crystal structures | Materials properties | CGCNNModel, MEGNetModel | Structure-based
Protein sequences | Protein properties | ProtBERT | Sequence-based
from sklearn.ensemble import RandomForestRegressor
# Wrap scikit-learn model
sklearn_model = RandomForestRegressor(n_estimators=100)
model = dc.models.SklearnModel(model=sklearn_model)
model.fit(train)
# Multitask regressor (for fingerprints)
model = dc.models.MultitaskRegressor(
n_tasks=2,
n_features=2048,
layer_sizes=[1000, 500],
dropouts=0.25,
learning_rate=0.001
)
model.fit(train, nb_epoch=50)
# Graph Convolutional Network
model = dc.models.GCNModel(
n_tasks=1,
mode='regression',
batch_size=128,
learning_rate=0.001
)
model.fit(train, nb_epoch=50)
# Graph Attention Network
model = dc.models.GATModel(n_tasks=1, mode='classification')
model.fit(train, nb_epoch=50)
# Attentive Fingerprint
model = dc.models.AttentiveFPModel(n_tasks=1, mode='regression')
model.fit(train, nb_epoch=50)
Quick access to 30+ curated benchmark datasets with standardized train/valid/test splits:
# Load benchmark dataset
tasks, datasets, transformers = dc.molnet.load_tox21(
featurizer='GraphConv', # or 'ECFP', 'Weave', 'Raw'
splitter='scaffold', # or 'random', 'stratified'
reload=False
)
train, valid, test = datasets
# Train and evaluate
model = dc.models.GCNModel(n_tasks=len(tasks), mode='classification')
model.fit(train, nb_epoch=50)
metric = dc.metrics.Metric(dc.metrics.roc_auc_score)
test_score = model.evaluate(test, [metric])
Common Datasets :
load_tox21(), load_bbbp(), load_hiv(), load_clintox()load_delaney(), load_freesolv(), load_lipo()load_qm7(), load_qm8(), load_qm9()See references/api_reference.md for complete dataset list.
Leverage pretrained models for improved performance, especially on small datasets:
# ChemBERTa (BERT pretrained on 77M molecules)
model = dc.models.HuggingFaceModel(
model='seyonec/ChemBERTa-zinc-base-v1',
task='classification',
n_tasks=1,
learning_rate=2e-5 # Lower LR for fine-tuning
)
model.fit(train, nb_epoch=10)
# GROVER (graph transformer pretrained on 10M molecules)
model = dc.models.GroverModel(
task='regression',
n_tasks=1
)
model.fit(train, nb_epoch=20)
When to use transfer learning :
Use the scripts/transfer_learning.py script for guided transfer learning workflows.
# Define metrics
classification_metrics = [
dc.metrics.Metric(dc.metrics.roc_auc_score, name='ROC-AUC'),
dc.metrics.Metric(dc.metrics.accuracy_score, name='Accuracy'),
dc.metrics.Metric(dc.metrics.f1_score, name='F1')
]
regression_metrics = [
dc.metrics.Metric(dc.metrics.r2_score, name='R²'),
dc.metrics.Metric(dc.metrics.mean_absolute_error, name='MAE'),
dc.metrics.Metric(dc.metrics.root_mean_squared_error, name='RMSE')
]
# Evaluate
train_scores = model.evaluate(train, classification_metrics)
test_scores = model.evaluate(test, classification_metrics)
# Predict on test set
predictions = model.predict(test)
# Predict on new molecules
new_smiles = ['CCO', 'c1ccccc1', 'CC(C)O']
new_features = featurizer.featurize(new_smiles)
new_dataset = dc.data.NumpyDataset(X=new_features)
# Apply same transformations as training
for transformer in transformers:
new_dataset = transformer.transform(new_dataset)
predictions = model.predict(new_dataset)
For evaluating a model on standard benchmarks:
import deepchem as dc
# 1. Load benchmark
tasks, datasets, _ = dc.molnet.load_bbbp(
featurizer='GraphConv',
splitter='scaffold'
)
train, valid, test = datasets
# 2. Train model
model = dc.models.GCNModel(n_tasks=len(tasks), mode='classification')
model.fit(train, nb_epoch=50)
# 3. Evaluate
metric = dc.metrics.Metric(dc.metrics.roc_auc_score)
test_score = model.evaluate(test, [metric])
print(f"Test ROC-AUC: {test_score}")
For training on custom molecular datasets:
import deepchem as dc
# 1. Load and featurize data
featurizer = dc.feat.CircularFingerprint(radius=2, size=2048)
loader = dc.data.CSVLoader(
tasks=['activity'],
feature_field='smiles',
featurizer=featurizer
)
dataset = loader.create_dataset('my_molecules.csv')
# 2. Split data (use ScaffoldSplitter for molecules!)
splitter = dc.splits.ScaffoldSplitter()
train, valid, test = splitter.train_valid_test_split(dataset)
# 3. Normalize (optional but recommended)
transformers = [dc.trans.NormalizationTransformer(
transform_y=True, dataset=train
)]
for transformer in transformers:
train = transformer.transform(train)
valid = transformer.transform(valid)
test = transformer.transform(test)
# 4. Train model
model = dc.models.MultitaskRegressor(
n_tasks=1,
n_features=2048,
layer_sizes=[1000, 500],
dropouts=0.25
)
model.fit(train, nb_epoch=50)
# 5. Evaluate
metric = dc.metrics.Metric(dc.metrics.r2_score)
test_score = model.evaluate(test, [metric])
For leveraging pretrained models:
import deepchem as dc
# 1. Load data (pretrained models often need raw SMILES)
loader = dc.data.CSVLoader(
tasks=['activity'],
feature_field='smiles',
featurizer=dc.feat.DummyFeaturizer() # Model handles featurization
)
dataset = loader.create_dataset('small_dataset.csv')
# 2. Split data
splitter = dc.splits.ScaffoldSplitter()
train, test = splitter.train_test_split(dataset)
# 3. Load pretrained model
model = dc.models.HuggingFaceModel(
model='seyonec/ChemBERTa-zinc-base-v1',
task='classification',
n_tasks=1,
learning_rate=2e-5
)
# 4. Fine-tune
model.fit(train, nb_epoch=10)
# 5. Evaluate
predictions = model.predict(test)
See references/workflows.md for 8 detailed workflow examples covering molecular generation, materials science, protein analysis, and more.
This skill includes three production-ready scripts in the scripts/ directory:
predict_solubility.pyTrain and evaluate solubility prediction models. Works with Delaney benchmark or custom CSV data.
# Use Delaney benchmark
python scripts/predict_solubility.py
# Use custom data
python scripts/predict_solubility.py \
--data my_data.csv \
--smiles-col smiles \
--target-col solubility \
--predict "CCO" "c1ccccc1"
graph_neural_network.pyTrain various graph neural network architectures on molecular data.
# Train GCN on Tox21
python scripts/graph_neural_network.py --model gcn --dataset tox21
# Train AttentiveFP on custom data
python scripts/graph_neural_network.py \
--model attentivefp \
--data molecules.csv \
--task-type regression \
--targets activity \
--epochs 100
transfer_learning.pyFine-tune pretrained models (ChemBERTa, GROVER) on molecular property prediction tasks.
# Fine-tune ChemBERTa on BBBP
python scripts/transfer_learning.py --model chemberta --dataset bbbp
# Fine-tune GROVER on custom data
python scripts/transfer_learning.py \
--model grover \
--data small_dataset.csv \
--target activity \
--task-type classification \
--epochs 20
# GOOD: Prevents data leakage
splitter = dc.splits.ScaffoldSplitter()
train, test = splitter.train_test_split(dataset)
# BAD: Similar molecules in train and test
splitter = dc.splits.RandomSplitter()
train, test = splitter.train_test_split(dataset)
transformers = [
dc.trans.NormalizationTransformer(
transform_y=True, # Also normalize target values
dataset=train
)
]
for transformer in transformers:
train = transformer.transform(train)
test = transformer.transform(test)
# Option 1: Balancing transformer
transformer = dc.trans.BalancingTransformer(dataset=train)
train = transformer.transform(train)
# Option 2: Use balanced metrics
metric = dc.metrics.Metric(dc.metrics.balanced_accuracy_score)
# Use DiskDataset for large datasets
dataset = dc.data.DiskDataset.from_numpy(X, y, w, ids)
# Use smaller batch sizes
model = dc.models.GCNModel(batch_size=32) # Instead of 128
Problem : Using random splitting allows similar molecules in train/test sets. Solution : Always use ScaffoldSplitter for molecular datasets.
Problem : Graph neural networks perform worse than simple fingerprints. Solutions :
Problem : Model memorizes training data. Solutions :
Problem : Module not found errors. Solution : Ensure DeepChem is installed with required dependencies:
uv pip install deepchem
# For PyTorch models
uv pip install deepchem[torch]
# For all features
uv pip install deepchem[all]
This skill includes comprehensive reference documentation:
references/api_reference.mdComplete API documentation including:
When to reference : Search this file when you need specific API details, parameter names, or want to explore available options.
references/workflows.mdEight detailed end-to-end workflows:
When to reference : Use these workflows as templates for implementing complete solutions.
Basic installation:
uv pip install deepchem
For PyTorch models (GCN, GAT, etc.):
uv pip install deepchem[torch]
For all features:
uv pip install deepchem[all]
If import errors occur, the user may need specific dependencies. Check the DeepChem documentation for detailed installation instructions.
Weekly Installs
55
Repository
GitHub Stars
17.3K
First Seen
Jan 20, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykWarn
Installed on
opencode48
codex47
gemini-cli47
cursor45
claude-code44
github-copilot44
超能力技能使用指南:AI助手技能调用优先级与工作流程详解
55,300 周安装
Monorepo架构师:精通Nx、Turborepo、Bazel、Lerna的单仓库构建与优化
234 周安装
Supabase RLS策略审计工具 - 检测行级安全漏洞与错误配置
237 周安装
Hugging Face Jobs:云端运行AI工作负载,无需本地GPU,支持数据处理、批量推理和模型训练
232 周安装
React/Next.js 高级质量保证工具:自动化测试、覆盖率分析与E2E测试脚手架
232 周安装
使用 Remotion 创建 Kurzgesagt 风格教育视频 - 专业视频制作技能指南
236 周安装
AI任务管理器技能 - 自动化任务协调与代理委派工具 | 提升开发效率
55 周安装
load_perovskite()load_bandgap()load_mp_formation_energy()