I’m interested in machine-learning. And python is good tool to do that for me.
Deep learning is one of the hot topic in these area.
There are some library for deep learning in python, “Theano”, “Pylearn2”.
But, these packages are difficult for me ;-( .
So, I used nolearn to do deep-learning
Nolearn is easy to install. If you want, you can install via pip.
Let’s write code.
At first, I got sampledata following link.
here
The dataset contains 4337 Structures with AMES Categorisation (mutagen/nonmutagen).
At first I converted mutagen/nonmutagen tag to 1/0.
from rdkit import Chem mols = [ mol for mol in Chem.SDMolSupplier("cas_4337.sdf") ] writer = Chem.SDWriter("ames.sdf") for mol in mols: if mol.GetProp("Ames test categorisation") == "mutagen": mol.SetProp("Act", str(1)) else: mol.SetProp("Act",str(0)) writer.write( mol ) writer.close()
OK, Let’s Predict some molecules.
To calculate molecular descriptors I used rdkit.
1. Calc descriptors
2. Split train/test data sets
3.Build model and predict test set.
#Ames rprediction using DBN import numpy as np from rdkit import Chem from rdkit.ML.Descriptors import MoleculeDescriptors from rdkit.Chem import Descriptors from sklearn.preprocessing import scale from sklearn.cross_validation import train_test_split from sklearn.metrics import classification_report from nolearn.dbn import DBN nms = [ x[0] for x in Descriptors._descList ] calc = MoleculeDescriptors.MolecularDescriptorCalculator( nms ) # define descriptor calculator. def calc_descs( mol ): res = calc.CalcDescriptors( mol ) return res # read molecules. mols = [ mol for mol in Chem.SDMolSupplier("ames.sdf") ] descs = [ calc_descs( mol ) for mol in mols ] acts = [ int(mol.GetProp("Act")) for mol in mols ] # convert list to array and convert nan to 0. and scaling. descs = scale(np.nan_to_num(np.asarray( descs ))) acts = np.asarray( acts ) # split data. train_descs, test_descs, train_acts, test_acts = train_test_split(descs, acts, test_size=0.3, random_state=0) print train_descs.shape, train_acts.shape print test_descs.shape, test_acts.shape # define parameter # increase epochs, you'll need long time. dbn = DBN( [descs.shape[1], descs.shape[1]/3,2], learn_rates = 0.27, minibatch_size = train_descs.shape[0], epochs=10, verbose=0,) dbn.fit(train_descs, train_acts) pred=dbn.predict(test_descs) print(classification_report( test_acts, pred )
Let’s try!
iwatobipen-MacBook-Air:cas_4337 iwatobipen$ python ames_pred.py (3503, 196) (3503,) (1502, 196) (1502,) precision recall f1-score support 0 0.84 0.81 0.82 669 1 0.85 0.87 0.86 833 avg / total 0.84 0.84 0.84 1502
Results were not so bad ;-) .
Nolearn is powerful but easy to implementation.