Make QSAR models and predict activity using pandas_ml and RDKit

In mishima.syk y_sama introduced us about pandas_ml.
I know it’s so late,but I was interested in the module, so I used pandas_ml for QSAR.
Pandas_ml is library of python to integrate pandas, scikit-learn, xgboost and seaborn.
To use pandas_ml, I installed xgboost python binding before and installed pandas_ml using pip command.
I tried to build QSAR model about AMES posi/nega classifier.
At first read sdf using pandasTools of RDKit and added Morganfingerprint.

from rdkit import Chem
from rdkit.Chem import PandasTools
import pandas as pd
import pandas_ml as pdml
from rdkit.Chem import AllChem, DataStructs
import numpy as np
from sklearn.preprocessing import LabelEncoder
df = PandasTools.LoadSDF('cas_4337.sdf')
le = LabelEncoder()
label=le.fit( df["Ames test categorisation"] )
df['target']=le.transform( df["Ames test categorisation"] )
def getFparr( mol ):
    fp = AllChem.GetMorganFingerprintAsBitVect( mol,2 )
    arr = np.zeros((1,))
    DataStructs.ConvertToNumpyArray( fp , arr )
    return arr
df['morganFP'] = df.ROMol.apply(getFparr)

Next, I made dataframe that has X and Y.

res = pd.DataFrame(df.morganFP.tolist())
res["target"] = df.target

Now ready to use pandas_ml. To use pandas_ml, I made ModelFrame was first.
To use the object, it was easy to call any methods, i.e. train_test split, make radomforest, svc, svm, xgboost etc.

df4ml = pdml.ModelFrame( res, target='target' )
train_df, test_df = df4ml.cross_validation.train_test_split()

I predict AMES using three methods, xgboost, svc, randomforest.
Code was following.
It seems very simple regardless of ML algorithms.
And also pnadas_ml can do cross_validation, grid_search of parameters and connect several functions using pipeline.

est = df4ml.xgboost.XGBClassifier()
train_df.fit( est )
pred = test_df.predict( est )
test_df.metrics.confusion_matrix()

Predicted	0	1
Target		
0	486	126
1	96	377

est2 = df4ml.ensemble.RandomForestClassifier()
train_df.fit(est2)
predict_rf=test_df.predict(est2)
test_df.metrics.confusion_matrix()

Predicted	0	1
Target		
0	527	85
1	126	347

est3 = df4ml.svm.SVC()
train_df.fit(est3)
predict_svc=test_df.predict(est3)
test_df.metrics.confusion_matrix()
Predicted	0	1
Target		
0	466	146
1	147	326

In summary, I think that pandas_ml is very useful for build models.
( In my personal opinion, user should study sklearn, xgboost, pandas at first before using pandas_ml. )
I uploaded the code to my repo.
https://github.com/iwatobipen/chemo_info/blob/master/qsar/pdml_rdkit.ipynb

Advertisement

Published by iwatobipen

I'm medicinal chemist in mid size of pharmaceutical company. I love chemoinfo, cording, organic synthesis, my family.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: