In mishima.syk y_sama introduced us about pandas_ml.
I know it’s so late,but I was interested in the module, so I used pandas_ml for QSAR.
Pandas_ml is library of python to integrate pandas, scikit-learn, xgboost and seaborn.
To use pandas_ml, I installed xgboost python binding before and installed pandas_ml using pip command.
I tried to build QSAR model about AMES posi/nega classifier.
At first read sdf using pandasTools of RDKit and added Morganfingerprint.
from rdkit import Chem from rdkit.Chem import PandasTools import pandas as pd import pandas_ml as pdml from rdkit.Chem import AllChem, DataStructs import numpy as np from sklearn.preprocessing import LabelEncoder df = PandasTools.LoadSDF('cas_4337.sdf') le = LabelEncoder() label=le.fit( df["Ames test categorisation"] ) df['target']=le.transform( df["Ames test categorisation"] ) def getFparr( mol ): fp = AllChem.GetMorganFingerprintAsBitVect( mol,2 ) arr = np.zeros((1,)) DataStructs.ConvertToNumpyArray( fp , arr ) return arr df['morganFP'] = df.ROMol.apply(getFparr)
Next, I made dataframe that has X and Y.
res = pd.DataFrame(df.morganFP.tolist()) res["target"] = df.target
Now ready to use pandas_ml. To use pandas_ml, I made ModelFrame was first.
To use the object, it was easy to call any methods, i.e. train_test split, make radomforest, svc, svm, xgboost etc.
df4ml = pdml.ModelFrame( res, target='target' ) train_df, test_df = df4ml.cross_validation.train_test_split()
I predict AMES using three methods, xgboost, svc, randomforest.
Code was following.
It seems very simple regardless of ML algorithms.
And also pnadas_ml can do cross_validation, grid_search of parameters and connect several functions using pipeline.
est = df4ml.xgboost.XGBClassifier() train_df.fit( est ) pred = test_df.predict( est ) test_df.metrics.confusion_matrix() Predicted 0 1 Target 0 486 126 1 96 377 est2 = df4ml.ensemble.RandomForestClassifier() train_df.fit(est2) predict_rf=test_df.predict(est2) test_df.metrics.confusion_matrix() Predicted 0 1 Target 0 527 85 1 126 347 est3 = df4ml.svm.SVC() train_df.fit(est3) predict_svc=test_df.predict(est3) test_df.metrics.confusion_matrix() Predicted 0 1 Target 0 466 146 1 147 326
In summary, I think that pandas_ml is very useful for build models.
( In my personal opinion, user should study sklearn, xgboost, pandas at first before using pandas_ml. )
I uploaded the code to my repo.
https://github.com/iwatobipen/chemo_info/blob/master/qsar/pdml_rdkit.ipynb