Vote Vote Vote #chemoinformatics

Somedays ago, I posted about ensemble classification method named ‘blending’. The method is not implemented in scikit-learn. So I am implementing the function now.
By the way, scikit-learn has an ensemble classification method named ‘VotingClassifer’.

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html#sklearn.ensemble.VotingClassifier
Following explanation from sklearn document.

The idea behind the VotingClassifier is to combine conceptually different machine learning classifiers and use a majority vote or the average predicted probabilities (soft vote) to predict the class labels. Such a classifier can be useful for a set of equally well performing model in order to balance out their individual weaknesses.

The classifier can combine many classifiers very easily.
The function has two modes, one is hard and the other is soft.
From document…
If ‘hard’, uses predicted class labels for majority rule voting. Else if ‘soft’, predicts the class label based on the argmax of the sums of the predicted probabilities, which is recommended for an ensemble of well-calibrated classifiers.

I used the classifier for QSAR modeling.

Following code, I used four classifier and BASE.csv from molecule net as test dataset.
Code is very simple! Just pass defined dictionary to VotingClassifier.

import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from xgboost import XGBClassifier

clf_dict = {'RF': RandomForestClassifier(n_estimators=100),
        'ETC': ExtraTreesClassifier(n_estimators=100),
        'GBC': GradientBoostingClassifier(learning_rate=0.05),
        'XGB': XGBClassifier(n_estimators=100),
        'SVC': SVC(probability=True, gamma='auto')}

voting_clf = VotingClassifier(estimators=[("RF", clf_dict["RF"]),
                                        ("GBC", clf_dict["GBC"]),
                                        ("XGB", clf_dict["XGB"]),
                                        ("SVC", clf_dict["SVC"])        
                                    ], voting='hard')

dataset = np.load("train.npz")['arr_0']
X = dataset[:,:-1]
y = dataset[:,-1]
idx = np.random.permutation(y.size)
X = X[idx]
y = y[idx]
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2, random_state=794)
nfolds = 10
skf = StratifiedKFold(nfolds)
for i, (train, val) in enumerate(skf.split(train_X, train_y)):
    #print('fold {}'.format(i))
    X_train = train_X[train]
    y_train = train_y[train]
    X_val = train_X[val]
    y_val = train_y[val]
    voting_clf.fit(X_train, y_train)
y_pred = voting_clf.predict(test_X)
print("Voting!")
print(confusion_matrix(test_y, y_pred))
print(classification_report(test_y, y_pred))

rf_clf = RandomForestClassifier(n_estimators=100)
 
for i, (train, val) in enumerate(skf.split(train_X, train_y)):
    #print('fold {}'.format(i))
    X_train = train_X[train]
    y_train = train_y[train]
    X_val = train_X[val]
    y_val = train_y[val]
    rf_clf.fit(X_train, y_train)
y_pred = rf_clf.predict(test_X)
print("Random Forest!")
print(confusion_matrix(test_y, y_pred))
print(classification_report(test_y, y_pred))

svc_clf = SVC(probability=True, gamma='auto')
for i, (train, val) in enumerate(skf.split(train_X, train_y)):
    #print('fold {}'.format(i))
    X_train = train_X[train]
    y_train = train_y[train]
    X_val = train_X[val]
    y_val = train_y[val]
    svc_clf.fit(X_train, y_train)
y_pred = svc_clf.predict(test_X)
print("SV!")
print(confusion_matrix(test_y, y_pred))
print(classification_report(test_y, y_pred)) 

Then run the code.
In this example voting method does not outperform to random forest, support vector classifier. But it is worth to know that sklearn provides useful feature for ensemble learning I think. ;-)

iwatobipen$ python voting.py 
Voting!
[[10  0  0]
 [ 0 11  0]
 [ 0  1  8]]
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00        10
         1.0       0.92      1.00      0.96        11
         2.0       1.00      0.89      0.94         9

   micro avg       0.97      0.97      0.97        30
   macro avg       0.97      0.96      0.97        30
weighted avg       0.97      0.97      0.97        30

Random Forest!
[[10  0  0]
 [ 0 11  0]
 [ 0  1  8]]
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00        10
         1.0       0.92      1.00      0.96        11
         2.0       1.00      0.89      0.94         9

   micro avg       0.97      0.97      0.97        30
   macro avg       0.97      0.96      0.97        30
weighted avg       0.97      0.97      0.97        30

SV!
[[10  0  0]
 [ 0 11  0]
 [ 0  1  8]]
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00        10
         1.0       0.92      1.00      0.96        11
         2.0       1.00      0.89      0.94         9

   micro avg       0.97      0.97      0.97        30
   macro avg       0.97      0.96      0.97        30
weighted avg       0.97      0.97      0.97        30
Advertisements

Visualize pharmacophore in RDKit #RDKit

RDKit has pharmacophore feature assignment function. The function can retrieve molecular features based on pre-defined ph4core.
And RDKit IPythonconsole can draw molecules on ipython notebook.
Today I tried to visualize ph4core on notebook.
Code is very simple.

from rdkit import Chem
from rdkit.Chem import ChemicalFeatures
from rdkit import rdBase
from rdkit.RDPaths import RDDocsDir
from rdkit.RDPaths import RDDataDir
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
from rdkit.Chem import AllChem
import os
print(rdBase.rdkitVersion)
IPythonConsole.ipython_useSVG=True
> 2018.09.1

First, load feature definition.

fdefFile = os.path.join(RDDataDir,'BaseFeatures.fdef')
featFact = ChemicalFeatures.BuildFeatureFactory(fdefFile)

Then calculate pharmacophore. And compute 2D cordes.

mols = [m for m in Chem.SDMolSupplier(os.path.join(RDDocsDir,"Book/data/cdk2.sdf"))]
featslists = [featFact.GetFeaturesForMol(mol) for mol in mols]
for mol in mols:
    AllChem.Compute2DCoords(mol)

Next I defined drawing function. To highlight ph4core, highlightAtomLists and legends are used as optional arguments of MolsToGridImage.

def drawp4core(mol, feats):
    atoms_list = {}
    for feat in feats:
        atom_ids = feat.GetAtomIds()
        feat_type = feat.GetType()
        atoms_list[feat_type] = atom_ids
    return Draw.MolsToGridImage([mol]*len(atoms_list), legends=list(atoms_list.keys()), highlightAtomLists=list(atoms_list.values()))

Test the function.

im = drawp4core(mols[1], featslists[1])
im

I could get following image.

The function worked and it could pick up pharmacophore of given molecule. ;-)

Applicable Domain on Deep Neural Networks #JCIM #chemoinformatics

I read interesting article from JCIM.
Dissecting Machine-Learning Prediction of Molecular Activity: Is an
Applicability Domain Needed for Quantitative Structure−Activity
Relationship Models Based on Deep Neural Networks?

URL is below.
https://pubs.acs.org/doi/10.1021/acs.jcim.8b00348

The pros of DNN is feature extraction. And there are many articles which use DNN for molecular activity prediction. BTW, is it true that DNN is outperform any other machine learning methods?

The authors of the article analyzed the performance of DNN. They used ECFP4 as an input feature and predicted biological activities extracted from CHEMBL DB.
Their approach was reasonable, they built model with training set and check the performance with test data and evaluate RMSE in several layers which are defined by molecular similarity. Layer1 means that dataset is similar to training data and Layer6 means that dataset is not similar to training set.

They analyzed performance of predictive method such as KNN, RF and DNN and their analysis revealed that DNN showed similar performance with RF and KNN. And also Fig 5 shows that DNN can not predict objective value when query molecule is not similar to training set.
It indicates that DNN does not learn feature of molecule from finger print but learned pattern of fingerprint.
More details are described in this article.
In the real drug discovery project, MedChem sometime designs not seed similar compounds. For the chemist, this is not so special. But it is difficult point for AI to learn sense of MedChem.

Biological activity prediction is challenging area I think, And ECFP is still de fact standard for chemoinformatics. I would like to develop new concept of molecular descriptors.

Generate possible list of SMLIES with RDKit #RDKit

In the computer vision, it is often used data augmentation technique for getting large data set. On the other hand, Canonical SMILES representations are used in chemoinformatics area.
RDKit UGM in last year, Dr. Esben proposed new approach for RNN with SMILES. He expanded 602 training molecules to almost 8000 molecules with different smiles representation technique.
https://github.com/rdkit/UGM_2017/blob/master/Presentations/Bjerrum_RDKitUGM_Smiles_Enumeration_for_RNN.pdf
This approach seems works well.
In the UGM hackathon at this year, this random smiles generate function is implemented and it can call from new version of RDKit!
I appreciate rdkit developers!

It is very easy to use, pls see the code below.

from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
from rdkit import rdBase
print(rdBase.rdkitVersion)
>2018.09.1

I used kinase inhibitor as an example.

testsmi = 'CC(C1=C(C=CC(=C1Cl)F)Cl)OC2=C(N=CC(=C2)C3=CN(N=C3)C4CCNCC4)N'
mol = Chem.MolFromSmiles(testsmi)
mol


Default output of MolToSmiles is canonical manner.

print(Chem.MolToSmiles(mol))
>CC(Oc1cc(-c2cnn(C3CCNCC3)c2)cnc1N)c1c(Cl)ccc(F)c1Cl

But if you set MolToSmiles with doRandom=True option, the function return random but valid SMILES.

mols = []
for _ in range(50):
  smi = Chem.MolToSmiles(mol, doRandom=True)
  print(smi)
  m = Chem.MolFromSmiles(smi)
  mols.append(m)

>Fc1c(Cl)c(C(Oc2cc(-c3cn(nc3)C3CCNCC3)cnc2N)C)c(cc1)Cl
>O(c1cc(-c2cnn(c2)C2CCNCC2)cnc1N)C(c1c(Cl)c(ccc1Cl)F)C
>--snip--
>c1(N)ncc(-c2cnn(C3CCNCC3)c2)cc1OC(c1c(Cl)ccc(F)c1Cl)C
#check molecules
Draw.MolsToGridImage(mols, molsPerRow = 10)

Different SMILES but same molecule!

There are many deep learning approaches which use SMIELS as input. It is useful for these models to augment input data I think.

I uploaded my example code on google colab and github my repo.

Colab
https://colab.research.google.com/drive/1dMmgCpskrfI1afh3qmPdv8ixkGTuOCDH

github
https://github.com/iwatobipen/playground/blob/master/random_smiles_rdkit.ipynb

nbviewer
http://nbviewer.jupyter.org/github/iwatobipen/playground/blob/master/random_smiles_rdkit.ipynb

Tracking progress of machine learning #MachineLearning

To conduct machine learning it is needed to optimize hyper parameters.
For example scikit-learn provides grid search method. And you know there are several packages to do that such as hyperopt or gyopt etc. How do you mange builded models? It is difficult for me.
Recently I am interested in mlflow . MLflow is an open source platform for managing the end-to-end machine learning lifecycle. It tackles three primary functions.
MLflow can track each model of hyper parameters and serve the models and also it can provide good web UI.
I used it in very simple example.
Code is below.
At first I got sample data and visualize the data set with PCA.

# on the ipython notebook
%matplotlib inline
!wget https://raw.githubusercontent.com/mlflow/mlflow/master/examples/sklearn_elasticnet_wine/wine-quality.csv -P ./data/
import matplotlib.pyplot as plt
import matplotlib.colors as colors
import matplotlib.cm as cm
import pandas as pd
import numpy as np

from sklearn.decomposition import PCA
data = pd.read_csv('./data/wine-quality.csv')

cmap = plt.get_cmap("Blues", len(data.quality.unique()))
pca = PCA()
wine_pca = pca.fit_transform(data.iloc[:,:-1])
plt.scatter(wine_pca[:,0], wine_pca[:,1], c=data.quality, cmap=cmap)
plt.xlim(np.min(wine_pca[:,0]), np.max(wine_pca[:,0]))
plt.ylim(np.min(wine_pca[:,1]), np.max(wine_pca[:,1]))
plt.colorbar()

Next train function is defined.
mlflow.log_param function can track scoring parameters and mlflow.sklearn.log_model can store the model.
After running the code, mlruns folder is generated in current directory and stored data.


def train():

    import os
    import warnings
    import sys
    import pandas as pd
    import numpy as np
    from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
    from sklearn.model_selection import train_test_split
    from sklearn.svm import SVR
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.model_selection import StratifiedKFold
    from sklearn.model_selection import cross_val_score
    import mlflow
    import mlflow.sklearn

    def eval_metrics(actual, pred):
        rmse = np.sqrt(mean_squared_error(actual, pred))
        mae = mean_absolute_error(actual, pred)
        r2 = r2_score(actual, pred)
        return rmse, mae, r2

    warnings.filterwarnings("ignore")
    np.random.seed(40)
    data = pd.read_csv('./data/wine-quality.csv')
    train, test = train_test_split(data)

    train_x = train.drop(["quality"], axis=1)
    test_x = test.drop(["quality"], axis=1)
    train_y = train[["quality"]]
    test_y = test[["quality"]]
    param = {'C':[0.01, 0.1, 1, 10, 100, 1000, 10000 ],
             'gamma':[1.0, 1e-1, 1e-2, 1e-3, 1e-4]}
    for c in param['C']:
        for g in param['gamma']:
            with mlflow.start_run():
                print(c,g)
                skf = StratifiedKFold(n_splits=5)
                svr = SVR(C=c, gamma=g)score = cross_val_score(svr, train_x, train_y, cv=skf, n_jobs=-1)
                svr.fit(train_x, train_y)
                predicted_qualities = svr.predict(test_x)
                (rmse, mae, r2) = eval_metrics(test_y, predicted_qualities)
                print("  RMSE: %s" % rmse)
                print("  MAE: %s" % mae)
                print("  R2: %s" % r2)
                mlflow.log_param("C", c)
                mlflow.log_param("gamma", g)
                mlflow.log_metric("r2", r2)
                mlflow.log_metric("rmse", rmse)
                mlflow.log_metric("mae", mae)
                mlflow.sklearn.log_model(svr, "model")

Run the function.

train()
0.01 1.0
  RMSE: 0.876717955304063
  MAE: 0.6586558965180616
  R2: 0.007250505904323745
0.01 0.1
  RMSE: 0.872902609847314
  MAE: 0.6523680676966712
  R2: 0.015872299345786156
--snip--
10000 0.0001
  RMSE: 0.7902872331540974
  MAE: 0.570097398346025
  R2: 0.19334133272639453
'''

After running the code, MLflow can provide very useful webUI. To access the UI, just type following command from terminal ;-).
And then access http://127.0.0.1:5000/#/.

iwatobipen$ mlflow server

I can check the summary of the training with metrics like below.

And each model is stored. I can see details and access each model like below.

It is useful to manage many experiments and many models I think.

Ensemble learning with scikit-learn and XGBoost #machine learning

I often post about the topics of deep learning. But today I would like to post about ensemble learning.
There are lots of documents describes Ensemble learning. And I think following document is very informative for me.
https://mlwave.com/kaggle-ensembling-guide/
I interested one of the method, named ‘blending’.
Regarding above URL, the merit of ‘blending’ are …

Blending has a few benefits:

It is simpler than stacking.
It wards against an information leak: The generalizers and stackers use different data.
You do not need to share a seed for stratified folds with your teammates. Anyone can throw models in the ‘blender’ and the blender decides if it wants to keep that model or not.

There are two layers in blending. First layer is a set of multiple classifiers that is trained with training data. And second layer is a classifier that is trained with output of the test set of the first layer.
I tried to write a code for blending. Following code I used scikit-learn and XGBoost.

First, import libraries and define the dictionary for my conveniense.

import click
import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from xgboost import XGBClassifier
l1_clf_dict = {'RF': RandomForestClassifier(n_estimators=100),
        'ETC': ExtraTreesClassifier(n_estimators=100),
        'GBC': GradientBoostingClassifier(learning_rate=0.05),
        'XGB': XGBClassifier(n_estimators=100),
        'SVC': SVC(probability=True, gamma='auto')}

l2_clf_dict = {'RF': RandomForestClassifier(n_estimators=100),
        'ETC': ExtraTreesClassifier(n_estimators=100),
        'GBC': GradientBoostingClassifier(learning_rate=0.05),
        'XGB': XGBClassifier(n_estimators=100),
        'SVC': SVC(probability=True, gamma='auto')}

Then defined model build function. Following code can be applied multiple classification problem.
The code seems a little bit complicated and it can only return set of trained classifiers. I would like to improve the code in the near the future.

@click.command()
@click.option('--l1', default='all', type=str, help='models of first layers input format is csv. RF/ETC/GBC/XGB/SVC')
@click.option('--l2', default='XGB', type=str, help='model of second layer default is XGB')
@click.option('--nfolds', default=10, type=int, help='number of KFolds default is 10')
@click.option('--traindata', default='train.npz', type=str, help='data for training')
def buildmodel(l1, l2, nfolds, traindata):
    skf = StratifiedKFold(nfolds)
    dataset = np.load(traindata)['arr_0']
    X = dataset[:,:-1]
    y = dataset[:,-1]
    idx = np.random.permutation(y.size)
    X = X[idx]
    y = y[idx]
    num_cls = len(set(y))
    train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2, random_state=794)

    if l1 == 'all':
        l1 = list(l1_clf_dict.keys())
        clfs = list(l1_clf_dict.values())
    else:
        clfs = [l1_clf_dict[clf] for clf in l1.split(',')]

    dataset_blend_train = np.zeros((train_X.shape[0], len(clfs), num_cls ))
    dataset_blend_test = np.zeros((test_X.shape[0], len(clfs), num_cls ))

    for j, clf in enumerate(clfs):
        dataset_blend_test_j = np.zeros((test_X.shape[0], nfolds, num_cls))
        for i, (train, val) in enumerate(skf.split(train_X, train_y)):
            print('fold {}'.format(i))
            X_train = train_X[train]
            y_train = train_y[train]
            X_val = train_X[val]
            y_val = train_y[val]
            clf.fit(X_train, y_train)
            # use clfs predicted value for next layer's training
            y_pred = clf.predict_proba(X_val)
            dataset_blend_train[val, j, :] = y_pred
            dataset_blend_test_j[:, i, :] = clf.predict_proba(test_X)
        dataset_blend_test[:, j, :] = dataset_blend_test_j.mean(1)
    l2_clf = l2_clf_dict[l2]
    print('Blending')
    print(dataset_blend_train.shape)
    dataset_blend_train = dataset_blend_train.reshape((dataset_blend_train.shape[0], -1))
    l2_clf.fit(dataset_blend_train, train_y)

    dataset_blend_test = dataset_blend_test.reshape((dataset_blend_test.shape[0], -1))
    y_pred = l2_clf.predict_proba(dataset_blend_test)
    y_pred
    print(classification_report(test_y, np.argmax(y_pred, 1)))
    print(confusion_matrix(test_y, np.argmax(y_pred, 1)))

    print("*"*50)
    for i, key in enumerate(l1):
        print('layer 1 {}'.format(l1[i]))
        l1_pred = clfs[i].predict_proba(test_X)
        print(classification_report(test_y, np.argmax(l1_pred, 1)))
        print(confusion_matrix(test_y, np.argmax(l1_pred, 1)))  
        print("*"*50) 
    return (clfs, l2_clf)

if __name__=='__main__':
    buildmodel()

Now I just finished making base code.
Let’s make sample code and run it.

# make data
import numpy as np
from sklearn.datasets import load_iris 
x = load_iris().data
y = load_iris().target
data = np.hstack((x, y.reshape(y.size, 1)))
np.savez('train.npz', data)

Run the code :)

iwatobipen$ python blending.py
train.npz
fold 0
fold 1
fold 2
--snip--
fold 7
fold 8
fold 9
Blending
(120, 5, 3)
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00        13
         1.0       1.00      0.83      0.91         6
         2.0       0.92      1.00      0.96        11

   micro avg       0.97      0.97      0.97        30
   macro avg       0.97      0.94      0.96        30
weighted avg       0.97      0.97      0.97        30

[[13  0  0]
 [ 0  5  1]
 [ 0  0 11]]
**************************************************
layer 1 RF
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00        13
         1.0       0.86      1.00      0.92         6
         2.0       1.00      0.91      0.95        11

   micro avg       0.97      0.97      0.97        30
   macro avg       0.95      0.97      0.96        30
weighted avg       0.97      0.97      0.97        30

[[13  0  0]
 [ 0  6  0]
 [ 0  1 10]]
**************************************************
layer 1 ETC
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00        13
         1.0       0.71      0.83      0.77         6
         2.0       0.90      0.82      0.86        11

   micro avg       0.90      0.90      0.90        30
   macro avg       0.87      0.88      0.88        30
weighted avg       0.91      0.90      0.90        30

[[13  0  0]
 [ 0  5  1]
 [ 0  2  9]]
**************************************************
layer 1 GBC
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00        13
         1.0       0.75      1.00      0.86         6
         2.0       1.00      0.82      0.90        11

   micro avg       0.93      0.93      0.93        30
   macro avg       0.92      0.94      0.92        30
weighted avg       0.95      0.93      0.93        30

[[13  0  0]
 [ 0  6  0]
 [ 0  2  9]]
**************************************************
layer 1 XGB
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00        13
         1.0       0.75      1.00      0.86         6
         2.0       1.00      0.82      0.90        11

   micro avg       0.93      0.93      0.93        30
   macro avg       0.92      0.94      0.92        30
weighted avg       0.95      0.93      0.93        30

[[13  0  0]
 [ 0  6  0]
 [ 0  2  9]]
**************************************************
layer 1 SVC
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00        13
         1.0       1.00      1.00      1.00         6
         2.0       1.00      1.00      1.00        11

   micro avg       1.00      1.00      1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30

[[13  0  0]
 [ 0  6  0]
 [ 0  0 11]]
**************************************************

All classifiers in the first layer showed good performance. This is very simple case and very small data to estimate of merit of the blending. I will check another data from now.

Ref.
http://www.chioka.in/stacking-blending-and-stacked-generalization/

Visualize important features of machine leaning #RDKit

As you know, rdkit2018 09 01 has very exiting method named ‘DrawMorganBit’ and ‘DrawMorganBits’. It can render the bit information of fingerprint.
It is described the following blog post.
http://rdkit.blogspot.com/2018/10/using-new-fingerprint-bit-rendering-code.html
And if you can read Japanese, Excellent posts are provided.
View story at Medium.com
https://future-chem.com/rdkit-fp-visualize/

What I want to do in the blog post is that visualize important features of random forest classifier.
Fingerprint based predictor is sometime difficult to understand feature importance of each bit. I think that using the new method, I can visualize important features of active molecules.

Let’s try it.
At frist import several packages.

import os
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import DataStructs
from rdkit.Chem import RDConfig
from rdkit.Chem import rdBase
print(rdBase.rdkitVersion)

Then load dataset and define mol2fp function.

trainpath = os.path.join(RDConfig.RDDocsDir, 'Book/data/solubility.train.sdf')
testpath = os.path.join(RDConfig.RDDocsDir, 'Book/data/solubility.test.sdf')
solclass = {'(A) low':0, '(B) medium': 1, '(C) high': 2}
n_splits = 10
def mol2fp(mol,nBits=1024):
    bitInfo={}
    fp = AllChem.GetMorganFingerprintAsBitVect(mol, 2, bitInfo=bitInfo)
    arr = np.zeros((1,))
    DataStructs.ConvertToNumpyArray(fp, arr)
    return arr, bitInfo
trainmols = [m for m in Chem.SDMolSupplier(trainpath)]
testmols = [m for m in Chem.SDMolSupplier(testpath)]

Then get X and Y.
In the blogpost I used solubility classification data.

trainX = np.array([mol2fp(m)[0] for m in trainmols])
trainY = np.array([solclass[m.GetProp('SOL_classification')] for m in trainmols], dtype=np.int)

testX = np.array([mol2fp(m)[0] for m in testmols])
testY = np.array([solclass[m.GetProp('SOL_classification')] for m in testmols], dtype=np.int)

Then train RandomForestClassifier.
I used StratifiedKFold instead of KFold.
The difference of two methods is well described following URL.
https://www.kaggle.com/ogrellier/kfold-or-stratifiedkfold

rfclf = RandomForestClassifier(n_estimators=50)
skf = StratifiedKFold(n_splits=n_splits)
for train, test in skf.split(trainX, trainY):
    trainx = trainX[train]
    trainy = trainY[train]
    testx = trainX[test]
    testy = trainY[test]
    rfclf.fit(trainx, trainy)
    predy = rfclf.predict(testx)
    print(accuracy_score(testy, predy))

Then get important features. It seems a little bit tricky. And picked index of high solubility molecule.

fimportance = rfclf.feature_importances_
fimportance_dict = dict(zip(range(1024), fimportance))
sorteddata = sorted(fimportance_dict.items(), key=lambda x:-x[1])
top50feat = [x[0] for x in sorteddata][:50]
solblemol_idx = []
for i, v in enumerate(testY):
    if v == 2:
        solblemol_idx.append(i)

Now ready.
Test!!!!

testidx = solblemol_idx[-1] 
testmols[testidx]

rfclf.predict([testX[testidx]])
array([2])

Get fingerprint and bit information and extract important features from onBits.

fp, bitinfo = mol2fp(testmols[testidx])
onbit = [bit for bit in bitinfo.keys()]
importantonbits = list(set(onbit) & set(top50feat))

Finally visualize important bit information with structures.

tpls = [(testmols[testidx], x, bitinfo) for x in importantonbits]
Draw.DrawMorganBits(tpls, legends = [str(x) for x in importantonbits])

This classifier can detect substructures with oxygen atom and sp3 carbon.
The example molecule has amino functional groups but the predictor does not detect nitrogen as an important group.
I think this is because the model is not well optimized.
But it is useful that visualize features of molecule.
All code is uploaded to google colab.
Readers who are interested the code. Can run the code on the colab!
https://colab.research.google.com/drive/16NmSqHMD3Tw562PeCh0u22OB9oPyqCsw