New molecular fingerprint for chemoinformatics #map4 #RDKit #memo #chemoinformatics

Molecular fingerprint(FP) is a very important for chemoinformatics because it is used for building many predictive models not only ADMET but also biological activities.

As you know, ECFP (Morgan Fingerprint) is one of golden standard FP of chemoinformatics. Because it shows stable performance against any problems. After ECFP is reported, many new fingerprint algorithm is proposed but still ECFP is used in many case.

It means that more improvement of fingerprint is required but it’s too difficult task.

Recently new fingerprint algorithm named MAP4 is proposed from Prof. Reymond lab. URL is below.

MAP of MAP4 means “MinHashed atom-pair fingerprint up to a diameter of four bonds“.

The calculation process is below.

First, the circular substructures surrounding each non-hydrogen atom in the molecule at radii 1 to r are written as canonical, non-isomeric, and rooted SMILES string .

Second, the minimum topological distance TPj,k separating each atom pair(j, k) in the input molecule is calculated.

Third,all atom-pair shingles are written for each atom pair (j, k) and each value of r, placing the two SMILES strings in lexicographical order.

Fourth, the resulting set of atom-pair shingles is hashed to a set of integers using the unique mapping SHA-1, and its corresponding transposed vector is finally MinHashed to form the MAP4 vector.

It seems similar approach to ECFP but this approach uses minhashing techniques. It works very fast and this fingerprint outperform compared to their developed MHFP, ECPF and other fingerprints which are impremented RDKit.

In their article Fig2 showed performance of virtual screening against various targets and MAP4FP outperformed in many targets.

In the Fig8 shows Jaccard distance between set of molecules. MAP4 shows better performance against not only small molecule but also large molecule such as peptide, compared to other finger print such as atom pair FP, ECFP etc.

So I would like to use the FP on my notebook, so I tried to use it.

The author disclosed source code so I could get code from github.

git clone
cd map4

I think original repo has bug for folded fingerprint calculation so I maed PR to original repo.

And following code used modified version of original code.

I compared the FP against morganFP came from rdkit.

Today’s code was uploaded my gist and github.

MAP4Calculator provides calculate and calculate_many methods. The first one calculate MAP4FP of molecule and second one calculates MAP4FP of list of molecules.

is_folded option is false in default, so to get folded(fixed length of) finger print, user need to change the option from False to True.

The test data shown above, Moragn FP seems more sensitive to difference of compound structure. Because similarity is lower to MAP4FP. and folded and unfolded MAP4FP showed almost similar performance.

Today I showed how to calculate MAP4FP, so I would like to check the FP performance with real drug discovery project with any machine learning algorithms. :-)

I also uploaded the code to my github repo.

Model interporation with new drawing code of RDKit #RDKit #Machine learning #chemoinformatics

Following code does not use new drawing code but it revised one of my old post. :)

I think, everyone who visits my blog has already read Gregs nice blog post about the drawing similarity map with new code. If you don’t read it I recommend to read it soon. URL is below. ;)

Now I like the drawing code because it can draw only atom property of own molecule but also compare molecular similarity and fingerprint contribution of predictive model.

Interpolation of the machine learning model is very important because it is use for next design of the new molecule.

So let’s write code with new drawing code. Following code is almost same as RDKit cook book and Greg’s blog post. Import some packages and check the version of rdkit. I used the newest version.

%matplotlib inline
import matplotlib.pyplot as plt
import os
from functools import partial
from rdkit import Chem
from rdkit import rdBase
from rdkit import RDPaths
from rdkit.Chem import Draw
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
from rdkit.Chem.Draw import SimilarityMaps, IPythonConsole
from IPython.display import SVG
import io
from PIL import Image
import rdkit
import numpy as np
from sklearn.ensemble import RandomForestClassifier

Load solubility data which is provided from rdkit data folder for example.

trainpath = os.path.join(RDPaths.RDDocsDir,'Book/data/solubility.train.sdf')
testpath =  os.path.join(RDPaths.RDDocsDir,'Book/data/solubility.test.sdf')
train_mols = [m for m in Chem.SDMolSupplier(trainpath) if m is not None]
test_mols = [m for m in Chem.SDMolSupplier(testpath) if m is not None]
val_dict = {'(A) low':0,
           '(B) medium':1,
           '(C) high':2}

Then define some helper function.

def mol2arr(mol, fpfunc):
    fp = fpfunc(mol)
    arr = np.zeros((0,))
    DataStructs.ConvertToNumpyArray(fp, arr)
    return arr
def show_png(data):
    bio = io.BytesIO(data)
    img =
    return img
fpfunc = partial(SimilarityMaps.GetMorganFingerprint, nBits=1024, radius=2)

Then calculate fingerprint and get solubility class. And make randomforest classifier object and fit it. It’s very common way for sunday chemoinformatician. ;)

trainX = [fpfunc(m) for m in train_mols]
trainY = [val_dict[m.GetProp('SOL_classification')] for m in train_mols]
testX = [fpfunc(m) for m in test_mols]
testY = [val_dict[m.GetProp('SOL_classification')] for m in test_mols]
rfc = RandomForestClassifier(random_state=794), trainY)

Next define getProba function the code is borrowed from rdkit cookbook. One difference to cookbook is I used third probability because I would like to visualize effect of the each fingerprint to the high solubility. And defined draw function with some optional arguments. ‘alpha’ is useful because it can control the blending value for the contour lines. It is key for beautiful visualization.

# Following example I would like to get probability of High solubility
def getProba(fp, predctionFunction):
    return predctionFunction((fp,))[0][2]
def drawmol(mols, idx):
    d = Draw.MolDraw2DCairo(1,1)
    _, maxWeight = SimilarityMaps.GetSimilarityMapForModel(mols[idx],fpfunc, 
                                                           lambda x: getProba(x, rfc.predict_proba),

Now almost there. Let’s visualize High and Low solubility molecules with the drawing code!

high_test_mols = [mol for mol in test_mols if mol.GetProp('SOL_classification') == '(C) high']
low_test_mols = [mol for mol in test_mols if mol.GetProp('SOL_classification') == '(A) low']
drawmol(high_test_mols, 0)

Hmm good, the model detects hydroxyl group is positive effect to solubility. How about second example?

drawmol(high_test_mols, 10)

OK, carbonyl group detected as positive group for solubility.

drawmol(low_test_mols, 5)

In this case it is difficult because all atoms are lipophilic. So remove any atoms are positive effect to solubility in my understanding.

The code investigates the effect of the each atom by removing one by one. And calculate difference of probability of prob molecule and atom remove molecule.

So, the magnitude of color shows relative effect of the each atom not absolute effect. This is reason why low solubility molecule shows many positive (red) colors in this example.

Any way, model interporation with 2D color mapping on the molecule is quite nice tool for QSAR analysis.

I would like to write more example codes.

Today’s code can be found on my gist and github.

Any comments suggestions and requests will be greatly appreciated. ;)


Python package for Ensemble learning #Chemoinformatics #Scikit learn

Ensemble learning is a technique for machine learning. I wrote post about blending learning before. URL is below.
I implemented the code by myself at that time.

Ensemble learning sometime outperform than single model. So it is useful for try to use the method. Fortunately now we can use ensemble learning very easily by using a python package named ‘ML-Ens‘ Installation is very easy, only use pip command common way for pythonista I think ;)

After installing the package user can build and train ensemble learning model with few lines. I would like to introduce two example of them one is stacking method and the other is a blending method. OK let’s go to code.

At first, load dataset and make input features. I used morgan fingerprint as input data.

from rdkit import Chem
from rdkit.Chem import DataStructs
from rdkit.Chem import AllChem
from rdkit import RDPaths
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
from rdkit.Chem import PandasTools
import numpy as np
import pandas as pd
from IPython.display import HTML
traindf = PandasTools.LoadSDF(os.path.join(RDPaths.RDDocsDir,'Book/data/solubility.train.sdf'))
testdf = PandasTools.LoadSDF(os.path.join(RDPaths.RDDocsDir, 'Book/data/solubility.test.sdf'))
# Chek data

cls2lab = {'(A) low':0, '(B) medium':1, '(C) high':2}

def fp2np(fp):
    arr = np.zeros((0,))
    DataStructs.ConvertToNumpyArray(fp, arr)
    return arr
trainfp = [AllChem.GetMorganFingerprintAsBitVect(m, 2) for m in traindf.ROMol]
testfp =  [AllChem.GetMorganFingerprintAsBitVect(m, 2) for m in testdf.ROMol]
trainX = np.array([fp2np(fp) for fp in trainfp])
testX = np.array([fp2np(fp) for fp in testfp])
trainY = np.array([cls2lab[i] for i in traindf.SOL_classification.to_list()])
testY =  np.array([cls2lab[i] for i in testdf.SOL_classification.to_list()])

Then import several package for ensemble learning. SuperLearner is class for stacking and BlendEnsemble is class for blending.

Making ensemble model is easy. Just use add method to layer addition and finally call add_meta method for adding final prediction layer.

from mlens.ensemble import SuperLearner
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import r2_score, accuracy_score
from sklearn.svm import SVR, SVC

# For stacnking
ensemble = SuperLearner(scorer=accuracy_score, random_state=794, verbose=2)
ensemble.add([RandomForestClassifier(n_estimators=100, random_state=794), SVC(gamma='auto', C=1000)])
ensemble.add_meta(LogisticRegression(solver='lbfgs', multi_class='auto')), trainY)
pred = ensemble.predict(testX)
accuracy_score(testY, pred)

# Blending
from mlens.ensemble import BlendEnsemble
ensemble2 = BlendEnsemble(scorer=accuracy_score, test_size=0.2, verbose=2)
ensemble2.add([RandomForestClassifier(n_estimators=794, random_state=794),
ensemble2.add_meta(LogisticRegression(solver='lbfgs', multi_class='auto')), trainY)
pred_b = ensemble2.predict(testX)
accuracy_score(pred_b, testY)

Also more models can added with add method. I uploaded whole code on my gist. After calling fit, it is easy to access result data by using data method.

code example

Unfortunately the ensemble models described in the post don’t outperform single random forest model but mlens is nice tool for ensemble learning there is still room of improvement for model performance such as kind of models, hyper parameters etc.

Original document give more informations. Please go to link if reader has interest.

Python package of machine learning for imbalanced data #machine_learning #chemoinformatics

Recently I’m struggling with imbalanced data. I didn’t have any idea to handle it. So my predictive model showed poor performance.

Some days ago, I found useful package for imbalanced data learning which name is ‘imbalanced learn‘.

It can be installed from conda. The package provides methods for over sampling and under sampling.

I had interested it and try to use it for drug discovery dataset.

Following example used two methods, SMOTE(Synthetic Minority Over-sampling Technique) and ADASYN(Adaptive Synthetic sampling approach).

Both methods are over sampling approach so they generate artificial data.

At first import packages and load dataset.

%matplotlib inline
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from imblearn.over_sampling import ADASYN
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython import display
from sklearn.decomposition import PCA
df = pd.read_csv('chembl_5HT.csv')
df = df.dropna()

Made imbalanced label.

# define class pIC50 >9 is active and other is inactive.
df['CLS'] = np.array(df.pchembl_value > 9,

Then calculate finger print with RDKit.

mols = [Chem.MolFromSmiles(smi) for smi in df.canonical_smiles]
fps = [AllChem.GetMorganFingerprintAsBitVect(mol, 2) for mol in mols]

def fp2np(fp):
    arr = np.zeros((0,))
    DataStructs.ConvertToNumpyArray(fp, arr)
    return arr

X = np.array([fp2np(fp) for fp in fps])
Y = df.CLS.to_numpy()
train_X, test_X, train_Y, test_Y = train_test_split(X, Y, random_state=123, test_size=0.2)

Finally, apply the data to 3 approaches, default, SMOTE and ADASYN.

>(11340, 2048)
rf = RandomForestClassifier(n_estimators=10), train_Y)
pred_Y = rf.predict(test_X)
print(classification_report(test_Y, pred_Y))
print(confusion_matrix(test_Y, pred_Y))

              precision    recall  f1-score   support

           0       0.95      0.97      0.96      2586
           1       0.57      0.42      0.48       250

    accuracy                           0.92      2836
   macro avg       0.76      0.69      0.72      2836
weighted avg       0.91      0.92      0.92      2836

[[2506   80]
 [ 145  105]]

Then try to re sampling. After resampling, ratio of negative and positive is fifty fifty.

X_resampled, Y_resampled = SMOTE().fit_resample(train_X, train_Y)
>(20710, 2048)

rf = RandomForestClassifier(n_estimators=10), Y_resampled)
pred_Y = rf.predict(test_X)
print(classification_report(test_Y, pred_Y))
print(confusion_matrix(test_Y, pred_Y))

              precision    recall  f1-score   support

           0       0.95      0.95      0.95      2586
           1       0.47      0.48      0.48       250

    accuracy                           0.91      2836
   macro avg       0.71      0.72      0.71      2836
weighted avg       0.91      0.91      0.91      2836

[[2451  135]
 [ 129  121]]
X_resampled, Y_resampled = ADASYN().fit_resample(train_X, train_Y)
>(20884, 2048)
rf = RandomForestClassifier(n_estimators=10), Y_resampled)
pred_Y = rf.predict(test_X)

clsreport = classification_report(test_Y, pred_Y)
print(classification_report(test_Y, pred_Y))
print(confusion_matrix(test_Y, pred_Y))

              precision    recall  f1-score   support

           0       0.95      0.94      0.95      2586
           1       0.44      0.47      0.46       250

    accuracy                           0.90      2836
   macro avg       0.70      0.71      0.70      2836
weighted avg       0.90      0.90      0.90      2836

[[2437  149]
 [ 132  118]]

Hmm, re sampling results are not improved..

Let’s check chemical space with PCA.

pca = PCA(n_components=3)
res = pca.fit_transform(X)

col = {0:'blue', 1:'yellow'}
color = [col[] for i in Y]
plt.scatter(res[:,0], res[:,1], c=color, alpha=0.5)

Negative and positive(yellow) is located very close.

In my example, generated fingerprint is not true molecule. Finally did PCA with re sampled data.

pca = PCA(n_components=3)
res = pca.fit_transform(X_resampled)
col = {0:'blue', 1:'yellow'}
color = [col[] for i in Y_resampled]
plt.scatter(res[:,0], res[:,1], c=color, alpha=0.5)

Wow both class located same chemical space… It seems difficult to classify for me. But randomforest showed better performance.

I need to learn more and more how to handle imbalanced data in drug discovery….

Imbalanced-learn has also under sampling method. I would like to try under sampling method if I have time to do it.

Machine learning workflow tool for none programmer #memo #machinelearning #dss

I’m on summer vacation. This summer is high temperature and humidity….. So it is tough for me to running. ;-( And now very big typhoon is coming to Japan. Oops…

Let’s leave that aside for now.

Today I would like to introduce very cool tool for machine learning. Recently we can use many machine learning package with python. Of course I use them for example scikit-learn, tenthorflow, pytorch and kearas etc etc…

It is good tool for programmer but it is little bit difficult for none programmer. Today I found very interesting tool for machine learning which name is data science studio(DSS).

DSS is developed by dataiku where begines in 2013 very new company. DSS is the collaborative data science software. I saw demo on youtube and felt nice. I would like to use it. Fortunately DSS can use freely. User can choice some options for installation, I chose Docker for DSS installation.

It was very easy to build the environment. Just type following 2 lines. ;-)

docker pull dataiku/dss
docker run -p 10000:10000 -d dataiku/dss

Now dss container is build and start. I can access the container localhost:10000.

I made test project named boston which is machine learning project with boston dataset from sklearn.

After uploading the sample data, I could see data make regression model very easily. Following screen shot is view of script. It is easy to filter, remove data, just click and select action from column header.

And also it is very easy to make chart just select chart style and drag&drop the column which you want to visualize. Following example is the scatter plot TAX vs LSAT.

Model building is easy! Just click on/off panel which you want to use. Of course it is easy to set up many combination of hyperprameters.

Now ready just click train button for training.

DSS track all training result and after training I could access all training results.

Following sc is decision trees of Gradient Boost and variables importance.

DSS has many algorithms as default but it is easy to implement your own algorithms.

Work flow of machine learning is datapreparation, analyze data, prepare and optimize model and test the model. DSS is very easy to make the work flow.

The tool don’t have chemoinformatics tools such as rdkit, openbabel etc, but I think it can install DSS server.

Recently progress of machine learning are is very fast. I have to think which is more efficient coding by myself or using workflow tool kit.

Coding is very interesting for me but it will take more time….

Make original sklearn classifier-2 #sklearn #chemoinfo

After posted ‘Make original sklearn classifier’, I could get comment from my follower @yamasaKit_-san and @kzfm-san. (Thanks!)

So I checked diversity of models with principal component analysis(PCA).
The example is almost same as yesterday but little bit different at last part.
Last part of my code is below. Extract feature importances from L1 layer classifiers and mono-random forest classifier. And then conduct PCA.

labels = ["rf", "et", "gbc", "xgbc", "mono_rf"]
feature_importances_list = [clf.feature_importances_ for clf in blendclf.l1_clfs_]
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
res = pca.fit_transform(feature_importances_list)

Then plot results with matplotlib. And I used adjustText library for plot labeling. adjustText is powerful package for making nice view of labeled plot.

from adjustText import adjust_text
x, y = res[:,0], res[:,1]
plt.plot(x, y, 'bo')
plt.xlim(-0.1, 0.3)
plt.ylim(-0.05, 0.1)

texts = [plt.text(x[i], y[i], '{}'.format(labels[i])) for i in range(len(labels))]

The plot indicates that two models ET, RF learned similar feature importances to mono Randomforest but XGBC and GBC learned different feature importance. So, combination of ET and RF is redundant for Ensemble learning I think.

To build good ensemble model, I need to think about combination of models and how to tune many hyper parameters.

You can check whole code from the URL below.

Tracking progress of machine learning #MachineLearning

To conduct machine learning it is needed to optimize hyper parameters.
For example scikit-learn provides grid search method. And you know there are several packages to do that such as hyperopt or gyopt etc. How do you mange builded models? It is difficult for me.
Recently I am interested in mlflow . MLflow is an open source platform for managing the end-to-end machine learning lifecycle. It tackles three primary functions.
MLflow can track each model of hyper parameters and serve the models and also it can provide good web UI.
I used it in very simple example.
Code is below.
At first I got sample data and visualize the data set with PCA.

# on the ipython notebook
%matplotlib inline
!wget -P ./data/
import matplotlib.pyplot as plt
import matplotlib.colors as colors
import as cm
import pandas as pd
import numpy as np

from sklearn.decomposition import PCA
data = pd.read_csv('./data/wine-quality.csv')

cmap = plt.get_cmap("Blues", len(data.quality.unique()))
pca = PCA()
wine_pca = pca.fit_transform(data.iloc[:,:-1])
plt.scatter(wine_pca[:,0], wine_pca[:,1], c=data.quality, cmap=cmap)
plt.xlim(np.min(wine_pca[:,0]), np.max(wine_pca[:,0]))
plt.ylim(np.min(wine_pca[:,1]), np.max(wine_pca[:,1]))

Next train function is defined.
mlflow.log_param function can track scoring parameters and mlflow.sklearn.log_model can store the model.
After running the code, mlruns folder is generated in current directory and stored data.

def train():

    import os
    import warnings
    import sys
    import pandas as pd
    import numpy as np
    from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
    from sklearn.model_selection import train_test_split
    from sklearn.svm import SVR
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.model_selection import StratifiedKFold
    from sklearn.model_selection import cross_val_score
    import mlflow
    import mlflow.sklearn

    def eval_metrics(actual, pred):
        rmse = np.sqrt(mean_squared_error(actual, pred))
        mae = mean_absolute_error(actual, pred)
        r2 = r2_score(actual, pred)
        return rmse, mae, r2

    data = pd.read_csv('./data/wine-quality.csv')
    train, test = train_test_split(data)

    train_x = train.drop(["quality"], axis=1)
    test_x = test.drop(["quality"], axis=1)
    train_y = train[["quality"]]
    test_y = test[["quality"]]
    param = {'C':[0.01, 0.1, 1, 10, 100, 1000, 10000 ],
             'gamma':[1.0, 1e-1, 1e-2, 1e-3, 1e-4]}
    for c in param['C']:
        for g in param['gamma']:
            with mlflow.start_run():
                skf = StratifiedKFold(n_splits=5)
                svr = SVR(C=c, gamma=g)score = cross_val_score(svr, train_x, train_y, cv=skf, n_jobs=-1)
      , train_y)
                predicted_qualities = svr.predict(test_x)
                (rmse, mae, r2) = eval_metrics(test_y, predicted_qualities)
                print("  RMSE: %s" % rmse)
                print("  MAE: %s" % mae)
                print("  R2: %s" % r2)
                mlflow.log_param("C", c)
                mlflow.log_param("gamma", g)
                mlflow.log_metric("r2", r2)
                mlflow.log_metric("rmse", rmse)
                mlflow.log_metric("mae", mae)
                mlflow.sklearn.log_model(svr, "model")

Run the function.

0.01 1.0
  RMSE: 0.876717955304063
  MAE: 0.6586558965180616
  R2: 0.007250505904323745
0.01 0.1
  RMSE: 0.872902609847314
  MAE: 0.6523680676966712
  R2: 0.015872299345786156
10000 0.0001
  RMSE: 0.7902872331540974
  MAE: 0.570097398346025
  R2: 0.19334133272639453

After running the code, MLflow can provide very useful webUI. To access the UI, just type following command from terminal ;-).
And then access

iwatobipen$ mlflow server

I can check the summary of the training with metrics like below.

And each model is stored. I can see details and access each model like below.

It is useful to manage many experiments and many models I think.