Probabilistic Random Forest approach to predict experimental value #RDKit #chemoinformatics #machine_learning

To build predictive model, input value(X) and target value(y) is required. But in the drug discovery area target value always has experimental error. So any experimental value (target value) may have uncertainly and it makes difficult to build predictive model.

Recently Ola Engkvist group who is in AZ published interesting article in Jounral of chemoinformatics. The article is below.

https://jcheminf.biomedcentral.com/articles/10.1186/s13321-021-00539-7#Sec25

The author uses probabilistic random forest(PRF) to handle noisy data and improved performance of the bio activity predictive model. The difference of normal RF and PRF is that ‘PRF algorithm treats the labels as probability distribution functions, rather than deterministic quantities’. The fig2 in the article shows schematic representation of how pChEMBL value is converted into the ideal y-label probability using cdf with different bioactivity thresholds and standard deviation (SD) values. The case when SD is 0 corresponds to traditional RF. The figure shows that target label is affected by experimental error.

Fig. 2
https://jcheminf.biomedcentral.com/articles/10.1186/s13321-021-00539-7/figures/2

In the article they compared performance of bioactivity prediction with PRF against normal RF. And in many case PRF outperformed to RF. So I have interest to PRF. Fortunately source code of PRF is disclosed in author’s github repo. So I cloned the repo and install it. Then check performance with solubility data which is provided from rdkit.

Whole code is uploaded on my gist.

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
view raw prf_demo.ipynb hosted with ❤ by GitHub

To use PRF, target value should be convert to probability distribution functions. To do it, scipy.stat.norm.cdf function is used.

In this data set RF outperformed in most of case, but PRF outperformed when data has large deviation.

In summary, PRF seems reasonable and useful approach for me. Because many data has deviation so it’s important for building model with these deviation.

It’s interesting for me that RF is old approach of machine learning but still has room for improvement.

Thanks for reading ;)

Advertisement

Published by iwatobipen

I'm medicinal chemist in mid size of pharmaceutical company. I love chemoinfo, cording, organic synthesis, my family.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: