Useful package for descriptor calculation #chemoinformatics #rdkit

Descriptor calculation is an important task for chemoinfomatics. I often use rdkit to do it. And today I found very useful package for descriptor calculation which name is descriptorus. URL is below.

https://github.com/bp-kelley/descriptastorus

It is very easy to install the package. Just following command.

pip install git+https://github.com/bp-kelley/descriptastorus

After did it, I could use the package.

By using the package, following descriptors are calculated very efficiently.

* atompaircounts
* morgan3counts
* morganchiral3counts
* morganfeature3counts
* rdkit2d
* rdkit2dnormalized
* rdkitfpbits

I tried to use the package. Very simple example.

from descriptastorus.descriptors.DescriptorGenerator import MakeGenerator
gen1 = MakeGenerator((‘rdkit2d’,))
smi = ‘c1ccncc1’
data=gen1.process(smi)
data
>out
[True,
3.000000000000001,
75.86113958768547,
4.242640687119286,
3.333964941448087,
3.333964941448087,
…]

Of course it is easy to get column names.

for col in gen1.GetColumns():
print(col[0])
>out

RDKit2D_calculated
BalabanJ
BertzCT
Chi0
Chi0n
Chi0v
Chi1
Chi1n
Chi1v
Chi2n
Chi2v
Chi3n
Chi3v
Chi4n
Chi4v
EState_VSA1
EState_VSA10
…

The first row indicates whether the calculation is successful or not.

And the package provides normalized descriptors which are useful for machine learning. It is easy to get it.

from descriptastorus.descriptors import rdDescriptors
from descriptastorus.descriptors import rdNormalizedDescriptors
gen2 = rdDescriptors.RDKit2D()
gen3 = rdNormalizedDescriptors.RDKit2DNormalized()
data2 = gen2.process(smi)
data3 = gen3.process(smi)
for i in range(len(data2)):
print(data2[i], data3[i])
>out
True True
3.000000000000001 0.9749367759562906
75.86113958768547 0.0018713336795980954
4.242640687119286 0.0001226379059079931
3.333964941448087 0.00012708489238074725
3.333964941448087 9.889979057774486e-05
3.0 0.00031922002005261777
1.8497311128276561 0.0003685049537727884
1.8497311128276561 0.0005439421913480759
….

Descriptorus can make a DescriptaStore. I failed it because I couldn’t kyotocabinet in my env….

In summary, the package is very useful for chemoinformatician because user can get many rdkit’s descriptors only typing few lines.

It is amazing for me that we can use such an useful open source packages very easily.

Thanks for OSS developers!

Today’s gist is below.

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

view raw descriptorus_example.ipynb hosted with ❤ by GitHub

2 thoughts on “Useful package for descriptor calculation #chemoinformatics #rdkit”

Thank you for sharing this package ;)
What is the difference between normalized and non-normalized descriptors? I guess one has to normalize with respect to some dataset.
I plan to use these descriptors to do PCA for some smiles, I guess the normalized descriptors are better for PCA?

iwatobipen says:

05/11/2019 at 22:30

I agree your comment. However it isn’t always true that normalized input is better than unnormalized data for PCA.
So how about to try PCA with both dataset?

Reply

Useful package for descriptor calculation #chemoinformatics #rdkit

Published by iwatobipen

2 thoughts on “Useful package for descriptor calculation #chemoinformatics #rdkit”

Leave a comment Cancel reply

Related

Published by iwatobipen

2 thoughts on “Useful package for descriptor calculation #chemoinformatics #rdkit”

Leave a comment Cancel reply