Useful package for descriptor calculation #chemoinformatics #rdkit

Descriptor calculation is an important task for chemoinfomatics. I often use rdkit to do it. And today I found very useful package for descriptor calculation which name is descriptorus. URL is below.

https://github.com/bp-kelley/descriptastorus

It is very easy to install the package. Just following command.

pip install git+https://github.com/bp-kelley/descriptastorus

After did it, I could use the package.

By using the package, following descriptors are calculated very efficiently.

* atompaircounts
* morgan3counts
* morganchiral3counts
* morganfeature3counts
* rdkit2d
* rdkit2dnormalized
* rdkitfpbits

I tried to use the package. Very simple example.

from descriptastorus.descriptors.DescriptorGenerator import MakeGenerator
gen1 = MakeGenerator((‘rdkit2d’,))
smi = ‘c1ccncc1’
data=gen1.process(smi)
data
>out
[True,
3.000000000000001,
75.86113958768547,
4.242640687119286,
3.333964941448087,
3.333964941448087,
…]

Of course it is easy to get column names.

for col in gen1.GetColumns():
print(col[0])
>out

RDKit2D_calculated
BalabanJ
BertzCT
Chi0
Chi0n
Chi0v
Chi1
Chi1n
Chi1v
Chi2n
Chi2v
Chi3n
Chi3v
Chi4n
Chi4v
EState_VSA1
EState_VSA10

The first row indicates whether the calculation is successful or not.

And the package provides normalized descriptors which are useful for machine learning. It is easy to get it.

from descriptastorus.descriptors import rdDescriptors
from descriptastorus.descriptors import rdNormalizedDescriptors
gen2 = rdDescriptors.RDKit2D()
gen3 = rdNormalizedDescriptors.RDKit2DNormalized()
data2 = gen2.process(smi)
data3 = gen3.process(smi)
for i in range(len(data2)):
print(data2[i], data3[i])
>out
True True
3.000000000000001 0.9749367759562906
75.86113958768547 0.0018713336795980954
4.242640687119286 0.0001226379059079931
3.333964941448087 0.00012708489238074725
3.333964941448087 9.889979057774486e-05
3.0 0.00031922002005261777
1.8497311128276561 0.0003685049537727884
1.8497311128276561 0.0005439421913480759
….

Descriptorus can make a DescriptaStore. I failed it because I couldn’t kyotocabinet in my env….

In summary, the package is very useful for chemoinformatician because user can get many rdkit’s descriptors only typing few lines.

It is amazing for me that we can use such an useful open source packages very easily.

Thanks for OSS developers!


Today’s gist is below.

2 thoughts on “Useful package for descriptor calculation #chemoinformatics #rdkit

  1. Thank you for sharing this package ;)
    What is the difference between normalized and non-normalized descriptors? I guess one has to normalize with respect to some dataset.
    I plan to use these descriptors to do PCA for some smiles, I guess the normalized descriptors are better for PCA?

    • I agree your comment. However it isn’t always true that normalized input is better than unnormalized data for PCA.
      So how about to try PCA with both dataset?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.