Handling chemoinformatics data with pandas #RDKit #chemoinformatics

I often use Pandas for data analysis. RDKit provides useful method named PandasTools. The method can load sdf and return data as pandas dataframe. By using dataframe, It isn’t needed to do something with for loop.

I found an interesting information in rdkit issues. A package named pyjanitor.

The package wraps pandas and provides useful method not only basic data handling but also chemoinformatics.

The package can be installed with conda from conda-forge channel. I used the library. Following code is example on my jupyter notebook. At first, read libraries for the trial.

%matplotlib inline
import matplotlib.pyplot as plt
import os
from rdkit import Chem
from rdkit import RDConfig

import pandas as pd
import janitor
from janitor import chemistry

from rdkit.Chem import PandasTools
from rdkit.Chem.Draw import IPythonConsole

from sklearn.decomposition import PCA

Then load sdf data which is provided from rdkit.

path = os.path.join(RDConfig.RDDocsDir,'Book/data/cdk2.sdf')
df = PandasTools.LoadSDF(path)

pyjanitor has chemoinformatics function, one is morgan_fingerprint. It can calculate data and add the result in current dataframe. To do it, it is very easy ;)

fp1=chemistry.morgan_fingerprint(df, mols_col='ROMol', radius=2, nbits=512, kind='bits')

Now I can get fingerprint information as pandas dataframe which shape is (num of compounds, num of bits).

Conduct PCA with fingerprint.

fp2=chemistry.morgan_fingerprint(df, mols_col='ROMol', radius=2, nbits=512, kind='counts')
pca = PCA(n_components=3)
pca_res = pca.fit_transform(fp2)

janitor.chemistry has smi2mol method. It converts SMILES to mol objects in pandas dataframe.

df['SMILES'] = df.ROMol.apply(Chem.MolToSmiles)
df = chemistry.smiles2mol(df, smiles_col='SMILES', mols_col='newROMol', progressbar='notebook')

And janitor has add_column method. To use the method, I can add multiple columns with chain method.

# add_column function is not native pandas method.
# df['NumAtm'] = df.ROMol.apply(Chem.Mol.GetAtoms) 
df = df.add_column('NumAtm', [Chem.Mol.GetNumAtoms(mol) for mol in df.ROMol])

It is easy to make fingerprint dataframe with janitor. And pyjanitor has other useful methods for pandas data handling.



Published by iwatobipen

I'm medicinal chemist in mid size of pharmaceutical company. I love chemoinfo, cording, organic synthesis, my family.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: