I often use Pandas for data analysis. RDKit provides useful method named PandasTools. The method can load sdf and return data as pandas dataframe. By using dataframe, It isn’t needed to do something with for loop.
I found an interesting information in rdkit issues. A package named pyjanitor.
The package wraps pandas and provides useful method not only basic data handling but also chemoinformatics.
The package can be installed with conda from conda-forge channel. I used the library. Following code is example on my jupyter notebook. At first, read libraries for the trial.
%matplotlib inline import matplotlib.pyplot as plt import os from rdkit import Chem from rdkit import RDConfig import pandas as pd import janitor from janitor import chemistry from rdkit.Chem import PandasTools from rdkit.Chem.Draw import IPythonConsole from sklearn.decomposition import PCA plt.style.use('ggplot')
Then load sdf data which is provided from rdkit.
path = os.path.join(RDConfig.RDDocsDir,'Book/data/cdk2.sdf') df = PandasTools.LoadSDF(path)
pyjanitor has chemoinformatics function, one is morgan_fingerprint. It can calculate data and add the result in current dataframe. To do it, it is very easy ;)
fp1=chemistry.morgan_fingerprint(df, mols_col='ROMol', radius=2, nbits=512, kind='bits')
Now I can get fingerprint information as pandas dataframe which shape is (num of compounds, num of bits).
Conduct PCA with fingerprint.
fp2=chemistry.morgan_fingerprint(df, mols_col='ROMol', radius=2, nbits=512, kind='counts') pca = PCA(n_components=3) pca_res = pca.fit_transform(fp2)

janitor.chemistry has smi2mol method. It converts SMILES to mol objects in pandas dataframe.
df['SMILES'] = df.ROMol.apply(Chem.MolToSmiles) df.head(2) df = chemistry.smiles2mol(df, smiles_col='SMILES', mols_col='newROMol', progressbar='notebook')
And janitor has add_column method. To use the method, I can add multiple columns with chain method.
# add_column function is not native pandas method. # df['NumAtm'] = df.ROMol.apply(Chem.Mol.GetAtoms) df = df.add_column('NumAtm', [Chem.Mol.GetNumAtoms(mol) for mol in df.ROMol])
It is easy to make fingerprint dataframe with janitor. And pyjanitor has other useful methods for pandas data handling.
code
https://nbviewer.jupyter.org/github/iwatobipen/playground/blob/master/pyjanitor_test.ipynb
pyjanitor
https://pyjanitor.readthedocs.io/index.html