Molecular set profiling with pandas_profiling #RDKit

Molecular descriptors are good indicator for molecular profiling. Visualize and analyze these descriptors are important to have a bird’s-eye view of given molecules set.

I often use “pandas” and “seaborn” to do it. Seaborn is powerful tool to make cool visualization but difficult to obtain statistics data.

Yesterday, I found interesting tool to analyze pandas data frame named “pandas_profiling”.
It seems very easy to make analyze report. It can be installed with conda/pip.

Let’s install the package and use!
First, call library.

import os
import pandas
import pandas_profiling
import pandas as pd
from rdkit import Chem
from rdkit import RDConfig
from rdkit.Chem import rdBase 
from rdkit.Chem import Descriptors
from rdkit.Chem.Descriptors import _descList
from rdkit.ML.Descriptors import MoleculeDescriptors
# I used cdk2.sdf dataset as test.
datadir =  os.path.join( RDConfig.RDDocsDir, "Book/data/cdk2.sdf" )

Then calculate descriptors and make dataframe

data = {}
for name in desc_name:
    data[name] = []
for descs in descs_list:
    for i, desc in enumerate(descs):
df = pd.DataFrame(data)

Let’s make report. It is very very easy!!!! ;-)


Then you can get analyze repot with bar chart.
Snap shots are below.

This package provides not only summary of the dataset but also details of the data. It seems very cool package isn’t it?
You can check whole code is following URL.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.