Molecular descriptors are good indicator for molecular profiling. Visualize and analyze these descriptors are important to have a bird’s-eye view of given molecules set.
I often use “pandas” and “seaborn” to do it. Seaborn is powerful tool to make cool visualization but difficult to obtain statistics data.
Yesterday, I found interesting tool to analyze pandas data frame named “pandas_profiling”.
It seems very easy to make analyze report. It can be installed with conda/pip.
https://github.com/pandas-profiling/pandas-profiling
Let’s install the package and use!
First, call library.
import os import pandas import pandas_profiling import pandas as pd from rdkit import Chem from rdkit import RDConfig from rdkit.Chem import rdBase from rdkit.Chem import Descriptors from rdkit.Chem.Descriptors import _descList from rdkit.ML.Descriptors import MoleculeDescriptors # I used cdk2.sdf dataset as test. datadir = os.path.join( RDConfig.RDDocsDir, "Book/data/cdk2.sdf" )
Then calculate descriptors and make dataframe
data = {} for name in desc_name: data[name] = [] for descs in descs_list: for i, desc in enumerate(descs): data[desc_name[i]].append(desc) df = pd.DataFrame(data)
Let’s make report. It is very very easy!!!! ;-)
pandas_profiling.ProfileReport(df)
Then you can get analyze repot with bar chart.
Snap shots are below.
This package provides not only summary of the dataset but also details of the data. It seems very cool package isn’t it?
You can check whole code is following URL.
https://nbviewer.jupyter.org/github/iwatobipen/chemo_info/blob/master/rdkit_notebook/pandas_profiling_test.ipynb