Tips for MCS of RDKit

Find MCS is useful function for me, because sometime I want to extract common substructure from compounds.
But, in the case of large amount of compounds set give me boring results like a ethyl and so on. It’s no wonder.

FindMCS function of RDKit has unique solution to solve that. To use “threshold” option I can define the maximum number of molecules that need to calculate MCS.
I found tips, the result of FindMCS with the option depends on order of molecules.
See following codes….

from rdkit import Chem
from rdkit.Chem import MCS
from rdkit.Chem.Draw import IPythonConsole
from rdkit import RDConfig
from rdkit.Chem import FragmentCatalog

mol1 = Chem.MolFromSmiles("Cc1ccccc1")
mol2 = Chem.MolFromSmiles( "CCc1ccccc1" )
mol3 = Chem.MolFromSmiles( "Oc1ccccc1" )
mol4 = Chem.MolFromSmiles( "COc1ccccc1" )


OK get MCS.

res = MCS.FindMCS([mol1,mol2,mol3,mol4], threshold=0.5)
res2 = MCS.FindMCS([mol4,mol3,mol2,mol1], threshold=0.5)




Different order of molecules gave different result. I will keep that mind!!!

An article in JMC.

My background is organic chemistry. I was trained medicinal chemistry on the job.
A topic about training medicinal chemistry was discussed in following article. And It was impressive for me.

“On the job training” is important for medchems because it makes their back ground of molecular design etc.
The author said,
For example, medicinal chemists who work primarily on CNS diseases will likely face different problems than oncology specialists.

BTW, environment around us is changing.

Chemistry CRO’s worldwide have realized explosive growth as the demand grows for low cost, budget flexible synthesis support. And large pharmas have undoubtedly realized significant operating cost savings as a result of chemistry staff reductions and elimination of expensive synthesis lab space.
I think the sentence is true for large pharma but it does not apply to mid or small pharma. But this trend will be main stream of research area.
It’s means lost of opportunity of on the job training for medicinal chemistry and synthetic chemistry.

What is key factor of training medicinal chemistry ?
Knowledge, skill, mindset…? We often discuss about drug likeness. There are several parameters for drug likeness for example LE, LLE, LogP etc. I think it is experience, what kind of problems did they solve, what kind of chemical series did they make.
They will get sense of medicinal chemistry through these experiences.

One of the reference of the article describe interesting experiment.
In the reference, author tried simple experiment with a group of 13 chemist volunteers. They judged as unacceptable from a list of a list of 2000 candidate structures. Each list was unique with the exception of 250 compounds that has been previously rejected as lead candidates.
Surprisingly, only one embedded compound was rejected by all 13 chemists! Only one!!! @_@
And the result indicated that the likelihood of a chemist repeating the same judgement of a compound was roughly 50%. I think it is huge task to get consistency of compounds selection with chemists….. Of course I have no confidence to select same compounds set from large number of dataset.

Back to the article. Table1 shows a notable fact. The table describes about ratio of publications in JMC with property considerations during optimization.
The percentage of the ratio is low in U.S. academia. The author discussed about the data. If reader who is interested, I recommend to read the article.

Automatic molecular designer

Recently, there are many articles and news about deep learning in web, journal etc.

“AI” is hot word.  You know, one of the very unique example is a system named “deep dream” which is a computer vision program created by google. The system generate new images from trained network. Neural network app also can generate songs, documents, poems and so on.

I am interested in making molecular designer using AI like deepdream. It’s mean MedChem AI!

Yesterday I found amazing article. The title was “Automatic chemical design using a data-driven continuous representation of molecules”.

One of the author is belonging to twitter inc. It’s interesting too. Is Twitter inc. interested in drug discovery ?? Hmm…

I felt the strategy of the article like a text mining because they used Smiles strings as input layer.

They used auto encoder and decoder  to generate molecules as smiles string. And RDKit is used for descriptor calculation and validate smiles.😉

After training and optimizing hyper parameters, the NN could generate similar but different molecules. It was amazing for me.

I think the work is nice, but there are some problems.

1st, this system used smiles strings as input and convert it to fixed length vector. Is the vector represents primitive of molecules?

2nd, the vector depends on 2D descriptors I think. For the protein, molecules are 3D object. So, we need to think the molecules as 3D.

Do data scientists produce artificial medicinal chemist, molecular designer  ?

I started to think about the  role of medicinal chemist again.







Extract Chemical Data From PDF, HTML, text etc.

I think medicinal chemist often grapple with many patents ,literatures and etc.
You know, recently there are many commercially available patent database. So, if we could use these databases, we can get data that is embedded in patens. But, if we don’t have them, we need to extract data from pdf or xml. This is time consuming step for me. ;-(

Yesterday, I read very cool article in J Chem Inf Model.
The title of the article is “ChemDataExtractor: A toolkit for automated extraction of chemical information from the scientific literature”.
DOI: 10.1021/acs.jcim.6b00207
Fortunately the authors released the tookit under MIT license and available to download from following URL.
Chemdataextractor also be installed using pip command!!😉

What’s ChemDataExtractor ?
ChemDataExtractor is text-mining pipeline. The system reads PDF/HTML/XML and categorize Abstract/Full text/captions and tables. Then process document / tables.

PDF documents are analyzed with PDFMiner framework.
It means that ChemDataExtractor can analyze PDF files like Patents!

Surprisingly, the system has table parser! In the article, fig.6 shows scheme of text parsing system. I think it is useful to extract biological data from tables.

Now I tested PDF parsing.
In my default environment (python3.5), chemdataextractor did not work so I switched python2.7 environment.
When I used python2.7, the system worked.
At first, I got sample data from patentlens.
Then read PDF. My code is following.

from pdfminer.pdfdocument import PDFDocument
from chemdataextractor import Document
from chemdataextractor.reader import PdfReader
import cirpy
pdffile = open("AU_2004_236144_A1.pdf")

#This code is different from official document example.
#readers=[PdfDocumentReader] did not work.
#cems means chemical named entity.
doc=Document.from_file(pdffile, readers=[PdfReader()])


OK, I got 927 chemical named entities.
Then I extracted some data.


[Span('vinca  alkaloids', 883, 899),
 Span('sertraline', 163, 173),
 Span('thienyl', 148, 155),
 Span('vincristine', 907, 918),
 Span('prednisolone', 202, 214),
 Span('vindesine', 935, 944),

It worked fine.

In the official site, more examples are described.
Also interactive online demo is worth to trying.

The system has opsin/cirpy so, the demo site can convert name to structure too.

I think the article is useful not only medicinal chemist for extract structure data but also computer chemist for text mining.

Draw radar chart with R

A radar chart is a graphical method of displaying multivariate data in the form of a two-dimensional chart of three or more quantitative variables represented on axes starting from the same point. The relative position and angle of the axes is typically uninformative. (from wikipedia)

It’s useful for visualize multi parameters in drug discovery.
For example, visualize compound profile, Lipinsky rule, ADMET profile etc….
Yesterday, I found new library to make radar chart in R language named ggradar.
You know, ggradar is library based on ggplot2. It sounds nice! I used ggradar.
I wrote simple example using iris dataset.
I used dpylr for data preparation. ( dplyr is cool library !!!! )

suppressMessages( library(dplyr) );
irisdata <- iris %>% 
            group_by( Species ) %>%
            mutate_each(funs(rescale)) %>%
            summarise( mean(Sepal.Length), mean(Sepal.Width), mean( Petal.Length ), mean( Petal.Width )  )
ggradar( irisdata )

Then I got following image.

ggradar can set many option, font color, size and line size etc.
But now legend text size can not change. The issue is submitted github.

Ggradar is early project, but useful and attractive library for me.






Predict in vivo parameter from in silico data.

Volume of distribution is calculated from following equation.
Vd = X / Cp0. X means dose, Cp0 means concentration of plasma at time 0. ( 1 compartment model)
This is the theoretical volume but important for PK/PD. If we could predict Vdss, we can predict dose of drug.
Recently I read the report written by scientist at Biogen.
The author used chemist friendly parameters for prediction because of the predict model tend to be black box.
In the report, the author used LogD of various PH range, molecular charge, and some donor acceptor parameters. I think these parameters are chemist friendly.😉
They used to methods, PLS and Random Forest. RF showed better performance than PLS.
By the way, I was interested in the method of training, they used Leave-Class-Out(LCO) method instead of Leave-one-out(LOO).
LCO method is that remove one chemical series from training set and test model using the removed class. It indicates that test set is not similar to training set.
And surprisingly, the method works fine!
I think, key point of the report is the author used PhysChem parameters for prediction not used fingerprint. If molecular fingerprints were used to build the model, LCO method would not work (I think..).
Fortunately, the author provided dataset in supporting information. So reader who interested in the paper, reproduction or try to build another model is easy.
Prediction of vivo parameters is very attractive for me!