Extract Chemical Data From PDF, HTML, text etc.

I think medicinal chemist often grapple with many patents ,literatures and etc.
You know, recently there are many commercially available patent database. So, if we could use these databases, we can get data that is embedded in patens. But, if we don’t have them, we need to extract data from pdf or xml. This is time consuming step for me. ;-(

Yesterday, I read very cool article in J Chem Inf Model.
The title of the article is “ChemDataExtractor: A toolkit for automated extraction of chemical information from the scientific literature”.
DOI: 10.1021/acs.jcim.6b00207
Fortunately the authors released the tookit under MIT license and available to download from following URL.
Chemdataextractor also be installed using pip command!! 😉

What’s ChemDataExtractor ?
ChemDataExtractor is text-mining pipeline. The system reads PDF/HTML/XML and categorize Abstract/Full text/captions and tables. Then process document / tables.

PDF documents are analyzed with PDFMiner framework.
It means that ChemDataExtractor can analyze PDF files like Patents!

Surprisingly, the system has table parser! In the article, fig.6 shows scheme of text parsing system. I think it is useful to extract biological data from tables.

Now I tested PDF parsing.
In my default environment (python3.5), chemdataextractor did not work so I switched python2.7 environment.
When I used python2.7, the system worked.
At first, I got sample data from patentlens.
Then read PDF. My code is following.

from pdfminer.pdfdocument import PDFDocument
from chemdataextractor import Document
from chemdataextractor.reader import PdfReader
import cirpy
pdffile = open("AU_2004_236144_A1.pdf")

#This code is different from official document example.
#readers=[PdfDocumentReader] did not work.
#cems means chemical named entity.
doc=Document.from_file(pdffile, readers=[PdfReader()])


OK, I got 927 chemical named entities.
Then I extracted some data.


[Span('vinca  alkaloids', 883, 899),
 Span('sertraline', 163, 173),
 Span('thienyl', 148, 155),
 Span('vincristine', 907, 918),
 Span('prednisolone', 202, 214),
 Span('vindesine', 935, 944),

It worked fine.

In the official site, more examples are described.
Also interactive online demo is worth to trying.

The system has opsin/cirpy so, the demo site can convert name to structure too.

I think the article is useful not only medicinal chemist for extract structure data but also computer chemist for text mining.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s