I think medicinal chemist often grapple with many patents ,literatures and etc.
You know, recently there are many commercially available patent database. So, if we could use these databases, we can get data that is embedded in patens. But, if we don’t have them, we need to extract data from pdf or xml. This is time consuming step for me. ;-(
Yesterday, I read very cool article in J Chem Inf Model.
The title of the article is “ChemDataExtractor: A toolkit for automated extraction of chemical information from the scientific literature”.
Fortunately the authors released the tookit under MIT license and available to download from following URL.
Chemdataextractor also be installed using pip command!! 😉
What’s ChemDataExtractor ?
ChemDataExtractor is text-mining pipeline. The system reads PDF/HTML/XML and categorize Abstract/Full text/captions and tables. Then process document / tables.
PDF documents are analyzed with PDFMiner framework.
It means that ChemDataExtractor can analyze PDF files like Patents!
Surprisingly, the system has table parser! In the article, fig.6 shows scheme of text parsing system. I think it is useful to extract biological data from tables.
Now I tested PDF parsing.
In my default environment (python3.5), chemdataextractor did not work so I switched python2.7 environment.
When I used python2.7, the system worked.
At first, I got sample data from patentlens.
Then read PDF. My code is following.
from pdfminer.pdfdocument import PDFDocument from chemdataextractor import Document from chemdataextractor.reader import PdfReader import cirpy pdffile = open("AU_2004_236144_A1.pdf") #This code is different from official document example. #readers=[PdfDocumentReader] did not work. #cems means chemical named entity. doc=Document.from_file(pdffile, readers=[PdfReader()]) len(doc.cems) Out:927
OK, I got 927 chemical named entities.
Then I extracted some data.
doc.cems[600:620] Out: [Span('vinca alkaloids', 883, 899), Span('sertraline', 163, 173), Span('thienyl', 148, 155), Span('vincristine', 907, 918), Span('prednisolone', 202, 214), Span('vindesine', 935, 944), ..........
It worked fine.
In the official site, more examples are described.
Also interactive online demo is worth to trying.
The system has opsin/cirpy so, the demo site can convert name to structure too.
I think the article is useful not only medicinal chemist for extract structure data but also computer chemist for text mining.