Extract Chemical Data From PDF, HTML, text etc.

I think medicinal chemist often grapple with many patents ,literatures and etc.
You know, recently there are many commercially available patent database. So, if we could use these databases, we can get data that is embedded in patens. But, if we don’t have them, we need to extract data from pdf or xml. This is time consuming step for me. ;-(

Yesterday, I read very cool article in J Chem Inf Model.
The title of the article is “ChemDataExtractor: A toolkit for automated extraction of chemical information from the scientific literature”.
DOI: 10.1021/acs.jcim.6b00207
http://pubs.acs.org/doi/abs/10.1021/acs.jcim.6b00207
Fortunately the authors released the tookit under MIT license and available to download from following URL.
http://chemdataextractor.org
Chemdataextractor also be installed using pip command!! 😉

What’s ChemDataExtractor ?
ChemDataExtractor is text-mining pipeline. The system reads PDF/HTML/XML and categorize Abstract/Full text/captions and tables. Then process document / tables.

PDF documents are analyzed with PDFMiner framework.
It means that ChemDataExtractor can analyze PDF files like Patents!

Surprisingly, the system has table parser! In the article, fig.6 shows scheme of text parsing system. I think it is useful to extract biological data from tables.

Now I tested PDF parsing.
In my default environment (python3.5), chemdataextractor did not work so I switched python2.7 environment.
When I used python2.7, the system worked.
At first, I got sample data from patentlens.
https://www.lens.org/lens/
Then read PDF. My code is following.

from pdfminer.pdfdocument import PDFDocument
from chemdataextractor import Document
from chemdataextractor.reader import PdfReader
import cirpy
pdffile = open("AU_2004_236144_A1.pdf")

#This code is different from official document example.
#readers=[PdfDocumentReader] did not work.
#cems means chemical named entity.
doc=Document.from_file(pdffile, readers=[PdfReader()])
len(doc.cems)

Out[37]:927

OK, I got 927 chemical named entities.
Then I extracted some data.

doc.cems[600:620]

Out[54]:
[Span('vinca  alkaloids', 883, 899),
 Span('sertraline', 163, 173),
 Span('thienyl', 148, 155),
 Span('vincristine', 907, 918),
 Span('prednisolone', 202, 214),
 Span('vindesine', 935, 944),
..........

It worked fine.

In the official site, more examples are described.
Also interactive online demo is worth to trying.
http://chemdataextractor.org/demo

The system has opsin/cirpy so, the demo site can convert name to structure too.

I think the article is useful not only medicinal chemist for extract structure data but also computer chemist for text mining.

広告

コメントを残す

以下に詳細を記入するか、アイコンをクリックしてログインしてください。

WordPress.com ロゴ

WordPress.com アカウントを使ってコメントしています。 ログアウト / 変更 )

Twitter 画像

Twitter アカウントを使ってコメントしています。 ログアウト / 変更 )

Facebook の写真

Facebook アカウントを使ってコメントしています。 ログアウト / 変更 )

Google+ フォト

Google+ アカウントを使ってコメントしています。 ログアウト / 変更 )

%s と連携中