Extract chemical information from patent data #pat-informatics #chemoinformatics

As you know, patent informatics is important for drug discovery project. And SureChembl is one of the dataset for chemical structures which are extracted from patent document by OCR. It is worth that it can freely available data source.

I surprised that recently google patents provides chemical data too.

It seems not fully cover all structure but seems cool. Let see the example, URL is below
https://patents.google.com/patent/WO2012125893A1/en?oq=WO2012125893A1.

The machine extracted information including structure is listed in ‘Concept’ table like below. I’m not sure which structures are extracted by machine. So it is not all structure.

The page source is html. I tried to extract the data with python ;).

To parse HTML, I used beautifulsoup. It’s very useful for html parsing. SMILES data is located in ‘concept=>ul=>span(itemprop=smiles)’. And additional information such as name, domain which shows where the data is extracted is provided.

Following code extract some dataset and make pandas data frame.

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

If the machine can extract all data in Experimental, Description and Claim, it will be powerful tool for pat-informatics.

Google provides many services which are freely available.

Published by iwatobipen

I'm medicinal chemist in mid size of pharmaceutical company. I love chemoinfo, cording, organic synthesis, my family.

2 thoughts on “Extract chemical information from patent data #pat-informatics #chemoinformatics

  1. Admiring the commitment you put into your blog and in depth information you provide.
    It’s nice to come across a blog every once in a while that isn’t the same
    old rehashed information. Fantastic read! I’ve saved your site and I’m adding
    your RSS feeds to my Google account.

Leave a reply to iwatobipen Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.