Extract chemical information from patent data #pat-informatics #chemoinformatics

As you know, patent informatics is important for drug discovery project. And SureChembl is one of the dataset for chemical structures which are extracted from patent document by OCR. It is worth that it can freely available data source.

I surprised that recently google patents provides chemical data too.

It seems not fully cover all structure but seems cool. Let see the example, URL is below
https://patents.google.com/patent/WO2012125893A1/en?oq=WO2012125893A1.

The machine extracted information including structure is listed in ‘Concept’ table like below. I’m not sure which structures are extracted by machine. So it is not all structure.

The page source is html. I tried to extract the data with python ;).

To parse HTML, I used beautifulsoup. It’s very useful for html parsing. SMILES data is located in ‘concept=>ul=>span(itemprop=smiles)’. And additional information such as name, domain which shows where the data is extracted is provided.

Following code extract some dataset and make pandas data frame.

If the machine can extract all data in Experimental, Description and Claim, it will be powerful tool for pat-informatics.

Google provides many services which are freely available.