WIPOのRSSからの情報取得

PATENTSCOPEはクエリをくんだ結果をRSSで出力できるので
色々解析に使えるだろうと思われるのですが
いかんせん修行不足の身、最近読んでる本を参考に
タイトル、番号、リンク、サマリを出すスクリプトを書いてみました。
エンコードでエラー出まくりだったので全部uft_8にエンコードしたのですがこれほんとにいいのか不明。
取りあえずPCT&PDE4 @ FPをクエリにして
pde4.txtに
http://patentscope.wipo.int/search/rss.jsf?query=FP%3APDE4+&office=+%28OF%3Awo%29&rss=true&sortOption=Pub+Date+Desc
を保存しときます。

patentparser.pyを書いて

#! /usr/bin/python
# -*- coding: utf_8 -*-

import re
import feedparser
import sys

feedlistfile = open(sys.argv[1],'r')

feeds = [feedparser.parse(url) for url in feedlistfile]

for feed in feeds:
    for entry in feed.entries:
        url=entry["link"]
        num = re.search("WO\d*",url)
        patent_no = num.group(0)
        title = entry["title"]
        summary = entry["summary"]
        link = entry["link"]
        published = entry["published"]
        print "PATENT_NO\t"+patent_no
        print "LINK\t"+link
        print "PUBLISHED\t"+published.encode('utf_8')
        print "TITLE\t"+title.encode('utf_8')
        print "SUMMARY\t"+summary.encode('utf_8')
        print "="*50
        print " "

取りあえず動く。
リダイレクションをファイルにして

$ python patentparser.py pde4.txt > out.txt
$ cat out.txt
PATENT_NO	WO2012149251
LINK	http://patentscope.wipo.int/search/en/detail.jsf?docId=WO2012149251&recNum=1&docAn=US2012035359&queryString=FP:PDE4 &maxRec=298
PUBLISHED	Fri, 02 Nov 2012 00:00:00 CET
TITLE	METHODS AND COMPOSITIONS USING PDE4 INHIBITORS FOR THE TREATMENT AND MANAGEMENT OF AUTOIMMUNE AND INFLAMMATORY DISEASES
SUMMARY	Methods of treating, preventing, or managing autoimmune inflammatory diseases and disorders including but not limited to spondylitis, juvenile rheumatoid arthritis, psoriasis, psoriatic arthritis, osteoarthritis, ankylosing spondylitis, and rheumatoid arthritis by the administration of phosphodiesterase 4 (PDE4) inhibitors in combination with other therapeutics are disclosed. Pharmaceutical compositions, dosage forms, and kits suitable for use in methods of the invention are also disclosed.
==================================================
 
PATENT_NO	WO2012110946
LINK	http://patentscope.wipo.int/search/en/detail.jsf?docId=WO2012110946&recNum=2&docAn=IB2012050657&queryString=FP:PDE4 &maxRec=298
PUBLISHED	Fri, 24 Aug 2012 00:00:00 CEST
TITLE	PHARMACEUTICAL COMPOSITION COMPRISING THE PDE4 ENZYME INHIBITOR REVAMILAST AND A DISEASE MODIFYING AGENT, PREFERABLY METHOTREXATE
SUMMARY	The present patent application relates to a pharmaceutical composition comprising a PDE4 enzyme inhibitor and a disease modifying agent; a process for preparing such composition; and its use in treating an autoimmune disease in a subject.
==================================================
 
PATENT_NO	WO2012098495
LINK	http://patentscope.wipo.int/search/en/detail.jsf?docId=WO2012098495&recNum=3&docAn=IB2012050215&queryString=FP:PDE4 &maxRec=298
PUBLISHED	Fri, 27 Jul 2012 00:00:00 CEST
TITLE	PHARMACEUTICAL COMPOSITION THAT INCLUDES REVAMILAST AND A BETA-2 AGONIST
SUMMARY	The present patent application relates to a pharmaceutical composition that includes a PDE4 enzyme inhibitor, namely revamilast, and a beta- 2 adrenergic receptor agonist; a process for preparing such a composition; and its use in treating a respiratory disorder in a subject.
==================================================

ということで出力先をもう少し整形しようと思う。
プロキシハンドラも入れないとだめですね。

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s