Try to use casperjs

CasperJS is navigation scripting and testing utility for the PhantomJS and SlimerJS written in Javascript.
You know, PhantomJS, and SlimerJS are headless browsers.
Some years ago, I used selenium for web scraping because selenium has python binding and easy to use.
Today, I used CasperJS for test.
Installation is very easy. Just use homebrew(for Mac users) or npm (Need to install PhantomJS before). 😉
I wrote simple code that the code search patents in google patent and echo the each link.
At first, create casper object. And then write next action like ‘casper.then( function() { /* your function */ } );’ .
fill function is useful for form input, user don’t need push button command.
Following code access google patent and search patents that are written about JAK3.
Then, echo urls.


var casper = require( 'casper' ).create();
function getLinks() {
        var links = [];
        var list = document.querySelectorAll( 'article > a' );


        for ( var i = 0; i < list.length; i++ ){
            var a = list[i];
            links.push( a.href );
        };
        return links;
};

casper.start().viewport( 1600,1000 );

casper.thenOpen( 'https://patents.google.com/',
                 function(){
                   this.echo( this.getTitle() );
                 });
casper.then(
                 function(){ this.capture('top.png') }
);

casper.then( function(){
             this.fill("form", { q : "JAK3" }, true);
});
casper.wait( 5000,
                 function(){ this.capture('res.png') }
);

casper.then(
                 function(){
                        links = this.evaluate( getLinks );
                        this.echo( links.length + 'patents found' );
                        for ( i = 0; i < links.length; i++ ){
                                    this.echo( links[i]  );
                        };
});


casper.run();

To run the code, just type casperjs yourscript.js.

 iwatobipen$ casperjs googlepat.js 
Google Patents
10patents found
https://patents.google.com/patent/US6210654B1/en?q=jak3
https://patents.google.com/patent/US6136595A/en?q=jak3
https://patents.google.com/patent/US5741899A/en?q=jak3
https://patents.google.com/patent/US7598257B2/en?q=jak3
https://patents.google.com/patent/US7335667B2/en?q=jak3
https://patents.google.com/patent/US7491732B2/en?q=jak3
https://patents.google.com/patent/US6080747A/en?q=jak3
https://patents.google.com/patent/US20060030018A1/en?q=jak3
https://patents.google.com/patent/US5916792A/en?q=jak3
https://patents.google.com/patent/US6160010A/en?q=jak3

Works fine and I got following screenshot.
CasperJS has more function for scraping. I’ll read API as soon as possible.

res

top

広告

コメントを残す

以下に詳細を記入するか、アイコンをクリックしてログインしてください。

WordPress.com ロゴ

WordPress.com アカウントを使ってコメントしています。 ログアウト / 変更 )

Twitter 画像

Twitter アカウントを使ってコメントしています。 ログアウト / 変更 )

Facebook の写真

Facebook アカウントを使ってコメントしています。 ログアウト / 変更 )

Google+ フォト

Google+ アカウントを使ってコメントしています。 ログアウト / 変更 )

%s と連携中