Try to use casperjs

CasperJS is navigation scripting and testing utility for the PhantomJS and SlimerJS written in Javascript.
You know, PhantomJS, and SlimerJS are headless browsers.
Some years ago, I used selenium for web scraping because selenium has python binding and easy to use.
Today, I used CasperJS for test.
Installation is very easy. Just use homebrew(for Mac users) or npm (Need to install PhantomJS before). 😉
I wrote simple code that the code search patents in google patent and echo the each link.
At first, create casper object. And then write next action like ‘casper.then( function() { /* your function */ } );’ .
fill function is useful for form input, user don’t need push button command.
Following code access google patent and search patents that are written about JAK3.
Then, echo urls.


var casper = require( 'casper' ).create();
function getLinks() {
        var links = [];
        var list = document.querySelectorAll( 'article > a' );


        for ( var i = 0; i < list.length; i++ ){
            var a = list[i];
            links.push( a.href );
        };
        return links;
};

casper.start().viewport( 1600,1000 );

casper.thenOpen( 'https://patents.google.com/',
                 function(){
                   this.echo( this.getTitle() );
                 });
casper.then(
                 function(){ this.capture('top.png') }
);

casper.then( function(){
             this.fill("form", { q : "JAK3" }, true);
});
casper.wait( 5000,
                 function(){ this.capture('res.png') }
);

casper.then(
                 function(){
                        links = this.evaluate( getLinks );
                        this.echo( links.length + 'patents found' );
                        for ( i = 0; i < links.length; i++ ){
                                    this.echo( links[i]  );
                        };
});


casper.run();

To run the code, just type casperjs yourscript.js.

 iwatobipen$ casperjs googlepat.js 
Google Patents
10patents found
https://patents.google.com/patent/US6210654B1/en?q=jak3
https://patents.google.com/patent/US6136595A/en?q=jak3
https://patents.google.com/patent/US5741899A/en?q=jak3
https://patents.google.com/patent/US7598257B2/en?q=jak3
https://patents.google.com/patent/US7335667B2/en?q=jak3
https://patents.google.com/patent/US7491732B2/en?q=jak3
https://patents.google.com/patent/US6080747A/en?q=jak3
https://patents.google.com/patent/US20060030018A1/en?q=jak3
https://patents.google.com/patent/US5916792A/en?q=jak3
https://patents.google.com/patent/US6160010A/en?q=jak3

Works fine and I got following screenshot.
CasperJS has more function for scraping. I’ll read API as soon as possible.

res

top

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s