down load images from web.

Web crawling technology is useful for getting broad range information.
There are lots of document about web crawling using python.
Today I used scrapy for crawling.
Scrapy is open source framework for web crawling.
The library can be installed using pip install.
After installed scrapy, I started following command.

iwatobipen$ scrapy startproject imagedl

Then imagedl folder was cleated and there are some files in the folder.

imagedl/
├── imagedl
│   ├── __init__.py
│   ├── items.py
│   ├── pipelines.py
│   ├── settings.py
│   └── spiders
│       └── __init__.py
└── scrapy.cfg

Next I coded setting.py file. The file define the base settings of crawler.
ITEM_PIPELINES setting is needed for image handling and IMAGES_STORE means path of download folder.

BOT_NAME = 'imagedl'
SPIDER_MODULES = ['imagedl.spiders']
NEWSPIDER_MODULE = 'imagedl.spiders'
ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline':1}
IMAGES_STORE = './stock'

Next items.py is defined.
images_urls and images are common way to get images.

 import scrapy
 class ImagedlItem(scrapy.Item):
      image_urls = scrapy.Field()
      images = scrapy.Field()

Finally define spider/imageget.py.
name is name of this crawler.
And start_urls is page for crawl.
I used HtmlXPathSelector for selecting image url in the web page.

import scrapy
from scrapy.pipelines.images import ImagesPipeline
from scrapy.selector import HtmlXPathSelector
from imagedl.items import ImagedlItem
class ImSpider( scrapy.Spider ):
    name = 'hoge'
    allowed_domains = ['test.org']
    start_urls = [ "https://www.google.co.jp/search?q=%E6%9D%B1%E4%BA%AC%E3%82%B0%E3%83%BC%E3%83%AB&source=lnms&tbm=isch&sa=X&ved=0ahUKEwjNw6yLzYPKAhXHHaYKHR7_Bv0Q_AUICCgC&biw=867&bih=602" ]
    def parse( self, response ):
        hxs = HtmlXPathSelector( response )
        item = ImagedlItem()
        image_urls = hxs.select( '//img/@src' ).extract()
        item['image_urls'] = [x for x in image_urls]
        return item

All setting was done.
Then run crawler.

iwatobipen$ scrapy crawl hoge

After run the command, stock folder appeared and images were stored in the folder.
If reader interested in the kind of image…. check the url.
Images are one of Japanese amine.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s