down load images from web.

Web crawling technology is useful for getting broad range information.
There are lots of document about web crawling using python.
Today I used scrapy for crawling.
Scrapy is open source framework for web crawling.
The library can be installed using pip install.
After installed scrapy, I started following command.

iwatobipen$ scrapy startproject imagedl

Then imagedl folder was cleated and there are some files in the folder.

imagedl/
├── imagedl
│   ├── __init__.py
│   ├── items.py
│   ├── pipelines.py
│   ├── settings.py
│   └── spiders
│       └── __init__.py
└── scrapy.cfg

Next I coded setting.py file. The file define the base settings of crawler.
ITEM_PIPELINES setting is needed for image handling and IMAGES_STORE means path of download folder.

BOT_NAME = 'imagedl'
SPIDER_MODULES = ['imagedl.spiders']
NEWSPIDER_MODULE = 'imagedl.spiders'
ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline':1}
IMAGES_STORE = './stock'

Next items.py is defined.
images_urls and images are common way to get images.

 import scrapy
 class ImagedlItem(scrapy.Item):
      image_urls = scrapy.Field()
      images = scrapy.Field()

Finally define spider/imageget.py.
name is name of this crawler.
And start_urls is page for crawl.
I used HtmlXPathSelector for selecting image url in the web page.

import scrapy
from scrapy.pipelines.images import ImagesPipeline
from scrapy.selector import HtmlXPathSelector
from imagedl.items import ImagedlItem
class ImSpider( scrapy.Spider ):
    name = 'hoge'
    allowed_domains = ['test.org']
    start_urls = [ "https://www.google.co.jp/search?q=%E6%9D%B1%E4%BA%AC%E3%82%B0%E3%83%BC%E3%83%AB&source=lnms&tbm=isch&sa=X&ved=0ahUKEwjNw6yLzYPKAhXHHaYKHR7_Bv0Q_AUICCgC&biw=867&bih=602" ]
    def parse( self, response ):
        hxs = HtmlXPathSelector( response )
        item = ImagedlItem()
        image_urls = hxs.select( '//img/@src' ).extract()
        item['image_urls'] = [x for x in image_urls]
        return item

All setting was done.
Then run crawler.

iwatobipen$ scrapy crawl hoge

After run the command, stock folder appeared and images were stored in the folder.
If reader interested in the kind of image…. check the url.
Images are one of Japanese amine.

広告

コメントを残す

以下に詳細を記入するか、アイコンをクリックしてログインしてください。

WordPress.com ロゴ

WordPress.com アカウントを使ってコメントしています。 ログアウト / 変更 )

Twitter 画像

Twitter アカウントを使ってコメントしています。 ログアウト / 変更 )

Facebook の写真

Facebook アカウントを使ってコメントしています。 ログアウト / 変更 )

Google+ フォト

Google+ アカウントを使ってコメントしています。 ログアウト / 変更 )

%s と連携中