down load images from web.

Web crawling technology is useful for getting broad range information.
There are lots of document about web crawling using python.
Today I used scrapy for crawling.
Scrapy is open source framework for web crawling.
The library can be installed using pip install.
After installed scrapy, I started following command.

iwatobipen$ scrapy startproject imagedl

Then imagedl folder was cleated and there are some files in the folder.

├── imagedl
│   ├──
│   ├──
│   ├──
│   ├──
│   └── spiders
│       └──
└── scrapy.cfg

Next I coded file. The file define the base settings of crawler.
ITEM_PIPELINES setting is needed for image handling and IMAGES_STORE means path of download folder.

BOT_NAME = 'imagedl'
SPIDER_MODULES = ['imagedl.spiders']
NEWSPIDER_MODULE = 'imagedl.spiders'
ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline':1}
IMAGES_STORE = './stock'

Next is defined.
images_urls and images are common way to get images.

 import scrapy
 class ImagedlItem(scrapy.Item):
      image_urls = scrapy.Field()
      images = scrapy.Field()

Finally define spider/
name is name of this crawler.
And start_urls is page for crawl.
I used HtmlXPathSelector for selecting image url in the web page.

import scrapy
from scrapy.pipelines.images import ImagesPipeline
from scrapy.selector import HtmlXPathSelector
from imagedl.items import ImagedlItem
class ImSpider( scrapy.Spider ):
    name = 'hoge'
    allowed_domains = ['']
    start_urls = [ "" ]
    def parse( self, response ):
        hxs = HtmlXPathSelector( response )
        item = ImagedlItem()
        image_urls = '//img/@src' ).extract()
        item['image_urls'] = [x for x in image_urls]
        return item

All setting was done.
Then run crawler.

iwatobipen$ scrapy crawl hoge

After run the command, stock folder appeared and images were stored in the folder.
If reader interested in the kind of image…. check the url.
Images are one of Japanese amine.


Published by iwatobipen

I'm medicinal chemist in mid size of pharmaceutical company. I love chemoinfo, cording, organic synthesis, my family.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: