crawl web

Crawler is useful tool to get many information from web.
There are a lots of library to do it.
Today I tried Scrapy.
Scrapy is a fast high-level screen scraping and web crawling framework.
You can install by using pip.
After installed, I made sample project at first.

iwatobipen$ scrapy startproject peng

Then peng folder was made.

iwatobipen$ tree peng
peng

├── peng
│   ├── __init__.py
│   ├── items.py
│   ├── pipelines.py
│   ├── settings.py
│   └── spiders
│       ├── __init__.py
└── scrapy.cfg

Next, I wrote some code..
At first added following code in settings.py

ITEM_PIPELINES = { 'scrapy.contrib.pipeline.images.ImagesPipeline' : 1 }
IMAGES_STORE = './imagestore'

Then, I make items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class PengItem(scrapy.Item):
    image_urls = scrapy.Field()
    images = scrapy.Field()

Next, I made spider.
in spiders folder.
Made spider_pen.py

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http.request import Request
from peng.items import PengItem

class Penguin( BaseSpider ):
    name = 'pen'
    allowed_domains = [ 'pen.com' ]
    start_urls = [
            'http://en.wikipedia.org/wiki/Penguin'
            ]
    def parse( self, response ):
        hxs = HtmlXPathSelector( response )
        item = PengItem()
        image_urls = hxs.select( '//img/@src' ).extract()
        item[ 'image_urls' ] = [ "http:"+url for url in image_urls ]
        return item

That’s all!
Let’s crawl !

iwatobipen$ scrapy crawl pen

After run the command, I got cute penguin image file from wiki.

├── imagestore
│   └── full
│       ├── 0a69851e1dd15bf066331b3d69b543f4e1efe7a1.jpg
│       ├── 29581fa7c4782c5c230a2aa5db9ab0084afe4bba.jpg
│       ├── 4724bd7d31f7d0a757c6f77c8603fe124dcb118d.jpg
│       ├── 4730fada49f7ea362bd2b2b400ce496853212c68.jpg
│       ├── 4b0ed9bab8d1c5c87353ecb01c43bc4d6805597d.jpg
│       ├── 4e04d9f5660403e8abd4fbf492653ba585a5e1da.jpg
│       ├── 54bd7e92b742cabeb7c875bb8a5f17505e6e3ec8.jpg
│       ├── 59addae2158171e432742924c4cf8dd1d4c6f5cb.jpg
│       ├── 6281195be681d151f210796859e1d5a8c62ba57b.jpg
│       ├── 6b187ae572b1f0c218714193e208b891009510e1.jpg
│       ├── 74142314fb3a6fbcfe1b5e71f9c4f4fe6aad2f83.jpg
│       ├── 8bbcca134af2579ac8983e60cde42185f01889a4.jpg
│       ├── 94c8efe79732da3ff78dbcb41542f80d8c611252.jpg
│       ├── 995f1c580641c7ad7e04eaaa517e863cc41d02de.jpg
│       ├── 9eb924c83589b64dedbdcc579256e62933573dc6.jpg
│       ├── a91559fe9c0beb20c90263ca95fa6413b67ecace.jpg
│       ├── a9958d0f37a757d1521559f8b49d88d126b846c9.jpg
│       ├── ae5e639dbf9e73e0b2aad744aa07ed784a24f871.jpg
│       ├── f34bc03737d7f1ec7098ee637de5ebd7a2e91551.jpg
│       ├── fbb24d5d48701796c44a6fbb6b02d35be8cc3687.jpg
│       └── fe5ae9a4bd5c1e95a12d9c5978c09c0051865547.jpg
├── peng
│   ├── __init__.py
│   ├── __init__.pyc
│   ├── items.py
│   ├── items.pyc
│   ├── pipelines.py
│   ├── settings.py
│   ├── settings.pyc
│   └── spiders
│       ├── __init__.py
│       ├── __init__.pyc
│       ├── spider_pen.py
│       └── spider_pen.pyc
└── scrapy.cfg

Scrapy is very powerful tool.

広告

コメントを残す

以下に詳細を記入するか、アイコンをクリックしてログインしてください。

WordPress.com ロゴ

WordPress.com アカウントを使ってコメントしています。 ログアウト / 変更 )

Twitter 画像

Twitter アカウントを使ってコメントしています。 ログアウト / 変更 )

Facebook の写真

Facebook アカウントを使ってコメントしています。 ログアウト / 変更 )

Google+ フォト

Google+ アカウントを使ってコメントしています。 ログアウト / 変更 )

%s と連携中