scrapy 的三个入门应用场景-阿里云开发者社区

scrapy 的三个入门应用场景

2016-08-30 1068

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： 说明：本文参照了官网的 dmoz 爬虫例子。不过这个例子有些年头了，而 dmoz.org 的网页结构已经不同以前。所以我对xpath也相应地进行了修改。概要：本文提出了scrapy 的三个入门应用场景爬取单页根据目录页面，爬取所有指向的页面爬取第一页，然后根据第一页的连接，再爬取下一页...。

说明：
本文参照了官网的 dmoz 爬虫例子。

不过这个例子有些年头了，而 dmoz.org 的网页结构已经不同以前。所以我对xpath也相应地进行了修改。

概要：
本文提出了scrapy 的三个入门应用场景

爬取单页
根据目录页面，爬取所有指向的页面
爬取第一页，然后根据第一页的连接，再爬取下一页...。依此，直到结束

对于场景二、场景三可以认为都属于：链接跟随(Following links)

链接跟随的特点就是：在 parse 函数结束时，必须 yield 一个带回调函数 callback 的 Request 类的实例

本文基于：windows 7 (64) + python 3.5 (64) + scrapy 1.2

场景一

描述：

爬取单页内容

示例代码：

import scrapy

from tutorial.items import DmozItem

class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]

    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    ]

    def parse(self, response):        
        for div in response.xpath('//div[@class="title-and-desc"]'):
            item = DmozItem()
            item['title'] = div.xpath('a/div/text()').extract_first().strip()
            item['link'] = div.xpath('a/@href').extract_first()
            item['desc'] = div.xpath('div[@class="site-descr "]/text()').extract_first().strip()
            yield item

场景二

描述：

①进入目录，提取连接。

②然后爬取连接指向的页面的内容
其中①的yield scrapy.Request的callback指向②

官网描述：

...extract the links for the pages you are interested, follow them and then extract the data you want for all of them.

示例代码：

import scrapy

from tutorial.items import DmozItem

class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]

    start_urls = [
        'http://www.dmoz.org/Computers/Programming/Languages/Python/' # 这是目录页面
    ]
    
    def parse(self, response):
        for a in response.xpath('//section[@id="subcategories-section"]//div[@class="cat-item"]/a'):
            url = response.urljoin(a.xpath('@href').extract_first().split('/')[-2])
            yield scrapy.Request(url, callback=self.parse_dir_contents)
        
        
    def parse_dir_contents(self, response):
        for div in response.xpath('//div[@class="title-and-desc"]'):
            item = DmozItem()
            item['title'] = div.xpath('a/div/text()').extract_first().strip()
            item['link'] = div.xpath('a/@href').extract_first()
            item['desc'] = div.xpath('div[@class="site-descr "]/text()').extract_first().strip()
            yield item

场景三

描述：

①进入页面，爬取内容，并提取下一页的连接。

②然后爬取下一页连接指向的页面的内容
其中①的yield scrapy.Request的callback指向①自己

官网描述：

A common pattern is a callback method that extracts some items, looks for a link to follow to the next page and then yields a Request with the same callback for it

示例代码：

import scrapy

from myproject.items import MyItem

class MySpider(scrapy.Spider):
    name = 'example.com'
    allowed_domains = ['example.com']

    start_urls = [
        'http://www.example.com/1.html',
        'http://www.example.com/2.html',
        'http://www.example.com/3.html',
    ]

    def parse(self, response):
        for h3 in response.xpath('//h3').extract():
            yield MyItem(title=h3)

        for url in response.xpath('//a/@href').extract():
            yield scrapy.Request(url, callback=self.parse)

说明：
第三个场景未测试！

scrapy 的三个入门应用场景

场景一

场景二

场景三

热门文章

最新文章

相关课程

相关电子书

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

scrapy 的三个入门应用场景

场景一

场景二

场景三

热门文章

最新文章

相关课程

相关电子书