前言
大家好，我是一身正气的辣条哥
今天主要跟大家分享一下Scrapy，Scrapy是一个为了爬取网站数据，提取结构性数据而编写的应用框架。其最初是为了页面抓取 (更确切来说, 网络抓取 )所设计的，也可以应用在获取API所返回的数据(例如 Amazon Associates Web Services ) 或者通用的网络爬虫。 Scrapy用途广泛,可以用于数据挖掘、监测和自动化测试 Scrapy使用了Twisted 异步网络库来处理网络通讯。

目录
前言
一.简介
二.组件介绍
2.1下载中间件
2.2爬虫中间件
三.项目命令
3.1创建项目:
3.2cd 到项目下
3.3.运行项目
3.4.setting 里配置
四.shell 交互式平台
4.1目标数据要求：
4.2爬虫文件
4.3items文件
4.4piplines文件
4.5settings文件
五.项目注意事项
六.scrapy shell
七.选择器
八.items文件
九.pipelines 文件
十.settings 文件
十一.Scrapy shell
十二.Scrapy 选择器
十三.嵌套选择器
十四.scrapy.Spider
十五 .logger
十六 .from_crawler
十七.start_requests() 开始请求
十八.parse 默认回调函数方法
一.简介
Scrapy是纯Python开发的一个高效,结构化的网页抓取框架；

另外有没有在学python比较蒙圈的，或者没什么好的思路的可以点击下方
点我点我点我

使用原因：
1.为了更利于我们将精力集中在请求与解析上
2.企业级的要求

安装
scrapy支持Python2.7和python3.4以上版本。
python包可以用全局安装(也称为系统范围),也可以安装在用户空间中。
运行流程

spiders网页爬虫
items项目
engine引擎
scheduler调度器
downloader下载器
item pipelines项目管道
middleware中间设备，中间件

数据流：
上图显示了Scrapy框架的体系结构及其组件，以及系统内部发生的数据流（由红色的箭头显示。）
Scrapy中的数据流由执行引擎控制,流程如下：

首先从网页爬虫获取初始的请求
将请求放入调度模块，然后获取下一个需要爬取的请求
调度模块返回下一个需要爬取的请求给引擎
引擎将请求发送给下载器，依次穿过所有的下载中间件
一旦页面下载完成，下载器会返回一个响应包含了页面数据，然后再依次穿过所有的下载中间件。
引擎从下载器接收到响应，然后发送给爬虫进行解析，依次穿过所有的爬虫中间件
爬虫处理接收到的响应，然后解析出item和生成新的请求，并发送给引擎
引擎将已经处理好的item发送给管道组件，将生成好的新的请求发送给调度模块，并请求下一个请求
该过程重复，直到调度程序不再有请求为止。

二.组件介绍
Scrapy Engine(引擎)
引擎负责控制系统所有组件之间的数据流，并在发生某些操作时触发事件。
scheduler（调度器)
调度程序接收来自引擎的请求，将它们排入队列，以便稍后引擎请求它们。
Downloader（下载器)
下载程序负责获取web页面并将它们提供给引擎，引擎再将它们提供给spider。
spider（爬虫）
爬虫是由用户编写的自定义的类，用于解析响应，从中提取数据，或其他要抓取的请求。
Item pipeline（管道)
管道负责在数据被爬虫提取后进行后续处理。典型的任务包括清理，验证和持久性（如将数据存储在数据库中）

2.1下载中间件
下载中间件是位于引擎和下载器之间的特定的钩子，它们处理从引擎传递到下载器的请求，以及下载器传递到引擎的响应。
如果你要执行以下操作之一，请使用Downloader中间件：
在请求发送到下载程序之前处理请求（即在scrapy将请求发送到网站之前）
在响应发送给爬虫之前
直接发送新的请求，而不是将收到的响应传递给蜘蛛
将响应传递给爬行器而不获取web页面;
默默的放弃一些请求

2.2爬虫中间件
爬虫中间件是位于引擎和爬虫之间的特定的钩子，能够处理传入的响应和传递出去的item和请求。
如果你需要以下操作请使用爬虫中间件：
处理爬虫回调之后的请求或item
处理start_requests
处理爬虫异常
根据响应内容调用errback而不是回调请
简单使用

三.项目命令
3.1创建项目:
scrapy startproject <project_name> [project_dir]
ps: “<>”表示必填 ,”[]”表示可选
scrapy startproject db

都是db

3.2cd 到项目下
scrapy genspider [options]

scrapy genspider example example.com

会创建在项目/spider下 ;其中example 是爬虫文件名, example.com 是 url
1
2

3.3.运行项目
scrapy crawl 爬虫文件名 #注重流程

3.4.setting 里配置
ROBOTSTXT_OBEY;DEFAULT_REQUEST_HEADERS

ROBOTSTXT_OBEY = False

DEFAULT_REQUEST_HEADERS = {
‘Accept’: ‘text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8’,
‘Accept-Language’: ‘en’,
“User-Agent”: “Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36”
}

四.shell 交互式平台
scrapy shell url (start_url) 获取我们项目中的response
测试 xpath进行匹配

4.1目标数据要求：
250个电影信息
电影信息为：电影名字,导演信息(可以包含演员信息),评分
将电影信息直接本地保存
将电影信息通过管道进行保存

4.2爬虫文件

-- coding: utf-8 --

import json

import scrapy

from ..items import DbItem #是一个安全的字典
class Db250Spider(scrapy.Spider):#继承基础类

name = 'db250'  #爬虫文件名字  必须存在且唯一
# allowed_domains = ['movie.douban.com'] #允许的域名   可以不存在 不存在  任何域名都可以
start_urls = ['https://movie.dou.com/top250']#初始url  必须要存在
page_num=0
def parse(self, response):#解析函数  处理响应数据
    node_list=response.xpath('//div[@class="info"]')
    with open("film.txt","w",encoding="utf-8") as f:
        for node  in  node_list:
            #电影名字
           # extract 新的知识
        film_name=node.xpath("./div/a/span/text()").extract()[0]
            #导演信息
            director_name=node.xpath("./div/p/text()").extract()[0].strip()
            #评分
            score=node.xpath('./div/div/span[@property="v:average"]/text()').extract()[0]

            #非管道存储
            item={}
            item["item_pipe"]=film_name
            item["director_name"]=director_name
            item["score"]=score
            content=json.dumps(item,ensure_ascii=False)
            f.write(content+"\n")

            #使用管道存储
            item_pipe=DbItem() #创建Dbitem对象  当成字典来使用
            item_pipe['film_name']=film_name
            item_pipe['director_name']=director_name
            item_pipe['score']=score
            yield item_pipe
    #发送新一页的请求
    #构造url
    self.page_num += 1
    if self.page_num==3:
        return
    page_url="https://movie.douban.com/top250?start={}&filter=".format(self.page_num*25)
    yield scrapy.Request(page_url)

page页规律

"https://movie.dou.com/top250?start=25&filter="
"https://movie.dou.com/top250?start=50&filter="
"https://movie.dou.com/top250?start=75&filter="

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
4.3items文件
import scrapy

class DbItem(scrapy.Item):

# define the fields for your item here like:
# name = scrapy.Field()
film_name=scrapy.Field()
director_name=scrapy.Field()
score=scrapy.Field()

1
2
3
4
5
6
7
8
4.4piplines文件
import json

class DbPipeline(object):

def  open_spider(self,spider):
    #爬虫文件开启,此方法执行
    self.f=open("film_pipe.txt","w",encoding="utf-8")

def process_item(self, item, spider):
    json_data=json.dumps(dict(item),ensure_ascii=False)+"\n"
    self.f.write(json_data)
    return item
def  close_spider(self,spider):
    # 爬虫文件关闭,此方法执行
    self.f.close() #关闭文件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
4.5settings文件

-- coding: utf-8 --

Scrapy settings for db project

For simplicity, this file contains only settings considered important or

commonly used. You can find more settings consulting the documentation:

https://docs.scrapy.org/en/latest/topics/settings.html

https://docs.scrapy.org/en/latest/topics/downloader-middleware.html

https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'db'

SPIDER_MODULES = ['db.spiders']
NEWSPIDER_MODULE = 'db.spiders'

Crawl responsibly by identifying yourself (and your website) on the user-agent

USER_AGENT = 'db (+http://www.yourdomain.com)'

Obey robots.txt rules

ROBOTSTXT_OBEY = False

Configure maximum concurrent requests performed by Scrapy (default: 16)

CONCURRENT_REQUESTS = 32

Configure a delay for requests for the same website (default: 0)

See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay

DOWNLOAD_DELAY = 3

The download delay setting will honor only one of:

CONCURRENT_REQUESTS_PER_DOMAIN = 16

CONCURRENT_REQUESTS_PER_IP = 16

Disable cookies (enabled by default)

COOKIES_ENABLED = False

Disable Telnet Console (enabled by default)

TELNETCONSOLE_ENABLED = False

Override the default request headers:

DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8',
'Accept-Language': 'en',

"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36"

}

Enable or disable spider middlewares

See https://docs.scrapy.org/en/latest/topics/spider-middleware.html

SPIDER_MIDDLEWARES = {

'db.middlewares.DbSpiderMiddleware': 543,

}

Enable or disable downloader middlewares

See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html

DOWNLOADER_MIDDLEWARES = {

'db.middlewares.DbDownloaderMiddleware': 543,

}

Enable or disable extensions

See https://docs.scrapy.org/en/latest/topics/extensions.html

EXTENSIONS = {

'scrapy.extensions.telnet.TelnetConsole': None,

}

Configure item pipelines

See https://docs.scrapy.org/en/latest/topics/item-pipeline.html

ITEM_PIPELINES = {
'db.pipelines.DbPipeline': 300,
}

Enable and configure the AutoThrottle extension (disabled by default)

See https://docs.scrapy.org/en/latest/topics/autothrottle.html

AUTOTHROTTLE_ENABLED = True

The initial download delay

AUTOTHROTTLE_START_DELAY = 5

The maximum download delay to be set in case of high latencies

AUTOTHROTTLE_MAX_DELAY = 60

The average number of requests Scrapy should be sending in parallel to

each remote server

AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

Enable showing throttling stats for every response received:

AUTOTHROTTLE_DEBUG = False

Enable and configure HTTP caching (disabled by default)

See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings

HTTPCACHE_ENABLED = True

HTTPCACHE_EXPIRATION_SECS = 0

HTTPCACHE_DIR = 'httpcache'

HTTPCACHE_IGNORE_HTTP_CODES = []

HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
五.项目注意事项
settings文件中项目默认的是 ROBOTSTXT_OBEY = True,即遵循robots协议,则不能爬取到数据
则更改为 ROBOTSTXT_OBEY = False

settings中,有些网站需要添加User-Agent ,才能获取到数据 (伪装成客户端)
settings中,需要将管道打开,才可以将数据传递到pipelines文件中
items中需要设置相应的字段,使用Item对象传递数据,(可以理解为mysql先定义字段,才能写入数据一样)

六.scrapy shell

Scrapy shell

[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc) #scrapy 模块
[s] crawler <scrapy.crawler.Crawler object at 0x000002624C415F98> #爬虫对象
[s] item {} #item对象
[s] request # 请求对象
[s] response <200 https://movie.douban.com/top250> #响应对象
[s] settings <scrapy.settings.Settings object at 0x000002624C415EB8> #配置文件
[s] spider #spider文件
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed) #通过url 获取response
[s] fetch(req) Fetch a scrapy.Request and update local objects #通过请求对象获取response
[s] shelp() Shell help (print this help) #列出命令
[s] view(response) View response in a browser #response 界面本地浏览器环境下使用
1
2
3
4
5
6
7
8
9
10
11
12
13
14
七.选择器
html_str="""

肖申克的救赎 / The Shawshank Redemption / 月黑高飞(港) / 刺激1995(台) [可播放]

                <div class="bd">
                    <p class="">
                        导演: 弗兰克·德拉邦特 Frank Darabont&nbsp;&nbsp;&nbsp;主演: 蒂姆·罗宾斯 Tim Robbins /...<br>
                        1994&nbsp;/&nbsp;美国&nbsp;/&nbsp;犯罪 剧情
                    </p>
                    <div class="star">
                            <span class="rating5-t"></span>
                            <span class="rating_num" property="v:average">9.7</span>
                            <span property="v:best" content="10.0"></span>
                            <span>1980500人评价</span>
                    </div>

                        <p class="quote">
                            <span class="inq">希望让人自由。</span>
                        </p>
                </div>
            </div>
        </div>

"""
from scrapy.selector import Selector

1.通过text 参数来构造对象

selc_text=Selector(text=html_str)

print(selc_text.xpath('//div[@class="info"]//div/a/span/text()').extract()[0])

print(selc_text.xpath('./body/div[@class="info"]//div/a/span/text()').extract()[0])

print(selc_text.xpath('//div[@class="info"]//div/a/span/text()').extract_first())

2.通过 response 构造selector对象

from scrapy.http import HtmlResponse
response=HtmlResponse(url="http://www.example.com",body=html_str.encode())
Selector(response=response)

print(response.selector.xpath('//div[@class="info"]//div/a/span/text()').extract()[0])

print(response.xpath('//div[@class="info"]//div/a/span/text()').extract()[0])

3.嵌套表达式 selector 可以任意使用 css xpath re

print(response.css("a").xpath('./span[1]/text()').extract()[0])

print(response.css("a").xpath('./span[1]/text()').re("的..")[0])
print(response.css("a").xpath('./span[1]/text()').re_first("的.."))

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49

次级页面抓取及数据传递拼接（电影）
1.详情页抓取（次级页面）的主要方法是get_detail 方法

def get_detail(self,response):

pass

1
2
2.参数的传递拼接的关键参数是 meta参数

spider文件

-- coding: utf-8 --

import json

import scrapy

from ..items import DbItem #是一个安全的字典
class Db250Spider(scrapy.Spider):#继承基础类

name = 'db250'  #爬虫文件名字  必须存在且唯一
# allowed_domains = ['movie.douban.com'] #允许的域名   可以不存在 不存在  任何域名都可以
start_urls = ['https://movie.dou.com/top250']#初始url  必须要存在
page_num=0
def parse(self, response):#解析函数  处理响应数据
    node_list=response.xpath('//div[@class="info"]')
    for node  in  node_list:
        #电影名字
        film_name=node.xpath("./div/a/span/text()").extract()[0]
        #导演信息
        director_name=node.xpath("./div/p/text()").extract()[0].strip()
        #评分
        score=node.xpath('./div/div/span[@property="v:average"]/text()').extract()[0]

        #使用管道存储
        item_pipe=DbItem() #创建Dbitem对象  当成字典来使用
        item_pipe['film_name']=film_name
        item_pipe['director_name']=director_name
        item_pipe['score']=score
        # yield item_pipe
        # print("电影信息",dict(item_pipe))
        # 电影简介
        detail_url = node.xpath('./div/a/@href').extract()[0]
        yield scrapy.Request(detail_url,callback=self.get_detail,meta={"info":item_pipe})

    #发送新一页的请求
    #构造url
    self.page_num += 1
    if self.page_num==4:
        return
    page_url="https://movie.douban.com/top250?start={}&filter=".format(self.page_num*25)
    yield scrapy.Request(page_url)
def  get_detail(self,response):
    item=DbItem()
    #解析详情页的response
    #1.meta 会跟随response 一块返回  2.通过response.meta接收 3.通过update  添加到新的item对象中
    info = response.meta["info"]
    item.update(info)
    #简介内容
    description=response.xpath('//div[@id="link-report"]//span[@property="v:summary"]/text()').extract()[0].strip()
    # print('description',description)

    item["description"]=description
    #通过管道保存
    yield  item

目标数据电影信息+ 获取电影简介数据次级页面的网页源代码里

请求流程访问一级页面提取电影信息+次级页面的url 访问次级页面url 从次级的数据中提取电影简介

存储的问题数据没有次序需要使用 meta传参保证同一电影的信息在一起

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
八.items文件
import scrapy

class DbItem(scrapy.Item):

# define the fields for your item here like:
# name = scrapy.Field()
film_name=scrapy.Field()
director_name=scrapy.Field()
score=scrapy.Field()
description=scrapy.Field()

1
2
3
4
5
6
7
8
9
10
九.pipelines 文件
import json
class DbPipeline(object):

def  open_spider(self,spider):
    #爬虫文件开启,此方法执行
    self.f=open("film_pipe.txt","w",encoding="utf-8")

def process_item(self, item, spider):
    json_data=json.dumps(dict(item),ensure_ascii=False)+"\n"
    self.f.write(json_data)
    return item
def  close_spider(self,spider):
    # 爬虫文件关闭,此方法执行
    self.f.close() #关闭文件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
十.settings 文件
此处删除了大部分注释

-- coding: utf-8 --

Scrapy settings for db project

BOT_NAME = 'db'

SPIDER_MODULES = ['db.spiders']
NEWSPIDER_MODULE = 'db.spiders'

Crawl responsibly by identifying yourself (and your website) on the user-agent

USER_AGENT = 'db (+http://www.yourdomain.com)'

Obey robots.txt rules

ROBOTSTXT_OBEY = False

Override the default request headers:

DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8',
'Accept-Language': 'en',

"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36"

}

Configure item pipelines

See https://docs.scrapy.org/en/latest/topics/item-pipeline.html

ITEM_PIPELINES = {
'db.pipelines.DbPipeline': 300,
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
十一.Scrapy shell
scrapy shell的作用是用于调试，

在项目目录下输入scrapy shell https://movie.dou…com/top250 得到下列信息：

scrapy shell 会自动加载settings里的配置，即robots协议，请求头等都可以加载，从而发起请求可以得到正确的响应信息。

快捷方法：
shelp()
fetch(url[,redirect=True])
fetch(request)
view(response)
scrapy 对象：
crawler
spider
request
response
setting

十二.Scrapy 选择器
Scrapy提供基于lxml库的解析机制，它们被称为选择器。
因为，它们“选择”由XPath或CSS表达式指定的HTML文档的某部分。
Scarpy选择器的API非常小，且非常简单。

选择器提供2个方法来提取标签

xpath() 基于xpath的语法规则
css() 基于css选择器的语法规则
快捷方式
response.xpath()
response.css()
它们返回的选择器列表
提取文本：
selector.extract() 返回文本列表
selector.extract_first() 返回第一个selector的文本，没有返回None

十三.嵌套选择器
有时候我们获取标签需要多次调用选择方法（.xpath()或.css()）
response.css(‘img’).xpath(‘@src’)

Selector还有一个.re()方法使用正则表达式提取数据的方法。
它返回字符串。
它一般使用在xpath()，css()方法之后，用来过滤文本数据。
re_first()用来返回第一个匹配的字符串。

html_str="""

肖申克的救赎 / The Shawshank Redemption / 月黑高飞(港) / 刺激1995(台) [可播放]

                <div class="bd">
                    <p class="">
                        导演: 弗兰克·德拉邦特 Frank Darabont&nbsp;&nbsp;&nbsp;主演: 蒂姆·罗宾斯 Tim Robbins /...<br>
                        1994&nbsp;/&nbsp;美国&nbsp;/&nbsp;犯罪 剧情
                    </p>
                    <div class="star">
                            <span class="rating5-t"></span>
                            <span class="rating_num" property="v:average">9.7</span>
                            <span property="v:best" content="10.0"></span>
                            <span>1980500人评价</span>
                    </div>

                        <p class="quote">
                            <span class="inq">希望让人自由。</span>
                        </p>
                </div>
            </div>
        </div>

"""
from scrapy.selector import Selector

1.通过text 参数来构造对象

selc_text=Selector(text=html_str)

print(selc_text.xpath('//div[@class="info"]//div/a/span/text()').extract()[0])

print(selc_text.xpath('./body/div[@class="info"]//div/a/span/text()').extract()[0])

print(selc_text.xpath('//div[@class="info"]//div/a/span/text()').extract_first())

2.通过 response 构造selector对象

from scrapy.http import HtmlResponse
response=HtmlResponse(url="http://www.example.com",body=html_str.encode())
Selector(response=response)

print(response.selector.xpath('//div[@class="info"]//div/a/span/text()').extract()[0])

print(response.xpath('//div[@class="info"]//div/a/span/text()').extract()[0])

3.嵌套表达式 selector 可以任意使用 css xpath re

print(response.css("a").xpath('./span[1]/text()').extract()[0])

print(response.css("a").xpath('./span[1]/text()').re("的..")[0])
print(response.css("a").xpath('./span[1]/text()').re_first("的.."))

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
十四.scrapy.Spider
spider 的名称 name

一个字符串，用于定义此蜘蛛的名称。蜘蛛名称是Scrapy如何定位（并实例化）蜘蛛，因此它必须是唯一的。这是最重要的蜘蛛属性，它是必需的。

起始urls

蜘蛛将开始爬取的URL列表。因此，下载的第一页将是此处列出的页面。后续Request将从起始URL中包含的数据连续生成。

自定义设置

运行此蜘蛛时将覆盖项目范围的设置。必须将其定义为类属性，因为在实例化之前更新了设置。

class Spider(object_ref):

"""Base class for scrapy spiders. All spiders must inherit from this
class.
"""

name = None
custom_settings = None

def __init__(self, name=None, **kwargs):
    if name is not None:
        self.name = name
    elif not getattr(self, 'name', None):
        raise ValueError("%s must have a name" % type(self).__name__)
    self.__dict__.update(kwargs)
    if not hasattr(self, 'start_urls'):
        self.start_urls = []

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
十五 .logger
使用Spider创建的Python日志器。您可以使用它来发送日志消息。

@property
def logger(self):

logger = logging.getLogger(self.name)
return logging.LoggerAdapter(logger, {'spider': self})

def log(self, message, level=logging.DEBUG, **kw):

"""Log the given message at the given log level

This helper wraps a log call to the logger within the spider, but you
can use it directly (e.g. Spider.logger.info('msg')) or use any other
Python logger too.
"""
self.logger.log(level, message, **kw)

1
2
3
4
5
6
7
8
9
10
11
12
13
十六 .from_crawler
这是Scrapy用于创建spider的类方法。一般不用覆盖。

@classmethod
def from_crawler(cls, crawler, args, *kwargs):

spider = cls(*args, **kwargs)
spider._set_crawler(crawler)
return spider

def _set_crawler(self, crawler):

self.crawler = crawler
self.settings = crawler.settings
crawler.signals.connect(self.close, signals.spider_closed)

1
2
3
4
5
6
7
8
9
10
十七.start_requests() 开始请求
此方法必须返回一个iterable，其中包含第一个要爬网的请求。它只会被调用一次

def start_requests(self):

cls = self.__class__
if not self.start_urls and hasattr(self, 'start_url'):
    raise AttributeError(
        "Crawling could not start: 'start_urls' not found "
        "or empty (but found 'start_url' attribute instead, "
        "did you miss an 's'?)")
if method_is_overridden(cls, Spider, 'make_requests_from_url'):
    warnings.warn(
        "Spider.make_requests_from_url method is deprecated; it "
        "won't be called in future Scrapy releases. Please "
        "override Spider.start_requests method instead (see %s.%s)." % (
            cls.__module__, cls.__name__
        ),
    )
    for url in self.start_urls:
        yield self.make_requests_from_url(url)
else:
    for url in self.start_urls:
        yield Request(url, dont_filter=True)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
十八.parse 默认回调函数方法
这是Scrapy在其请求未指定回调时处理下载的响应时使用的默认回调

def parse(self, response):

raise NotImplementedError('{}.parse callback is not defined'.format(self.__class__.__name__))

1
2
close 关闭spider

spider关闭时调用

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

Python框架篇：结构化的网页抓取框架-Scrapy

-- coding: utf-8 --

page页规律

-- coding: utf-8 --

Scrapy settings for db project

For simplicity, this file contains only settings considered important or

commonly used. You can find more settings consulting the documentation:

Crawl responsibly by identifying yourself (and your website) on the user-agent

USER_AGENT = 'db (+http://www.yourdomain.com)'

Obey robots.txt rules

Configure maximum concurrent requests performed by Scrapy (default: 16)

CONCURRENT_REQUESTS = 32

Configure a delay for requests for the same website (default: 0)

See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay

See also autothrottle settings and docs

DOWNLOAD_DELAY = 3

The download delay setting will honor only one of:

CONCURRENT_REQUESTS_PER_DOMAIN = 16

CONCURRENT_REQUESTS_PER_IP = 16

Disable cookies (enabled by default)

COOKIES_ENABLED = False

Disable Telnet Console (enabled by default)

TELNETCONSOLE_ENABLED = False

Override the default request headers:

Enable or disable spider middlewares

See https://docs.scrapy.org/en/latest/topics/spider-middleware.html

SPIDER_MIDDLEWARES = {

'db.middlewares.DbSpiderMiddleware': 543,

}

Enable or disable downloader middlewares

See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html

DOWNLOADER_MIDDLEWARES = {

'db.middlewares.DbDownloaderMiddleware': 543,

}

Enable or disable extensions

See https://docs.scrapy.org/en/latest/topics/extensions.html

EXTENSIONS = {

'scrapy.extensions.telnet.TelnetConsole': None,

}

Configure item pipelines

See https://docs.scrapy.org/en/latest/topics/item-pipeline.html

Enable and configure the AutoThrottle extension (disabled by default)

See https://docs.scrapy.org/en/latest/topics/autothrottle.html

AUTOTHROTTLE_ENABLED = True

The initial download delay

AUTOTHROTTLE_START_DELAY = 5

The maximum download delay to be set in case of high latencies

AUTOTHROTTLE_MAX_DELAY = 60

The average number of requests Scrapy should be sending in parallel to

each remote server

AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

Enable showing throttling stats for every response received:

AUTOTHROTTLE_DEBUG = False

Enable and configure HTTP caching (disabled by default)

See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings

HTTPCACHE_ENABLED = True

HTTPCACHE_EXPIRATION_SECS = 0

HTTPCACHE_DIR = 'httpcache'

HTTPCACHE_IGNORE_HTTP_CODES = []

HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

Scrapy shell

1.通过text 参数来构造对象

print(selc_text.xpath('//div[@class="info"]//div/a/span/text()').extract()[0])

print(selc_text.xpath('./body/div[@class="info"]//div/a/span/text()').extract()[0])

print(selc_text.xpath('//div[@class="info"]//div/a/span/text()').extract_first())

2.通过 response 构造selector对象

目标数据电影信息+ 获取电影简介数据次级页面的网页源代码里

请求流程访问一级页面提取电影信息+次级页面的url 访问次级页面url 从次级的数据中提取电影简介

存储的问题数据没有次序需要使用 meta传参保证同一电影的信息在一起