Scrapy 爬虫框架(二)

简介: Scrapy 爬虫框架(二)

接上文 Scrapy 爬虫框架(一)https://developer.aliyun.com/article/1618014

  1. 2 创建爬虫
    在创建爬虫时,首先需要创建一个爬虫模块文件,该文件需要放置在spiders文件夹当中。爬虫模块是用于从一个网站或多个网站中爬取数据的类,它需要继承scrapy.Spider类,scrapy.Spider类中提供了start_requests()方法实现初始化网络请求,然后通过parse()方法解析返回的结果。scrapy.Spider类中常用属性与方法含义如下:

§ name:用于定义一个爬虫名称的字符串。Scrapy通过这个爬虫名称进行爬虫的查找,所以这名称必须是唯一的,不过我们可以生成多个相同的爬虫实例。如果爬取单个网站一般会用这个网站的名称作为爬虫的名称。

§ allowed_domains:包含了爬虫允许爬取的域名列表,当OffsiteMiddleware启动时,域名不在列表中的URL不会被爬取。

§ start_urls:URL的初始列表,如果没有指定特定的URL,爬虫将从该列表中进行爬取。

§ custom_settings:这是一个专属于当前爬虫的配置,是一个字典类型的数据,设置该属性会覆盖整个项目的全局,所以在设置该属性时必须在实例化前更新,必须定义为类变量。

§ settings:这是一个settings对象,通过它,我们可以获取项目的全局设置变量。

§ logger:使用Spider创建的Python日志器。

§ start_requests():该方法用于生成网络请求,它必须返回一个可迭代对象。该方法默认使用start_urls中的URL来生成request, 而request请求方式为GET,如果我们下通过POST方式请求网页时,可以使用FormRequest()重写该方法。

§ parse():如果response没有指定回调函数时,该方法是Scrapy处理response的默认方法。该方法负责处理response并返回处理的数据和下一步请求,然后返回一个包含request或Item的可迭代对象。

§ closed():当爬虫关闭时,该函数会被调用。该方法用于代替监听工作,可以定义释放资源或是收尾操作。

3.2.1 爬取网页代码并保存为HTML文件
以爬取下图所示的网页为例,实现爬取网页后将网页的代码以HTML文件形式保存值项目文件夹当中。

image.png

在spiders文件夹当中创建一个名称为“crawl.py”的爬虫文件,然后在该文件中,首先创建QuotesSpider类,该类需要继承自scrapy.Spider类,然后重写start_requests()方法实现网络的请求工作,接着重写parse()方法实现向文件中写入获取的html代码。示例代码如下:

#_*_coding:utf-8_*_
# 作者      :liuxiaowei
# 创建时间   :2/17/22 11:18 AM
# 文件      :crawl.py
# IDE      :PyCharm
# 导入框架
import scrapy

class QuotesSpider(scrapy.Spider):
# 定义爬虫名称
    name = 'quotes_1' 
    def start_requests(self):
        # 设置爬取目标的地址
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
    # 获取所有地址,有几个地址则发送几个请求
        for url in urls:
            # 发送请求
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        # 获取页数
        page = response.url.split('/')[-2]
        # 根据页数设置文名称
        filename = 'quotes-%s.html' % page
        # 以写入文件模式打开文件,如果没有该文件将创建该文件
        with open(filename, 'wb') as f:
            # 向文件中写入获取的HTML代码
            f.write(response.body)
        # 输出保存文件的名称
        self.log('Saved file %s' % filename)

在运行Scrapy所创建的爬虫项目时,需要在命令窗口输入“scrapy crawl quotes_1“,其中”quotes_1“是自己定义的爬虫名称。本人使用第三方开发工具PyCharm,所以需要在底部的Terminal窗口中输入运行爬虫的命令行,运行完成以后如下图所示:

liuxiaowei@MacBookAir spiders % scrapy crawl quotes_1  # 运行爬虫的命令行
2022-02-17 11:23:47 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: scrapyDemo)
2022-02-17 11:23:47 [scrapy.utils.log] INFO: Versions: lxml 4.7.1.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.9.9 (v3.9.9:ccb0e6a345, Nov 15 2021, 13:29:20) - [Clang 6.0 (clang-600.0.57)], pyOpenSSL 21.0.0 (OpenSSL 1.1.1m  14 Dec 2021), cryptography 36.0.1, Platform macOS-10.16-x86_64-i386-64bit
2022-02-17 11:23:47 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-02-17 11:23:47 [scrapy.crawler] INFO: Overridden settings:
{
   'BOT_NAME': 'scrapyDemo',
 'NEWSPIDER_MODULE': 'scrapyDemo.spiders',
 'ROBOTSTXT_OBEY': True,
 .................    # 省略中间字符
2022-02-17 11:23:49 [quotes_1] DEBUG: Saved file quotes-1.html
2022-02-17 11:23:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/2/> (referer: None)
2022-02-17 11:23:49 [quotes_1] DEBUG: Saved file quotes-2.html
2022-02-17 11:23:49 [scrapy.core.engine] INFO: Closing spider (finished)
2022-02-17 11:23:49 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
............ # 省略中间的字符
2022-02-17 11:23:49 [scrapy.core.engine] INFO: Spider closed (finished)

3.2.2 使用FormRequest()函数,实现一个POST请求
示例代码如下:

#_*_coding:utf-8_*_
# 作者      :liuxiaowei
# 创建时间   :2/17/22 12:43 PM
# 文件      :POST请求.py
# IDE      :PyCharm

# 导入框架
import scrapy
# 导入json模块
import json
class QuotesSPider(scrapy.Spider):
    name = "quotes_2"
    # 字典类型的表单参数
    data = {
   '1':'能力是有限的, 而努力是无限的。',
            '2':'星光不问赶路人, 时光不负有心人。'}
    def start_requests(self):
        return [scrapy.FormRequest('http://httpbin.org/post', formdata=self.data, callback=self.parse)]


    # 响应信息
    def parse(self, response):
        # 将响应数据转换为字典类型
        response_dict = json.loads(response.text)
        # 打印转换后的响应数据
        print(response_dict)

运行结果如下:

liuxiaowei@MacBookAir spiders % scrapy crawl quotes_2    # 运行爬虫命令    
2022-02-17 12:53:01 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: scrapyDemo)
2022-02-17 12:53:01 [scrapy.utils.log] INFO: Versions: lxml 4.7.1.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.9.9 (v3.9.9:ccb0e6a345, Nov 15 2021, 13:29:20) - [Clang 6.0 (clang-600.0.57)], pyOpenSSL 21.0.0 (OpenSSL 1.1.1m  14 Dec 2021), cryptography 36.0.1, Platform macOS-10.16-x86_64-i386-64bit
2022-02-17 12:53:01 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-02-17 12:53:01 [scrapy.crawler] INFO: Overridden settings:
{
   'BOT_NAME': 'scrapyDemo',
 'NEWSPIDER_MODULE': 'scrapyDemo.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['scrapyDemo.spiders']}
2022-02-17 12:53:01 [scrapy.extensions.telnet] INFO: Telnet Password: 6965cfb5ccb132d6
2022-02-17 12:53:01 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2022-02-17 12:53:01 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-02-17 12:53:01 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-02-17 12:53:01 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-02-17 12:53:01 [scrapy.core.engine] INFO: Spider opened
2022-02-17 12:53:01 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-02-17 12:53:01 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-02-17 12:53:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://httpbin.org/robots.txt> (referer: None)
2022-02-17 12:53:02 [scrapy.core.engine] DEBUG: Crawled (200) <POST http://httpbin.org/post> (referer: None)
{
   'args': {
   }, 'data': '', 'files': {
   }, 'form': {
   '1': '能力是有限的, 而努力是无限的。', '2': '星光不问赶路人, 时光不负有心人。'}, 'headers': {
   'Accept': 'text/html,applicati0.9,*/*;q=0.8', 'Accept-Encoding': 'gzip, deflate', 'Accept-Language': 'en', 'Content-Length': '286', 'Content-Type': 'application/x-www-form-urlencoded', 'Host': 'httpbin.org', 'User-Agent': 'Scrapy/2.5.1 (+https://scrapy.org)', 'X-Amzn-Trace-Id': 'Root=1-620dd4ae-3eaa8de12c3f3606567f0039'}, 'json': None, 'origin': '122.143.185.159', 'url': 'http://httpbin.org/post'}
2022-02-17 12:53:02 [scrapy.core.engine] INFO: Closing spider (finished)
2022-02-17 12:53:02 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{
   'downloader/request_bytes': 772,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 1,
 'downloader/request_method_count/POST': 1,
 'downloader/response_bytes': 1214,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'elapsed_time_seconds': 1.026007,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 2, 17, 4, 53, 2, 396943),
 'log_count/DEBUG': 2,
 'log_count/INFO': 10,
 'memusage/max': 54693888,
 'memusage/startup': 54693888,
 'response_received_count': 2,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/200': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2022, 2, 17, 4, 53, 1, 370936)}
2022-02-17 12:53:02 [scrapy.core.engine] INFO: Spider closed (finished)

说 明

除了使用在命令窗口中输入命令“scrapy crawl quotes_2“启动爬虫程序以外,Scrapy还提供了可以在程序中启动爬虫的API,也就是CrawlProcess类。首先需要在CrawlProcess类初始化时传入项目的settings 信息,然后在crawl()方法中传入爬虫的名称,最后通过start()方法启动爬虫。代码如下:

# 导入CrawlProcess类
from scrapy.crawler import CrawlerProcess
# 导入获取项目设置信息
from scrapy.utils.project import get_project_settings

# 程序入口
if __name__ == "__main__":
  # 创建CrawlProcess类对象并传入项目设置信息参数
  process = CrawlerProcess(get_project_settings())
  # 设置需要启动的爬虫名称
  process.crawl('quotes_2')
  # 启动爬虫
  process.start()
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
运行结果如下:

/Users/liuxiaowei/PycharmProjects/爬虫练习/venv/bin/python /Users/liuxiaowei/PycharmProjects/爬虫练习/Scrapy爬虫框架/scrapyDemo/scrapyDemo/spiders/POST请求.py
2022-02-17 13:02:16 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: scrapyDemo)
2022-02-17 13:02:16 [scrapy.utils.log] INFO: Versions: lxml 4.7.1.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.9.9 (v3.9.9:ccb0e6a345, Nov 15 2021, 13:29:20) - [Clang 6.0 (clang-600.0.57)], pyOpenSSL 21.0.0 (OpenSSL 1.1.1m  14 Dec 2021), cryptography 36.0.1, Platform macOS-10.16-x86_64-i386-64bit
2022-02-17 13:02:16 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-02-17 13:02:16 [scrapy.crawler] INFO: Overridden settings:
{
   'BOT_NAME': 'scrapyDemo',
 'NEWSPIDER_MODULE': 'scrapyDemo.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['scrapyDemo.spiders']}
2022-02-17 13:02:16 [scrapy.extensions.telnet] INFO: Telnet Password: 7aa61c26ffb3372a
2022-02-17 13:02:16 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2022-02-17 13:02:16 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-02-17 13:02:16 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-02-17 13:02:16 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-02-17 13:02:16 [scrapy.core.engine] INFO: Spider opened
2022-02-17 13:02:16 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-02-17 13:02:16 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-02-17 13:02:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://httpbin.org/robots.txt> (referer: None)
2022-02-17 13:02:17 [scrapy.core.engine] DEBUG: Crawled (200) <POST http://httpbin.org/post> (referer: None)
{
   'args': {
   }, 'data': '', 'files': {
   }, 'form': {
   '1': '能力是有限的, 而努力是无限的。', '2': '星光不问赶路人, 时光不负有心人。'}, 'headers': {
   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Encoding': 'gzip, deflate', 'Accept-Language': 'en', 'Content-Length': '286', 'Content-Type': 'application/x-www-form-urlencoded', 'Host': 'httpbin.org', 'User-Agent': 'Scrapy/2.5.1 (+https://scrapy.org)', 'X-Amzn-Trace-Id': 'Root=1-620dd6d9-1e241f7e7f705c1172c103b5'}, 'json': None, 'origin': '122.143.185.159', 'url': 'http://httpbin.org/post'}
2022-02-17 13:02:17 [scrapy.core.engine] INFO: Closing spider (finished)
2022-02-17 13:02:17 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{
   'downloader/request_bytes': 772,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 1,
 'downloader/request_method_count/POST': 1,
 'downloader/response_bytes': 1214,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'elapsed_time_seconds': 1.030657,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 2, 17, 5, 2, 17, 995487),
 'log_count/DEBUG': 2,
 'log_count/INFO': 10,
 'memusage/max': 48164864,
 'memusage/startup': 48164864,
 'response_received_count': 2,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/200': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2022, 2, 17, 5, 2, 16, 964830)}
2022-02-17 13:02:17 [scrapy.core.engine] INFO: Spider closed (finished)

Process finished with exit code 0

注 意

如果在运行Scrapy所创建的爬虫项目时,出现SyntaxError:invalid syntax的错误信息,说明Python3.7这个版本将“async“识别成了关键字,解决此类错误,首先打开Python3.7/Lib/site-packages/twisted/conch/manhole.py文件,然后将该文件中的所有"async"关键字修改成与关键字无关的标识符,如“async_“。

3.3 获取数据
Scrapy爬虫框架可以通过特定的CSS或者XPath表达式来选择HTML文件中的某一处,并且提取出相应的数据。CSS(Cascading Style Sheet,即层叠样式表),用于控制HTML页面的布局、字体、颜色、背景以及其他效果。XPath是一门可以在XML文档中根据元素和属性查找信息的语言。

3.3.1 CSS提取数据
使用CSS提取HTML文件中的某一处数据时,可以指定HTML文件中的标签名称。例如,获取前面示例网页中title标签数据时,可以使用如下命令:

response.css('title').extract()

示例代码如下:

#_*_coding:utf-8_*_
 # 作者      :liuxiaowei
 # 创建时间   :2/17/22 2:18 PM
 # 文件      :css提取数据.py
 # IDE      :PyCharm

 # 导入框架
 import scrapy

 class QuotesSpider(scrapy.Spider):
 # 定义爬虫名称
     name = 'quotes_3'
     def start_requests(self):
         # 设置爬取目标的地址
         urls = [
             'http://quotes.toscrape.com/page/1/',
             'http://quotes.toscrape.com/page/2/',
         ]
     # 获取所有地址,有几个地址则发送几个请求
         for url in urls:
             # 发送请求
             yield scrapy.Request(url=url, callback=self.parse)

     def parse(self, response):
        # 获取title标签数据
         title = response.css('title').extract()
        # 打印title
         print(title)

获取结果如如下:

liuxiaowei@MacBookAir spiders % scrapy crawl quotes_3       
2022-02-17 14:25:03 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: scrapyDemo)
2022-02-17 14:25:03 [scrapy.utils.log] INFO: Versions: lxml 4.7.1.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.9.9 (v3.9.9:ccb0e6a345, Nov 15 2021, 13:29:20) - [Clang 6.0 (clang-600.0.57)], pyOpenSSL 21.0.0 (OpenSSL 1.1.1m  14 Dec 2021), cryptography 36.0.1, Platform macOS-10.16-x86_64-i386-64bit
2022-02-17 14:25:03 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-02-17 14:25:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
['<title>Quotes to Scrape</title>']
2022-02-17 14:25:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/2/> (referer: None)
['<title>Quotes to Scrape</title>']
2022-02-17 14:25:05 [scrapy.core.engine] INFO: Spider closed (finished)

说 明

使用CSS提取数据时返回的内容为CSS表达式所对应节点的list列表,所以在提取标签中的数据时,可以使用以下的代码:

response.css('title::text').extract_first()
或者
response.css('title::text')[0].extract()

运行结果如下:

2022-02-17 14:32:10 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/2/> (referer: None)
['<title>Quotes to Scrape</title>'] 
 Quotes to Scrape

3.3.2 XPath提取数据
使用XPath表达式提取HTML文件中的某一处数据时,需要根据XPath表达式的语法规定来获取指定的数据信息,例如,同样获取title标签那的信息时,可以使用如下命令:

response.xpath('//title/text()').extract_first()

通过示例实现使用XPath获取上面测试页中的多条信息,代码如下:

#_*_coding:utf-8_*_
# 作者      :liuxiaowei
# 创建时间   :2/17/22 2:37 PM
# 文件      :crawl_Xpath.py
# IDE      :PyCharm

# 导入框架
import scrapy

class QuotesSpider(scrapy.Spider):
# 定义爬虫名称
    name = 'quotes'
    def start_requests(self):
        # 设置爬取目标的地址
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
    # 获取所有地址,有几个地址则发送几个请求
        for url in urls:
            # 发送请求
            yield scrapy.Request(url=url, callback=self.parse)
    # 响应信息
    def parse(self, response):
        # 获取所有信息
        for quote in response.xpath(".//*[@class='quote']"):
            # 获取名人名言文字信息
            text = quote.xpath('.//*[@class="text"]/text()').extract_first()
            # 获取作者
            author = quote.xpath('.//*[@class="author"]/text()').extract_first()
            # 获取标签
            tags = quote.xpath('.//*[@class="tag"]/text()').extract()
            # 以字典形式输出信息
            print(dict(text=text, author=author, tags=tags))

运行结果如下:

liuxiaowei@MacBookAir spiders % scrapy crawl quotes
2022-02-17 14:38:57 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: scrapyDemo)
2022-02-17 14:38:57 [scrapy.utils.log] INFO: Versions: lxml 4.7.1.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.9.9 (v3.9.9:ccb0e6a345, Nov 15 2021, 13:29:20) - [Clang 6.0 (clang-600.0.57)], pyOpenSSL 21.0.0 (OpenSSL 1.1.1m  14 Dec 2021), cryptography 36.0.1, Platform macOS-10.16-x86_64-i386-64bit
2022-02-17 14:38:57 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-02-17 14:38:58 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2022-02-17 14:38:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
{
   'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', 'author': 'Albert Einstein', 'tags': ['change', 'deep-thoughts', 'thinking', 'world']}
{
   'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', 'author': 'J.K. Rowling', 'tags': ['abilities', 'choices']}
{
   'text': '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”', 'author': 'Albert Einstein', 'tags': ['inspirational', 'life', 'live', 'miracle', 'miracles']}
{
   'text': '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”', 'author': 'Jane Austen', 'tags': ['aliteracy', 'books', 'classic', 'humor']}
{
   'text': "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”", 'author': 'Marilyn Monroe', 'tags': ['be-yourself', 'inspirational']}
{
   'text': '“Try not to become a man of success. Rather become a man of value.”', 'author': 'Albert Einstein', 'tags': ['adulthood', 'success', 'value']}
{
   'text': '“It is better to be hated for what you are than to be loved for what you are not.”', 'author': 'André Gide', 'tags': ['life', 'love']}
{
   'text': "“I have not failed. I've just found 10,000 ways that won't work.”", 'author': 'Thomas A. Edison', 'tags': ['edison', 'failure', 'inspirational', 'paraphrased']}
{
   'text': "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”", 'author': 'Eleanor Roosevelt', 'tags': ['misattributed-eleanor-roosevelt']}
{
   'text': '“A day without sunshine is like, you know, night.”', 'author': 'Steve Martin', 'tags': ['humor', 'obvious', 'simile']}
2022-02-17 14:38:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/2/> (referer: None)
{
   'text': "“This life is what you make it. No matter what, you're going to mess up sometimes, it's a universal truth. But the good part is you get to decide how you're going to mess it up. Girls will be your friends - they'll act like it anyway. But just remember, some come, some go. The ones that stay with you through everything - they're your true best friends. Don't let go of them. Also remember, sisters make the best friends in the world. As for lovers, well, they'll come and go too. And baby, I hate to say it, most of them - actually pretty much all of them are going to break your heart, but you can't give up because if you give up, you'll never find your soulmate. You'll never find that half who makes you whole and that goes for everything. Just because you fail once, doesn't mean you're gonna fail at everything. Keep trying, hold on, and always, always, always believe in yourself, because if you don't, then who will, sweetie? So keep your head high, keep your chin up, and most importantly, keep smiling, because life's a beautiful thing and there's so much to smile about.”", 'author': 'Marilyn Monroe', 'tags': ['friends', 'heartbreak', 'inspirational', 'life', 'love', 'sisters']}
{
   'text': '“It takes a great deal of bravery to stand up to our enemies, but just as much to stand up to our friends.”', 'author': 'J.K. Rowling', 'tags': ['courage', 'friends']}
{
   'text': "“If you can't explain it to a six year old, you don't understand it yourself.”", 'author': 'Albert Einstein', 'tags': ['simplicity', 'understand']}
{
   'text': "“You may not be her first, her last, or her only. She loved before she may love again. But if she loves you now, what else matters? She's not perfect—you aren't either, and the two of you may never be perfect together but if she can make you laugh, cause you to think twice, and admit to being human and making mistakes, hold onto her and give her the most you can. She may not be thinking about you every second of the day, but she will give you a part of her that she knows you can break—her heart. So don't hurt her, don't change her, don't analyze and don't expect more than she can give. Smile when she makes you happy, let her know when she makes you mad, and miss her when she's not there.”", 'author': 'Bob Marley', 'tags': ['love']}
{
   'text': '“I like nonsense, it wakes up the brain cells. Fantasy is a necessary ingredient in living.”', 'author': 'Dr. Seuss', 'tags': ['fantasy']}
{
   'text': '“I may not have gone where I intended to go, but I think I have ended up where I needed to be.”', 'author': 'Douglas Adams', 'tags': ['life', 'navigation']}
{
   'text': "“The opposite of love is not hate, it's indifference. The opposite of art is not ugliness, it's indifference. The opposite of faith is not heresy, it's indifference. And the opposite of life is not death, it's indifference.”", 'author': 'Elie Wiesel', 'tags': ['activism', 'apathy', 'hate', 'indifference', 'inspirational', 'love', 'opposite', 'philosophy']}
{
   'text': '“It is not a lack of love, but a lack of friendship that makes unhappy marriages.”', 'author': 'Friedrich Nietzsche', 'tags': ['friendship', 'lack-of-friendship', 'lack-of-love', 'love', 'marriage', 'unhappy-marriage']}
{
   'text': '“Good friends, good books, and a sleepy conscience: this is the ideal life.”', 'author': 'Mark Twain', 'tags': ['books', 'contentment', 'friends', 'friendship', 'life']}
{
   'text': '“Life is what happens to us while we are making other plans.”', 'author': 'Allen Saunders', 'tags': ['fate', 'life', 'misattributed-john-lennon', 'planning', 'plans']}
2022-02-17 14:38:59 [scrapy.core.engine] INFO: Closing spider (finished)
2022-02-17 14:38:59 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
2022-02-17 14:38:59 [scrapy.core.engine] INFO: Spider closed (finished)

说 明

Scrapy的选择对象中还提供了.re()方法,这是一种可以使用正则表达式提取数据的方法,可以直接通过response.xpath().re()方式进行调用,然后在.re()方法中填入对应的正则表达式即可。

3.3.4 翻页提取数据
如果需要获取整个网页的所有信息就需要使用翻页功能。例如获取上节测试页中的整个网站的作者名,示例代码如下:

#_*_coding:utf-8_*_
# 作者      :liuxiaowei
# 创建时间   :2/17/22 2:57 PM
# 文件      :翻页提取数据.py
# IDE      :PyCharm

# 导入框架
import scrapy

class QuotesSpider(scrapy.Spider):
# 定义爬虫名称
    name = 'quotes_4'
    def start_requests(self):
        # 设置爬取目标的地址
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
    # 获取所有地址,有几个地址则发送几个请求
        for url in urls:
            # 发送请求
            yield scrapy.Request(url=url, callback=self.parse)
    # 响应信息
    def parse(self, response):
        # div.quote
        # 获取所有信息
        for quote in response.xpath('.//*[@class="quote"]'):
            # 获取作者
            author = quote.xpath('.//*[@class="author"]/text()').extract_first()
            # 打印作者名称
            print(author)
        # 实现翻页
        for href in response.css('li.next a::attr(href)'):
            yield response.follow(href, self.parse)

程序运行结果如下:

# 第一页的作者名称
2022-02-17 15:00:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
Albert Einstein
J.K. Rowling
Albert Einstein
Jane Austen
Marilyn Monroe
Albert Einstein
André Gide
Thomas A. Edison
Eleanor Roosevelt
Steve Martin
# 下面是第10页的部分作者名称
2022-02-17 15:03:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/10/> (referer: http://quotes.toscrape.com/page/9/)
J.K. Rowling
Jimi Hendrix
J.M. Barrie
E.E. Cummings
Khaled Hosseini
Harper Lee
Madeleine L'Engle
Mark Twain
Dr. Seuss
George R.R. Martin
2022-02-17 15:03:54 [scrapy.core.engine] INFO: Closing spider (finished)

3.3.4 创建Items
爬取网页数据的过程,就是从非结构性的数据源中提取结构性数据。例如,在QuotesSpider类的parse()方法中已经获取到了text、author以及tags信息,如果需要将这些数据包装成结构化数据,那么就需要使用Scrapy所提供的Item类来满足这样的需求。Item对象是一个简单的容器,用于保存爬取到的数据信息,它提供了一个类似于字典的API,用于声明其可用字段的便捷语法,Item使用简单的类定义语法和Field对象来声明。在创建scrapyDemo项目时,项目的目录结构中就已经自动创建了一个items.py文件,用来定义存储数据信息的Item类,它需要继承scrapy.Item。示例代码如下:

import scrapy

class ScrapydemoItem(scrapy.Item):
  # define the fields for your item here like:
  # 定义获取名人名言文字信息
  text = scrapy.Field()
  # 定义获取的作者
  author = scrapy.Field()
  # 定义获取的标签
  tags = scrapy.Field()

  pass

示例代码如下:

#_*_coding:utf-8_*_
# 作者      :liuxiaowei
# 创建时间   :2/17/22 3:22 PM
# 文件      :包装结构化数据.py
# IDE      :PyCharm

import scrapy  # 导入框架
from scrapyDemo.items import ScrapydemoItem # 导入ScrapydemoItem类


class QuotesSpider(scrapy.Spider):
    name = "quotes_5"  # 定义爬虫名称

    def start_requests(self):
        # 设置爬取目标的地址
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        # 获取所有地址,有几个地址发送几次请求
        for url in urls:
            # 发送网络请求
            yield scrapy.Request(url=url, callback=self.parse)
        # 响应信息

    def parse(self, response):
        # 获取所有信息
        for quote in response.xpath(".//*[@class='quote']"):
            # 获取名人名言文字信息
            text = quote.xpath(".//*[@class='text']/text()").extract_first()
            # 获取作者
            author = quote.xpath(".//*[@class='author']/text()").extract_first()
            # 获取标签
            tags = quote.xpath(".//*[@class='tag']/text()").extract()
            # 创建Item对象
            item = ScrapydemoItem(text=text, author=author, tags=tags)
            yield item  # 输出信息

class ScrapydemoItem(scrapy.Item):
   # define the fields for your item here like:
   # 定义获取名人名言文字信息
    text = scrapy.Field()
    # 定义获取的作者
    author = scrapy.Field()
    # 定义获取的标签
    tags = scrapy.Field()

    pass

程序运行结果如下:

liuxiaowei@MacBookAir spiders % scrapy crawl  quotes_5
2022-02-17 15:30:04 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: scrapyDemo)
2022-02-17 15:30:04 [scrapy.utils.log] INFO: Versions: lxml 4.7.1.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.9.9 (v3.9.9:ccb0e6a345, Nov 15 2021, 13:29:20) - [Clang 6.0 (clang-600.0.57)], pyOpenSSL 21.0.0 (OpenSSL 1.1.1m  14 Dec 2021), cryptography 36.0.1, Platform macOS-10.16-x86_64-i386-64bit
2022-02-17 15:30:04 [scrapy.core.engine] INFO: Spider opened
2022-02-17 15:30:04 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-02-17 15:30:04 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-02-17 15:30:05 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2022-02-17 15:30:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
2022-02-17 15:30:05 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{
   'author': 'Albert Einstein',
 'tags': ['change', 'deep-thoughts', 'thinking', 'world'],
 'text': '“The world as we have created it is a process of our thinking. It '
         'cannot be changed without changing our thinking.”'}
2022-02-17 15:30:05 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{
   'author': 'J.K. Rowling',
 'tags': ['abilities', 'choices'],
 'text': '“It is our choices, Harry, that show what we truly are, far more '
         'than our abilities.”'}
2022-02-17 15:30:05 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{
   'author': 'Albert Einstein',
 'tags': ['inspirational', 'life', 'live', 'miracle', 'miracles'],
 'text': '“There are only two ways to live your life. One is as though nothing '
         'is a miracle. The other is as though everything is a miracle.”'}
2022-02-17 15:30:05 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{
   'author': 'Jane Austen',
 'tags': ['aliteracy', 'books', 'classic', 'humor'],
 'text': '“The person, be it gentleman or lady, who has not pleasure in a good '
         'novel, must be intolerably stupid.”'}
2022-02-17 15:30:05 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{
   'author': 'Marilyn Monroe',
 'tags': ['be-yourself', 'inspirational'],
 'text': "“Imperfection is beauty, madness is genius and it's better to be "
         'absolutely ridiculous than absolutely boring.”'}
2022-02-17 15:30:05 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{
   'author': 'Albert Einstein',
 'tags': ['adulthood', 'success', 'value'],
 'text': '“Try not to become a man of success. Rather become a man of value.”'}
2022-02-17 15:30:05 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{
   'author': 'André Gide',
 'tags': ['life', 'love'],
 'text': '“It is better to be hated for what you are than to be loved for what '
         'you are not.”'}
2022-02-17 15:30:05 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{
   'author': 'Thomas A. Edison',
 'tags': ['edison', 'failure', 'inspirational', 'paraphrased'],
 'text': "“I have not failed. I've just found 10,000 ways that won't work.”"}
2022-02-17 15:30:05 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{
   'author': 'Eleanor Roosevelt',
 'tags': ['misattributed-eleanor-roosevelt'],
 'text': '“A woman is like a tea bag; you never know how strong it is until '
         "it's in hot water.”"}
2022-02-17 15:30:05 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{
   'author': 'Steve Martin',
 'tags': ['humor', 'obvious', 'simile'],
 'text': '“A day without sunshine is like, you know, night.”'}
2022-02-17 15:30:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/2/> (referer: None)
2022-02-17 15:30:05 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{
   'author': 'Marilyn Monroe',
 'tags': ['friends', 'heartbreak', 'inspirational', 'life', 'love', 'sisters'],
 'text': "“This life is what you make it. No matter what, you're going to mess "
         "up sometimes, it's a universal truth. But the good part is you get "
         "to decide how you're going to mess it up. Girls will be your friends "
         "- they'll act like it anyway. But just remember, some come, some go. "
         "The ones that stay with you through everything - they're your true "
         "best friends. Don't let go of them. Also remember, sisters make the "
         "best friends in the world. As for lovers, well, they'll come and go "
         'too. And baby, I hate to say it, most of them - actually pretty much '
         "all of them are going to break your heart, but you can't give up "
         "because if you give up, you'll never find your soulmate. You'll "
         'never find that half who makes you whole and that goes for '
         "everything. Just because you fail once, doesn't mean you're gonna "
         'fail at everything. Keep trying, hold on, and always, always, always '
         "believe in yourself, because if you don't, then who will, sweetie? "
         'So keep your head high, keep your chin up, and most importantly, '
         "keep smiling, because life's a beautiful thing and there's so much "
         'to smile about.”'}
2022-02-17 15:30:05 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{
   'author': 'J.K. Rowling',
 'tags': ['courage', 'friends'],
 'text': '“It takes a great deal of bravery to stand up to our enemies, but '
         'just as much to stand up to our friends.”'}
2022-02-17 15:30:05 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{
   'author': 'Albert Einstein',
 'tags': ['simplicity', 'understand'],
 'text': "“If you can't explain it to a six year old, you don't understand it "
         'yourself.”'}
2022-02-17 15:30:05 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{
   'author': 'Bob Marley',
 'tags': ['love'],
 'text': '“You may not be her first, her last, or her only. She loved before '
         'she may love again. But if she loves you now, what else matters? '
         "She's not perfect—you aren't either, and the two of you may never be "
         'perfect together but if she can make you laugh, cause you to think '
         'twice, and admit to being human and making mistakes, hold onto her '
         'and give her the most you can. She may not be thinking about you '
         'every second of the day, but she will give you a part of her that '
         "she knows you can break—her heart. So don't hurt her, don't change "
         "her, don't analyze and don't expect more than she can give. Smile "
         'when she makes you happy, let her know when she makes you mad, and '
         "miss her when she's not there.”"}
2022-02-17 15:30:05 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{
   'author': 'Dr. Seuss',
 'tags': ['fantasy'],
 'text': '“I like nonsense, it wakes up the brain cells. Fantasy is a '
         'necessary ingredient in living.”'}
2022-02-17 15:30:05 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{
   'author': 'Douglas Adams',
 'tags': ['life', 'navigation'],
 'text': '“I may not have gone where I intended to go, but I think I have '
         'ended up where I needed to be.”'}
2022-02-17 15:30:05 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{
   'author': 'Elie Wiesel',
 'tags': ['activism',
          'apathy',
          'hate',
          'indifference',
          'inspirational',
          'love',
          'opposite',
          'philosophy'],
 'text': "“The opposite of love is not hate, it's indifference. The opposite "
         "of art is not ugliness, it's indifference. The opposite of faith is "
         "not heresy, it's indifference. And the opposite of life is not "
         "death, it's indifference.”"}
2022-02-17 15:30:05 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{
   'author': 'Friedrich Nietzsche',
 'tags': ['friendship',
          'lack-of-friendship',
          'lack-of-love',
          'love',
          'marriage',
          'unhappy-marriage'],
 'text': '“It is not a lack of love, but a lack of friendship that makes '
         'unhappy marriages.”'}
2022-02-17 15:30:05 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{
   'author': 'Mark Twain',
 'tags': ['books', 'contentment', 'friends', 'friendship', 'life'],
 'text': '“Good friends, good books, and a sleepy conscience: this is the '
         'ideal life.”'}
2022-02-17 15:30:05 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{
   'author': 'Allen Saunders',
 'tags': ['fate', 'life', 'misattributed-john-lennon', 'planning', 'plans'],
 'text': '“Life is what happens to us while we are making other plans.”'}
2022-02-17 15:30:05 [scrapy.core.engine] INFO: Closing spider (finished)
相关文章
|
1月前
|
数据采集 存储 JSON
Python网络爬虫:Scrapy框架的实战应用与技巧分享
【10月更文挑战第27天】本文介绍了Python网络爬虫Scrapy框架的实战应用与技巧。首先讲解了如何创建Scrapy项目、定义爬虫、处理JSON响应、设置User-Agent和代理,以及存储爬取的数据。通过具体示例,帮助读者掌握Scrapy的核心功能和使用方法,提升数据采集效率。
105 6
|
2月前
|
数据采集 中间件 开发者
Scrapy爬虫框架-自定义中间件
Scrapy爬虫框架-自定义中间件
58 1
|
2月前
|
数据采集 中间件 Python
Scrapy爬虫框架-通过Cookies模拟自动登录
Scrapy爬虫框架-通过Cookies模拟自动登录
111 0
|
1月前
|
数据采集 前端开发 中间件
Python网络爬虫:Scrapy框架的实战应用与技巧分享
【10月更文挑战第26天】Python是一种强大的编程语言,在数据抓取和网络爬虫领域应用广泛。Scrapy作为高效灵活的爬虫框架,为开发者提供了强大的工具集。本文通过实战案例,详细解析Scrapy框架的应用与技巧,并附上示例代码。文章介绍了Scrapy的基本概念、创建项目、编写简单爬虫、高级特性和技巧等内容。
77 4
|
1月前
|
数据采集 中间件 API
在Scrapy爬虫中应用Crawlera进行反爬虫策略
在Scrapy爬虫中应用Crawlera进行反爬虫策略
|
2月前
|
数据采集 中间件 数据挖掘
Scrapy 爬虫框架(一)
Scrapy 爬虫框架(一)
54 0
|
4月前
|
数据采集 中间件 调度
Scrapy 爬虫框架的基本使用
Scrapy 爬虫框架的基本使用
|
7月前
|
数据采集 存储 数据处理
Scrapy:Python网络爬虫框架的利器
在当今信息时代,网络数据已成为企业和个人获取信息的重要途径。而Python网络爬虫框架Scrapy则成为了网络爬虫工程师的必备工具。本文将介绍Scrapy的概念与实践,以及其在数据采集和处理过程中的应用。
80 1
|
7月前
|
数据采集 中间件 Python
Scrapy爬虫:利用代理服务器爬取热门网站数据
Scrapy爬虫:利用代理服务器爬取热门网站数据
|
5月前
|
数据采集 存储 NoSQL
Redis 与 Scrapy:无缝集成的分布式爬虫技术
Redis 与 Scrapy:无缝集成的分布式爬虫技术