Python爬虫:scrapy框架请求参数meta、headers、cookies一探究竟(2)

简介: Python爬虫:scrapy框架请求参数meta、headers、cookies一探究竟(2)

cookies

上面的信息中少了个response.cookies,如果添加上回报错:


AttributeError: 'TextResponse' object has no attribute 'cookies'

说明响应是不带cookies参数的


通过 http://httpbin.org/cookies 测试cookies


# -*- coding: utf-8 -*-
from scrapy import Spider, Request
import logging
class MySpider(Spider):
    name = 'my_spider'
    allowed_domains = ['httpbin.org']
    start_urls = [
        'http://httpbin.org/cookies'
    ]
    def start_requests(self):
        for url in self.start_urls:
            yield Request(url, cookies={"username": "pengshiyu"})
    def parse(self, response):
        logging.debug("*" * 40)
        logging.debug("response text: %s" % response.text)
        logging.debug("request headers: %s" % response.request.headers)
        logging.debug("request cookies: %s" % response.request.cookies)
if __name__ == '__main__':
    from scrapy import cmdline
    cmdline.execute("scrapy crawl my_spider".split())

返回值如下:


response text: 
{
    "cookies":
        {
        "username":"pengshiyu"
        }
}
request headers: 
{
    b'Accept': [b'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'], b'Accept-Language': [b'en'], 
    b'User-Agent': [b'Scrapy/1.5.1 (+https://scrapy.org)'], 
    b'Accept-Encoding': [b'gzip,deflate'], 
    b'Cookie': [b'username=pengshiyu']
}
request cookies: 
{
    'username': 'pengshiyu'
}

服务器端已经接收到我的cookie值了,不过request的headers也包含了相同的cookie,保存到了键为Cookie下面


其实并没有什么cookie,浏览器请求的·cookies·被包装到了·headers·中发送给服务器端

既然这样,在headers中包含Cookie试试


def start_requests(self):
        for url in self.start_urls:
            yield Request(url, headers={"Cookie": {"username": "pengshiyu"}})

返回结果


response text: 
{
    "cookies":{}
}
request headers: 
{
    b'Accept': [b'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'], b'Accept-Language': [b'en'], 
    b'User-Agent': [b'Scrapy/1.5.1 (+https://scrapy.org)'], 
    b'Accept-Encoding': [b'gzip,deflate']
}
request cookies: 
{}

cookies 是空的,设置失败


我们找到 default_settings 中的cookie中间件


'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 700
class CookiesMiddleware(object):
    """This middleware enables working with sites that need cookies"""
    def __init__(self, debug=False):
        self.jars = defaultdict(CookieJar)
        self.debug = debug
    @classmethod
    def from_crawler(cls, crawler):
        if not crawler.settings.getbool('COOKIES_ENABLED'):
            raise NotConfigured
        return cls(crawler.settings.getbool('COOKIES_DEBUG'))
    def process_request(self, request, spider):
        if request.meta.get('dont_merge_cookies', False):
            return
        cookiejarkey = request.meta.get("cookiejar")
        jar = self.jars[cookiejarkey]
        cookies = self._get_request_cookies(jar, request)
        for cookie in cookies:
            jar.set_cookie_if_ok(cookie, request)
        # set Cookie header
        request.headers.pop('Cookie', None)
        jar.add_cookie_header(request)
        self._debug_cookie(request, spider)
    def process_response(self, request, response, spider):
        if request.meta.get('dont_merge_cookies', False):
            return response
        # extract cookies from Set-Cookie and drop invalid/expired cookies
        cookiejarkey = request.meta.get("cookiejar")
        jar = self.jars[cookiejarkey]
        jar.extract_cookies(response, request)
        self._debug_set_cookie(response, spider)
        return response
    def _debug_cookie(self, request, spider):
        if self.debug:
            cl = [to_native_str(c, errors='replace')
                  for c in request.headers.getlist('Cookie')]
            if cl:
                cookies = "\n".join("Cookie: {}\n".format(c) for c in cl)
                msg = "Sending cookies to: {}\n{}".format(request, cookies)
                logger.debug(msg, extra={'spider': spider})
    def _debug_set_cookie(self, response, spider):
        if self.debug:
            cl = [to_native_str(c, errors='replace')
                  for c in response.headers.getlist('Set-Cookie')]
            if cl:
                cookies = "\n".join("Set-Cookie: {}\n".format(c) for c in cl)
                msg = "Received cookies from: {}\n{}".format(response, cookies)
                logger.debug(msg, extra={'spider': spider})
    def _format_cookie(self, cookie):
        # build cookie string
        cookie_str = '%s=%s' % (cookie['name'], cookie['value'])
        if cookie.get('path', None):
            cookie_str += '; Path=%s' % cookie['path']
        if cookie.get('domain', None):
            cookie_str += '; Domain=%s' % cookie['domain']
        return cookie_str
    def _get_request_cookies(self, jar, request):
        if isinstance(request.cookies, dict):
            cookie_list = [{'name': k, 'value': v} for k, v in \
                    six.iteritems(request.cookies)]
        else:
            cookie_list = request.cookies
        cookies = [self._format_cookie(x) for x in cookie_list]
        headers = {'Set-Cookie': cookies}
        response = Response(request.url, headers=headers)
        return jar.make_cookies(response, request)

观察源码,发现以下几个方法


# process_request
jar.add_cookie_header(request)   # 添加cookie到headers
# process_response
jar.extract_cookies(response, request)  # 提取出cookie
# _debug_cookie 
request.headers.getlist('Cookie')  # 从headers获取cookie
# _debug_set_cookie
response.headers.getlist('Set-Cookie')  # 从headers获取Set-Cookie

几个参数:


# settings
COOKIES_ENABLED
COOKIES_DEBUG
# meta
dont_merge_cookies
cookiejar
# headers
Cookie
Set-Cookie

image.png

使用最开始cookie部分的代码,为了看的清晰,我删除了headers中其他参数,下面逐个做测试


1、COOKIES_ENABLED


COOKIES_ENABLED = True (默认)
response text: 
{
    "cookies":{"username":"pengshiyu"}
}
request headers: 
{
    b'Cookie': [b'username=pengshiyu']
}
request cookies: 
{
    'username': 'pengshiyu'
}

一切ok

COOKIES_ENABLED = False


response text: 
{
    "cookies":{}
}
request headers: 
{}
request cookies: 
{
    'username': 'pengshiyu'
}

虽然request的cookies有内容,不过headers没有加进去,所以服务器端没有获取到cookie


注意:查看请求的真正cookie,应该在request的header中查看


2、COOKIES_DEBUG

COOKIES_DEBUG = False (默认)


DEBUG: Crawled (200) http://httpbin.org/cookies> (referer: None)

COOKIES_DEBUG = True

多输出了下面一句,可以看到我设置的cookie


[scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: http://httpbin.org/cookies>
Cookie: username=pengshiyu

当然,debug模式下服务器肯定能正常接收我的cookie


3、dont_merge_cookies

设置meta={"dont_merge_cookies": True} 默认为 False


response text: 
{
    "cookies":{}
}
request headers: 
{}
request cookies: 
{
    'username': 'pengshiyu'
}

服务器并没有接收到我的cookie


4、cookiejar

直接通过response.request.meta.get("cookiejar")获取



response text: 
{"cookies":{"username":"pengshiyu"}}
request headers: 
{b'Cookie': [b'username=pengshiyu']}
request cookies: 
{'username': 'pengshiyu'}
request cookiejar: 
None

啥也没有


5、Cookie

直接获取:response.request.headers.get("Cookie"))

headers Cookie: 
b'username=pengshiyu'

看来这里已经被处理成字节串了


修改Request请求参数

cookies={"username": "pengshiyu", "password": "123456"}


# response.request.headers.get("Cookie"))
headers Cookie: 
b'username=pengshiyu; password=123456'
# request.headers.getlist('Cookie')
headers Cookies: 
[b'username=pengshiyu; password=123456']

很明显,两个获取方式,一个获取的是字符串,一个获取的是列表


6、Set-Cookie


同样,我通过以下


response.headers.get("Set-Cookie")
response.headers.getlist("Set-Cookie")

还是啥都没有


headers Set-Cookie: None
headers Set-Cookies: []

不过,到目前为止,cookie设置的大概流程应该如下:

image.png


image.png

request cookies: {'username': 'pengshiyu', 'password': '123456'}
request cookiejar: None
request Cookie: b'username=pengshiyu; password=123456'
response text: {"cookies":{"password":"123456","username":"pengshiyu"}}
response Set-Cookie: None
response Set-Cookies: []

7、接收服务器传递过来的cookie


将请求链接改为 :’http://httpbin.org/cookies/set/key/value

开启 COOKIES_DEBUG

在debug中看到如下变化


Sending cookies to: http://httpbin.org/cookies/set/key/value>
Cookie: username=pengshiyu; password=123456
Received cookies from: <302 http://httpbin.org/cookies/set/key/value>
Set-Cookie: key=value; Path=/
Redirecting (302) to http://httpbin.org/cookies> from http://httpbin.org/cookies/set/key/value>
Sending cookies to: http://httpbin.org/cookies>
Cookie: key=value; username=pengshiyu; password=123456

日志看出他进行了两次请求,看到中间的cookie变化:


发送 -> 接收 -> 发送


第二次发送的cookie包含了第一次请求时服务器端传递过来的cookie,说明scrapy对服务器端和客户端的cookie进行了管理


最后的cookie输出


request cookies: {'username': 'pengshiyu', 'password': '123456'}
request cookiejar: None
request Cookie: b'key=value; username=pengshiyu; password=123456'
response text: {"cookies":{"key":"value","password":"123456","username":"pengshiyu"}}
response Set-Cookie: None

request的cookies并没有变化,而request.headers.get(“Cookie”)已经发生了变化


8、接收服务器传递过来的 同key键cookie

将请求链接换为:httpbin.org/cookies/set/username/pengpeng


Sending cookies to: http://httpbin.org/cookies/set/username/pengpeng>
Cookie: username=pengshiyu
Received cookies from: <302 http://httpbin.org/cookies/set/username/pengpeng>
Set-Cookie: username=pengpeng; Path=/
Redirecting (302) to http://httpbin.org/cookies> from http://httpbin.org/cookies/set/username/pengpeng>
Sending cookies to: http://httpbin.org/cookies>
Cookie: username=pengshiyu

发现虽然收到了username=pengpeng但是,第二次发请求的时候,又发送了原来的的cookieusername=pengshiyu


这说明客户端设置的cookie优先级高于服务器端传递过来的cookie


9、取消使用中间件CookiesMiddleware


DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': None
}

请求链接:http://httpbin.org/cookies


request cookies: {'username': 'pengshiyu'}
request cookiejar: None
request Cookie: None
response text: {"cookies":{}}
response Set-Cookie: None
response Set-Cookies: []

这个效果类似COOKIES_ENABLED = False


10、自定义cookie池


class RandomCookiesMiddleware(object):
    def process_request(self, request, spider):
        cookies = []
        cookie = random.choice(cookies)
        request.cookies = cookie

同样需要设置


DOWNLOADER_MIDDLEWARES = {
    'myscrapy.middlewares.RandomCookiesMiddleware': 600
}

注意到scrapy的中间件CookiesMiddleware值是700,为了cookie设置生效,需要在这个中间件启用之前就设置好自定义的cookie,优先级按照从小到大的顺序执行,所以我们自己自定义的cookie中间件需要小于 < 700


'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 700,

总结


image.png

image.png

常用的中间件如下


import random
from fake_useragent import UserAgent
class RandomUserAgentMiddleware(object):
    def process_request(self, request, spider):
        ua = UserAgent()
        user_agent = ua.chrome
        request.headers.setdefault(b'User-Agent', user_agent)
class RandomProxyMiddleware(object):
    def process_request(self, request, spider):
        proxies = []
        proxy = random.choice(proxies)
        request.meta["proxy"] = proxy
class RandomCookiesMiddleware(object):
    def process_request(self, request, spider):
        cookies = []
        cookie = random.choice(cookies)
        request.cookies = cookie

当然,cookies 和 proxies 需要结合自己的情况补全


相关文章
|
1月前
|
数据采集 存储 JSON
Python网络爬虫:Scrapy框架的实战应用与技巧分享
【10月更文挑战第27天】本文介绍了Python网络爬虫Scrapy框架的实战应用与技巧。首先讲解了如何创建Scrapy项目、定义爬虫、处理JSON响应、设置User-Agent和代理,以及存储爬取的数据。通过具体示例,帮助读者掌握Scrapy的核心功能和使用方法,提升数据采集效率。
81 6
|
2月前
|
数据采集 中间件 开发者
Scrapy爬虫框架-自定义中间件
Scrapy爬虫框架-自定义中间件
55 1
|
2月前
|
数据采集 中间件 Python
Scrapy爬虫框架-通过Cookies模拟自动登录
Scrapy爬虫框架-通过Cookies模拟自动登录
102 0
|
1月前
|
数据采集 前端开发 中间件
Python网络爬虫:Scrapy框架的实战应用与技巧分享
【10月更文挑战第26天】Python是一种强大的编程语言,在数据抓取和网络爬虫领域应用广泛。Scrapy作为高效灵活的爬虫框架,为开发者提供了强大的工具集。本文通过实战案例,详细解析Scrapy框架的应用与技巧,并附上示例代码。文章介绍了Scrapy的基本概念、创建项目、编写简单爬虫、高级特性和技巧等内容。
60 4
|
1月前
|
数据采集 中间件 API
在Scrapy爬虫中应用Crawlera进行反爬虫策略
在Scrapy爬虫中应用Crawlera进行反爬虫策略
|
2月前
|
消息中间件 数据采集 数据库
小说爬虫-03 爬取章节的详细内容并保存 将章节URL推送至RabbitMQ Scrapy消费MQ 对数据进行爬取后写入SQLite
小说爬虫-03 爬取章节的详细内容并保存 将章节URL推送至RabbitMQ Scrapy消费MQ 对数据进行爬取后写入SQLite
28 1
|
2月前
|
消息中间件 数据采集 数据库
小说爬虫-02 爬取小说详细内容和章节列表 推送至RabbitMQ 消费ACK确认 Scrapy爬取 SQLite
小说爬虫-02 爬取小说详细内容和章节列表 推送至RabbitMQ 消费ACK确认 Scrapy爬取 SQLite
22 1
|
2月前
|
数据采集 JavaScript 前端开发
使用Panther进行爬虫时,如何优雅地处理登录和Cookies?
使用Panther进行爬虫时,如何优雅地处理登录和Cookies?
|
2月前
|
数据采集 SQL 数据库
小说爬虫-01爬取总排行榜 分页翻页 Scrapy SQLite SQL 简单上手!
小说爬虫-01爬取总排行榜 分页翻页 Scrapy SQLite SQL 简单上手!
85 0
|
4月前
|
数据采集 存储 中间件
Python进行网络爬虫:Scrapy框架的实践
【8月更文挑战第17天】网络爬虫是自动化程序,用于从互联网收集信息。Python凭借其丰富的库和框架成为构建爬虫的首选语言。Scrapy作为一款流行的开源框架,简化了爬虫开发过程。本文介绍如何使用Python和Scrapy构建简单爬虫:首先安装Scrapy,接着创建新项目并定义爬虫,指定起始URL和解析逻辑。运行爬虫可将数据保存为JSON文件或存储到数据库。此外,Scrapy支持高级功能如中间件定制、分布式爬取、动态页面渲染等。在实践中需遵循最佳规范,如尊重robots.txt协议、合理设置爬取速度等。通过本文,读者将掌握Scrapy基础并了解如何高效地进行网络数据采集。
222 6