cookies
上面的信息中少了个response.cookies,如果添加上回报错:
AttributeError: 'TextResponse' object has no attribute 'cookies'
说明响应是不带cookies参数的
通过 http://httpbin.org/cookies 测试cookies
# -*- coding: utf-8 -*- from scrapy import Spider, Request import logging class MySpider(Spider): name = 'my_spider' allowed_domains = ['httpbin.org'] start_urls = [ 'http://httpbin.org/cookies' ] def start_requests(self): for url in self.start_urls: yield Request(url, cookies={"username": "pengshiyu"}) def parse(self, response): logging.debug("*" * 40) logging.debug("response text: %s" % response.text) logging.debug("request headers: %s" % response.request.headers) logging.debug("request cookies: %s" % response.request.cookies) if __name__ == '__main__': from scrapy import cmdline cmdline.execute("scrapy crawl my_spider".split())
返回值如下:
response text: { "cookies": { "username":"pengshiyu" } } request headers: { b'Accept': [b'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'], b'Accept-Language': [b'en'], b'User-Agent': [b'Scrapy/1.5.1 (+https://scrapy.org)'], b'Accept-Encoding': [b'gzip,deflate'], b'Cookie': [b'username=pengshiyu'] } request cookies: { 'username': 'pengshiyu' }
服务器端已经接收到我的cookie值了,不过request的headers也包含了相同的cookie,保存到了键为Cookie下面
其实并没有什么cookie,浏览器请求的·cookies·被包装到了·headers·中发送给服务器端
既然这样,在headers中包含Cookie试试
def start_requests(self): for url in self.start_urls: yield Request(url, headers={"Cookie": {"username": "pengshiyu"}})
返回结果
response text: { "cookies":{} } request headers: { b'Accept': [b'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'], b'Accept-Language': [b'en'], b'User-Agent': [b'Scrapy/1.5.1 (+https://scrapy.org)'], b'Accept-Encoding': [b'gzip,deflate'] } request cookies: {}
cookies 是空的,设置失败
我们找到 default_settings 中的cookie中间件
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 700
class CookiesMiddleware(object): """This middleware enables working with sites that need cookies""" def __init__(self, debug=False): self.jars = defaultdict(CookieJar) self.debug = debug @classmethod def from_crawler(cls, crawler): if not crawler.settings.getbool('COOKIES_ENABLED'): raise NotConfigured return cls(crawler.settings.getbool('COOKIES_DEBUG')) def process_request(self, request, spider): if request.meta.get('dont_merge_cookies', False): return cookiejarkey = request.meta.get("cookiejar") jar = self.jars[cookiejarkey] cookies = self._get_request_cookies(jar, request) for cookie in cookies: jar.set_cookie_if_ok(cookie, request) # set Cookie header request.headers.pop('Cookie', None) jar.add_cookie_header(request) self._debug_cookie(request, spider) def process_response(self, request, response, spider): if request.meta.get('dont_merge_cookies', False): return response # extract cookies from Set-Cookie and drop invalid/expired cookies cookiejarkey = request.meta.get("cookiejar") jar = self.jars[cookiejarkey] jar.extract_cookies(response, request) self._debug_set_cookie(response, spider) return response def _debug_cookie(self, request, spider): if self.debug: cl = [to_native_str(c, errors='replace') for c in request.headers.getlist('Cookie')] if cl: cookies = "\n".join("Cookie: {}\n".format(c) for c in cl) msg = "Sending cookies to: {}\n{}".format(request, cookies) logger.debug(msg, extra={'spider': spider}) def _debug_set_cookie(self, response, spider): if self.debug: cl = [to_native_str(c, errors='replace') for c in response.headers.getlist('Set-Cookie')] if cl: cookies = "\n".join("Set-Cookie: {}\n".format(c) for c in cl) msg = "Received cookies from: {}\n{}".format(response, cookies) logger.debug(msg, extra={'spider': spider}) def _format_cookie(self, cookie): # build cookie string cookie_str = '%s=%s' % (cookie['name'], cookie['value']) if cookie.get('path', None): cookie_str += '; Path=%s' % cookie['path'] if cookie.get('domain', None): cookie_str += '; Domain=%s' % cookie['domain'] return cookie_str def _get_request_cookies(self, jar, request): if isinstance(request.cookies, dict): cookie_list = [{'name': k, 'value': v} for k, v in \ six.iteritems(request.cookies)] else: cookie_list = request.cookies cookies = [self._format_cookie(x) for x in cookie_list] headers = {'Set-Cookie': cookies} response = Response(request.url, headers=headers) return jar.make_cookies(response, request)
观察源码,发现以下几个方法
# process_request jar.add_cookie_header(request) # 添加cookie到headers # process_response jar.extract_cookies(response, request) # 提取出cookie # _debug_cookie request.headers.getlist('Cookie') # 从headers获取cookie # _debug_set_cookie response.headers.getlist('Set-Cookie') # 从headers获取Set-Cookie
几个参数:
# settings COOKIES_ENABLED COOKIES_DEBUG # meta dont_merge_cookies cookiejar # headers Cookie Set-Cookie
使用最开始cookie部分的代码,为了看的清晰,我删除了headers中其他参数,下面逐个做测试
1、COOKIES_ENABLED
COOKIES_ENABLED = True (默认) response text: { "cookies":{"username":"pengshiyu"} } request headers: { b'Cookie': [b'username=pengshiyu'] } request cookies: { 'username': 'pengshiyu' }
一切ok
COOKIES_ENABLED = False
response text: { "cookies":{} } request headers: {} request cookies: { 'username': 'pengshiyu' }
虽然request的cookies有内容,不过headers没有加进去,所以服务器端没有获取到cookie
注意:查看请求的真正cookie,应该在request的header中查看
2、COOKIES_DEBUG
COOKIES_DEBUG = False (默认)
DEBUG: Crawled (200) http://httpbin.org/cookies> (referer: None)
COOKIES_DEBUG = True
多输出了下面一句,可以看到我设置的cookie
[scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: http://httpbin.org/cookies> Cookie: username=pengshiyu
当然,debug模式下服务器肯定能正常接收我的cookie
3、dont_merge_cookies
设置meta={"dont_merge_cookies": True} 默认为 False
response text: { "cookies":{} } request headers: {} request cookies: { 'username': 'pengshiyu' }
服务器并没有接收到我的cookie
4、cookiejar
直接通过response.request.meta.get("cookiejar")获取
response text: {"cookies":{"username":"pengshiyu"}} request headers: {b'Cookie': [b'username=pengshiyu']} request cookies: {'username': 'pengshiyu'} request cookiejar: None
啥也没有
5、Cookie
直接获取:response.request.headers.get("Cookie"))
headers Cookie: b'username=pengshiyu'
看来这里已经被处理成字节串了
修改Request请求参数
cookies={"username": "pengshiyu", "password": "123456"}
# response.request.headers.get("Cookie")) headers Cookie: b'username=pengshiyu; password=123456' # request.headers.getlist('Cookie') headers Cookies: [b'username=pengshiyu; password=123456']
很明显,两个获取方式,一个获取的是字符串,一个获取的是列表
6、Set-Cookie
同样,我通过以下
response.headers.get("Set-Cookie") response.headers.getlist("Set-Cookie")
还是啥都没有
headers Set-Cookie: None headers Set-Cookies: []
不过,到目前为止,cookie设置的大概流程应该如下:
request cookies: {'username': 'pengshiyu', 'password': '123456'} request cookiejar: None request Cookie: b'username=pengshiyu; password=123456' response text: {"cookies":{"password":"123456","username":"pengshiyu"}} response Set-Cookie: None response Set-Cookies: []
7、接收服务器传递过来的cookie
将请求链接改为 :’http://httpbin.org/cookies/set/key/value’
开启 COOKIES_DEBUG
在debug中看到如下变化
Sending cookies to: http://httpbin.org/cookies/set/key/value> Cookie: username=pengshiyu; password=123456 Received cookies from: <302 http://httpbin.org/cookies/set/key/value> Set-Cookie: key=value; Path=/ Redirecting (302) to http://httpbin.org/cookies> from http://httpbin.org/cookies/set/key/value> Sending cookies to: http://httpbin.org/cookies> Cookie: key=value; username=pengshiyu; password=123456
日志看出他进行了两次请求,看到中间的cookie变化:
发送 -> 接收 -> 发送
第二次发送的cookie包含了第一次请求时服务器端传递过来的cookie,说明scrapy对服务器端和客户端的cookie进行了管理
最后的cookie输出
request cookies: {'username': 'pengshiyu', 'password': '123456'} request cookiejar: None request Cookie: b'key=value; username=pengshiyu; password=123456' response text: {"cookies":{"key":"value","password":"123456","username":"pengshiyu"}} response Set-Cookie: None
request的cookies并没有变化,而request.headers.get(“Cookie”)已经发生了变化
8、接收服务器传递过来的 同key键cookie
将请求链接换为:httpbin.org/cookies/set/username/pengpeng
Sending cookies to: http://httpbin.org/cookies/set/username/pengpeng> Cookie: username=pengshiyu Received cookies from: <302 http://httpbin.org/cookies/set/username/pengpeng> Set-Cookie: username=pengpeng; Path=/ Redirecting (302) to http://httpbin.org/cookies> from http://httpbin.org/cookies/set/username/pengpeng> Sending cookies to: http://httpbin.org/cookies> Cookie: username=pengshiyu
发现虽然收到了username=pengpeng但是,第二次发请求的时候,又发送了原来的的cookieusername=pengshiyu
这说明客户端设置的cookie优先级高于服务器端传递过来的cookie
9、取消使用中间件CookiesMiddleware
DOWNLOADER_MIDDLEWARES = { 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': None }
请求链接:http://httpbin.org/cookies
request cookies: {'username': 'pengshiyu'} request cookiejar: None request Cookie: None response text: {"cookies":{}} response Set-Cookie: None response Set-Cookies: []
这个效果类似COOKIES_ENABLED = False
10、自定义cookie池
class RandomCookiesMiddleware(object): def process_request(self, request, spider): cookies = [] cookie = random.choice(cookies) request.cookies = cookie
同样需要设置
DOWNLOADER_MIDDLEWARES = { 'myscrapy.middlewares.RandomCookiesMiddleware': 600 }
注意到scrapy的中间件CookiesMiddleware值是700,为了cookie设置生效,需要在这个中间件启用之前就设置好自定义的cookie,优先级按照从小到大的顺序执行,所以我们自己自定义的cookie中间件需要小于 < 700
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 700,
总结
常用的中间件如下
import random from fake_useragent import UserAgent class RandomUserAgentMiddleware(object): def process_request(self, request, spider): ua = UserAgent() user_agent = ua.chrome request.headers.setdefault(b'User-Agent', user_agent) class RandomProxyMiddleware(object): def process_request(self, request, spider): proxies = [] proxy = random.choice(proxies) request.meta["proxy"] = proxy class RandomCookiesMiddleware(object): def process_request(self, request, spider): cookies = [] cookie = random.choice(cookies) request.cookies = cookie
当然,cookies 和 proxies 需要结合自己的情况补全