对于scrapy请参数,会经常用到,不过没有深究
今天我就来探索下scrapy请求时所携带的3个重要参数headers, cookies, meta
原生参数
首先新建myscrapy项目,新建my_spider爬虫
通过访问:http://httpbin.org/get 来测试请求参数
将爬虫运行起来
# -*- coding: utf-8 -*- from scrapy import Spider, Request import logging class MySpider(Spider): name = 'my_spider' allowed_domains = ['httpbin.org'] start_urls = [ 'http://httpbin.org/get' ] def parse(self, response): self.write_to_file("*" * 40) self.write_to_file("response text: %s" % response.text) self.write_to_file("response headers: %s" % response.headers) self.write_to_file("response meta: %s" % response.meta) self.write_to_file("request headers: %s" % response.request.headers) self.write_to_file("request cookies: %s" % response.request.cookies) self.write_to_file("request meta: %s" % response.request.meta) def write_to_file(self, words): with open("logging.log", "a") as f: f.write(words) if __name__ == '__main__': from scrapy import cmdline cmdline.execute("scrapy crawl my_spider".split())
保存到文件中的信息如下:
response text: { "args":{}, "headers":{ "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Encoding":"gzip,deflate", "Accept-Language":"en", "Connection":"close", "Host":"httpbin.org", "User-Agent":"Scrapy/1.5.1 (+https://scrapy.org)" }, "origin":"223.72.90.254", "url":"http://httpbin.org/get" } response headers: { b'Server': [b'gunicorn/19.8.1'], b'Date': [b'Sun, 22 Jul 2018 10:03:15 GMT'], b'Content-Type': [b'application/json'], b'Access-Control-Allow-Origin': [b'*'], b'Access-Control-Allow-Credentials': [b'true'], b'Via': [b'1.1 vegur'] } response meta: { 'download_timeout': 180.0, 'download_slot': 'httpbin.org', 'download_latency': 0.5500118732452393 } request headers: { b'Accept': [b'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'], b'Accept-Language': [b'en'], b'User-Agent': [b'Scrapy/1.5.1 (+https://scrapy.org)'], b'Accept-Encoding': [b'gzip,deflate'] } request cookies: {} request meta: { 'download_timeout': 180.0, 'download_slot': 'httpbin.org', 'download_latency': 0.5500118732452393 }
meta
通过上面的输出比较,发现 response 和 request 的meta参数是一样的,meta的功能就是从request携带信息,将其传递给response的
修改下代码,测试下传递效果
# -*- coding: utf-8 -*- from scrapy import Spider, Request import logging class MySpider(Spider): name = 'my_spider' allowed_domains = ['httpbin.org'] start_urls = [ 'http://httpbin.org/get' ] def start_requests(self): for url in self.start_urls: yield Request(url, meta={"uid": "this is uid of meta"}) def parse(self, response): print("request meta: %s" % response.request.meta.get("uid")) print("response meta: %s" % response.meta.get("uid"))
输出如下
request meta: this is uid of meta response meta: this is uid of meta
看来获取request中meta这两种方式都可行,这里的meta类似字典,可以按照字典获取key-value的形式获取对应的值
当然代理设置也是通过meta的
以下是一个代理中间件的示例
import random class ProxyMiddleware(object): def process_request(self, request, spider): proxy=random.choice(proxies) request.meta["proxy"] = proxy
headers
按照如下路径,打开scrapy的default_settings文件
from scrapy.settings import default_settings
发现是这么写的
USER_AGENT = 'Scrapy/%s (+https://scrapy.org)' % import_module('scrapy').__version__ DEFAULT_REQUEST_HEADERS = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'en', }
修改下请求头,看服务器返回的信息
# -*- coding: utf-8 -*- from scrapy import Spider, Request import logging class MySpider(Spider): name = 'my_spider' allowed_domains = ['httpbin.org'] start_urls = [ 'http://httpbin.org/get', ] def start_requests(self): for url in self.start_urls: yield Request(url, headers={"User-Agent": "Chrome"}) def parse(self, response): logging.debug("*" * 40) logging.debug("response text: %s" % response.text) logging.debug("response headers: %s" % response.headers) logging.debug("request headers: %s" % response.request.headers) if __name__ == '__main__': from scrapy import cmdline cmdline.execute("scrapy crawl my_spider".split())
输出如下
response text: { "args":{}, "headers": { "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Encoding":"gzip,deflate", "Accept-Language":"en", "Connection":"close", "Host":"httpbin.org", "User-Agent":"Chrome" }, "origin":"122.71.64.121", "url":"http://httpbin.org/get" } response headers: { b'Server': [b'gunicorn/19.8.1'], b'Date': [b'Sun, 22 Jul 2018 10:29:26 GMT'], b'Content-Type': [b'application/json'], b'Access-Control-Allow-Origin': [b'*'], b'Access-Control-Allow-Credentials': [b'true'], b'Via': [b'1.1 vegur'] } request headers: { b'User-Agent': [b'Chrome'], b'Accept': [b'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'], b'Accept-Language': [b'en'], b'Accept-Encoding': [b'gzip,deflate'] }
看到 request 和 服务器接收到并返回的的 headers(User-Agent)变化了,说明已经把默认的User-Agent修改了
看到default_settings中默认使用了中间件UserAgentMiddleware
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 500,
源码如下
class UserAgentMiddleware(object): """This middleware allows spiders to override the user_agent""" def __init__(self, user_agent='Scrapy'): self.user_agent = user_agent @classmethod def from_crawler(cls, crawler): o = cls(crawler.settings['USER_AGENT']) crawler.signals.connect(o.spider_opened, signal=signals.spider_opened) return o def spider_opened(self, spider): self.user_agent = getattr(spider, 'user_agent', self.user_agent) def process_request(self, request, spider): if self.user_agent: request.headers.setdefault(b'User-Agent', self.user_agent)
仔细阅读源码,发现无非就是对User-Agent读取和设置操作,仿照源码写自己的中间件
这里使用fake_useragent库来随机获取请求头,详情可参看:
https://blog.csdn.net/mouday/article/details/80476409
middlewares.py 编写自己的中间件
from fake_useragent import UserAgent class UserAgentMiddleware(object): def process_request(self, request, spider): ua = UserAgent() user_agent = ua.chrome request.headers.setdefault(b'User-Agent', user_agent)
settings.py 用自己的中间件替换默认中间件
DOWNLOADER_MIDDLEWARES = { 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None, 'myscrapy.middlewares.UserAgentMiddleware': 500 }
输出如下:
request headers: { b'Accept': [b'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'], b'Accept-Language': [b'en'], b'User-Agent': [b'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36'], b'Accept-Encoding': [b'gzip,deflate'] }
关于scrapy请求头设置,可以参考我之前的文章: