一、创建Scrapy项目zhongjj,进入zhongjj项目,创建爬虫文件zhongjjpc
scrapy startproject zhongjj cd zhongjj scrapy genspider zhongjjpc www.xxx.com
二、修改配置文件
ROBOTSTXT_OBEY = False LOG_LEVEL = 'ERROR'
三、添加三个目标地址,其中最后一个地址是错误的url
start_urls = ["https://www.baidu.com/","https://www.sina.com.cn/","https://wwwwww.sohu.com/"]
四、修改中间件文件
1、删除爬虫中间件类ZhongjjSpiderMiddleware
2、修改拦截内容响应内容及异常内容
def process_request(self, request, spider): print(request.url+"我是requests") return None def process_response(self, request, response, spider): print(request.url+"我是response") return response def process_exception(self, request, exception, spider): print(request.url+"我是异常信息") pass
3、在settings文件里面开启中间件
DOWNLOADER_MIDDLEWARES = { "zhongjj.middlewares.ZhongjjDownloaderMiddleware": 543, }
五、运行结果,三个函数都被调用
六、开发中间件
1、代理中间件
request.meta['proxy'] = 'https://ip:port'
2、UA中间件
request.headers['User-Agent'] = 'Mozilla/5.0 (Windows ......'
3、Cookies中间件
request.headers['cookie'] = 'xxx' 第二种方法 request.cookies = 'xxx'