Python爬虫:scrapy中间件及一些参数

本文涉及的产品
服务治理 MSE Sentinel/OpenSergo,Agent数量 不受限
注册配置 MSE Nacos/ZooKeeper,118元/月
Serverless 应用引擎免费试用套餐包,4320000 CU,有效期3个月
简介: Python爬虫:scrapy中间件及一些参数

scrapy中间件

from scrapy.settings import default_settings
'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 100,
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': 300,
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware': 350,
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware': 400,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 500,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': 550,
'scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware': 560,
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware': 580,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 590,
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 600,
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 700,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 750,
'scrapy.downloadermiddlewares.stats.DownloaderStats': 850,
'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware': 900,
from scrapy.downloadermiddlewares.robotstxt import RobotsTxtMiddleware
from scrapy.downloadermiddlewares.httpauth import HttpAuthMiddleware
from scrapy.downloadermiddlewares.downloadtimeout import DownloadTimeoutMiddleware
from scrapy.downloadermiddlewares.defaultheaders import DefaultHeadersMiddleware
from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware
from scrapy.downloadermiddlewares.retry import RetryMiddleware
from scrapy.downloadermiddlewares.ajaxcrawl import AjaxCrawlMiddleware
from scrapy.downloadermiddlewares.redirect import MetaRefreshMiddleware
from scrapy.downloadermiddlewares.httpcompression import HttpCompressionMiddleware
from scrapy.downloadermiddlewares.redirect import RedirectMiddleware
from scrapy.downloadermiddlewares.cookies import CookiesMiddleware
from scrapy.downloadermiddlewares.httpproxy import HttpProxyMiddleware
from scrapy.downloadermiddlewares.stats import DownloaderStats
from scrapy.downloadermiddlewares.httpcache import HttpCacheMiddleware

参数

# settings
USER_AGENT
ROBOTSTXT_OBEY {bool}
DOWNLOAD_TIMEOUT {float}
DEFAULT_REQUEST_HEADERS
RETRY_ENABLED {bool}
RETRY_TIMES {int}
RETRY_HTTP_CODES
RETRY_PRIORITY_ADJUST
AJAXCRAWL_ENABLED {bool}
AJAXCRAWL_MAXSIZE {int} 32768
REDIRECT_MAX_METAREFRESH_DELAY {int}
METAREFRESH_MAXDELAY {int}
COMPRESSION_ENABLED {bool}
COOKIES_ENABLED {bool}
COOKIES_DEBUG {bool}
HTTPPROXY_ENABLED
HTTPPROXY_AUTH_ENCODING
DOWNLOADER_STATS {bool}
HTTPCACHE_ENABLED {bool}
HTTPCACHE_POLICY
HTTPCACHE_STORAGE
HTTPCACHE_IGNORE_MISSING
CONCURRENT_REQUESTS 同时发出的请求
DOWNLOAD_DELAY 下载延时
HTTPERROR_ALLOWED_CODES
# request
dont_filter {bool}
# request.meta
dont_obey_robotstxt {bool}
download_timeout {float}
dont_retry {bool} False
retry_times {} 0
max_retry_times 
ajax_crawlable {bool}
dont_redirect {bool} False
dont_merge_cookies  {bool} False
cookiejar
proxy
dont_cache {bool} False
# request.headers
User-Agent
Cookie
Proxy-Authorization
# response.headers
Set-Cookie
# spider
http_user {str}
http_pass {str}
download_timeout {float}
user_agent
handle_httpstatus_list []


相关文章
|
29天前
|
数据采集 Web App开发 JavaScript
基于Selenium的Python爬虫抓取动态App图片
基于Selenium的Python爬虫抓取动态App图片
204 68
|
28天前
|
数据采集 存储 Web App开发
逆向音乐APP:Python爬虫获取音乐榜单
逆向音乐APP:Python爬虫获取音乐榜单
184 58
|
12天前
|
数据采集 Web App开发 JavaScript
无头浏览器技术:Python爬虫如何精准模拟搜索点击
无头浏览器技术:Python爬虫如何精准模拟搜索点击
|
14天前
|
数据采集 机器学习/深度学习 Web App开发
Python爬虫如何应对贝壳网的IP封禁与人机验证?
Python爬虫如何应对贝壳网的IP封禁与人机验证?
|
13天前
|
数据采集 Web App开发 JavaScript
Python爬虫解析动态网页:从渲染到数据提取
Python爬虫解析动态网页:从渲染到数据提取
|
21天前
|
数据采集 存储 前端开发
Python爬虫自动化:批量抓取网页中的A链接
Python爬虫自动化:批量抓取网页中的A链接
|
22天前
|
数据采集 存储 NoSQL
Python爬虫Cookie管理最佳实践:存储、清理与轮换
Python爬虫Cookie管理最佳实践:存储、清理与轮换
|
1月前
|
数据采集 前端开发 JavaScript
Python爬虫如何应对网站的反爬加密策略?
Python爬虫如何应对网站的反爬加密策略?
108 11
|
1月前
|
数据采集 Web App开发 前端开发
Python+Selenium爬虫:豆瓣登录反反爬策略解析
Python+Selenium爬虫:豆瓣登录反反爬策略解析
|
1月前
|
数据采集 Web App开发 JavaScript
Python爬虫如何获取JavaScript动态渲染后的网页内容?
Python爬虫如何获取JavaScript动态渲染后的网页内容?

推荐镜像

更多