当我们在scrapy中写了几个爬虫程序之后,他们是怎么被检索出来的,又是怎么被加载的?这就涉及到爬虫加载的API,今天我们就来分享爬虫加载过程及其自定义加载程序。
SpiderLoader API
该API是爬虫实例化API,主要实现一个类SpiderLoader
class scrapy.loader.SpiderLoader
该类负责检索和处理项目中定义的spider类。
可以通过在SPIDER_LOADER_CLASS项目设置中指定其路径来使用自定义爬虫装载程序,但是自定义加载程序必须完全实现scrapy.interfaces.ISpiderLoader接口以保证无错执行。
该类具备下列方法
from_settings(settings)
Scrapy使用此类方法来创建类的实例。使用当前项目配置,加载在SPIDER_MODULES 设置模块中递归发现的爬虫,在一般是创建项目时候生成的setting文件中类似['demo1.spiders']
参数: settings(Settings实例) - 项目配置文件
@classmethod
def from_settings(cls, settings):
return cls(settings)
AI 代码解读
load(spider_name )
获取具有给定名称的Spider类。它将查看以前加载的名为spider_name的爬虫类的爬虫,如果找不到则会引发KeyError。
参数: spider_name(str) - 爬虫类名
def load(self, spider_name):
"""
Return the Spider class for the given spider name. If the spider
name is not found, raise a KeyError.
"""
try:
return self._spiders[spider_name]
except KeyError:
raise KeyError("Spider not found: {}".format(spider_name))
AI 代码解读
list()
获取项目中可用蜘蛛的名称。
def list(self):
"""
Return a list with the names of all spiders available in the project.
"""
return list(self._spiders.keys())
find_by_request(request)
列出可以处理给定请求的爬虫名称。将尝试将请求的URL与爬虫的域匹配。
参数: request(Requestinstance) - 查询请求
def find_by_request(self, request):
"""
Return the list of spider names that can handle the given request.
"""
return [name for name, cls in self._spiders.items()
if cls.handles_request(request)]
AI 代码解读
完整源码:
-
# -*- coding: utf-8 -*- from __future__ import absolute_import from collections import defaultdict import traceback import warnings from zope.interface import implementer from scrapy.interfaces import ISpiderLoader from scrapy.utils.misc import walk_modules from scrapy.utils.spider import iter_spider_classes @implementer(ISpiderLoader) class SpiderLoader(object): """ SpiderLoader is a class which locates and loads spiders in a Scrapy project. """ def __init__(self, settings): self.spider_modules = settings.getlist('SPIDER_MODULES') self.warn_only = settings.getbool('SPIDER_LOADER_WARN_ONLY') self._spiders = {} self._found = defaultdict(list) self._load_all_spiders() def _check_name_duplicates(self): dupes = ["\n".join(" {cls} named {name!r} (in {module})".format( module=mod, cls=cls, name=name) for (mod, cls) in locations) for name, locations in self._found.items() if len(locations)>1] if dupes: msg = ("There are several spiders with the same name:\n\n" "{}\n\n This can cause unexpected behavior.".format( "\n\n".join(dupes))) warnings.warn(msg, UserWarning) def _load_spiders(self, module): for spcls in iter_spider_classes(module): self._found[spcls.name].append((module.__name__, spcls.__name__)) self._spiders[spcls.name] = spcls def _load_all_spiders(self): for name in self.spider_modules: try: for module in walk_modules(name): self._load_spiders(module) except ImportError as e: if self.warn_only: msg = ("\n{tb}Could not load spiders from module '{modname}'. " "See above traceback for details.".format( modname=name, tb=traceback.format_exc())) warnings.warn(msg, RuntimeWarning) else: raise self._check_name_duplicates() @classmethod def from_settings(cls, settings): return cls(settings) def load(self, spider_name): """ Return the Spider class for the given spider name. If the spider name is not found, raise a KeyError. """ try: return self._spiders[spider_name] except KeyError: raise KeyError("Spider not found: {}".format(spider_name)) def find_by_request(self, request): """ Return the list of spider names that can handle the given request. """ return [name for name, cls in self._spiders.items() if cls.handles_request(request)] def list(self): """ Return a list with the names of all spiders available in the project. """ return list(self._spiders.keys())
AI 代码解读 配置自定义加载类
在setting文件中配置SPIDER_LOADER_CLASS
默认: 'scrapy.spiderloader.SpiderLoader'将用于加载蜘蛛的类,必须实现 SpiderLoader API。
该API的作用是搜寻项目中定义的多个爬虫程序,并提供相关的操作方法,包括加载、查询指定爬虫、判断请求对应的爬虫等功能,一般情况下不需要自己写加载程序,而是内部实现。