开发者社区> 问答> 正文

scrapy-redis组件中如何实现的任务的去重?

scrapy-redis组件中如何实现的任务的去重?

展开
收起
珍宝珠 2019-11-22 14:07:38 1734 0
1 条回答
写回答
取消 提交回答
  • a. 内部进行配置,连接Redis
    b.去重规则通过redis的集合完成,集合的Key为:
       key = defaults.DUPEFILTER_KEY % {'timestamp': int(time.time())}
       默认配置:
          DUPEFILTER_KEY = 'dupefilter:%(timestamp)s'
    c.去重规则中将url转换成唯一标示,然后在redis中检查是否已经在集合中存在
       from scrapy.utils import request
       from scrapy.http import Request
       req = Request(url='http://www.cnblogs.com/wupeiqi.html')
       result = request.request_fingerprint(req)
       print(result)  # 8ea4fd67887449313ccc12e5b6b92510cc53675c
    
    scrapy和scrapy-redis的去重规则(源码)
    1. scrapy中去重规则是如何实现?
    class RFPDupeFilter(BaseDupeFilter):
        """Request Fingerprint duplicates filter"""
    
        def __init__(self, path=None, debug=False):
            self.fingerprints = set()
            
    
        @classmethod
        def from_settings(cls, settings):
            debug = settings.getbool('DUPEFILTER_DEBUG')
            return cls(job_dir(settings), debug)
    
        def request_seen(self, request):
            # 将request对象转换成唯一标识。
            fp = self.request_fingerprint(request)
            # 判断在集合中是否存在,如果存在则返回True,表示已经访问过。
            if fp in self.fingerprints:
                return True
            # 之前未访问过,将url添加到访问记录中。
            self.fingerprints.add(fp)
    
        def request_fingerprint(self, request):
            return request_fingerprint(request)
    
            
    2. scrapy-redis中去重规则是如何实现?
    class RFPDupeFilter(BaseDupeFilter):
        """Redis-based request duplicates filter.
    
        This class can also be used with default Scrapy's scheduler.
    
        """
    
        logger = logger
    
        def __init__(self, server, key, debug=False):
            
            # self.server = redis连接
            self.server = server
            # self.key = dupefilter:123912873234
            self.key = key
            
    
        @classmethod
        def from_settings(cls, settings):
            
            # 读取配置,连接redis
            server = get_redis_from_settings(settings)
    
            #  key = dupefilter:123912873234
            key = defaults.DUPEFILTER_KEY % {'timestamp': int(time.time())}
            debug = settings.getbool('DUPEFILTER_DEBUG')
            return cls(server, key=key, debug=debug)
    
        @classmethod
        def from_crawler(cls, crawler):
            
            return cls.from_settings(crawler.settings)
    
        def request_seen(self, request):
            
            fp = self.request_fingerprint(request)
            # This returns the number of values added, zero if already exists.
            # self.server=redis连接
            # 添加到redis集合中:1,添加工程;0,已经存在
            added = self.server.sadd(self.key, fp)
            return added == 0
    
        def request_fingerprint(self, request):
            
            return request_fingerprint(request)
    
        def close(self, reason=''):
            
            self.clear()
    
        def clear(self):
            """Clears fingerprints data."""
            self.server.delete(self.key)
    
    2019-11-22 14:07:47
    赞同 展开评论 打赏
问答排行榜
最热
最新

相关电子书

更多
Redis在唯品会的应用实践——架构演进与功能定制 立即下载
微博的Redis定制之路 立即下载
云数据库Redis版的开源之路 立即下载