前文介绍过 历史对比法 : 对比代码的需求变化和版本历史,从而学习需求如何被实现。今天我们一起从 requests 源码开始,使用 历史对比法,深入其实现细节,考古一下远古爬虫的实现。本文主要包括下面几个部分:
- 如何使用vs-code进行历史对比
- 如何使用pycharm-ce进行历史对比
- v0.2.1 文件上传支持
- v0.2.2 cookie支持
- v0.2.3 response优化
- v0.2.4 改进request类
- 小结
- 小技巧
如何使用vs-code git历史对比
工欲善其事,必先利其器,我们先花点时间了解如何使用各种IDE工具进行git历史对比操作。先是vs-code中安装git-history插件:
对比前,先切换到初始的v0.2.0版本。然后进入插件视图,搜索v0.2.1:
版本代码差异:
如何使用pycharm-ce git历史对比
pycharm的社区版本也支持git历史对比。同样进入git视图:
对比代码差异:
v0.2.1 文件上传支持
我们之前已经阅读过requests的代码,没有看过朋友可以去翻翻前文 requests 源码阅读 , 今天我们就略过v0.2.0的初始化版本介绍,直接进入对比分析。
HISTORY文件中介绍了v0.2.1新增的功能:
0.2.1 (2011-02-14) ++++++++++++++++++ * Added file attribute to POST and PUT requests for multipart-encode file uploads. * Added Request.url attribute for context and redirects
文件上传功能涉及的代码主要是:
from .packages.poster.encode import multipart_encode from .packages.poster.streaminghttp import register_openers ... def send(self, anyway=False) if self.method == 'POST': if self.files: register_openers() # 1 datagen, headers = multipart_encode(self.files) # 2 req = _Request(self.url, data=datagen, headers=headers, method='POST') ...
核心在register_openers和multipart_encode两行代码,一个是支持文件上传的连接,一个是支持文件数据读取。先看register_openers:
def register_openers(): handlers = [StreamingHTTPHandler, StreamingHTTPRedirectHandler] if hasattr(httplib, "HTTPS"): handlers.append(StreamingHTTPSHandler) opener = urllib2.build_opener(*handlers) urllib2.install_opener(opener) return opener
主要在使用_StreamingHTTPMixin扩展了send方法:
class _StreamingHTTPMixin: def send(self, value): ... try: if hasattr(value, 'next'): for data in value: self.sock.sendall(data) ... except socket.error, v: if v[0] == 32: # Broken pipe self.close() raise
可以看到send的支持从迭代器中读取并发送到socket。multipart_encode主要是读取本地文件并生成对于的http头:
def multipart_encode(params, boundary=None, cb=None) boundary = gen_boundary() boundary = urllib.quote_plus(boundary) headers = get_headers(params, boundary) params = MultipartParam.from_params(params) return multipart_yielder(params, boundary, cb), headers
生成http头的方法:
def get_headers(params, boundary): """Returns a dictionary with Content-Type and Content-Length headers for the multipart/form-data encoding of ``params``.""" # boundary = uuid.uuid4().hex headers = {} boundary = urllib.quote_plus(boundary) headers['Content-Type'] = "multipart/form-data; boundary=%s" % boundary headers['Content-Length'] = str(get_body_size(params, boundary)) return headers def get_body_size(params, boundary): size = sum(p.get_size(boundary) for p in MultipartParam.from_params(params)) return size + len(boundary) + 6
boundary就是文件分段的分隔符,详情见: stackoverflow.com/questions/3…
本地文件的读取准备主要在下面的iter_encode函数:
def iter_encode(self, boundary, blocksize=4096): total = self.get_size(boundary) current = 0 block = self.encode_hdr(boundary) current += len(block) yield block last_block = "" encoded_boundary = "--%s" % encode_and_quote(boundary) boundary_exp = re.compile("^%s$" % re.escape(encoded_boundary), re.M) while True: block = self.fileobj.read(blocksize) if not block: current += 2 yield "\r\n" break last_block += block if boundary_exp.search(last_block): raise ValueError("boundary found in file data") last_block = last_block[-len(encoded_boundary)-2:] current += len(block) yield block
iter_encode函数将fileobj包装成一个迭代器返回,结合前面的send方法就实现了文件的http上传。
因为代码较少,我们还可以发现一些可以优化的代码, 比如下面的实现中MultipartParam.from_params执行了两次:
def multipart_encode(params, boundary=None, cb=None): ... headers = get_headers(params, boundary) params = MultipartParam.from_params(params) ... def get_body_size(params, boundary): size = sum(p.get_size(boundary) for p in MultipartParam.from_params(params)) return size + len(boundary) + 6 def get_headers(params, boundary): ... headers['Content-Length'] = str(get_body_size(params, boundary)) return headers
v0.2.2 cookie支持
v0.2.2主要功能是支持cookie:
0.2.2 (2011-02-14) ++++++++++++++++++ * Still handles request in the event of an HTTPError. (Issue #2) * Eventlet and Gevent Monkeypatch support. * Cookie Support (Issue #1)
比如put-API中,request对象上接收cookiejar
def put(url, data='', headers={}, files={}, cookies=None, auth=None): r = Request() r.url = url r.method = 'PUT' r.data = data r.files = files r.headers = headers r.cookiejar = cookies r.auth = _detect_auth(url, auth) r.send() return r.response
然后使用urllib2的HTTPCookieProcessor封装cookiejar
def _get_opener(self): _handlers = [] if self.auth or self.cookiejar: if self.cookiejar: cookie_handler = urllib2.HTTPCookieProcessor(cookiejar) _handlers.append(cookie_handler) opener = urllib2.build_opener(*_handlers) return opener.open
CookieJar的实现,就是在http头上增加Cookie字段,字段的值是一个字典序列化的字符串:
class CookieJar: def __init__(self, policy=None): if policy is None: policy = DefaultCookiePolicy() self._policy = policy self._cookies_lock = _threading.RLock() self._cookies = {} def add_cookie_header(self, request): self._cookies_lock.acquire() try: self._policy._now = self._now = int(time.time()) cookies = self._cookies_for_request(request) attrs = self._cookie_attrs(cookies) if attrs: if not request.has_header("Cookie"): request.add_unredirected_header( "Cookie", "; ".join(attrs)) # if necessary, advertise that we know RFC 2965 if (self._policy.rfc2965 and not self._policy.hide_cookie2 and not request.has_header("Cookie2")): for cookie in cookies: if cookie.version != 1: request.add_unredirected_header("Cookie2", '$Version="1"') break finally: self._cookies_lock.release() self.clear_expired_cookies()
配合下面cookie的示例,就很容易理解了:
Cookie: lang=zh-CN; i_like_gogs=a333326706d3bb1c; _csrf=KSQKMtYni1y4Zbi5aRpjYbW32t86MTYxNzY3NjIzNzg5MDg3NDY0MQ%3D%3D
同时代码进行了优化,增加了_build_response
用来构建响应:
def _build_response(self, resp): """Build internal Response object from given response.""" self.response.status_code = resp.code self.response.headers = resp.info().dict self.response.content = resp.read() self.response.url = resp.url
优化前后对比很明显:
# before resp = opener(req) self.response.status_code = resp.code self.response.headers = resp.info().dict if self.method == 'GET': self.response.content = resp.read() self.response.url = resp.url # after resp = opener(req) self._build_response(resp)
我们可以认为这个优化符合DRY(Don't repeat yourself)规则和迪米特法则LOD(Law of Demeter)。之前在V0.2.1中response的处理,GET,POST,PUT3个分支中重复;同时send函数还需要关心response的处理细节,重构后很好的解决了这2个问题。
v0.2.3 response优化
v0.2.3主要功能是response优化:
0.2.3 (2011-02-15) ++++++++++++++++++ * New HTTPHandling Methods - Reponse.__nonzero__ (false if bad HTTP Status) - Response.ok (True if expected HTTP Status) - Response.error (Logged HTTPError if bad HTTP Status) - Reponse.raise_for_status() (Raises stored HTTPError
response增加了一个ok状态,用来描述请求成功与否:
class Response(object) def __init__(self): self.content = None self.status_code = None self.headers = dict() self.url = None self.ok = False self.error = None
同样response更加内聚,可以反应出请求的成功和失败:
# before resp = opener(req) self._build_response(resp) success = True # after resp = opener(req) self._build_response(resp) self.response.ok = True
同时response还支持bool判断
def __nonzero__(self): """Returns true if status_code is 'OK'.""" return not self.error
__nonzero__
为python的魔法函数
object.__nonzero__(self) Called to implement truth value testing and the built-in operation bool(); should return False or True, or their integer equivalents 0 or 1. When this method is not defined, __len__() is called, if it is defined, and the object is considered true if its result is nonzero. If a class defines neither __len__() nor __nonzero__(), all its instances are considered true.
v0.2.4 改进request类
v0.2.4主要工作是改进Request:
0.2.4 (2011-02-15) ++++++++++++++++++ * Python 2.5 Support * PyPy-c v1.4 Support * Auto-Authentication tests * Improved Request object constructor
request改进了构造函数,支持更多参数:
class Request(object): def __init__(self, url=None, headers=dict(), files=None, method=None, params=dict(), data=dict(), auth=None, cookiejar=None): self.url = url self.headers = headers self.files = files self.method = method self.params = params self.data = data self.response = Response() self.auth = auth self.cookiejar = cookiejar self.sent = False
这样使用的时候:
# before r = Request() r.method = 'HEAD' r.url = url # return response object r.params = params r.headers = headers r.cookiejar = cookies r.auth = _detect_auth(url, auth) r.send() # after r = Request(method='HEAD', url=url, params=params, headers=headers, cookiejar=cookies, auth=_detect_auth(url, auth)) r.send()
另外重用认证信息,对于同一个url使用缓存中的url
def add_autoauth(url, authobject): global AUTOAUTHS AUTOAUTHS.append((url, authobject)) def _detect_auth(url, auth): return _get_autoauth(url) if not auth else auth def _get_autoauth(url): for (autoauth_url, auth) in AUTOAUTHS: if autoauth_url in url: return auth return None
我们可以思考,认证按照url进行处理并不太好,按照domain可能更合适。
小结
通过v0.2.0~v0.2.4四个版本的源码历史对比法阅读,我们可以知道下面2个功能的实现逻辑:
- 文件上传的功能就是将本地文件读取到流中再通过socket发送出去
- cookie是在http头中的一个特殊序列化的字典
同时也了解了Reqeust和Response两个关键对象的一些优化方法,提高代码可读性的同时,也增加代码稳定性。
历史对比的同时,我们还可以发现一些有趣的题外细节。比如在2011-02-14~2011-02-15日两天,作者Kenneth Reitz完成了4个小版本的提交,果然是单身狗的快乐。
比较一下代码第一个版本的API:
>>> import requests >>> r = requests.get('http://google.com') >>> r.status_code 401
下面是现在的API:
>>> r = requests.get('https://api.github.com/user', auth=('user', 'pass')) >>> r.status_code 200
可以看到长达10年的时间内,作者最开始设计的API基本没有变化,这个还是挺厉害的。
小技巧
应该避免使用硬编码的魔法数字,比如在v0.2.1中的get_body_size:
def get_body_size(params, boundary): size = sum(p.get_size(boundary) for p in MultipartParam.from_params(params)) return size + len(boundary) + 6
这里的数字6就比较难以理解。
没错,反例也是技巧 :)