开发者学堂课程【Python爬虫实战:Python 实现 urllib3 和 requests 库使用】学习笔记,与课程紧密联系,让用户快速学习知识。
课程地址:https://developer.aliyun.com/learning/course/555/detail/7644
Python 实现 urllib3 和 requests 库使用
内容简介:
一、 urllib3 库
二、 request 库
一、urllib3 库
(1)简介
https://urllib3.readthedocs.io/en/latest/
标准库urlIib缺少了一些关键的功能,非标准库的第三方库 urllib3 提供了,比如说连接池管理。
(2)安装
$pip install urllib3
Import urllib3
# 打开一个url返回一个对象
url = ‘
https://movie.douban.com/
’
ua = “Mozilla/5.0(Windows NT 6.1) AppleWebKit/537.36(KHTML,like Gecko) Chrome/55.0.2883.75Safati/537.36”
连接池管理器
with urllib3.PoolManager ()as http:
response = http.request('GET', url , headers={
'User-Agent':ua
})
print(type(response))
print(response.status, response. reason)
print(response. headers)
print(response. data)
在下面的目标框中运行如下代码:
pip install urllib3
点击urllib3会出现以下代码(其中all是所需要用到的东西):
urllib3-Thread-safe connection pooling and re-using.
import...
try: # Python 2.7+
from logging import NullHandler
except ImportError :
class NullHandler (logging. Handler):
def emit(self, record):
pass
__ author__='Andrey Petrov(andrey.petrov@shazow.net )’
__ license __='MIT'
__ version __=’ 1.23'
__ all__=(
' HTTPConnectionPool ',
‘ ETTPSConnectionPool ',
‘ PoolManager ',
' ProxyManager ',
' HTTPResponse ',
' Retry',
‘ Timeout’,
‘ add_stderr_logger',
‘ connection_from_url',
‘ disable_warings’,
'encode_multipart_formdata′,
'get_host',
'make_headers',
‘proxy_from_url',
)
运行一个实例
import urllib3
withurlib3 . PoolManager () as http:
http.request()
在上述代码上点击request会跳转到如下代码中
def request(self, method, bill, fields=None, headers=None, **urlopen_kw):
“””
Make a request using:meth:"urlopen"with the appropriate encoding of I fields”based on the"method"used.
This is a convenience method that requires the least amount of manual effort. It can be used in most situations , while still having the option to drop down to more specific methods when necessary,such as:meth:’request_encode_url', :meth:’request_encode_body’, or even the lowest level:meth:’urlopen‘.
method=method.upper()
urlopen_kw['request_url’]=url
if method in self. encode url methods:
return self. request_encode_url(method, url,fields=fields,
headers-headers,
**urlopen_kw)
else:
return self.request_encode_body(method,url, fields=fields,
headers-headers,
**urlopen_kw)
运行另一个方法的实例
import urllib3
withurlib3 . PoolManager () as http:
http.urlopen()
在上述代码上点击urlopen会跳转到如下代码中
def urlopen(self, method, url, redirect=True,**kw):
“”“
Same as:meth:'urllib3.connectionpool , HTTPConnectionPool , urlopen‘with custom cross-host redirect logic and only sends the request-uri portion of the’’url’’
The given‘‘url‘‘parameter must be absolute, such that an appropriate :class:’urllib3. connectionpool . ConectionPool‘ can be chosen for it.
u=parse_url(url)
conn=self.connection_from host(u.host, port=u.port, scheme=u. scheme)
kw[‘assert_same_host']=False
kw['redirect']=False
if'headers'not in kw:
kw['headers']=self. headers. copy()
if self. proxyisnotNoneandu .scheme——"http":
response=conn. urlopen(method, url,kw)
else:
response-conn. urlopen(method,u. request_uri,**kw)
redirect location=redirect and response. get redirect location/l if not redirect location:
return response
#Support relative URLs for redirecting .
redirect_location=urljoin(url, redirect _ location)
#RFC 7231,Section 6.4.4
response,status==303:
method='GET'
retries=kw. get('retries')
if not isinstance (retries, Retry):
retries=Retry.from_init(retries, redirect=redirect)
# Strip headers marked as unsafe to forward to the redirected location.·
#Check remove_headers_on_redirect to avoid a potential network call within
#conn, is_same_host() which may use socket.gethostbyname () in the future.
if(retries. remove headers on redirect
and not conn. is_same_host(redirect _ location)):
for header in retries.remove_headers_on_redirect: kw[ˈheaders']. pop(header, None)
try:
retries=retries.increment(method,url,response=response,_pool=conn)
except MaxRetryError :
if retries. raise _ on _ redirect:
raise
return response
kw['retries']=retries
kw[ˈredirect']=redirect
log.info (" Redirecting &s->s", url, redirect_location)
return self.urlopen(method, redirect_location,**kw)
(3)例子:
import urlib3
from urllib. parse import urlencode
from urllib3.response import HTMLResponse
jurl='httpss:// movie.douban.com/j/search_subjects ‘
d={
'type':'movie',
‘tag':‘热门',
'page _limit':10,
'page _ start':10
}
with urllib3.PoolManager () as http:
response=http.request(‘GET','{}?{},format(jurl, urlencodede(d)),headerer={
‘User-agent’:"Mozilla/5.0( Wndows NT 6.1) AppleWebKit/537.36(KHTML,like Gecko)Chrome/55.0. 2833.75 Safari/533.30
})
print(type(response))
response:HTTPResponse = HTTPResponse()
response.status
//点击status会跳入到以下代码页面(其中reason,status的只是可以使用的,还有许多属性比如池子属性,连接属性这些都是不能让我们用到的)
if isinstance (headers, HTTPHeaderDict ):
self. headers=headers
else:
self. headers=HTTPHeaderDict (headers)
self. status=status
self. version=version
self. reason=reason
self. strict=strict
self. decode_content=decode _ content
self. retries=retries
self, enforce_content _ length=enforce_content_length
self. _decoder=None
self. _body=None
self. _fp=None
self. _original_response=original_response
self. _fp_bytes_read=0
self. msg=msg
self. request_url=request_url
以下可以在代码上打印出status和data:
import urllib3
from urllib. parse import urlencode
from urllib3.response import HTMLResponse
jurl='httpss:// movie.douban.com/j/search_subjects ‘
d={
'type':'movie',
‘tag':‘热门',
'page _limit':10,
'page _ start':10
}
with urllib3.PoolManager () as http:
response=http.request(‘GET','{}?{},format(jurl, urlencodede(d)),headerer={
‘User-agent’:"Mozilla/5.0( Wndows NT 6.1) AppleWebKit/537.36(KHTML,like Gecko)Chrome/55.0. 2833.75 Safari/533.30
})
print(type(response))
#response:HTTPResponse = HTTPResponse()//这是3.6允许的语法内容
print(response.status)
print(response.data)
不同的 response 属性各不相同,所以最好的方式是把 response 写出来加”.”来出现错具备的属性
结果中,因为用到的 data,所以返回的是 bytes,在访问过程中会在池子中找到一个连接,把连接之后的东西装到 response 中去,只需要关心 response是谁,然后对其进行操作。
urllib 中需要做很多封装,发现在连接池管理器中所提供的方法和属性还是比较原始的,所以为了更加方便,就需要用到接下来使用的 request 库。
二、request库**
(1)简介
封装效果非常好,request 使用 urllib3,但是 API 用着更加友好,推荐使用
import requests
ua="Mozilla/5.8(Windows NT6.1) Applewebkit /537. 36(KHTML, like Gcko) Chrome/55.6. 2833.7. Safari/537. 36"
url='
https://movie.douban.com/'
response=requests. request('GET', url, headers={'User-Agent':ua})with response:
print(type(response))
print(response. url)
print(responsel. status _ code)
print(response. request. headers)#请求头
print(response. headers)#响应头
print(response. text[:200])#HTML的内容
with open('o:/movie. html','w', encoding='utf-8') as f:
f. write(response. text)#保存文件,以后备用
requests 默认使用 Session 对象,是为了在多次和服务器端交互中保留会话的信息,例如 cookie。
#直接使用 Session
import requests
ua="Mozillal/5.0( WindowsNT6 .1) Applewebbit /537. 36(KHTML, like Gecko) Chrome/55.0. 2833.7. Safari/537. 36"
urls=["
https://www.baidu.com/s?wd=maged.',
`
https://www.baidu.com/s?wd=maged=
"session=Irequests. Session()
with session:
for url in urls:
response=session. get(url, headers={'User-Agent':ua})
with response:
print(type(response))
print(response. url)
print(response. status _ code)
print(response. request. headers)#请求头
print(response. cookies)#响应的cookie
print(response. text[:20])#HTML的内容
(2)安装及其实例
首先需要在软件下方运行效果框中查找是否安装成功,查找代码为:pip install requests,并且它依赖了idna ,certifi,urllib3,以及 chardet
import urllib3
from urllib. parse import urlencode
from urllib3.response import HTMLResponse
import
jurl='httpss:// movie.douban.com/j/search_subjects ‘
d={
'type':'movie',
‘tag':‘热门',
'page _limit':10,
'page _ start':10
}
url='{}?{},format(jurl, urlencodede(d))
response= request.request(‘GET’,url,headers={
‘User-agent’:"Mozilla/5.0( Wndows NT 6.1) AppleWebKit/537.36(KHTML,like Gecko)Chrome/55.0. 2833.75 Safari/533.30
})
with response:
print(response.text)//text 是一个属性,是一个在 Unicode 返回的内容,也就是说明他已经转化过,相比较下用起来更加方便,编码是由系统定的,不用关心,只需要返回内容就行,按照编码顺序排列
print(response.status_code)
print(response.url)//url 是最终的定位,如果它本身能够跳转的话,拿到的就是跳转之后的位置
print(response.request)//可以拿到预处理的 ,在model文件名中可以找到,其中包括初始化、预处理的请求、方法、url、headers、_cookies 以及 body 等,预处理的方法就是组装请求。Request 中的这么多属性,是可以拿到的
//点击进入 request 后可以看到详细内容,其中重要的一句代码是:
with sessions.Session() as session:
return session.request(method= method,url=url,**kwargs)
表示在默认情况下是使用 session 机制俩管理会话请求,也就是有的 id 会在 session 中发出会话请求,就可以在两者之间来回传送。
上述代码进行添加后变成以下内容后运行:
import urllib3
from urllib. parse import urlencode
from urllib3.response import HTMLResponse
import request
jurl='httpss:// movie.douban.com/j/search_subjects ‘
d={
'type':'movie',
‘tag':‘热门',
'page _limit':10,
'page _ start':10
}
url='{}?{},format(jurl, urlencodede(d))
response= request.request(‘GET’,url,headers={
‘User-agent’:"Mozilla/5.0( Wndows NT 6.1) AppleWebKit/537.36(KHTML,like Gecko)Chrome/55.0. 2833.75 Safari/533.30
})
with response:
print(response.text)
print(response.status_code)
print(response.url)
print(response.headers,’~~~~~’)
print(response.request.headers)
、
进一步改良代码后,代码如下:
import urllib3
from urllib. parse import urlencode
from urllib3.response import HTMLResponse
import request
urls=["
https://www.baidu.com/s?wd=maged.',
`
https://www.baidu.com/s?wd=maged=
"session=Irequests. Session()
with session:
response= session.get(url,headers={
‘User-agent’:"Mozilla/5.0( Wndows NT 6.1) AppleWebKit/537.36(KHTML,like Gecko)Chrome/55.0. 2833.75 Safari/533.30
})
with response:
print(response.text[:50])
print(‘-‘*30)
print(response.cookies)
print(’-’*30)
print(response.headders,’~~~~~’)
print(response.request.headers)
注意:每次运行的结果的信息都在变化,有时会出现 Set-Cookie。