写爬虫的时候requests报错
HTTPSConnectionPool(host='xxxxx', port=443): Max retries exceeded with
升级 requests库
python2 -m pip install --upgrade resuests
如果有ssl验证报错则关闭验证并去掉警告提醒
fromrequests.packages.urllib3.exceptionsimportInsecureRequestWarningrequests.packages.urllib3.disable_warnings(InsecureRequestWarning) requests.get(req_url,params=req_header,verify=False)
-----------------------------------------------------------------------------------------------------------------------
运行爬虫程序过程中,发现requests.get得到的值经过BeautifulSoup格式化之后内容不同了。经过查找资料发现
https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html
解析器 |
使用方法 |
优势 |
劣势 |
Python标准库 |
BeautifulSoup(markup,"html.parser") |
|
|
lxml HTML 解析器 |
BeautifulSoup(markup,"lxml") |
|
|
lxml XML 解析器 |
BeautifulSoup(markup,["lxml","xml"]) BeautifulSoup(markup,"xml") |
|
|
html5lib |
BeautifulSoup(markup,"html5lib") |
|
|
建议使用 lxml或者html5lib,因为原带不稳定。