第一部分 创建爬虫
第一章 初见网络爬虫
1.1 网络连接
# py3 urllib from urllib.request import urlopen url = "http://www.baidu.com" html = urlopen(url) print(html.read())
官方文档:https://docs.python.org/3/library/urllib.html
1.2 BeautifulSoup
安装:
pip install beautifulsoup4
官方中文文档:
https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/index.html
推荐使用虚拟环境: Mac安装python环境以及虚拟环境
# py3 BeautifulSoup from urllib.request import urlopen from bs4 import BeautifulSoup url = "http://www.baidu.com" html = urlopen(url) soup = BeautifulSoup(html.read()) # "html.parser" print(soup.title) # <title>百度一下,你就知道</title>
增加异常处理
# 异常处理 # 1、 服务器不存在 # 2、 网页不存在 # py3 from urllib.request import urlopen from bs4 import BeautifulSoup url = "http://www.baidu.com" try: html = urlopen(url) except Exception as e: print(e) else: soup = BeautifulSoup(html.read(), "html.parser") print(soup.title) # <title>百度一下,你就知道</title> tag = soup.xxxxx print(type(tag)) # <class 'NoneType'> finally: pass
重新组织代码
# py3 from urllib.request import urlopen from urllib.error import HTTPError, URLError from bs4 import BeautifulSoup def getTitle(url): """ 异常处理 1、 服务器不存在 2、 网页不存在 3、 标签不存在 """ try: html = urlopen(url) except (HTTPError, URLError) as e: return None try: soup = BeautifulSoup(html.read(), "html.parser") title = soup.head.title except AttributeError as e: return None return title url = "http://www.baidu.com" title = getTitle(url) if title == None: print("title is None") else: print(title) # < title > 百度一下,你就知道 < / title >
第二章 复杂HTML解析
2.1 标签过滤
find, find_all 参数 name=None, 标签,或运算 attrs={}, 属性,与运算 recursive=True, 递归 text=None, 文本 limit=None, 范围限制 find <=> find_all(limit=1) **kwargs, 关键字 class_
url = "http://www.pythonscraping.com/pages/warandpeace.html" html = urlopen(url) soup = BeautifulSoup(html, "html.parser") name_list = soup.find_all("span", {"class": "green"}) for name in name_list: print(name.get_text()) h1 = soup.find(text="Chapter 1") print(h1)
2.2 BeautifulSoup对象
四个对象: BeautifulSoup 文档 Tag 标签 NavigableString 标签文字 Comment 注释
2.3 导航树
# 孩子和后代 children descendants # 兄弟 next_siblings previous_siblings next_sibling previous_sibling # 父亲 parent parents
url = "http://www.pythonscraping.com/pages/page3.html" html = urlopen(url) soup = BeautifulSoup(html, "html.parser") table = soup.find("table", {"id": "giftList"}) from bs4.element import Tag # for tr in table.children: # if isinstance(tr, Tag): # for td in tr.children: # print(td.get_text(), end="|") # print("\n") img = table.find("img", {"src": "../img/gifts/img1.jpg"}) price = img.parent.previous_sibling.get_text() print(price)
2.4 正则表达式
官方文档:https://docs.python.org/3/library/re.html
参考文章:Python编程:re正则库
正则邮箱: [A-Za-z0-9\._+]+@[A-Za-z0-9]+\.(com|cn|org|edu|net)
import re
url = "http://www.pythonscraping.com/pages/page3.html" html = urlopen(url) soup = BeautifulSoup(html, "html.parser") regex = re.compile(r"\.\.\/img\/gifts\/img.*\.jpg") imgs = soup.find_all("img", {"src": regex}) for img in imgs: print(img.get("src")) # 获取属性 attrs
2.5 lambda表达式
# 返回True 或 False tags = soup.find_all(lambda tag: len(tag.attrs) == 2) print(tags)
2.6 html解析库
lxml: http://lxml.de/
html.parser:https://docs.python.org/3/library/html.parser.html