一、预备知识
1、Web基本工作原理
Web 服务是互联网提供的 World wide Web 服务的简称,最简单的 Web 服务是如下的2层体系结构:
这种浏览器和 Web 服务器交的体系结构也称为 B/S 结构,文本、图片等信息在请求到达之前即通过 HTML 语言以静态网页形式存储在 Web 服务器上,HTTP 请求到达后,Web 服务器把网页发给客户端的浏览器进行响应,属于静态网页技术。
2、网络的Robots协议
Robots 协议:在网站根目录下的 robots.txt 文件,用于告知网络哪些页面可以抓取,哪些不行,例如:http://baidu.com/robots.txt Robots 协议是建议但非约束性,网络可以不遵守,但存在法律风险。
二、爬取网页
1、请求服务器并获取网页
假设要使用Requests库爬取网址为 http://httpbin.org/ 的网页内容,主要步骤包括:假设要使用Requests库爬取网址为 http://httpbin.org/
的网页内容,主要步骤包括:
(1)导入requests
库
(2)调用requests.get()
方法获取网页
import requests url='http://httpbin.org/' response = requests.get(url=url)
2、查看服务器端响应的状态码
response.status_code
运行结果:
200
status_code
等于200,表示浏览器正确获取了服务器端传递过来的网页。
3、输出网页内容
print(response.text)
运行结果:
<!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <title>httpbin.org</title> <link href="https://fonts.googleapis.com/css?family=Open+Sans:400,700|Source+Code+Pro:300,600|Titillium+Web:400,600,700" rel="stylesheet"> <link rel="stylesheet" type="text/css" href="/flasgger_static/swagger-ui.css"> <link rel="icon" type="image/png" href="/static/favicon.ico" sizes="64x64 32x32 16x16" /> <style> html { box-sizing: border-box; overflow: -moz-scrollbars-vertical; overflow-y: scroll; } *, *:before, *:after { box-sizing: inherit; } body { margin: 0; background: #fafafa; } </style> </head> <body> <a href="https://github.com/requests/httpbin" class="github-corner" aria-label="View source on Github"> <svg width="80" height="80" viewBox="0 0 250 250" style="fill:#151513; color:#fff; position: absolute; top: 0; border: 0; right: 0;" aria-hidden="true"> <path d="M0,0 L115,115 L130,115 L142,142 L250,250 L250,0 Z"></path> <path d="M128.3,109.0 C113.8,99.7 119.0,89.6 119.0,89.6 C122.0,82.7 120.5,78.6 120.5,78.6 C119.2,72.0 123.4,76.3 123.4,76.3 C127.3,80.9 125.5,87.3 125.5,87.3 C122.9,97.6 130.6,101.9 134.4,103.2" fill="currentColor" style="transform-origin: 130px 106px;" class="octo-arm"></path> <path d="M115.0,115.0 C114.9,115.1 118.7,116.5 119.8,115.4 L133.7,101.6 C136.9,99.2 139.9,98.4 142.2,98.6 C133.8,88.0 127.5,74.4 143.8,58.0 C148.5,53.4 154.0,51.2 159.7,51.0 C160.3,49.4 163.2,43.6 171.4,40.1 C171.4,40.1 176.1,42.5 178.8,56.2 C183.1,58.6 187.2,61.8 190.9,65.4 C194.5,69.0 197.7,73.2 200.1,77.6 C213.8,80.2 216.3,84.9 216.3,84.9 C212.7,93.1 206.9,96.0 205.4,96.6 C205.1,102.4 203.0,107.8 198.3,112.5 C181.9,128.9 168.3,122.5 157.7,114.1 C157.9,116.9 156.7,120.9 152.7,124.9 L141.0,136.5 C139.8,137.7 141.6,141.9 141.8,141.8 Z" fill="currentColor" class="octo-body"></path> </svg> </a> <svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" style="position:absolute;width:0;height:0"> <defs> <symbol viewBox="0 0 20 20" id="unlocked"> <path d="M15.8 8H14V5.6C14 2.703 12.665 1 10 1 7.334 1 6 2.703 6 5.6V6h2v-.801C8 3.754 8.797 3 10 3c1.203 0 2 .754 2 2.199V8H4c-.553 0-1 .646-1 1.199V17c0 .549.428 1.139.951 1.307l1.197.387C5.672 18.861 6.55 19 7.1 19h5.8c.549 0 1.428-.139 1.951-.307l1.196-.387c.524-.167.953-.757.953-1.306V9.199C17 8.646 16.352 8 15.8 8z"></path> </symbol> <symbol viewBox="0 0 20 20" id="locked"> <path d="M15.8 8H14V5.6C14 2.703 12.665 1 10 1 7.334 1 6 2.703 6 5.6V8H4c-.553 0-1 .646-1 1.199V17c0 .549.428 1.139.951 1.307l1.197.387C5.672 18.861 6.55 19 7.1 19h5.8c.549 0 1.428-.139 1.951-.307l1.196-.387c.524-.167.953-.757.953-1.306V9.199C17 8.646 16.352 8 15.8 8zM12 8H8V5.199C8 3.754 8.797 3 10 3c1.203 0 2 .754 2 2.199V8z" /> </symbol> <symbol viewBox="0 0 20 20" id="close"> <path d="M14.348 14.849c-.469.469-1.229.469-1.697 0L10 11.819l-2.651 3.029c-.469.469-1.229.469-1.697 0-.469-.469-.469-1.229 0-1.697l2.758-3.15-2.759-3.152c-.469-.469-.469-1.228 0-1.697.469-.469 1.228-.469 1.697 0L10 8.183l2.651-3.031c.469-.469 1.228-.469 1.697 0 .469.469.469 1.229 0 1.697l-2.758 3.152 2.758 3.15c.469.469.469 1.229 0 1.698z" /> </symbol> <symbol viewBox="0 0 20 20" id="large-arrow"> <path d="M13.25 10L6.109 2.58c-.268-.27-.268-.707 0-.979.268-.27.701-.27.969 0l7.83 7.908c.268.271.268.709 0 .979l-7.83 7.908c-.268.271-.701.27-.969 0-.268-.269-.268-.707 0-.979L13.25 10z" /> </symbol> <symbol viewBox="0 0 20 20" id="large-arrow-down"> <path d="M17.418 6.109c.272-.268.709-.268.979 0s.271.701 0 .969l-7.908 7.83c-.27.268-.707.268-.979 0l-7.908-7.83c-.27-.268-.27-.701 0-.969.271-.268.709-.268.979 0L10 13.25l7.418-7.141z" /> </symbol> <symbol viewBox="0 0 24 24" id="jump-to"> <path d="M19 7v4H5.83l3.58-3.59L8 6l-6 6 6 6 1.41-1.41L5.83 13H21V7z" /> </symbol> <symbol viewBox="0 0 24 24" id="expand"> <path d="M10 18h4v-2h-4v2zM3 6v2h18V6H3zm3 7h12v-2H6v2z" /> </symbol> </defs> </svg> <div id="swagger-ui"> <div data-reactroot="" class="swagger-ui"> <div> <div class="information-container wrapper"> <section class="block col-12"> <div class="info"> <hgroup class="main"> <h2 class="title">httpbin.org <small> <pre class="version">0.9.2</pre> </small> </h2> <pre class="base-url">[ Base URL: httpbin.org/ ]</pre> </hgroup> <div class="description"> <div class="markdown"> <p>A simple HTTP Request & Response Service. <br> <br> <b>Run locally: </b> <code>$ docker run -p 80:80 kennethreitz/httpbin</code> </p> </div> </div> <div> <div> <a href="https://kennethreitz.org" target="_blank">the developer - Website</a> </div> <a href="mailto:me@kennethreitz.org">Send email to the developer</a> </div> </div> <!-- ADDS THE LOADER SPINNER --> <div class="loading-container"> <div class="loading"></div> </div> </section> </div> </div> </div> </div> <div class='swagger-ui'> <div class="wrapper"> <section class="clear"> <span style="float: right;"> [Powered by <a target="_blank" href="https://github.com/rochacbruno/flasgger">Flasgger</a>] <br> </span> </section> </div> </div> <script src="/flasgger_static/swagger-ui-bundle.js"> </script> <script src="/flasgger_static/swagger-ui-standalone-preset.js"> </script> <script src='/flasgger_static/lib/jquery.min.js' type='text/javascript'></script> <script> window.onload = function () { fetch("/spec.json") .then(function (response) { response.json() .then(function (json) { var current_protocol = window.location.protocol.slice(0, -1); if (json.schemes[0] != current_protocol) { // Switches scheme to the current in use var other_protocol = json.schemes[0]; json.schemes[0] = current_protocol; json.schemes[1] = other_protocol; } json.host = window.location.host; // sets the current host const ui = SwaggerUIBundle({ spec: json, validatorUrl: null, dom_id: '#swagger-ui', deepLinking: true, jsonEditor: true, docExpansion: "none", apisSorter: "alpha", //operationsSorter: "alpha", presets: [ SwaggerUIBundle.presets.apis, // yay ES6 modules ↘ Array.isArray(SwaggerUIStandalonePreset) ? SwaggerUIStandalonePreset : SwaggerUIStandalonePreset.default ], plugins: [ SwaggerUIBundle.plugins.DownloadUrl ], // layout: "StandaloneLayout" // uncomment to enable the green top header }) window.ui = ui // uncomment to rename the top brand if layout is enabled // $(".topbar-wrapper .link span").replaceWith("<span>httpbin</span>"); }) }) } </script> <div class='swagger-ui'> <div class="wrapper"> <section class="block col-12 block-desktop col-12-desktop"> <div> <h2>Other Utilities</h2> <ul> <li> <a href="/forms/post">HTML form</a> that posts to /post /forms/post</li> </ul> <br /> <br /> </div> </section> </div> </div> </body> </html>
三、使用BeautifulSoup定位网页元素
下面给出部分网页内容,用于演示如何使用BeautifulSoup
查找网页上需要的内容。
html=''' <html> <head> <title> The Dormouse's story </title> </head> <body> <p class="title"> <b> The Dormouse's story </b> </p> <p class="story"> Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1"> Elsie </a> , <a class="sister" href="http://example.com/lacie" id="link2"> Lacie </a> and <a class="sister" href="http://example.com/tillie" id="link2"> Tillie </a> ; and they lived at the bottom of a well. </p> <p class="story">爱丽丝梦游仙境</p> </body> </html> '''
1、首先需要导入BeautifulSoup库
参数说明:html
就是上面的html
文档字符串,'html.parser'
指明了解析该文档字符串的解析器是html
解析器。
from bs4 import BeautifulSoup soup=BeautifulSoup(html,'html.parser')
基本元素 | 说明 |
Tag | 标签,最基本的信息组织单元,分别用<>和</>标明开头和结尾 |
Name | 标签的名字,<p>...</p> 的名字是’p’,格式:<tag>.name |
Attributes | 标签的属性,字典形式组织,格式:<tag>.attrs |
NavigableString | 标签内非属性字符串,<>...</> 中字符串,格式:<tag>.string |
2、使用find/find_all函数查找所需的标签元素
(1)认识html的标签元素
上面一整行是img
标签,它由开始标签和结束标签两部分构成,标签名是img
,它含有src
和size
两个属性。
(2)find函数用于寻找满足条件的第一个标签
查看find
函数的帮助信息:
soup.find?
运行结果:
Signature: soup.find(name=None, attrs={}, recursive=True, text=None, **kwargs) Docstring: Return only the first child of this Tag matching the given criteria. File: d:\dell\appdata\anaconda3\lib\site-packages\bs4\element.py Type: method
查找文档中的第一个<p>
元素/标签:
first_p=soup.find("p") first_p
运行结果:
<p class="title"> <b> The Dormouse's story </b> </p>
(3)查看找到的元素类型和属性
#输出找到的元素类型,是bs4.element.Tag类型 print(type(first_p)) #输出找到的元素的属性,是一个字典 first_p.attrs
运行结果:
<clas
(4)find_all函数用于寻找满足条件的所有标签,这些标签将被放入一个列表中
find_all
函数的原型如下:
find_all(self, name=None attrs=f, recursive=True, text=None, limit=None, **kwargs)
self表明它是一个类成员函数;
name是要查找的tag元素名称,默认是None,如果不提供,就是查找所有的元素;
attrs是元素的属性,它是一个字典,默认是空,如果提供就是查找有这个指定属性的元素;
recursive指定查找是否在元素节点的子树下面全范围进行,默认是True;
后面的text、limit、kwargs参数比较复杂,将在后面用到时介绍;
find_all函数返回查找到的所有指定的元素的列表,每个元素是一个 bs4.element.Tag对象。
查找文档中的所有<a>
元素:
a_ls=soup.find_all('a') for a in a_ls: print(a)
运行结果:
<a class="sister" href="http://example.com/elsie" id="link1"> Elsie </a> <a class="sister" href="http://example.com/lacie" id="link2"> Lacie </a> <a class="sister" href="http://example.com/tillie" id="link2"> Tillie </a>
(5)查找文档中class='story’的p元素
p_story=soup.find_all('p',attrs={"class":"story"}) p_story
运行结果:
[<p class="story"> Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1"> Elsie </a> , <a class="sister" href="http://example.com/lacie" id="link2"> Lacie </a> and <a class="sister" href="http://example.com/tillie" id="link2"> Tillie </a> ; and they lived at the bottom of a well. </p>, <p class="story">爱丽丝梦游仙境</p>]
(6)练习:请找出文档中class='sister’的元素
all_sister=soup.find_all(attrs={"class":"sister"}) all_sister
运行结果:
[<a class="sister" href="http://example.com/elsie" id="link1"> Elsie </a>, <a class="sister" href="http://example.com/lacie" id="link2"> Lacie </a>, <a class="sister" href="http://example.com/tillie" id="link2"> Tillie </a>]
四、获取元素的属性值
(1)判断元素是否含有某属性
#判断文档中的第一个<p>元素是否含有class属性 first_p.has_attr("class")
运行结果:
True
(2)得到元素的属性值
因为属性名和值构成字典,所以采用字典的访问形式得到属性值。
#输出文档中所有<a>元素的href属性值: a_ls=soup.find_all('a') for a in a_ls: print(a["href"])
运行结果:
http://example.com/elsie http://example.com/lacie http://example.com/tillie
五、获取元素包含的文本
先找到class
='story'
的第一个p
元素。
p_story_fst=soup.find('p',attrs={"class":"story"})
1、使用get_text属性查看该元素所包含的html文本
print(p_story_fst.get_text)
运行结果:
<bound method Tag.get_text of <p class="story"> Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1"> Elsie </a> , <a class="sister" href="http://example.com/lacie" id="link2"> Lacie </a> and <a class="sister" href="http://example.com/tillie" id="link2"> Tillie </a> ; and they lived at the bottom of a well. </p>>
2、使用text属性查看该元素及子孙元素包含的文本(可能包含空白字符)
p_story_fst.text
运行结果:
'\n Once upon a time there were three little sisters; and their names were\n \n Elsie\n \n ,\n \n Lacie\n \n and\n \n Tillie\n \n ; and they lived at the bottom of a well.\n '
3、使用stripped_strings属性查看元素及其子孙包含的不带空白字符的文本
list(p_story_fst.stripped_strings)
运行结果:
['Once upon a time there were three little sisters; and their names were', 'Elsie', ',', 'Lacie', 'and', 'Tillie', '; and they lived at the bottom of a well.']
六、遍历文档元素
(1)先找到class='story’的第一个p元素
p_story_fst=soup.find('p',attrs={"class":"story"}) p_story_fst
运行结果:
<p class="story"> Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1"> Elsie </a> , <a class="sister" href="http://example.com/lacie" id="link2"> Lacie </a> and <a class="sister" href="http://example.com/tillie" id="link2"> Tillie </a> ; and they lived at the bottom of a well. </p>
(2)向下遍历找到孩子元素
for child in p_story_fst.children: print(child)
运行结果:
Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1"> Elsie </a> , <a class="sister" href="http://example.com/lacie" id="link2"> Lacie </a> and <a class="sister" href="http://example.com/tillie" id="link2"> Tillie </a> ; and they lived at the bottom of a well.
(3)向上遍历找到父亲元素
parnt=p_story_fst.parent parnt.name
运行结果:
'body'
(4)平行遍历找到前面的兄弟节点
list(p_story_fst.previous_siblings)
运行结果:
['\n', <p class="title"> <b> The Dormouse's story </b> </p>, '\n']
(5)平行遍历找到后面的兄弟节点
list(p_story_fst.next_siblings)
运行结果:
['\n', <p class="story">爱丽丝梦游仙境</p>, '\n']
七、练习
test='''<html><head></head><body><span>1234 <a href="www.test.edu.cn">This is a test!<b>abc</b></a></span> </body></html>'''
(1)写出导入BeautifulSoup
库和创建BeautifulSoup
对象的代码:
from bs4 import BeautifulSoup soup=BeautifulSoup(test,'html.parser')
(2)完善代码,使得pos
能定位到(指向)上述html
代码中的span
元素节点:
pos=soup.find('span') pos
运行结果:
<span>1234 <a href="www.test.edu.cn">This is a test!<b>abc</b></a></span>
(3)完善代码,能输出span
元素内部包含的所有文本(包含子孙元素的文本):
print(pos.get_text())
运行结果:
1234 This is a test!abc
(4)完善代码,能输出span
元素后面直接包含的文本(不包含子孙元素的文本):
print(pos.next_sibling.string.strip())
运行结果:
(5)找出a
元素的孩子和父亲节点名称
# 定位到a元素节点 a_tag=soup.find('a') # 输出a元素的孩子节点名称 for child in a_tag.children: print("Child node name:", child.name) # 输出a元素的父亲节点名称 print("Parent node name:", a_tag.parent.name)
运行结果:
Child node name: None Child node name: b Parent node name: span
(6)找出a
元素包含的超链接信息
# 定位到a元素节点 a_tag=soup.find('a') # 获取超链接的URL link_url=a_tag.get('href') print("Link URL:", link_url) # 获取超链接文本 link_text=a_tag.get_text() print("Link Text:", link_text)
运行结果:
Link URL: www.test.edu.cn Link Text: This is a test!abc
(7)找出a
元素包含的兄弟信息
# 定位到a元素节点 a_tag=soup.find('a') # 获取下一个兄弟节点的文本内容 next_sibling_text=a_tag.next_sibling.string.strip() if a_tag.next_sibling else None print("Next Sibling Text:", next_sibling_text) # 获取上一个兄弟节点的文本内容 prev_sibling_text=a_tag.previous_sibling.string.strip() if a_tag.previous_sibling else None print("Previous Sibling Text:", prev_sibling_text)
运行结果:
Next Sibling Text: None Previous Sibling Text: 1234