使用BeatifulSoup模块提取信息-阿里云开发者社区

使用BeatifulSoup模块提取信息

2023-05-17 73

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： 使用BeatifulSoup模块提取信息

Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时甚至数天的工作时间.

前提准备

需要安装beautifulsoup4和lxml：

pip install beautifulsoup4

pip install lxml

创建BeautifulSoup对象

导入beautifuosoup

from bs4 import BeautifulSoup

格式化html文本 ( 当然也可以用本地html文件 ) 为BeautifulSoup对象

import urllib.request
from bs4 import BeautifulSoup
response = urllib.request.urlopen("http://www.baidu.com")
html = BeautifulSoup(response.read().decode(), "lxml")
print(type(html)) # <class 'bs4.BeautifulSoup'>

对象的种类

Tag
tag的意思就是标签，html文本就是有一个个标签组成的。如：

<html>
    <head>
        <title>Python</title>
        <meta chaset="utf-8"/>
    </head>
    <body>
        <p>Python真好！</p>
    </body>
</html>

这里尖括号括起来的就是一个个标签，有的是双标签，有的是单标签。

我们获取BeatifulSoup对象的某个标签时，就可以使用**.tagname**l来取得(只能获得第一个)

print(html.title) # <title>百度一下，你就知道</title>
print(html.a)     # <a class="toindex" href="/">百度首页</a>
print(type(html.a)) # <class 'bs4.element.Tag'>

NavigableString

得到了具体的标签我们就可以使用 .string来去的标签内部的信息

print(html.title.string) # 百度一下，你就知道
print(html.a.string) # 百度首页
print(type(html.a.string)) # <class 'bs4.element.NavigableString'>

遍历文档树

直接子节点
tag 的 .content 属性可以将tag的子节点以列表的方式输出：

print(html.head.contents)

.children，返回的是一个生成器对象

Iter = html.head.children
print(type(Iter)) # <class 'list_iterator'>
print(next(Iter)) # <meta content="text/html;charset=utf-8" http-equiv="Content-Type"/>
print(next(Iter)) # <meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>

.contents 和 .children 属性仅包含tag的直接子节点，.descendants 属性可以对所有tag的子孙节点进行递归循环，和 children类似，返回的也是一个生成器对象
节点内容

.string
如果标签里还有唯一标签，会返回最深层数据，如果不唯一，则不会返回数据，返回None

.strings
获取多个内容，但需要遍历

父节点，.parent

兄弟节点，.next_sibling，.prev_sibling

全部兄弟节点，.next_siblings，.previous_siblings

搜索文档树

*find_all( name , attrs , recursive , text , *kwargs )

name参数—标签名
A. 传tagname，表示查找所有的name的标签
B. 传正则表达式，表示查找符合正则表达的标签
C. 传列表，表示查找列表里所有的标签
keyword参数—属性

html.find_all(id="news_hot_data") # 查找id为“news_hot_data”的标签
html.find_all("img", class_="index-logo-src") # 查找所有class为“index-logo-src”的img标签
html.find_all(class_="index-logo-src",id="news_hot_data") # 查看class为“index-logo-src”且id为“news_hot_data”的标签

有一些特殊属性不能这样使用，但可以使用attrs参数

html.find_all(data-foo="value") # 不得行哦
html.find_all(attr={"data-foo": "value"}) # √

text参数

通过 text 参数可以搜搜文档中的字符串内容.与 name 参数的可选值一样, text 参数接受字符串 , 正则表达式 , 列表, True：

html.find_all(text="31省区市新增确诊52例 本土36例")

limit参数
限制返回的结果数量
recursive参数
默认是True，表示会检测子孙节点，如果只想检测子节点，应设为False
find( name , attrs , recursive , text , **kwargs ),用法跟find_all一样，不同的是find只会寻找第一个
其他(重点)

print('a标签类型是：', type(soup.a))   # 查看a标签的类型
print('第一个a标签的属性是：', soup.a.attrs)  # 获取a标签的所有属性(注意到格式是字典)
print('a标签属性的类型是：', type(soup.a.attrs))  # 查看a标签属性的类型
print('a标签的class属性是：', soup.a.attrs['class'])   # 因为是字典，通过字典的方式获取a标签的class属性
print('a标签的href属性是：', soup.a.attrs['href'])   # 同样，通过字典的方式获取a标签的href属性
print('a标签的href属性是：', soup.a['href']) # 效果一样

Css选择器

顾名思义我们需要用到Css的知识，基础的有：

标签不需要任何修饰
class 前加点 .
id 前加井号 #
p:not(.haha)，选择class不是haha的p标签
选择器1选择器2选择器3 ，选择三者同时满足的标签
父节点 > 子节点，如：p > a 表示选择p标签下的为直接节点的a标签

祖先节点子节点，如 p a 表示选择p标签下的所有a标签

我们需要使用select()函数，参数就是选择器。使用**.get_text()**来来获取选中标签的内容

html.select("title")  # 通过标签查找
html.select(".haha") # 通过类名为haha查找
html.select("#box1") # 通过id为box1来查找
html.select("p #box2") # 查找p标签下id为box2的内容
html.select('a[class="box"][id="p1"]') # 添加一些属性限制来查找.表示查找三个条件（a, class="box", id="p1"）同事满足的标签