Beautiful Soup 4.12.0 文档（二）-阿里云开发者社区

Beautiful Soup 4.12.0 文档（一）:https://developer.aliyun.com/article/1515340

find_all()

find_all(name , attrs , recursive , string , **kwargs )

find_all() 方法搜索当前 tag 的所有子节点，并判断是否符合过滤器的条件。过滤器类型中已经举过几个例子，这里再展示几个新例子:

soup.find_all("title")

# [The Dormouse's story]

soup.find_all("p", "title")

# [

The Dormouse's story

]

soup.find_all("a")

# [Elsie,

# Lacie,

# Tillie]

soup.find_all(id="link2")

# [Lacie]

import re

soup.find(string=re.compile("sisters"))

# u'Once upon a time there were three little sisters; and their names were\n'

有几个方法很相似，还有几个方法是新的，参数中的 string 和 id 是什么含义? 为什么 find_all("p", "title") 返回的是CSS Class为”title”的

标签? 我们来仔细看一下 find_all() 的参数

name 参数

传一个值给 name 参数，就可以查找所有名字为 name 的 tag。所有文本都会被忽略掉，因为它们不匹配标签名字。

简单的用法如下:

soup.find_all("title")

# [The Dormouse's story]

回忆过滤器类型中描述的内容，搜索 name 的参数值可以是：字符串、正则表达式、列表、方法或是 True 。

keyword 参数

如果动态参数中出现未能识别的参数名，搜索时会把该参数当作 tag 属性来搜索，比如搜索参数中包含一个名字为 id 的参数，Beautiful Soup 会搜索每个 tag 上的 id 属性

soup.find_all(id='link2')

# [Lacie]

如果传入 href 参数，Beautiful Soup会搜索每个 tag 的 href 属性

soup.find_all(href=re.compile("elsie"))

# [Elsie]

搜索指定名字的属性时可以使用的参数值包括字符串 , 正则表达式 , 列表, True .

下面的例子在文档树中查找所有包含 id 属性的 tag，无论 id 的值是什么:

soup.find_all(id=True)

# [Elsie,

# Lacie,

# Tillie]

使用多个指定名字的参数可以同时过滤多个 tag 属性:

soup.find_all(href=re.compile("elsie"), id='link1')

# [three]

有些 tag 属性在搜索不能使用，比如HTML5中的 data-* 属性:

data_soup = BeautifulSoup('

foo!

data_soup.find_all(data-foo="value")

# SyntaxError: keyword can't be an expression

这种情况下可以通过 find_all() 方法的 attrs 参数定义一个字典参数来搜索包含特殊属性的 tag:

data_soup.find_all(attrs={"data-foo": "value"})

# [

foo!

]

不要使用 “name” 作为关键字参数搜索 HTML 元素，因为 Beautiful Soup 用 name 来识别 tag 本身的名字。换一种方法，你可以这样搜索属性中的 “name” 值

name_soup = BeautifulSoup('', 'html.parser')

name_soup.find_all(name="email")

# []

name_soup.find_all(attrs={"name": "email"})

# []

按CSS搜索

按照 CSS 类名搜索的功能非常实用，但标识 CSS 类名的关键字 class 在Python中是保留字，使用 class 做参数会导致语法错误。从 Beautiful Soup 4.1.2 版本开始，可以通过 class_ 参数搜索有指定CSS类名的 tag:

soup.find_all("a", class_="sister")

# [Elsie,

# Lacie,

# Tillie]

作为关键字形式的参数 class_ 同样接受不同类型的过滤器，字符串、正则表达式、方法或 True :

soup.find_all(class_=re.compile("itl"))

# [

The Dormouse's story

]

def has_six_characters(css_class):

return css_class is not None and len(css_class) == 6

soup.find_all(class_=has_six_characters)

# [Elsie,

# Lacie,

# Tillie]

tag 的 class 属性是多值属性。按照 CSS 类名搜索时，表示匹配到 tag 中任意 CSS 类名:

css_soup = BeautifulSoup('

css_soup.find_all("p", class_="strikeout")

# [

]

css_soup.find_all("p", class_="body")

# [

]

搜索 class 属性时也可以通过 CSS 值进行完全匹配:

css_soup.find_all("p", class_="body strikeout")

# [

]

完全匹配 class 的值时，如果CSS类名的顺序与实际不符，将搜索不到结果:

css_soup.find_all("p", class_="strikeout body")

# []

如果想要通过多个 CSS 类型来搜索 tag，应该使用 CSS 选择器

css_soup.select("p.strikeout.body")

# [

]

在旧版本的 Beautiful Soup 中，可能不支持 class_，这时可以使用 attrs 实现相同效果。创建一个字典，包含要搜索的 class 类名（或者正则表达式等形式）

soup.find_all("a", attrs={"class": "sister"})

# [Elsie,

# Lacie,

# Tillie]

string 参数

通过 string 参数可以搜索文档中的字符串内容。与 name 参数接受的值一样， string 参数接受字符串 , 正则表达式 , 列表, 函数, True 。看例子:

soup.find_all(string="Elsie")

# [u'Elsie']

soup.find_all(string=["Tillie", "Elsie", "Lacie"])

# [u'Elsie', u'Lacie', u'Tillie']

soup.find_all(string=re.compile("Dormouse"))

[u"The Dormouse's story", u"The Dormouse's story"]

def is_the_only_string_within_a_tag(s):

""Return True if this string is the only child of its parent tag.""

return (s == s.parent.string)

soup.find_all(string=is_the_only_string_within_a_tag)

# [u"The Dormouse's story", u"The Dormouse's story", u'Elsie', u'Lacie', u'Tillie', u'...']

虽然 string 参数用于搜索字符串，同时也以与其它参数混合使用来搜索 tag。 Beautiful Soup 会过滤那些 string 值与 .string 参数相符的 tag。下面代码用来搜索内容里面包含 “Elsie” 的标签:

soup.find_all("a", string="Elsie")

# [Elsie]

string 参数是在 4.4.0 中新增的。早期版本中该参数名为 text。

soup.find_all("a", text="Elsie")

# [Elsie]

limit 参数

find_all() 方法会返回全部的搜索结构，如果文档树很大那么搜索会很慢。如果我们不需要全部结果，可以使用 limit 参数限制返回结果的数量。效果与SQL中的limit关键字类似，当搜索到的结果数量达到 limit 的限制时，就停止搜索返回结果。

“爱丽丝”文档例子中有 3 个 tag 符合搜索条件，但下面例子中的结果只返回了 2 个，因为我们限制了返回数量:

soup.find_all("a", limit=2)

# [Elsie,

# Lacie]

recursive 参数

如果调用 mytag.find_all() 方法，Beautiful Soup 会检索 mytag 的所有子孙节点，如果只想搜索直接子节点，可以使用参数 recursive=False。查看下面例子

soup.html.find_all("title")

# [The Dormouse's story]

soup.html.find_all("title", recursive=False)

# []

下面一段简单的文档:

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
...

<title> 标签在 <html> 标签之下，但并不是直接子节点，<head> 标签才是直接子节点。在允许查询所有后代节点时 Beautiful Soup 能够查找到 <title> 标签。但是使用了 recursive=False 参数之后，只能查找直接子节点，这样就查不到 <title> 标签了。

Beautiful Soup 提供了多种 DOM 树搜索方法。这些方法都使用了类似的参数定义。比如这些方法: find_all(): name, attrs, text, limit. 但是只有 find_all() 和 find() 支持 recursive 参数。

像调用 find_all() 一样调用tag

find_all() 几乎是 Beautiful Soup 中最常用的搜索方法，所以我们定义了它的简写方法。 BeautifulSoup 对象和 Tag 对象可以被当作一个方法来使用，这个方法的执行结果与调用这个对象的 find_all() 方法相同，下面两行代码是等价的:

soup.find_all("a")
soup("a")

这两行代码也是等价的:

soup.title.find_all(string=True)
soup.title(string=True)

find()

find(name , attrs , recursive , string , **kwargs )

find_all() 方法将返回文档中符合条件的所有 tag，尽管有时候我们只想得到一个结果。比如文档中只有一个 <body> 标签，那么使用 find_all() 方法来查找 <body> 标签就不太合适，使用 find_all 方法并设置 limit=1 参数不如直接使用 find() 方法。下面两行代码是等价的:

soup.find_all('title', limit=1)
# [<title>The Dormouse's story</title>]

soup.find('title')
# <title>The Dormouse's story</title>

唯一的区别是 find_all() 方法的返回结果是值包含一个元素的列表，而 find() 方法直接返回结果。

find_all() 方法没有找到目标是返回空列表， find() 方法找不到目标时，返回 None。

print(soup.find("nosuchtag"))

# None

soup.head.title 是 Tag 的名字方法的简写。这个简写就是通过多次调用 find() 方法实现的:

soup.head.title
# <title>The Dormouse's story</title>

soup.find("head").find("title")
# <title>The Dormouse's story</title>

find_parents() 和 find_parent()

find_parents( name , attrs , recursive , string , **kwargs )

find_parent( name , attrs , recursive , string , **kwargs )

我们已经用了很大篇幅来介绍 find_all() 和 find() 方法，Beautiful Soup 中还有 10 个用于搜索的 API。它们中有 5 个用的是与 find_all() 相同的搜索参数，另外 5 个与 find() 方法的搜索参数类似。区别仅是它们搜索文档的位置不同。

首先来看看 find_parents() 和 find_parent()。记住: find_all() 和 find() 只搜索当前节点的所有子节点，孙子节点等。而这 2 个方法刚好相反，它们用来搜索当前节点的父辈节点。我们来试试看，从例子文档中的一个深层叶子节点开始:

a_string = soup.find(string="Lacie")
a_string
# u'Lacie'

a_string.find_parents("a")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

a_string.find_parent("p")
# <p class="story">Once upon a time there were three little sisters; and their names were
#  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
#  and they lived at the bottom of a well.</p>

a_string.find_parents("p", class="title")
# []

文档中的一个 <a> 标签是是当前叶子节点的直接父节点，所以可以被找到。还有一个 <p> 标签，是目标叶子节点的间接父辈节点，所以也可以被找到。包含 class 值为 “title” 的 <p> 标签不是不是目标叶子节点的父辈节点，所以通过 find_parents() 方法搜索不到。

find_parent() 和 find_parents() 方法会让人联想到 .parent 和 .parents 属性。它们之间的联系非常紧密。搜索父辈节点的方法实际上就是对 .parents 属性的迭代搜索。

find_next_siblings() 和 find_next_sibling()

find_next_siblings( name , attrs , recursive , string , **kwargs )

find_next_sibling( name , attrs , recursive , string , **kwargs )

这 2 个方法通过 .next_siblings 属性对当 tag 的所有后面解析 [5] 的兄弟tag节点进行迭代， find_next_siblings() 方法返回所有符合条件的后面的兄弟节点， find_next_sibling() 只返回符合条件的后面的第一个 tag 节点。

first_link = soup.a
first_link
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

first_link.find_next_siblings("a")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

first_story_paragraph = soup.find("p", "story")
first_story_paragraph.find_next_sibling("p")
# <p class="story">...</p>

find_previous_siblings() 和 find_previous_sibling()

find_previous_siblings( name , attrs , recursive , string , **kwargs )

find_previous_sibling( name , attrs , recursive , string , **kwargs )

这 2 个方法通过 .previous_siblings 属性对当前 tag 的前面解析 [5] 的兄弟 tag 节点进行迭代， find_previous_siblings() 方法返回所有符合条件的前面的兄弟节点， find_previous_sibling() 方法返回第一个符合条件的前面的兄弟节点:

last_link = soup.find("a", id="link3")
last_link
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

last_link.find_previous_siblings("a")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

first_story_paragraph = soup.find("p", "story")
first_story_paragraph.find_previous_sibling("p")
# <p class="title"><b>The Dormouse's story</b></p>

find_all_next() 和 find_next()

find_all_next( name , attrs , recursive , string , **kwargs )

find_next( name , attrs , recursive , string , **kwargs )

这 2 个方法通过 .next_elements 属性对当前 tag 的之后的 [5] tag 和字符串进行迭代， find_all_next() 方法返回所有符合条件的节点， find_next() 方法返回第一个符合条件的节点:

first_link = soup.a
first_link
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

first_link.find_all_next(string=True)
# [u'Elsie', u',\n', u'Lacie', u' and\n', u'Tillie',
#  u';\nand they lived at the bottom of a well.', u'\n\n', u'...', u'\n']

first_link.find_next("p")
# <p class="story">...</p>

第一个例子中，字符串 “Elsie”也被显示出来，尽管它被包含在我们开始查找的 <a> 标签的里面。第二个例子中，最后一个<p>标签也被显示出来，尽管它与我们开始查找位置的 <a> 标签不属于同一部分。例子中，搜索的重点是要匹配过滤器的条件，以及元素在文档中出现的顺序要在查找的元素的之后。

find_all_previous() 和 find_previous()

find_all_previous( name , attrs , recursive , string , **kwargs )

find_previous( name , attrs , recursive , string , **kwargs )

这 2 个方法通过 .previous_elements 属性对当前节点前面 [5] 的 tag 和字符串进行迭代， find_all_previous() 方法返回所有符合条件的节点， find_previous() 方法返回第一个符合条件的节点。

first_link = soup.a
first_link
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

first_link.find_all_previous("p")
# [<p class="story">Once upon a time there were three little sisters; ...</p>,
#  <p class="title"><b>The Dormouse's story</b></p>]

first_link.find_previous("title")
# <title>The Dormouse's story</title>

find_all_previous("p") 既返回了文档中的第一段(class=”title”的那段)，还返回了第二段，包含了我们开始查找的 <a> 标签的那段。不用惊讶，这段代码的功能是查找所有出现在指定 <a> 标签之前的 <p> 标签，因为这个 <p> 标签包含了开始的 <a> 标签，所以 <p> 标签当然是在 <a> 之前出现的。

CSS 选择器

BeautifulSoup 对象和 Tag 对象支持通过 .css 属性实现 CSS 选择器。具体选择功能是通过 Soup Sieve 库实现的，在 PyPI 上通过关键字 soupsieve 可以找到。通过 pip 安装 Beautiful Soup 时，Soup Sieve 也会自动安装，不用其它额外操作。

Soup Sieve 文档列出了当前支持的 CSS 选择器，下面是一些基本应用

soup.css.select("title")
# [<title>The Dormouse's story</title>]

soup.css.select("p:nth-of-type(3)")
# [<p class="story">...</p>]

查找指定层级的 tag:

soup.css.select("body a")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie"  id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.css.select("html head title")
# [<title>The Dormouse's story</title>]

找到某个 tag 标签下的直接子标签 [6] :

soup.css.select("head > title")
# [<title>The Dormouse's story</title>]

soup.css.select("p > a")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie"  id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.css.select("p > a:nth-of-type(2)")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

soup.css.select("p > #link1")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

soup.css.select("body > a")
# []

找到兄弟节点标签:

soup.css.select("#link1 ~ .sister")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie"  id="link3">Tillie</a>]

soup.css.select("#link1 + .sister")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

通过 CSS 的类名查找:

soup.css.select(".sister")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.css.select("[class~=sister]")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

通过 id 查找 tag:

soup.css.select("#link1")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

soup.css.select("a#link2")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

查找符合列表中任意一个选择器的 tag：

soup.css.select("#link1,#link2")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

通过是否存在某个属性来查找:

soup.css.select('a[href]')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

通过属性的值来查找:

soup.css.select('a[href="http://example.com/elsie"]')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
soup.css.select('a[href^="http://example.com/"]')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
soup.css.select('a[href$="tillie"]')
# [<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
soup.css.select('a[href*=".com/el"]')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

soup.css.select_one(".sister")
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

为了方便使用，在 BeautifulSoup 或 Tag 对象上直接调用 select() 和 select_one() 方法，中间省略 .css 属性

soup.select('a[href$="tillie"]')
# [<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
soup.select_one(".sister")
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

CSS 选择器对于熟悉 CSS 语法的人来说非常方便。你可以在 Beautiful Soup 中使用相同的方法。但是如果你只需要使用 CSS 选择器就够了，那么应该 lxml 作为文档解析器：因为速度快很多。但是 Soup Sieve 也有优势，它允许组合使用 CSS 选择器和 Beautiful Soup 的 API。

Soup Sieve 高级特性

Soup Sieve 提供的是比 select() 和 select_one() 更底层的方法，通过 Tag 或 Beautiful Soup 对象的 .css 属性，可以调用大部分的 API。下面是支持这种调用方式的方法列表，查看 Soup Sieve 文档了解全部细节。

iselect() 方法与 select() 效果相同，区别是返回的结果是迭代器。

[tag['id'] for tag in soup.css.iselect(".sister")]

# ['link1', 'link2', 'link3']

closest() 方法与 find_parent() 方法相似，返回符合 CSS 选择器的 Tag 对象的最近父级。

elsie = soup.css.select_one(".sister")
elsie.css.closest("p.story")
# <p class="story">Once upon a time there were three little sisters; and their names were
#  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
#  and they lived at the bottom of a well.</p>

match() 方法返回布尔结果，标记指定 Tag 是否符合指定筛选器

# elsie.css.match("#link1")
True
# elsie.css.match("#link2")
False
filter() 方法返回 tag 直接子节点中符合筛选器的节点列表
[tag.string for tag in soup.find('p', 'story').css.filter('a')]
# ['Elsie', 'Lacie', 'Tillie']
escape() 方法可以对 CSS 标识符中的特殊字符进行转义，否则是非法 CSS 标识符
soup.css.escape("1-strange-identifier")
# '\\31 -strange-identifier'

CSS 筛选器中的命名空间

如果解析的 XML 文档中定义了命名空间，那么 CSS 筛选器中也可以使用

from bs4 import BeautifulSoup
xml = """<tag xmlns:ns1="http://namespace1/" xmlns:ns2="http://namespace2/">
<ns1:child>I'm in namespace 1</ns1:child>
<ns2:child>I'm in namespace 2</ns2:child>
</tag> """
namespace_soup = BeautifulSoup(xml, "xml")
namespace_soup.css.select("child")
# [<ns1:child>I'm in namespace 1</ns1:child>, <ns2:child>I'm in namespace 2</ns2:child>]
namespace_soup.css.select("ns1|child")
# [<ns1:child>I'm in namespace 1</ns1:child>]

Beautiful Soup 尝试自动匹配解析文档中的命名空间前缀，除此之外，你还可以自定义目录的缩写

namespaces = dict(first="http://namespace1/", second="http://namespace2/")
namespace_soup.css.select("second|child", namespaces=namespaces)
# [<ns1:child>I'm in namespace 2</ns1:child>]

支持 CSS 筛选器的历史版本

.css 属性是在 Beautiful Soup 4.12.0 中添加的。在此之前，只能使用 .select() 和

.select_one() 方法。

Soup Sieve 是在 Beautiful Soup 4.7.0 开始集成的。早期版本中有 .select() 方法，但仅能支持最常用的 CSS 选择器。

修改文档树

Beautiful Soup 的强项是文档树的搜索，但也支持修改文档数，或者编写新的 HTML、XML 文档。

修改 tag 的名称和属性

在 Tag.attrs 的章节中已经介绍过这个功能，但是再看一遍也无妨。重命名一个 tag, 改变属性的值，添加或删除属性

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'html.parser')
tag = soup.b
tag.name = "blockquote"
tag['class'] = 'verybold'
tag['id'] = 1
tag
# <blockquote class="verybold" id="1">Extremely bold</blockquote>
del tag['class']
del tag['id']
tag
# <blockquote>Extremely bold</blockquote>

修改 .string

如果设置 tag 的 .string 属性值，就相当于用新的内容替代了原来的内容:

markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup, 'html.parser')

tag = soup.a
tag.string = "New link text."
tag
# <a href="http://example.com/">New link text.</a>

注意：如果 tag 原本包含了其它子节点，原有的所有内容包括子 tag 都会被覆盖掉。

append()

向 tag 中添加内容可以使用 Tag.append() 方法，就好像调用 Python 列表的 .append() 方法:

soup = BeautifulSoup("<a>Foo</a>", 'html.parser')
soup.a.append("Bar")

soup
# <a>FooBar</a>
soup.a.contents
# ['Foo', 'Bar']

extend()

从 Beautiful Soup 4.7.0 版本开始，tag 增加了 .extend() 方法，可以把一个列表中内容，按顺序全部添加到一个 tag 当中

soup = BeautifulSoup("<a>Soup</a>", 'html.parser')
soup.a.extend(["'s", " ", "on"])

soup
# <a>Soup's on</a>
soup.a.contents
# ['Soup', ''s', ' ', 'on']

NavigableString() 和 .new_tag()

如果想添加一段文本内容到文档中，可以将一个 Python 字符串对象传给 append() 方法，或调用 NavigableString 构造方法:

from bs4 import NavigableString
soup = BeautifulSoup("<b></b>", 'html.parser')
tag = soup.b
tag.append("Hello")
new_string = NavigableString(" there")
tag.append(new_string)
tag
# <b>Hello there.</b>
tag.contents
# ['Hello', ' there']

如果想要创建一段注释，或其它 NavigableString 的子类，只要调用构造方法:

from bs4 import Comment
new_comment = Comment("Nice to see you.")
tag.append(new_comment)
tag
# <b>Hello there<!--Nice to see you.--></b>
tag.contents
# ['Hello', ' there', 'Nice to see you.']
(这是 Beautiful Soup 4.4.0 中新增的方法)

如果需要新创建一个 tag，最好的方法是调用工厂方法 BeautifulSoup.new_tag()

soup = BeautifulSoup("<b></b>", 'html.parser')
original_tag = soup.b
new_tag = soup.new_tag("a", href="http://www.example.com")
original_tag.append(new_tag)
original_tag
# <b><a href="http://www.example.com"></a></b>
new_tag.string = "Link text."
original_tag
# <b><a href="http://www.example.com">Link text.</a></b>

只有第一个参数用作 tag 的 name，是必填的。

insert()

Tag.insert() 方法与 Tag.append() 方法类似，区别是不会把新元素添加到父节点 .contents 属性的最后。而是把元素插入到按顺序指定的位置。与 Python 列表中的 .insert() 方法的用法相同

markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup, 'html.parser')
tag = soup.a
tag.insert(1, "but did not endorse ")
tag
# <a href="http://example.com/">I linked to but did not endorse <i>example.com</i></a>
tag.contents
# ['I linked to ', 'but did not endorse', <i>example.com</i>]

insert_before() 和 insert_after()

insert_before() 方法可以在文档树中直接在目标之前添加 tag 或文本

soup = BeautifulSoup("<b>leave</b>", 'html.parser')
tag = soup.new_tag("i")
tag.string = "Don't"
soup.b.string.insert_before(tag)
soup.b
# <b><i>Don't</i>leave</b>

insert_after() 方法可以在文档树中直接在目标之后添加 tag 或文本

div = soup.new_tag('div')
div.string = 'ever'
soup.b.i.insert_after(" you ", div)
soup.b
# <b><i>Don't</i> you <div>ever</div> leave</b>
soup.b.contents
# [<i>Don't</i>, ' you', <div>ever</div>, 'leave']

clear()

Tag.clear() 方法可以移除 tag 的内容:

markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup, 'html.parser')
tag = soup.a

tag.clear()
tag
# <a href="http://example.com/"></a>

extract()

PageElement.extract() 方法将当前 tag 或文本从文档树中移除，并返回被删除的内容:

markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup, 'html.parser')
a_tag = soup.a

i_tag = soup.i.extract()

a_tag
# <a href="http://example.com/">I linked to</a>

i_tag
# <i>example.com</i>

print(i_tag.parent)
# None

这个方法实际上产生了 2 个文档树: 一个是原始文档的 BeautifulSoup 对象，另一个是被移除并且返回的文档树。还可以在新生成的文档树上继续调用 extract 方法:

my_string = i_tag.string.extract()
my_string
# 'example.com'

print(my_string.parent)
# None
i_tag
# <i></i>

decompose()

Tag.decompose() 方法会将前节点从文档书中移除并完全销毁:

markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup, 'html.parser')
a_tag = soup.a
i_tag = soup.i

i_tag.decompose()
a_tag
# <a href="http://example.com/">I linked to</a>

被 decompose 的 Tag 或者 NavigableString 是不稳定的，什么时候都不要使用它。如果不确定某些内容是否被 decompose 了，可以通过 .decomposed 属性进行检查 (Beautiful Soup 4.9.0 新增)

i_tag.decomposed
# True
a_tag.decomposed
# False

replace_with()

PageElement.replace_with() 方法移除文档树中的某段内容，并用新 tag 或文本节点替代它:

markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup, 'html.parser')
a_tag = soup.a

new_tag = soup.new_tag("b")
new_tag.string = "example.com"
a_tag.i.replace_with(new_tag)

a_tag
# <a href="http://example.com/">I linked to <b>example.com</b></a>

bold_tag = soup.new_tag("b")
bold_tag.string = "example"
i_tag = soup.new_tag("i")
i_tag.string = "net"
a_tag.b.replace_with(bold_tag, ".", i_tag)

a_tag
# <a href="http://example.com/">I linked to <b>example</b>.<i>net</i></a>

replace_with() 方法返回被替代的 tag 或文本节点，可以用来检查或添加到文档树其它地方。

传递多个参数给 replace_with() 方法在 Beautiful Soup 4.10.0 版本中新增

wrap()

PageElement.wrap() 方法可以对指定的tag元素进行包装 [8] ，并返回包装后的结果:

soup = BeautifulSoup("<p>I wish I was bold.</p>", 'html.parser')
soup.p.string.wrap(soup.new_tag("b"))
# <b>I wish I was bold.</b>
soup.p.wrap(soup.new_tag("div"))
# <div><p><b>I wish I was bold.</b></p></div>

该方法在 Beautiful Soup 4.0.5 中添加。

unwrap()

Tag.unwrap() 方法与 wrap() 方法相反。它将用 tag 内内容来替换 tag 本身，该方法常被用来解包内容:

markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup, 'html.parser')
a_tag = soup.a

a_tag.i.unwrap()
a_tag
# <a href="http://example.com/">I linked to example.com</a>

与 replace_with() 方法相同，unwrap() 方法会返回被移除的 tag。

smooth()

调用了一堆修改文档树的方法后，可能剩下的是 2 个或更多个彼此衔接的 NavigableString 对象。 Beautiful Soup 处理起来没有问题，但在刚刚解析的文档树中，可能会出现非预期情况

soup = BeautifulSoup("<p>A one</p>", 'html.parser')
soup.p.append(", a two")
soup.p.contents
# ['A one', ', a two']
print(soup.p.encode())
# b'<p>A one, a two</p>'
print(soup.p.prettify())
# <p>
#  A one
#  , a two
# </p>

这时可以使用 Tag.smooth() 方法来清理文档树，把相邻的字符串平滑的链接到一起

soup.smooth()
soup.p.contents
# ['A one, a two']
print(soup.p.prettify())
# <p>
#  A one, a two
# </p>

该方法在 Beautiful Soup 4.8.0 中添加。

输出

格式化输出

prettify() 方法将 Beautiful Soup 的文档树格式化后以 Unicode 编码输出，每个 XML/HTML 标签都独占一行

markup = '<html><head><body><a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup, 'html.parser')
soup.prettify()
# '<html>\n <head>\n </head>\n <body>\n  <a href="http://example.com/">\n...'

print(soup.prettify())
# <html>
#  <head>
#  </head>
#  <body>
#   <a href="http://example.com/">
#    I linked to
#    <i>
#     example.com
#    </i>
#   </a>
#  </body>
# </html>

BeautifulSoup 对象的根节点和它的所有 tag 节点都可以调用 prettify() 方法:

print(soup.a.prettify())
# <a href="http://example.com/">
#  I linked to
#  <i>
#   example.com
#  </i>
# </a>

因为格式化会添加额外的空格（为了换行显示），因为 prettify() 会改变 HTML 文档的内容，所以不要用来格式化文档。 prettify() 方法的设计目标是为了帮助更好的显示和理解文档。

压缩输出

如果只想得到结果字符串，不重视格式，那么可以对一个 BeautifulSoup 对象或 Tag 对象使用 Python 的 unicode() 或 str() 方法:

str(soup)
# '<html><head></head><body><a href="http://example.com/">I linked to <i>example.com</i></a></body></html>'

str(soup.a)
# '<a href="http://example.com/">I linked to <i>example.com</i></a>'

str() 方法返回 UTF-8 编码的字符串，查看定编码了解更多选项。

还可以调用 encode() 方法获得字节码或调用 decode() 方法获得Unicode。

输出格式

Beautiful Soup 输出是会将 HTML 中的特殊字符编码转换成 Unicode, 比如 “&lquot;”:

soup = BeautifulSoup("“Dammit!” he said.", 'html.parser')

str(soup)

# '“Dammit!” he said.'

如果将文档转换成字节编码，那么字节码 Unicode 会被编码成 UTF-8。并且无法再转换回 html 中的特殊字符编码:

soup.encode("utf8")

# b'\xe2\x80\x9cDammit!\xe2\x80\x9d he said.'

默认情况下，只会转义 & 符号和尖角号。它们会被转义为 “&”，”<” 和 “>”，因此 Beautiful Soup 不会无意间生成错误格式的的 HTML 或 XML

soup = BeautifulSoup("<p>The law firm of Dewey, Cheatem, & Howe</p>", 'html.parser')
soup.p
# <p>The law firm of Dewey, Cheatem, &amp; Howe</p>
soup = BeautifulSoup('<a href="http://example.com/?foo=val1&bar=val2">A link</a>', 'html.parser')
soup.a
# <a href="http://example.com/?foo=val1&amp;bar=val2">A link</a>

修改默认转义规则的方法是，设置 prettify(), encode(), 或 decode() 方法的 formatter 参数。Beautiful Soup 可以识别 5 种 formatter 值。

默认的设置是 formatter="minimal"。处置字符串时 Beautiful Soup 会确保生成合法的 HTML/XML

french = "<p>Il a dit &lt;&lt;Sacr&eacute; bleu!&gt;&gt;</p>"
soup = BeautifulSoup(french, 'html.parser')
print(soup.prettify(formatter="minimal"))
# <p>
#  Il a dit &lt;&lt;Sacré bleu!&gt;&gt;
# </p>

设置为 formatter="html" 时，Beautiful Soup 会尽可能把 Unicode 字符转换为 HTML 实体

print(soup.prettify(formatter="html"))
# <p>
#  Il a dit &lt;&lt;Sacr&eacute; bleu!&gt;&gt;
# </p>

设置为 formatter="html5" 时，结果与 formatter="html" 相似，区别是 Beautiful Soup 会忽略 HTML 标签种空标签里的斜杠符号，比如 “br” 标签

br = BeautifulSoup("<br>", 'html.parser').br
print(br.encode(formatter="html"))
# b'<br/>'
print(br.encode(formatter="html5"))
# b'<br>'

另外，如果属性的值为空字符串的，它会变为 HTML 风格的 boolean 属性

option = BeautifulSoup('<option selected=""></option>').option
print(option.encode(formatter="html"))
# b'<option selected=""></option>'
print(option.encode(formatter="html5"))
# b'<option selected></option>'

这种机制在 Beautiful Soup 4.10.0 中添加。

设置为 formatter=None 时，Beautiful Soup 在输出时不会修改任何字符串内容。这是效率最高的选项，但可能导致输出非法的 HTML/XML，比如下面例子

print(soup.prettify(formatter=None))
# <p>
#  Il a dit <<Sacré bleu!>>
# </p>
link_soup = BeautifulSoup('<a href="http://example.com/?foo=val1&bar=val2">A link</a>', 'html.parser')
print(link_soup.a.encode(formatter=None))
# b'<a href="http://example.com/?foo=val1&bar=val2">A link</a>'

格式化对象

如果需要更复杂的机制来控制输出内容，可以实例化 Beautiful Soup 的 formatter 实例，然后用作 formatter 参数。

class bs4.HTMLFormatter

可以用来自定义 HTML 文档的格式化规则。

下面的 formatter 例子，可以将字符串全部转化为大写，不论是文字节点中的字符还是属性值

from bs4.formatter import HTMLFormatter
def uppercase(str):
    return str.upper()
formatter = HTMLFormatter(uppercase)
print(soup.prettify(formatter=formatter))
# <p>
#  IL A DIT <<SACRÉ BLEU!>>
# </p>
print(link_soup.a.prettify(formatter=formatter))
# <a href="HTTP://EXAMPLE.COM/?FOO=VAL1&BAR=VAL2">
#  A LINK
# </a>

下面的 formatter 例子，在美化文档时增加缩进长度

formatter = HTMLFormatter(indent=8)
print(link_soup.a.prettify(formatter=formatter))
# <a href="http://example.com/?foo=val1&bar=val2">
#         A link
# </a>
class bs4.XMLFormatter

可以用来自定义 XML 文档的格式化规则。

编写自定义 formatter

HTMLFormatter or XMLFormatter 的子类可以控制更多的输出过程。例如，Beautiful Soup 默认情况下会对属性中的 tag 进行排序

attr_soup = BeautifulSoup(b'<p z="1" m="2" a="3"></p>', 'html.parser')
print(attr_soup.p.encode())
# <p a="3" m="2" z="1"></p>

class UnsortedAttributes(HTMLFormatter):
    def attributes(self, tag):
        for k, v in tag.attrs.items():
            if k == 'm':
                continue
            yield k, v
print(attr_soup.p.encode(formatter=UnsortedAttributes()))
# <p z="1" a="3"></p>

危险提示：如果创建了 CData 对象，对象中的字符串对象始终表示原始内容，不会被格式化方法影响。 Beautiful Soup 输出时依然会调用自定义格式化方法，以防自定义方法中包含自定义的字符串计数方法，但调用后不会使用返回结果，不影响原来的返回值。

from bs4.element import CData
soup = BeautifulSoup("<a></a>", 'html.parser')
soup.a.string = CData("one < three")
print(soup.a.prettify(formatter="html"))
# <a>
#  <![CDATA[one < three]]>
# </a>

get_text()

如果只想得到 tag 中包含的文本内容，那么可以调用 get_text() 方法，这个方法获取到 tag 包含的所有文本内容，包括子孙 tag 中的可读内容，并将结果作为单独的一个 Unicode 编码字符串返回:

markup = '<a href="http://example.com/">\nI linked to <i>example.com</i>\n</a>'
soup = BeautifulSoup(markup, 'html.parser')

soup.get_text()
'\nI linked to example.com\n'
soup.i.get_text()
'example.com''

可以通过参数指定 tag 的文本内容的连接符:

# soup.get_text("|")
'\nI linked to |example.com|\n'

还可以去除每一个文本片段内容的前后空白:

# soup.get_text("|", strip=True)
'I linked to|example.com'

但这种情况，你可能应该使用 .stripped_strings 生成器，获得文本列表后手动处理内容:

[text for text in soup.stripped_strings]
# ['I linked to', 'example.com']

因为 Beautiful Soup 4.9.0 版本开始使用 lxml 或 html.parser，<script>，<style> 和 <template> 标签中的内容不会被当做普通的 ‘文本’ 来处理，因此这些标签中的内容不会算作页面中的可读内容的一部分。

Beautiful Soup 4.10.0 版本以后，可以在 NavigableString 对象上调用 get_text()，.strings 或 .stripped_strings 属性，结果会返回对象本身或空，这种用法只有在对混合类型列表迭代时才会用到。

Beautiful Soup 4.12.0 文档（三）：https://developer.aliyun.com/article/1515386

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

​Beautiful Soup 4.12.0 文档（二）

find_all()

keyword 参数

按CSS搜索

string 参数

limit 参数

recursive 参数

像调用 find_all() 一样调用tag

find()

find_parents() 和 find_parent()

find_next_siblings() 和 find_next_sibling()

find_previous_siblings() 和 find_previous_sibling()

find_all_next() 和 find_next()

find_all_previous() 和 find_previous()

CSS 选择器

CSS 筛选器中的命名空间

支持 CSS 筛选器的历史版本

修改文档树

修改 tag 的名称和属性

修改 .string

append()

extend()

NavigableString() 和 .new_tag()

insert()

insert_before() 和 insert_after()

clear()

extract()

decompose()

replace_with()

wrap()

unwrap()

smooth()

输出

格式化输出

压缩输出

输出格式

格式化对象

编写自定义 formatter

get_text()

热门文章

最新文章

相关电子书

Beautiful Soup 4.12.0 文档（二）