How to get text and corresponding tag with BeautifulSoup?

简介: How to get text and corresponding tag with BeautifulSoup?

English Version


pre-commit suddenly started to fail installing the isort hook in our builds today with the following error

[INFO] Installing environment for https://github.com/pycqa/isort.
[INFO] Once installed this environment will be reused.
[INFO] This may take a few minutes...
An unexpected error has occurred: CalledProcessError: command: ('/builds/.../.cache/pre-commit/repo0_h0f938/py_env-python3.8/bin/python', '-mpip', 'install', '.')
return code: 1
expected return code: 0
[...]
stderr:
      ERROR: Command errored out with exit status 1:
[...]
        File "/tmp/pip-build-env-_3j1398p/overlay/lib/python3.8/site-packages/poetry/core/masonry/api.py", line 40, in prepare_metadata_for_build_wheel
          poetry = Factory().create_poetry(Path(".").resolve(), with_groups=False)
        File "/tmp/pip-build-env-_3j1398p/overlay/lib/python3.8/site-packages/poetry/core/factory.py", line 57, in create_poetry
          raise RuntimeError("The Poetry configuration is invalid:\n" + message)
      RuntimeError: The Poetry configuration is invalid:
        - [extras.pipfile_deprecated_finder.2] 'pip-shims<=0.3.4' does not match '^[a-zA-Z-_.0-9]+$'

It seems to be related with poetry configuration..



I have a text, contains HTML tags something like:

text = <p>Some text</p> <h1>Some text</h1> .... 
soup = BeautifulSoup(text)

I parsed this text using BeautifulSoup. I would like to extract every sentence with corresponding text and tag. I tried:

for sent in soup:
    print(sent.text) <- ok
    print(sent.tag) <- **not ok since NavigableString does not has tag attribute**

I also tried soup.find_all() and stuck at the same point: I have access to text but not original tag.





Instead of tag use name to get the elements tag name:

for tag in soup.find_all():
    print(tag.text, tag.name)

Use the parameter 'html.parser' to avoid behavior of standard parser lxml that will slightly reshape the structure and wraps partial HTML in  <html> and <body>

Example

from bs4 import BeautifulSoup
html = '''<p>Some text</p><h1>Some text</h1>'''
soup = BeautifulSoup(html, 'html.parser')
for tag in soup.find_all():
    print(tag.text, tag.name)

Output

Some text p
Some text h1
目录
相关文章
|
1月前
|
XML 数据采集 API
MechanicalSoup与BeautifulSoup的区别分析
MechanicalSoup与BeautifulSoup的区别分析
MechanicalSoup与BeautifulSoup的区别分析
WK
|
2月前
|
XML 数据采集 数据挖掘
什么是Beautiful Soup?有哪些特点?
Beautiful Soup,常被称为“美丽汤”,是用于解析HTML和XML文档的Python库,能自动修复不规范的标签,便于遍历、搜索及修改文档结构,适用于网页爬虫和数据采集。它提供直观的方法来处理文档,支持多种解析器,具备强大的搜索功能,包括find()和find_all()等方法,并兼容CSS选择器,简化了数据提取过程。广泛应用于网页爬虫、数据挖掘及网页内容分析等领域。
WK
123 1
|
3月前
|
数据采集 XML 前端开发
BeautifulSoup
【8月更文挑战第18天】
48 1
|
3月前
|
前端开发 Python
Beautiful Soup
【8月更文挑战第4】
58 9
|
5月前
bs4 beautifulsoup学习笔记
bs4 beautifulsoup学习笔记
26 0
|
6月前
|
XML 机器学习/深度学习 移动开发
​Beautiful Soup 4.12.0 文档(三)
​Beautiful Soup 4.12.0 文档(三)
|
6月前
|
XML 前端开发 数据格式
​Beautiful Soup 4.12.0 文档(一)
​Beautiful Soup 4.12.0 文档(一)
|
6月前
|
XML 前端开发 数据格式
​Beautiful Soup 4.12.0 文档(二)
​Beautiful Soup 4.12.0 文档(二)
|
数据采集 SQL 移动开发
【Python爬虫】Beautifulsoup4中find_all函数
【Python爬虫】Beautifulsoup4中find_all函数
|
XML 数据格式 Python