github: https://github.com/buriy/python-readability
pypi: https://pypi.org/project/readability-lxml/
安装
$ pip install readability-lxml
代码示例
# -*- coding: utf-8 -*- from readability import Document import requests url = "https://blog.csdn.net/mouday/article/details/94021769" response = requests.get(url) response.encoding = "utf-8" doc = Document(response.text) print(doc.title()) # 标题 print(doc.summary()) # 主体内容
尝试过几个网页后,发现部分网页可以正常提取主体内容,有些网站提取不正确