正文提取库goose,效果不是太好,要求不高的话可以试试用
github:https://github.com/grangier/python-goose
安装
git clone https://github.com/grangier/python-goose.git cd python-goose pip install -r requirements.txt python setup.py install
代码实例
# -*- coding: utf-8 -*- # @File : pygoose_demo.py # @Date : 2018-07-25 # @Author : Peng Shiyu from goose import Goose from goose.text import StopWordsChinese g = Goose({'stopwords_class': StopWordsChinese}) url = "http://www.xinhuanet.com/" article = g.extract(url=url) print(article.title) print(article.cleaned_text)