BeautifulSoup4的安装及使用-阿里云开发者社区

BeautifulSoup4的安装及使用

2017-11-27 1349

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介：

一、BeautifulSoup4的安装
   方法一：cmd->easy_install BeautifulSoup
   方法二：从http://www.crummy.com/software/BeautifulSoup/bs4/download/
下载->cmd->进入下载的文件目录->python setuyp.py install

二、 BeautifulSoup4的使用
1、导入
    from bs4 import BeautifulSoup
    注意：要是BeautifulSoup的版本为3.x，则导入方式为：from BeautifulSoup import BeautifulSoup
2、example
    html文件：
    html_doc = """

The Dormouse's story

Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.

...

"""

代码：
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)

接下来可以开始使用各种功能

  soup.X (X为任意标签，返回整个标签，包括标签的属性，内容等）

如：soup.title

   #

   soup.p

   #

The Dormouse's story

  soup.a （注：仅仅返回第一个结果）

   # Elsie

   soup.find_all('a') （find_all 可以返回所有）

   # [Elsie,

   # Lacie,

   # Tillie]

   find还可以按属性查找
   soup.find(id="link3")
   # Tillie

   要取某个标签的某个属性，可用函数有 find_all,get
   for link in soup.find_all('a'):
     print(link.get('href'))
   # http://example.com/elsie
   # http://example.com/lacie
   # http://example.com/tillie

   要取html文件中的所有文本，可使用get_text()
   print(soup.get_text())
   # The Dormouse's story
   # The Dormouse's story
   # Once upon a time there were three little sisters; and their names were
   # Elsie,
   # Lacie and
   # Tillie;
   # and they lived at the bottom of a well.
   # ...

   如果是打开html文件，语句可用：
   soup = BeautifulSoup(open("index.html"))
   BeautifulSoup中的Object
  tag （对应html中的标签）
   tag.attrs (以字典形式返回tag的所有属性）
  可以直接对tag的属性进行增、删、改，跟操作字典一样

   tag['class'] = 'verybold'

   tag['id'] = 1

   tag

   #< blockquote class="verybold" id="1">Extremely bold</blockquote>

   del tag['class']

   del tag['id']

   tag

   #< blockquote>Extremely bold</blockquote>

   tag['class']

   # KeyError: 'class'

   print(tag.get('class'))

   # None

   X.contents (X为标签，可返回标签的内容）

   eg.

   head_tag = soup.head

   head_tag

   #< head><title>The Dormouse's story</title></head>

   head_tag.contents

   [<title>The Dormouse's story</title>]

   title_tag = head_tag.contents[0]

   title_tag

   #< title>The Dormouse's story</title>

   title_tag.contents

   # [u'The Dormouse's story']

   解决解析网页出现乱码问题：
   import urllib2
   2    from BeautifulSoup import BeautifulSoup
   3
   4    page = urllib2.urlopen('http://www.leeon.me');
   5    soup = BeautifulSoup(page,fromEncoding="gb18030")
   6
   7    print soup.originalEncoding
   8    print soup.prettify()

# python --version
# wget https://www.crummy.com/software/BeautifulSoup/bs4/download/4.6/beautifulsoup4-4.6.0.tar.gz
# tar -zxvf beautifulsoup4-4.6.0.tar.gz
# cd beautifulsoup4-4.6.0/
# python setup.py install

本文转自 guowang327 51CTO博客，原文链接：http://blog.51cto.com/guowang327/1927398，如需转载请自行联系原作者

BeautifulSoup4的安装及使用

热门文章

最新文章

相关电子书

探索云世界

热门

云计算

大数据

云原生

人工智能

数据库

开发与运维

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

BeautifulSoup4的安装及使用

热门文章

最新文章

相关电子书