Requests 与 BeautifulSoup 模块

简介:

一、Requests

参考 :http://www.python-requests.org/en/master/user/quickstart/#make-a-request

Requests是一个很实用的Python HTTP客户端库,编写爬虫和测试服务器响应数据时经常会用到。Requests 完全满足如今网络的需求

安装方式一般采用 pip install requests

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
In [ 1 ]:  import  requests
In [ 2 ]: response = requests.get( 'https://api.github.com/events' )
In [ 3 ]:  print (response)
<Response [ 200 ]>
In [ 4 ]: response = requests.post( 'http://httpbin.org/post' ,data = { 'key1' : 'values1' })          #提交表单时使用
In [ 5 ]:  print (response)
<Response [ 200 ]>
In [ 7 ]: response = requests.put( 'http://httpbin.org/put' ,data = { 'key1' : 'values1' })
In [ 8 ]:  print (response)
<Response [ 200 ]>
In [ 10 ]: response = requests.delete( 'http://httpbin.org/delete' )
In [ 11 ]:  print (response)
<Response [ 200 ]>
In [ 13 ]: response = requests.head( 'http://httpbin.org/get' )
In [ 14 ]:  print (response)
<Response [ 200 ]>
In [ 15 ]: response = requests.options( 'http://httpbin.org/get' )  
In [ 16 ]:  print (response)
<Response [ 200 ]>
In [ 17 ]: payload = { 'key1' : 'value1' , 'key2' : 'value2' }
In [ 18 ]: response = requests.get( 'http://httpbin.org/get' ,params = payload)    #携带参数发送get请求
In [ 19 ]:  print (response)
<Response [ 200 ]>
In [ 20 ]:  print (response.text)
{
   "args" : {
     "key1" "value1" ,
     "key2" "value2"
   },
   "headers" : {
     "Accept" "*/*" ,
     "Accept-Encoding" "gzip, deflate" ,
     "Connection" "close" ,
     "Host" "httpbin.org" ,
     "User-Agent" "python-requests/2.18.4"
   },
   "origin" "103.215.2.233" ,
   "url" "http://httpbin.org/get?key1=value1&key2=value2"
}
In [ 22 ]:  print (response.url)
http: / / httpbin.org / get?key1 = value1&key2 = value2
In [ 23 ]: payload = { 'key1' : 'value1' , 'key2' :[ 'value2' , 'value3' ]}
In [ 24 ]: response = requests.get( 'http://httpbin.org/get' ,params = payload)
In [ 25 ]:  print (response.url)
http: / / httpbin.org / get?key1 = value1&key2 = value2&key2 = value3
In [ 27 ]: response = requests.get( 'http://api.github.com/events' )
In [ 28 ]: response.encoding               #字符集编码
Out[ 28 ]:  'utf-8'
In [ 29 ]:  print (response.text)   #文件信息
[{ "id" : "6850814749" , "type" : "CreateEvent" , "actor" :{ "id" : 679017 , "login" :......
In [ 30 ]:  print (response.content)         #二进制格式信息
b'[{ "id" : "6850814749" , "type" : "CreateEvent" , "actor" :{ "id" : 679017 , "login" :".....
In [ 34 ]: response.json()
In [ 36 ]: response.status_code            #返回状态码
Out[ 36 ]:  200
In [ 38 ]: headers = 'User-Agent' :'Mozilla / 5.0  (Macintosh; Intel Mac OS X  10_11_6 ) AppleWebKit / 537.36  (KHTML, like Gecko) Chrome / 62.
     ...:  0.3202 . 75  Safari / 537.36 ',' Accept ':' text / html,application / xhtml + xml,application / xml;q = 0.9 ,image / webp,image / apng, * / * ;q = 0.8 '
     ...: , 'Accept-Encoding' : 'gzip, deflate, br' , 'Accept-Language' : 'zh-CN,zh;q=0.9,en;q=0.8' , 'Connection' : 'keep-alive' }
In [ 39 ]: response = requests.get( 'https://api.github.com/events' ,headers = headers)
In [ 40 ]:  print (response.headers)
{ 'Server' 'GitHub.com' 'Date' 'Tue, 14 Nov 2017 06:10:31 GMT' 'Content-Type' 'application/json; charset=utf-8' 'Transfer-Encoding' 'chunked' 'Status' '200 OK' 'X-RateLimit-Limit' '60' 'X-RateLimit-Remaining' '58' 'X-RateLimit-Reset' '1510642339' 'Cache-Control' 'public, max-age=60, s-maxage=60' 'Vary' 'Accept' 'ETag' 'W/"34b51a08c5a8f4fa2400dd5c0d89221b"' 'Last-Modified' 'Tue, 14 Nov 2017 06:10:31 GMT' 'X-Poll-Interval' '60' 'X-GitHub-Media-Type' 'unknown, github.v3' 'Link' '<https://api.github.com/events?page=2>; rel="next", <https://api.github.com/events?page=10>; rel="last"' 'Access-Control-Expose-Headers' 'ETag, Link, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval' 'Access-Control-Allow-Origin' '*' 'Content-Security-Policy' "default-src 'none'" 'Strict-Transport-Security' 'max-age=31536000; includeSubdomains; preload' 'X-Content-Type-Options' 'nosniff' 'X-Frame-Options' 'deny' 'X-XSS-Protection' '1; mode=block' 'X-Runtime-rack' '0.104190' 'Content-Encoding' 'gzip' 'X-GitHub-Request-Id' 'D528:C0F5:6BAAA:E4CB6:5A0A88D6' }
In [ 41 ]:
In [ 43 ]:  print (response.headers[ 'Content-Type' ])
application / json; charset = utf - 8
In [ 44 ]:  print (response.headers.get( 'Content-Type' ))
application / json; charset = utf - 8
In [ 45 ]: url = 'http://www.baidu.com'
In [ 46 ]: response = requests.get(url,headers = headers)            #向baidu请求会有cookies返回,有些site没有cookies
In [ 47 ]:  print (response.cookies)                               #输出整个cookies
<RequestsCookieJar[<Cookie H_PS_PSSID = 1425_21088_24880  for  .baidu.com / >, <Cookie BDSVRTM = 0  for  www.baidu.com / >, <Cookie BD_HOME = 0  for  www.baidu.com / >]>
In [ 48 ]:  for  k,v  in  response.cookies.get_dict().items():       #遍历cookies内容
     ...:      print (k,v)
     ...:
H_PS_PSSID  1425_21088_24880
BDSVRTM  0
BD_HOME  0
In [ 49 ]: cookies = { 'c1' : 'v1' , 'c2' : 'v2' }
In [ 50 ]: response = requests.get( 'http://httpbin.org/cookies' ,cookies = cookies)   #携带cookies发送请求
In [ 52 ]:  print (response.text)
{
   "cookies" : {
     "c1" "v1" ,
     "c2" "v2"
   }
}
In [ 53 ]: jar  =  requests.cookies.RequestsCookieJar()
In [ 54 ]: jar. set ( 'tasty_cookie' 'yum' , domain = 'httpbin.org' , path = '/cookies' )
Out[ 54 ]: Cookie(version = 0 , name = 'tasty_cookie' , value = 'yum' , port = None , port_specified = False , domain = 'httpbin.org' , domain_specified = True , domain_initial_dot = False , path = '/cookies' , path_specified = True , secure = False , expires = None , discard = True , comment = None , comment_url = None , rest = { 'HttpOnly' None }, rfc2109 = False )
In [ 55 ]: jar. set ( 'gross_cookie' 'blech' , domain = 'httpbin.org' , path = '/elsewhere' )
Out[ 55 ]: Cookie(version = 0 , name = 'gross_cookie' , value = 'blech' , port = None , port_specified = False , domain = 'httpbin.org' , domain_specified = True , domain_initial_dot = False , path = '/elsewhere' , path_specified = True , secure = False , expires = None , discard = True , comment = None , comment_url = None , rest = { 'HttpOnly' None }, rfc2109 = False )
In [ 56 ]: url  =  'http://httpbin.org/cookies'
In [ 57 ]: response  =  requests.get(url, cookies = jar)
In [ 58 ]:  print (response.text)
{
   "cookies" : {
     "tasty_cookie" "yum"
   }
}

Cookies are returned in a RequestsCookieJar, which acts like a dict but also offers a more complete interface, suitable for use over multiple domains or paths. Cookie jars can also be passed in to requests

1
2
3
4
5
6
7
8
9
10
11
12
In [ 62 ]: url = 'http://github.com'
In [ 64 ]: response = requests.get(url,allow_redirects = True )
In [ 65 ]:  print (response.url)
https: / / github.com /
In [ 66 ]: response.history
Out[ 66 ]: [<Response [ 301 ]>]
In [ 69 ]: url  =  'http://httpbin.org/post'
In [ 70 ]: files  =  { 'file' open ( 'test.txt' 'rb' )}
In [ 71 ]: response = requests.post(url,files = files)                  #post提交时携带文件
In [ 72 ]: response.text
Out[ 72 ]:  '...文件的内容...'
In [ 73 ]: response = requests.get( 'https://github.com' , timeout = 5 )    #关于请求超时


import json

import requests

from io import BytesIO

from PIL import Image

#1 处理图片

1
2
3
r = requests.get( 'http://img.jrjimg.cn/2013/11/20131105065502114.jpg' )
image = Image. open (BytesIO(r.content))   #从图片的二进制内容 生成一张图片
image.save( 'mm.jpg' )

#2 Json 处理josn

1
2
3
4
r = requests.get( 'https://github.com/timeline.json' )
print ( type (r.json))
print (r.json)
print (r.text)

#3 org data 处理源数据

1
2
3
4
r = requests.get( 'https://timgsa.baidu.com/timg?image&quality=80&size=b9999_10000&sec=1508166336374&di=ef1073a52a7582f29ffa27c47e95e74e&imgtype=0&src=http%3A%2F%2Fp3.gexing.com%2FG1%2FM00%2F3F%2FDD%2FrBACE1MaezngiEoIAADSr3bccSw151.jpg' )
with  open ( 'mm2.jpg' , 'wb+' ) as f:
     for  chunk  in  r.iter_content( 1024 ):
         f.write(chunk)

#4 Form 处理表单

1
2
3
4
5
form = { 'username' : 'user' , 'password' : 'pwd' }
r = requests.post( 'http://httpbin.org/post' ,data = form)
print (r.text)
r = requests.post( 'http://httpbin.org/post' ,data = json.dumps(form))
print (r.text)

二、通过Requests抓取豆瓣电影列表及评分

a5a453ee720a9c6b8131ca5a78fdc9fb.jpg

所以抓取代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import  requests
from  lxml  import  etree
sess  =  requests.Session()
headers = 'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36' , 'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8' , 'Accept-Encoding' : 'gzip, deflate, br' , 'Accept-Language' : 'zh-CN,zh;q=0.9,en;q=0.8' , 'Connection' : 'keep-alive' }
for  id  in  range ( 0 250 25 ):
     url  =  'https://movie.douban.com/top250/?start='  +  str ( id )
     =  sess.get(url,headers = headers)
     r.encoding  =  'utf-8'
     #fname="movie"+str(id)+".txt"
     #with open(fname,"wb+") as f:
     #    f.write(r.content)
     root  =  etree.HTML(r.content)   #使用lxml解析器对html文档解析
     items  =  root.xpath( '//ol/li/div[@class="item"]' )
     for  item  in  items:
         title  =  item.xpath( './div[@class="info"]//a/span[@class="title"]/text()' )
         name  =  title[ 0 ].encode( 'gb2312' 'ignore' ).decode( 'gb2312' )
         rating  =  item.xpath( './/div[@class="bd"]//span[@class="rating_num"]/text()' )[ 0 ]
         rating  =  item.xpath( './/div[@class="bd"]/div[@class="star"]/span[@class="rating_num"]/text()' )[ 0 ]
         print (name, rating)

三、BeautifulSoup

BeautifulSoup模块用于接收一个HTMLXML字符串,然后将其进行格式化,之后便可以使用他提供的方法进行快速查找指定元素,从而使得在HTMLXML中查找指定元素变得简单。Beautiful Soup 支持 Python 标准库中的 HTML 解析器,还支持一些第三方的解析器,如果不安装第三方解析器,Python 会使用默认的解析器。常见的解析器有:lxml, html5lib,  html.parser其中lxml 解析器更加强大,速度更快,推荐安装。

1
2
3
from  bs4  import  BeautifulSoup
soup = BeautifulSoup( open ( 'test.html' ))   #这种方式适用于打开本地文件进行解析
print (soup.prettify())   #格式化输出

#1 Tag 处理tag

1
2
3
print ( type (soup.title))
print (soup.title)
print (soup.title.name)

#2 String

1
2
print ( type (soup.title.string))
print (soup.title.string)

#3 Comment

1
2
3
4
print ( type (soup.a.string))
print (soup.a.string)
for  item  in  soup.body.contents:
     print (item.name)

#4 CSS query

1
2
3
print (soup.select( '.sister' ))
print (soup.select( '#link1' ))
print (soup.select( 'head > title' ))
1
2
3
a_s = soup.select( 'a' )
for  in  a_s:
     print (a)

例:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
from  bs4  import  BeautifulSoup
html_doc  =  """
<html><head><title>The Dormouse's story</title></head>
<body>
asdf
     <div class="title">
         <b>The Dormouse's story总共</b>
         <h1>f</h1>
     </div>
<div class="story">Once upon a time there were three little sisters; and their names were
     <a  class="sister0" id="link1">Els<span>f</span>ie</a>,
     <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
     <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</div>
ad<br/>sf
<p class="story">...</p>
</body>
</html>
"""
1
2
3
4
soup  =  BeautifulSoup(html_doc, features = "lxml" )
tag1  =  soup.find(name = 'a' )          #find first tag a
tag2  =  soup.find_all(name = 'a' )      #find all tag a
tag3  =  soup.select( '#link2' )       #find  id = link2 label
1
2
3
4
5
6
7
8
9
print (tag1.name)           # 输出 a
print (tag1.attrs)          # 输出 字典 {'class': ['sister0'], 'id': 'link1'}
tag1.attrs[ 'id' ] = 'link01'
print (tag1.attrs)           # 输出 字典 {'class': ['sister0'], 'id': 'link01'}
print (tag1.has_attr( 'id' ))  # 输出 True
print (tag1.get_text( 'id' ))  # 输出  Elsidfidie
tag1.name = 'soup'            # 设置name 的值
print (tag2)                 # 输出 [<a class="sister0" id="link1">Els<span>f</span>ie</a>, ......]
print (tag2[ 0 ].name)         # 输出 soup

decode,转换为字符串(含当前标签);decode_contents(不含当前标签)

1
2
3
4
print (tag2[ 1 ])                   # 输出 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
print ( type (tag2[ 1 ]))             # 输出 <class 'bs4.element.Tag'>
print (tag2[ 1 ].decode())          # 输出  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
print ( type (tag2[ 1 ].decode()))    # 输出 <class 'str'>

encode,转换为字节(含当前标签);encode_contents(不含当前标签)

1
2
3
print (tag2[ 1 ].encode())             # 输出 b'<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>'
print ( type (tag2[ 1 ].encode()))       # 输出 <class 'bytes'>
print (tag2[ 1 ].get_text())           # 输出 Lacie
1
2
3
4
5
body  =  soup.find(name = 'body' )        #所有子标签
childs  =  body.children
print (childs)                        # 输出 <list_iterator object at 0x10349b9e8>
for  tag  in  childs:
     print (tag)
1
2
3
4
5
body = soup.find(name = 'body' )          #所有子子孙孙标签
descs = body.descendants               # 输出 <generator object descendants at 0x106327360>
print (descs)
for  des  in  descs:
     print (des)
1
2
3
body = soup.find(name = 'body' )          # 将标签的所有子标签全部清空 , 保留本标签名
body.clear()
print (soup)
1
2
3
body = soup.find(name = 'body' )
body.decompose()                     # 递归的删除所有的标签 
print (soup)
1
2
3
4
body = soup.find(name = 'body' )
d = body.extract()                     # 递归的删除所有的标签,并获取删除的标签
print (soup)
print (d)
1
2
3
body = soup.find(name = 'body' )
index = body.index(body.find( 'div' ))   # 输出 1 ,  检查标签在某标签中的索引位置
print (index)
1
2
3
br = soup.find(name = 'br' )
test = br.is_empty_element      # 输出True ; 判断是否是如下标签:'br' , 'hr', 'input', 'img', 'meta','spacer', 'link', 'frame', 'base'
print (test)
1
2
3
4
5
span = soup.find( 'span' )
print (span)                   # 输出 <span>f</span>
print (span.string)            # 输出 f
span.string = 'yeecall.com'     # 设置 string
print (span.string)            # 输出 yeecall.com
1
2
3
4
5
body = soup.find(name = 'body' )
texts = body.stripped_strings   # 递归内部获取所有标签的文本
print (texts)                  # 输出 <generator object stripped_strings at 0x107311360>
for  text  in  texts:
     print (text)

# Select CSS 选择器的举例

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
soup.select( "title" )
soup.select( "p nth-of-type(3)" )
soup.select( "body a" )
soup.select( "html head title" )
tag  =  soup.select( "span,a" )
soup.select( "head > title" )
soup.select( "p > a" )
soup.select( "p > a:nth-of-type(2)" )
soup.select( "p > #link1" )
soup.select( "body > a" )
soup.select( "#link1 ~ .sister" )
soup.select( "#link1 + .sister" )
soup.select( ".sister" )
soup.select( "[class~=sister]" )
soup.select( "#link1" )
soup.select( "a#link2" )
soup.select( 'a[href]' )
soup.select( 'a[href="http://example.com/elsie"]' )
soup.select( 'a[href^="http://example.com/"]' )
soup.select( 'a[href$="tillie"]' )
soup.select( 'a[href*=".com/el"]' )

四、使用requests 、BeautifulSoup实现豆瓣登录

登录的部分源html如下:

4f8b1dfce385927493b9894d0534e6ce.jpg

验证码的部分源html如下:

cb0fd6edd50db1ad5713329fb92f5090.jpg

所以登录代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import  requests
import  html5lib
import  re
from  bs4  import  BeautifulSoup
sess = requests.Session()
url_login = 'https://accounts.douban.com/login'
formdata = {
     'redir' : 'https://www.douban.com' ,
     'source' : 'index_nav' ,
     'form_email' : '******@*****.com' ,
     'form_password' : '*********' ,
     'login' :u '登录'
}
headers = { 'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36' }
r = sess.post(url_login,data = formdata,headers = headers)
content = r.text
soup = BeautifulSoup(content, 'html5lib' )
captcha = soup.find( 'img' , id = 'captcha_image' )
if  captcha:
     print (captcha)
     captcha_url = captcha[ 'src' ]
     #re_captcha_id=r'id="(.*?)"&'
     #captcha_id=re.findall(re_captcha_id,captcha)
     captcha_id = re.findall(r '(id=)(.*)(&)' ,captcha_url)
     captcha_id = captcha_id[ 0 ][ 1 ]
     print (captcha_url)
     print (captcha_id)
     captcha_text = input ( 'Please input the captcha:' )
     formdata[ 'captcha-solution' ] = captcha_text
     formdata[ 'captcha-id' ] = captcha_id
     print (formdata)
     r = sess.post(url_login,data = formdata,headers = headers)
with  open ( 'contacts.txt' , 'w+' ,encoding = 'utf-8' ) as f:
     f.write(r.text)

以上仅为个人学习笔记,高手指点且勿喷










本文转自 meteor_hy 51CTO博客,原文链接:http://blog.51cto.com/caiyuanji/1981695,如需转载请自行联系原作者
目录
相关文章
|
2月前
|
XML 数据格式 开发者
解析数据的Beautiful Soup 模块(一)
解析数据的Beautiful Soup 模块(一)
75 0
|
2月前
|
XML 数据采集 API
MechanicalSoup与BeautifulSoup的区别分析
MechanicalSoup与BeautifulSoup的区别分析
50 2
MechanicalSoup与BeautifulSoup的区别分析
|
2月前
|
前端开发 Python
解析数据的Beautiful Soup 模块(二)
解析数据的Beautiful Soup 模块(二)
25 1
|
7月前
|
XML 数据格式
Beautiful Soup 库提供了许多常用的方法
【5月更文挑战第10天】Beautiful Soup库用于HTML/XML文档解析和操作,提供初始化、查找、提取信息及修改文档的方法。如:find()和find_all()查找元素,.string或.get_text()获取文本,.attrs获取属性,.append()、.insert()、.remove()、.replace_with()、.unwrap()和.wrap()修改文档结构。还有.prettify()格式化输出,.encode()和.decode()处理编码。这些功能组合使用可灵活处理文档信息。
42 1
WK
|
3月前
|
XML 前端开发 API
Beautiful Soup有哪些支持功能
Beautiful Soup是一个强大的Python库,用于从HTML或XML文件中提取数据。它支持多种解析器,如html.parser、lxml和html5lib,能灵活应对不同格式的文档。通过丰富的API,可以轻松遍历解析树,按标签名、属性或字符串内容搜索和提取数据。此外,Beautiful Soup还支持简单的树修改操作,处理不同编码的文档,并具备良好的容错性。从4.0版本起,它引入了CSS选择器,使搜索更加便捷。详尽的官方文档和广泛的社区支持使其成为处理网页数据的理想选择。
WK
56 1
|
7月前
|
数据采集 Web App开发 安全
Beautiful Soup和Requests
【5月更文挑战第7天】本文介绍了使用Python中的Requests和Beautiful Soup库创建网络爬虫的方法。Requests库简化了HTTP请求,Beautiful Soup则用于解析HTML和XML文档,便于提取信息。首先,文章解释了两个库的作用和安装步骤。接着,通过实例展示了如何提取网页标题和链接,以及如何下载并保存图片。对于动态加载的内容,文章推荐使用Selenium库模拟浏览器行为。此外,还介绍了如何处理登录认证,包括安全输入密码和从外部文件读取凭据。总结来说,本文提供了Python网络爬虫的基础知识和实用技巧。
|
数据采集 SQL 移动开发
【Python爬虫】Beautifulsoup4中find_all函数
【Python爬虫】Beautifulsoup4中find_all函数
|
数据采集 前端开发 测试技术
BeautifulSoup的基本使用
要使用BeautifulSoup4需要先安装lxml,再安装bs4
BeautifulSoup的基本使用
|
Python
Beautiful Soup库的介绍
本节中将介绍如何使用 Beautiful Soup 来解析 HTML 以获取我们想要的信息。
116 0

热门文章

最新文章