在爬取这个网站之前,试过爬取其他网站的漫画,但是发现有很多反爬虫的限制,有的图片后面加了动态参数,每秒都会更新,所以前一秒爬取的图片链接到一下秒就会失效了,还有的是图片地址不变,但是访问次数频繁的话会返回403,终于找到一个没有限制的漫画网站,演示一下selenium爬虫
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
|
# -*- coding:utf-8 -*-
# crawl kuku漫画
__author__
=
'fengzhankui'
from
selenium
import
webdriver
from
selenium.webdriver.common.desired_capabilities
import
DesiredCapabilities
import
os
import
urllib2
import
chrom
class
getManhua(
object
):
def
__init__(
self
):
self
.num
=
5
self
.starturl
=
'http://comic.kukudm.com/comiclist/2154/51850/1.htm'
self
.browser
=
self
.getBrowser()
self
.getPic(
self
.browser)
def
getBrowser(
self
):
dcap
=
dict
(DesiredCapabilities.PHANTOMJS)
dcap[
"phantomjs.page.settings.userAgent"
]
=
(
"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36"
)
browser
=
webdriver.PhantomJS(desired_capabilities
=
dcap)
try
:
browser.get(
self
.starturl)
except
:
print
'open url fail'
browser.implicitly_wait(
20
)
return
browser
def
getPic(
self
,browser):
cartoonTitle
=
browser.title.split(
'_'
)[
0
]
self
.createDir(cartoonTitle)
os.chdir(cartoonTitle)
for
i
in
range
(
1
,
self
.num):
i
=
str
(i)
imgurl
=
browser.find_element_by_tag_name(
'img'
).get_attribute(
'src'
)
print
imgurl
with
open
(
'page'
+
i
+
'.jpg'
,
'wb'
) as fp:
agent
=
chrom.pcUserAgent.get(
'Firefox 4.0.1 - Windows'
)
request
=
urllib2.Request(imgurl)
request.add_header(agent.split(
':'
,
1
)[
0
],agent.split(
':'
,
1
)[
0
])
response
=
urllib2.urlopen(request)
fp.write(response.read())
print
'page'
+
i
+
'success'
NextTag
=
browser.find_elements_by_tag_name(
'a'
)[
-
1
].get_attribute(
'href'
)
browser.get(NextTag)
browser.implicitly_wait(
20
)
def
createDir(
self
,cartoonTitle):
if
os.path.exists(cartoonTitle):
print
'exists'
else
:
os.mkdir(cartoonTitle)
if
__name__
=
=
'__main__'
:
getManhua()
|
对了应对反爬虫的机制,我在selenium和urllib2分别加了请求参数,反正网站通过过滤请求的方式将爬虫过滤掉,在这里仅爬取了开始url往下的5页,而且为了防止图片和网络延时,设置20秒了等待时间,刚开始运行时间会稍微有点长,需要等待。
运行过程如图所示
本文转自 无心低语 51CTO博客,原文链接:http://blog.51cto.com/fengzhankui/1946775,如需转载请自行联系原作者