一、演绎自已的北爱
踏上北漂的航班,开始演奏了我自已的北京爱情故事
二、爬虫1
1、网络爬虫的思路
首先:指定一个url,然后打开这个url地址,读其中的内容。
其次:从读取的内容中过滤关键字;这一步是关键,可以通过查看源代码的方式获取。
最后:下载获取的html的url地址,或者图片的url地址保存到本地
2、针对指定的url来网络爬虫
分析:
第一步:大约共有4300个下一页。
第二步:一个页面上有10个个人头像
第三步:一个头像内大约有100张左右的个人图片
指定的淘宝mm的url为:http://mm.taobao.com/json/request_top_list.htm?type=0&page=1
这个页面默认是没有下一页按钮的,我们可以通过修改其url地址来进行查看下一个页面
最后一页的url地址和页面展示如下图所示:
点击任意一个头像来进入个人的主页,如下图
3、定制的脚本
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
|
#!/usr/bin/env python
#coding:utf-8
#Author:Allentuns
#Email:zhengyansheng@hytyi.com
import
urllib
import
os
import
sys
import
time
ahref =
'<a href="'
ahrefs =
'<a href="h'
ahtml =
".htm"
atitle =
"<img style"
ajpg =
".jpg"
btitle =
'<img src="'
page = 0
while
page < 4300:
#这个地方可以修改;最大值为4300,我测试的时候写的是3.
mmurl =
"http://mm.taobao.com/json/request_top_list.htm?type=0&page=%d"
%(page)
content = urllib.urlopen(mmurl).
read
()
href = content.
find
(ahref)
html = content.
find
(ahtml)
url = content[href + len(ahref) : html + len(ahtml)]
print url
imgtitle = content.
find
(btitle,html)
imgjpg = content.
find
(ajpg,imgtitle)
littleimgurl = content[imgtitle + len(btitle): imgjpg + len(ajpg)]
print littleimgurl
urllib.urlretrieve(littleimgurl,
"/www/src/temp/image/taobaomm/allentuns.jpg"
)
s = 0
while
s < 18:
href = content.
find
(ahrefs,html)
html = content.
find
(ahtml,href)
url = content[href + len(ahref): html + len(ajpg)]
print s,url
imgtitle = content.
find
(btitle,html)
imgjpg = content.
find
(ajpg,imgtitle)
littleimgurl = content[imgtitle : imgjpg + len(ajpg)]
littlesrc = littleimgurl.
find
(
"src"
)
tureimgurl = littleimgurl[littlesrc + 5:]
print s,tureimgurl
if
url.
find
(
"photo"
) == -1:
content01 = urllib.urlopen(url).
read
()
imgtitle = content01.
find
(atitle)
imgjpg = content01.
find
(ajpg,imgtitle)
littleimgurl = content01[imgtitle : imgjpg + len(ajpg)]
littlesrc = littleimgurl.
find
(
"src"
)
tureimgurl = littleimgurl[littlesrc + 5:]
print tureimgurl
imgcount = content01.count(atitle)
i = 20
try:
while
i < imgcount:
content01 = urllib.urlopen(url).
read
()
imgtitle = content01.
find
(atitle,imgjpg)
imgjpg = content01.
find
(ajpg,imgtitle)
littleimgurl = content01[imgtitle : imgjpg + len(ajpg)]
littlesrc = littleimgurl.
find
(
"src"
)
tureimgurl = littleimgurl[littlesrc + 5:]
print i,tureimgurl
time
.
sleep
(1)
if
tureimgurl.count(
"<"
) == 0:
imgname = tureimgurl[tureimgurl.index(
"T"
):]
urllib.urlretrieve(tureimgurl,
"/www/src/temp/image/taobaomm/%s-%s"
%(page,imgname))
else
:
pass
i += 1
except IOError:
print
'/nWhy did you do an EOF on me?'
break
except:
print
'/nSome error/exception occurred.'
s += 1
else
:
print
"---------------{< 20;1 page hava 10 htm and pic }-------------------------}"
page = page + 1
print
"****************%s page*******************************"
%(page)
else
:
print
"Download Finshed."
|
4、图片展示(部分图片)
5、查看下载的图片数量
二、爬虫2
1、首先来分析url
第一步:总共有7个页面;
第二步:每个页面有20篇文章
第三步:查看后总共有317篇文章
2、python脚本
脚本的功能:通过给定的url来将这片博客里面的所有文章下载到本地
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
|
#!/usr/bin/env python
#coding: utf-8
import
urllib
import
time
list00 = []
i = j = 0
page = 1
while
page < 8:
str =
"http://blog.sina.com.cn/s/articlelist_1191258123_0_%d.html"
%(page)
content = urllib.urlopen(str).
read
()
title = content.
find
(r
"<a title"
)
href = content.
find
(r
"href="
,title)
html = content.
find
(r
".html"
,href)
url = content[href + 6:html + 5]
urlfilename = url[-26:]
list00.append(url)
print i, url
while
title != -1 and href != -1 and html != -1 and i < 350:
title = content.
find
(r
"<a title"
,html)
href = content.
find
(r
"href="
,title)
html = content.
find
(r
".html"
,href)
url = content[href + 6:html + 5]
urlfilename = url[-26:]
list00.append(url)
i = i + 1
print i, url
else
:
print
"Link address Finshed."
print
"This is %s page"
%(page)
page = page + 1
else
:
print
"spage="
,list00[50]
print list00[:51]
print list00.count(
""
)
print
"All links address Finshed."
x = list00.count(
''
)
a = 0
while
a < x:
y1 = list00.index(
''
)
list00.pop(y1)
print a
a = a + 1
print list00.count(
''
)
listcount = len(list00)
while
j < listcount:
content = urllib.urlopen(list00[j]).
read
()
open
(r
"/tmp/hanhan/"
+list00[j][-26:],
'a+'
).write(content)
print
"%2s is finshed."
%(j)
j = j + 1
#time.sleep(1)
else
:
print
"Write to file End."
|
3、下载文章后的截图
4、从linux下载到windows本地,然后打开查看;如下截图
本文转自zys467754239 51CTO博客,原文链接:http://blog.51cto.com/467754239/1574528,如需转载请自行联系原作者