这个框架关注了很久,但是直到最近空了才仔细的看了下 这里我用的是scrapy0.24版本
先来个成品好感受这个框架带来的便捷性,等这段时间慢慢整理下思绪再把最近学到的关于此框架的知识一一更新到博客来。
最近想学git 于是把代码放到 git-osc上了:
https://git.oschina.net/1992mrwang/doubangroupspider
先说明下这个玩具爬虫的目的
能够将种子URL页面当中的小组进行爬取 并分析出有关联的小组连接 以及小组的组员人数 和组名等信息
出来的数据大概是这样的
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
|
{
'RelativeGroups'
: [u
'http://www.douban.com/group/10127/'
,
u
'http://www.douban.com/group/seventy/'
,
u
'http://www.douban.com/group/lovemuseum/'
,
u
'http://www.douban.com/group/486087/'
,
u
'http://www.douban.com/group/lovesh/'
,
u
'http://www.douban.com/group/NoAstrology/'
,
u
'http://www.douban.com/group/shanghaijianzhi/'
,
u
'http://www.douban.com/group/12658/'
,
u
'http://www.douban.com/group/shanghaizufang/'
,
u
'http://www.douban.com/group/gogo/'
,
u
'http://www.douban.com/group/117546/'
,
u
'http://www.douban.com/group/159755/'
],
'groupName'
: u
'\u4e0a\u6d77\u8c46\u74e3'
,
'groupURL'
:
'http://www.douban.com/group/Shanghai/'
,
'totalNumber'
: u
'209957'
}
|
有啥用 其实这些数据就能够分析小组与小组之间的关联度等,如果有心还能抓取到更多的信息。不在此展开 本文章主要是为了能够快速感受一把。
首先就是 start 一个新的名为douban的项目
# scrapy startproject douban
# cd douban
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
这是整个项目的完整后的目录
ps
放到git-osc时候为了美观改变了项目主目录名称 clone下来无影响
mrwang@mrwang-ubuntu:~
/student/py/douban
$ tree
.
├── douban
│ ├── __init__.py
│ ├── items.py
# 实体
│ ├── pipelines.py
# 数据管道文件
│ ├── settings.py
# 设置
│ └── spiders
│ ├── BasicGroupSpider.py
# 真正进行爬取的爬虫
│ └── __init__.py
├──
nohup
.out
# 我用nohup 进行后台运行生成的一个日志文件
├── scrapy.cfg
├── start.sh
# 为了方便写的启动shell 很简单
├── stop.sh
# 为了方便写的停止shell 很简单
└──
test
.log
# 抓取时生成的日志 在启动脚本中就有
|
编写实体 items.py , 主要是为了抓回来的数据可以很方便的持久化
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
mrwang@mrwang
-
ubuntu:~
/
student
/
py
/
douban$ cat douban
/
items.py
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
from
scrapy.item
import
Item, Field
class
DoubanItem(Item):
# define the fields for your item here like:
# name = Field()
groupName
=
Field()
groupURL
=
Field()
totalNumber
=
Field()
RelativeGroups
=
Field()
ActiveUesrs
=
Field()
|
编写爬虫并自定义一些规则进行数据的处理
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
|
mrwang@mrwang
-
ubuntu:~
/
student
/
py
/
douban$ cat douban
/
spiders
/
BasicGroupSpider.py
# -*- coding: utf-8 -*-
from
scrapy.contrib.spiders
import
CrawlSpider, Rule
from
scrapy.contrib.linkextractors.sgml
import
SgmlLinkExtractor
from
scrapy.selector
import
HtmlXPathSelector
from
scrapy.item
import
Item
from
douban.items
import
DoubanItem
import
re
class
GroupSpider(CrawlSpider):
# 爬虫名
name
=
"Group"
allowed_domains
=
[
"douban.com"
]
# 种子链接
start_urls
=
[
"http://www.douban.com/group/explore?tag=%E8%B4%AD%E7%89%A9"
,
"http://www.douban.com/group/explore?tag=%E7%94%9F%E6%B4%BB"
,
"http://www.douban.com/group/explore?tag=%E7%A4%BE%E4%BC%9A"
,
"http://www.douban.com/group/explore?tag=%E8%89%BA%E6%9C%AF"
,
"http://www.douban.com/group/explore?tag=%E5%AD%A6%E6%9C%AF"
,
"http://www.douban.com/group/explore?tag=%E6%83%85%E6%84%9F"
,
"http://www.douban.com/group/explore?tag=%E9%97%B2%E8%81%8A"
,
"http://www.douban.com/group/explore?tag=%E5%85%B4%E8%B6%A3"
]
# 规则 满足后 使用callback指定的函数进行处理
rules
=
[
Rule(SgmlLinkExtractor(allow
=
(
'/group/[^/]+/$'
, )),
callback
=
'parse_group_home_page'
, process_request
=
'add_cookie'
),
Rule(SgmlLinkExtractor(allow
=
(
'/group/explore\?tag'
, )), follow
=
True
,
process_request
=
'add_cookie'
),
]
def
__get_id_from_group_url(
self
, url):
m
=
re.search(
"^http://www.douban.com/group/([^/]+)/$"
, url)
if
(m):
return
m.group(
1
)
else
:
return
0
def
add_cookie(
self
, request):
request.replace(cookies
=
[
]);
return
request;
def
parse_group_topic_list(
self
, response):
self
.log(
"Fetch group topic list page: %s"
%
response.url)
pass
def
parse_group_home_page(
self
, response):
self
.log(
"Fetch group home page: %s"
%
response.url)
# 这里使用的是一个叫 XPath 的选择器
hxs
=
HtmlXPathSelector(response)
item
=
DoubanItem()
#get group name
item[
'groupName'
]
=
hxs.select(
'//h1/text()'
).re(
"^\s+(.*)\s+$"
)[
0
]
#get group id
item[
'groupURL'
]
=
response.url
groupid
=
self
.__get_id_from_group_url(response.url)
#get group members number
members_url
=
"http://www.douban.com/group/%s/members"
%
groupid
members_text
=
hxs.select(
'//a[contains(@href, "%s")]/text()'
%
members_url).re(
"\((\d+)\)"
)
item[
'totalNumber'
]
=
members_text[
0
]
#get relative groups
item[
'RelativeGroups'
]
=
[]
groups
=
hxs.select(
'//div[contains(@class, "group-list-item")]'
)
for
group
in
groups:
url
=
group.select(
'div[contains(@class, "title")]/a/@href'
).extract()[
0
]
item[
'RelativeGroups'
].append(url)
return
item
|
编写数据处理的管道这个阶段我会把爬虫收集到的数据存储到mongodb当中去
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
|
mrwang@mrwang
-
ubuntu:~
/
student
/
py
/
douban$ cat douban
/
pipelines.py
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import
pymongo
from
scrapy
import
log
from
scrapy.conf
import
settings
from
scrapy.exceptions
import
DropItem
class
DoubanPipeline(
object
):
def
__init__(
self
):
self
.server
=
settings[
'MONGODB_SERVER'
]
self
.port
=
settings[
'MONGODB_PORT'
]
self
.db
=
settings[
'MONGODB_DB'
]
self
.col
=
settings[
'MONGODB_COLLECTION'
]
connection
=
pymongo.Connection(
self
.server,
self
.port)
db
=
connection[
self
.db]
self
.collection
=
db[
self
.col]
def
process_item(
self
, item, spider):
self
.collection.insert(
dict
(item))
log.msg(
'Item written to MongoDB database %s/%s'
%
(
self
.db,
self
.col),level
=
log.DEBUG, spider
=
spider)
return
item
|
在设置类中设置 所使用的数据处理管道 以及mongodb连接参数 和 user-agent 躲避爬虫被禁
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
|
mrwang@mrwang
-
ubuntu:~
/
student
/
py
/
douban$ cat douban
/
settings.py
# -*- coding: utf-8 -*-
# Scrapy settings for douban project
#
# For simplicity, this file contains only the most important settings by
# default. All the other settings are documented here:
#
# http://doc.scrapy.org/en/latest/topics/settings.html
#
BOT_NAME
=
'douban'
SPIDER_MODULES
=
[
'douban.spiders'
]
NEWSPIDER_MODULE
=
'douban.spiders'
# 设置等待时间缓解服务器压力 并能够隐藏自己
DOWNLOAD_DELAY
=
2
RANDOMIZE_DOWNLOAD_DELAY
=
True
USER_AGENT
=
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.54 Safari/536.5'
COOKIES_ENABLED
=
True
# 配置使用的数据管道
ITEM_PIPELINES
=
[
'douban.pipelines.DoubanPipeline'
]
MONGODB_SERVER
=
'localhost'
MONGODB_PORT
=
27017
MONGODB_DB
=
'douban'
MONGODB_COLLECTION
=
'doubanGroup'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'douban (+http://www.yourdomain.com)'
|
OK 一个玩具爬虫就简单的完成了
启动启动命令
nohup scrapy crawl Group --logfile=test.log &
=========================== 2014/12/02 更新 ===================================
在github上发现已经有人 和我想的一样 重新写了一个调度器 使用mongodb进行存储需要接下来访问的页面,于是照着模仿了一遍写一个来用
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
|
mrwang@mrwang
-
ThinkPad
-
Edge
-
E431:~
/
student
/
py
/
douban$ cat douban
/
scheduler.py
from
scrapy.utils.reqser
import
request_to_dict, request_from_dict
import
pymongo
import
datetime
class
Scheduler(
object
):
def
__init__(
self
, mongodb_server, mongodb_port, mongodb_db, persist, queue_key, queue_order):
self
.mongodb_server
=
mongodb_server
self
.mongodb_port
=
mongodb_port
self
.mongodb_db
=
mongodb_db
self
.queue_key
=
queue_key
self
.persist
=
persist
self
.queue_order
=
queue_order
def
__len__(
self
):
return
self
.client.size()
@
classmethod
def
from_crawler(
cls
, crawler):
settings
=
crawler.settings
mongodb_server
=
settings.get(
'MONGODB_QUEUE_SERVER'
,
'localhost'
)
mongodb_port
=
settings.get(
'MONGODB_QUEUE_PORT'
,
27017
)
mongodb_db
=
settings.get(
'MONGODB_QUEUE_DB'
,
'scrapy'
)
persist
=
settings.get(
'MONGODB_QUEUE_PERSIST'
,
True
)
queue_key
=
settings.get(
'MONGODB_QUEUE_NAME'
,
None
)
queue_type
=
settings.get(
'MONGODB_QUEUE_TYPE'
,
'FIFO'
)
if
queue_type
not
in
(
'FIFO'
,
'LIFO'
):
raise
Error(
'MONGODB_QUEUE_TYPE must be FIFO (default) or LIFO'
)
if
queue_type
=
=
'LIFO'
:
queue_order
=
-
1
else
:
queue_order
=
1
return
cls
(mongodb_server, mongodb_port, mongodb_db, persist, queue_key, queue_order)
def
open
(
self
, spider):
self
.spider
=
spider
if
self
.queue_key
is
None
:
self
.queue_key
=
"%s_queue"
%
spider.name
connection
=
pymongo.Connection(
self
.mongodb_server,
self
.mongodb_port)
self
.db
=
connection[
self
.mongodb_db]
self
.collection
=
self
.db[
self
.queue_key]
# notice if there are requests already in the queue
size
=
self
.collection.count()
if
size >
0
:
spider.log(
"Resuming crawl (%d requests scheduled)"
%
size)
def
close(
self
, reason):
if
not
self
.persist:
self
.collection.drop()
def
enqueue_request(
self
, request):
data
=
request_to_dict(request,
self
.spider)
self
.collection.insert({
'data'
: data,
'created'
: datetime.datetime.utcnow()
})
def
next_request(
self
):
entry
=
self
.collection.find_and_modify(sort
=
{
"$natural"
:
self
.queue_order}, remove
=
True
)
if
entry:
request
=
request_from_dict(entry[
'data'
],
self
.spider)
return
request
return
None
def
has_pending_requests(
self
):
return
self
.collection.count() >
0
|
这个默认都有配置,如果希望自定义也可以在douban/settings.py 中配置
具体的可以配置的东西有
参数名 默认值
MONGODB_QUEUE_SERVER=localhost 服务器
MONGODB_QUEUE_PORT=27017 端口号
MONGODB_QUEUE_DB=scrapy 数据库名
MONGODB_QUEUE_PERSIST=True 完成后是否将任务队列从mongo中删除
MONGODB_QUEUE_NAME=None 队列集合名 如果为None 默认为你爬虫的名字
MONGODB_QUEUE_TYPE=FIFO 先进先出 或者 LIFO后进先出
任务队列分离后可以方便后期将爬虫改造成为分布式突破单机限制,git-osc 已更新。
会有人考虑任务队列的效率问题,我在个人电脑上测试队列达到将近百万级对mongodb做一次比较复杂的查询,再未做任何索引的情况下出来的效果还是不错的。8G内存+I5 内存未用尽,还打开了大量程序的情况下进行,如果有人在看,也可以自行做一次测试 不算太糟糕。
本文转自 拖鞋崽 51CTO博客,原文链接:http://blog.51cto.com/1992mrwang/1583539