系统需要对爬取到的数据进行数据可视化
本系统获取到的数据是存储在数据库中的,当需要对爬取数据进行查看时,特别是爬取到的房源数据量很大的时候,数据查看很不方便,而且数据库浏览界面太过单一,无法突出数据特点,所以通过使用界面以数据可视化的形式将爬取到的房源数据展示出来。
3.2 功能性需求分析
3.2.1 数据爬取功能
数据爬取功能是指将房源信息数据从数据来源网站爬取下来的功能。本系统是面向二手房信息的分布式爬取,原始数据来源于链家网广州二手房。分布式爬取是使用一个Master服务器和多个Slave服务器快速的对网页进行爬取,加快爬取速度和效率;Master端负责对目录页中的URL进行爬取和存储,Slave端负责对详情页的URL进行爬取和存储。
import scrapy
from scrapy.http.response.text import TextResponse
from datetime import datetime
import hashlib
from scrapy_lianjia_ershoufang.items import ScrapyLianjiaErshoufangItem
class ErshoufangSpider(scrapy.Spider):
name = 'ErshoufangSpider'
def __init__(self, name=None, **kwargs):
super().__init__(name=None, **kwargs)
if getattr(self, 'city', None) is None:
setattr(self, 'city', 'sz')
self.allowed_domains = ['%s.lianjia.com' % getattr(self, 'city')]
def start_requests(self):
city = getattr(self, 'city')
urls = ['https://%s.lianjia.com/ershoufang/pg%d/' % (city, i)
for i in range(1, 101)]
for url in urls:
yield scrapy.Request(url, self.parse, headers={'Referer': url})
def parse(self, response: TextResponse):
items = response.css('ul.sellListContent li')
for li in items:
item = ScrapyLianjiaErshoufangItem()
item['title'] = li.css('div.title a::text').get().replace(':', '').replace(',', ' ').replace("\n", '')
house_infos = li.css('div.address .houseInfo::text').re(
r'\|\s+(.*)\s+\|\s+(.*)平米\s+\|\s+(.*)\s+\|\s+(.*)\s+\|\s+(.*)')
item['room'] = house_infos[0]
item['area'] = house_infos[1]
item['orientation'] = house_infos[2]
item['decoration'] = house_infos[3]
item['elevator'] = house_infos[4]
item['xiaoqu'] = li.css('div.address a::text').get()
item['flood'] = li.css('div.flood .positionInfo::text').get().replace('-', '').strip()
item['location'] = li.css('div.flood .positionInfo a::text').get()
follow_infos = li.css('div.followInfo::text').re(r'(.*)人关注\s+/\s+共(.*)次带看\s+/\s+(.*)发布')
item['follow_number'] = follow_infos[0]
item['look_number'] = follow_infos[1]
item['pub_duration'] = follow_infos[2]
item['total_price'] = li.css('div.priceInfo div.totalPrice span::text').get()
unit_price = li.css('div.priceInfo .unitPrice span::text').re(r'单价(.*)元/平米')
item['unit_price'] = unit_price[0]
item['total_unit'] = li.css('div.totalPrice::text').get()
item['crawl_time'] = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
item['house_id'] = self.genearteMD5(''.join((str(item['title']), str(item['room']), str(item['area']),
str(item['orientation']), str(item['elevator']),
str(item['xiaoqu']),
str(item['flood']), str(item['location']))))
yield item
def genearteMD5(self, text):
# 创建md5对象
hl = hashlib.md5()
hl.update(text.encode(encoding='utf-8'))
return hl.hexdigest()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
————————————————