Scrapy和Django实现蚌埠医学院手机新闻网站制作-阿里云开发者社区

Scrapy和Django实现蚌埠医学院手机新闻网站制作

2018-11-19 1667

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介：

最终效果（不看效果就讲过程都是耍流氓）：
列表页、详情页展示

实现过程如下:

框架：

Scrapy：数据采集
Django：数据呈现

目标网站：
蚌埠医学院学院新闻列表:http://www.bbmc.edu.cn/index.php/view/viewcate/0/

蚌埠医学院学院新闻列表截图

第一步：数据抓取

新建爬虫项目

在终端中执行命令

srapy startproject bynews

执行完毕，自动新建好项目文件
爬虫项目目录

编写爬虫代码

在爬虫目录spider中，新建具体爬虫的by_spider.py文件：

by_spider文件

爬虫代码功能说明：

新闻列表自动翻下一页，直到结束，每个新闻列表页提取进如新闻详情页的url
逐个新闻详情页面进入，提取新闻名称，发文机构，发文时间，新闻内容

爬虫源代码内容：

import scrapy
from bynews.items import BynewsItem
import re
class BynewsSpider(scrapy.Spider):
    name = 'news'
    start_urls = ['http://www.bbmc.edu.cn/index.php/view/viewcate/0/']
    allowed_domains = ['bbmc.edu.cn']

    def parse(self, response):
        news = response.xpath("//div[@class='cate_body']//ul/li")
        next_page = response.xpath("//div[@class='pagination']/a[contains(text(),'>')]/@href").extract_first()
        href_urls = response.xpath("//div[@class='cate_body']/ul/li/a/@href").extract()
        for href in href_urls:
            yield scrapy.Request(href,callback=self.parse_item)
        if next_page is not None:
            yield scrapy.Request(next_page)

    def parse_item(self, response):
        item = BynewsItem() 
        item['title'] = response.xpath("//div[@id='news_main']/div/h1/text()").extract_first()
        # 获取 作者和发文日期
        string = response.xpath("//div[@class='title_head']/span/text()").extract_first()
        # 通过分割获取文章发布日期
        full_date = string.split('/')
        year = full_date[0][-4:]
        month = full_date[1]
        day = full_date[2]
        item['post_date'] = year + month +day
        # 通过正则表达式获取作者
        matchObj = re.search(r'\[.*\]',string)
        # 去除两边的中括号
        string = matchObj.group()
        string = string[1:]
        string = string[:-1]
        item['author'] = string
        contents = response.xpath("//div[@id='news_main']/div/div/span/text()").extract_first()
        text = ''
        for content in contents :
            text = text + content
        item['content'] = text
        yield item

提取的数据结构items.py文件：

import scrapy
class BynewsItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    author = scrapy.Field()
    post_date = scrapy.Field()
    content = scrapy.Field()

爬虫项目的settings.py文件，增加请求头，关闭爬虫协议，仅放修改的部分，其他的默认即可：

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
ROBOTSTXT_OBEY = False
DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
}

保存爬取数据

准备工作就绪后，执行scrapy crawl news -o news.csv命令开始抓取数据，并且报存到文件news.csv中，方便后续建站导入数据：
爬虫抓取数据过程图

采集好的数据：
csv文件图

爬虫过程结束，准备建站啦~

第二步：数据呈现

新建项目

在终端中执行命令：django-admin startproject 项目名

新建应用

执行命令：python manage.py startapp bynews新建应用
Django目录截图

Django采用MVC架构，需要写的内容比较多，就不一一放图了，直接甩上代码：

项目目录下

urls.py文件：

from django.contrib import admin
from django.urls import include,path
urlpatterns = [
    path('admin/', admin.site.urls),
    path('blog/', include('blog.urls')),
    # 蚌医新闻
    path('bynews/',include('bynews.urls')),
]

settings.py文件(仅放修改部分)：

INSTALLED_APPS = [
    'blog.apps.BlogConfig',
    'bynews.apps.BynewsConfig',
    'django.contrib.admin',
    'django.contrib.auth',
    'django.contrib.contenttypes',
    'django.contrib.sessions',
    'django.contrib.messages',
    'django.contrib.staticfiles',
]

bynews应用目录下：

模型models.py文件(模型文件弄好后，需要执行数据迁移什么的命令，Django保证了所有操作都是基于面向对象，十分强大)：

from django.db import models
class Bynews(models.Model):
    title = models.CharField(max_length=30, default='Title')
    author = models.CharField(null = True,max_length=30)
    content= models.TextField(null = True)
    def __str__(self):
        return self.title

视图views.py文件：

from django.shortcuts import render,get_object_or_404
from .models import Bynews

def index(request):
    news = Bynews.objects.order_by('id')[:30]
    return render(request, 'bynews/newslist.html',{'news_list':news})

def detail(request,news_id):
    news = get_object_or_404(Bynews,pk=news_id)
    return render(request, 'bynews/detail.html', {'news':news})

urls.py文件：

from django.urls import path 
from . import views

app_name = 'bynews'

urlpatterns = [
    # ex:/bynews/
    path('', views.index, name='index'),
    # ex:/bynews/4/
    path('<int:news_id>/', views.detail, name = 'detail'),
    #path('<int:news_id>/vote/', views.form ,name='form'),
]

在应用目录下新建template文件夹用来存放模板文件：

放上一个新闻列表页的模板newslist文件，需要注意其中的url生成方式，static的生成方式:

{% load static %}
<!DOCTYPE html>
<html lang="en" class="app">
<head>
  <meta charset="utf-8" />
  <meta name="viewport" content="initial-scale=1.0, maximum-scale=1.0, user-scalable=no" />
  <meta name="description" content="" />
  <meta name="keywords" content="" />
  <link rel="stylesheet" type="text/css" href="{% static 'bynews/css/style.css' %}">
  <title>蚌埠医学院 - 新闻列表</title>
</head>
<body>
    <header class="header">
            <h1>蚌埠医学院</h1>
            <h2>新闻列表</h2>
        </header>
    <ul class="list">
    {% for news in news_list %}
          <li>
            <a href="default.htm">
              <i class="fl"></i>
              <span class="fl"><a href="{% url 'bynews:detail' news.id %}" >{{news.title}}</a></span>
              <em class="fl"> </em>
            </a>
          </li>
    {% endfor %}
    </ul>


    <script src="{% static 'bynews/js/vue.js' %}" type="text/javascript"></script>
        <script>
            
        </script>
</body>
</html>

最终效果(请忽视正文内容不全问题，这是采集没弄好，懒得弄了，基本想法实现了就行)：

列表页、详情页展示

Scrapy和Django实现蚌埠医学院手机新闻网站制作

框架：

第一步：数据抓取

新建爬虫项目

编写爬虫代码

保存爬取数据

第二步：数据呈现

新建项目

新建应用

项目目录下

bynews应用目录下：

热门文章

最新文章

相关课程

相关电子书

相关实验场景

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

Scrapy和Django实现蚌埠医学院手机新闻网站制作

框架：

第一步：数据抓取

新建爬虫项目

编写爬虫代码

保存爬取数据

第二步：数据呈现

新建项目

新建应用

项目目录下

bynews应用目录下：

热门文章

最新文章

相关课程

相关电子书

相关实验场景