Python的Requests来爬取今日头条的图片和文章-阿里云开发者社区

Python的Requests来爬取今日头条的图片和文章

2023-08-29 323

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： Python的Requests来爬取今日头条的图片和文章

Python的Requests来爬取今日头条的图片和文章并且存入mongo

config.py

MONGO_HOST = 'localhost'
MONGO_PORT = 27017
MONGO_DB = 'toutiao'
MONGO_TABLE = 'toutiao'
GROUP_START = 1
GROUP_END = 20
KEYWORD = '原油'

toutiao.py

import json
import os
from urllib.parse import urlencode
import pymongo
import requests
from multiprocessing import Pool
from requests.exceptions import ConnectionError
from hashlib import md5
from config import *
client = pymongo.MongoClient(MONGO_HOST, MONGO_PORT)
db = client[MONGO_DB]
def get_page_index(offset, keyword):
    data = {
        'autoload': 'true',
        'count': 20,
        'cur_tab': 1,
        'app_name': 'web_search',
        'format': 'search_tab',
        'keyword': keyword,
        'offset': offset,
    }
    params = urlencode(data)
    base = 'https://www.toutiao.com/api/search/content'
    url = base + '?' + params
    try:
        response = requests.get(url)
        if response.status_code == 200:
            data = json.loads(response.text)
            if data  and 'data' in data.keys():
                if data.get('data') is not None:
                    for item in data.get('data'):
                        if item is not None:
                           yield [item.get('article_url'), item.get('abstract'), item.get('large_image_url')]
    except ConnectionError:
        print('Error occurred')
        return None
def download_image(url):
    print('Downloading', url)
    try:
        response = requests.get(url)
        if response.status_code == 200:
            save_image(response.content)
        return None
    except Exception:
        return None
def save_image(content):
    file_path = '{0}/picture/{1}.{2}'.format(os.getcwd(), md5(content).hexdigest(), 'jpg')
    print(file_path)
    if not os.path.exists(file_path):
        with open(file_path, 'wb') as f:
            f.write(content)
            f.close()
def save_to_mongo(result):
    if db[MONGO_TABLE].insert_one(result):
        print('Successfully Saved to Mongo', result)
        return True
    return False
def main(offset):
    items = get_page_index(offset, KEYWORD)
    for item in items:
            if (item[2] is not None) and len(item[2])!=0:
                download_image(item[2])
            if (item[0] is not None and len(item[0]) != 0)\
                    or  (item[1] is not None and len(item[1]) != 0)\
                    or  (item[2] is not None and len(item[2]) != 0):
                json = {
                    'article_url': item[0],
                    'abstract': item[1],
                    'large_image_url': item[2]
                }
                save_to_mongo(json)
if __name__ == '__main__':
    pool = Pool()
    groups = ([x * 20 for x in range(GROUP_START, GROUP_END + 1)])
    pool.map(main, groups)
    pool.close()
    pool.join()

知识拓展：

一、用Flask+Redis维护Cookies池

为什什么要⽤用Cookies池

⽹网站需要登录才可爬取，例例如新浪微博

爬取过程中如果频率过⾼高会导致封号

需要维护多个账号的Cookies池实现⼤大规模爬取

Cookies池的要求

⾃自动登录更更新

定时验证筛选

提供外部接⼝口

代码：https://github.com/Python3WebSpider/CookiesPool

二、用Flask+Redis维护代理池

为什么要⽤用代理理池？

许多⽹网站有专⻔门的反爬⾍虫措施，可能遇到封IP等问题。

互联⽹网上公开了了⼤大量量免费代理理，利利⽤用好资源。

通过定时的检测维护同样可以得到多个可⽤用代理理。

代理理池的要求

多站抓取，异步检测

定时筛选，持续更更新

提供接⼝口，易易于提取

代码：https://github.com/Python3WebSpider/ProxyPool

三、VirtualEnv

　　Virtualenv他最大的好处是，可以让每一个python项目单独使用一个环境，而不会影响python系统环境，也不会影响其他项目的环境。

安装，virtualenv本质上是个python包, 使用pip安装:

pip install virtualenv

在工作目录下创建虚拟环境(默认在当前目录)：注意需要自定义虚拟环境的名字！

~$virtualenv TestEnv
New python executable in ~/TestEnv/bin/python
Installing setuptools, pip, wheel...done.

默认情况下, 虚拟环境中不包括系统的site-packages, 若要使用请添加参数:

语法：virtualenv --system-site-packages TestEnv

使用virtualenv默认python版本创建虚拟环境

语法：virtualenv --no-site-packages ubuntu_env

四、url去重策略

Python的Requests来爬取今日头条的图片和文章

热门文章

最新文章

相关课程

相关电子书

相关实验场景

推荐镜像

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

Python的Requests来爬取今日头条的图片和文章

热门文章

最新文章

相关课程

相关电子书

相关实验场景

推荐镜像