Python编程异步爬虫——协程的基本原理(二)

简介: Python编程异步爬虫——协程的基本原理(二)

接上文 Python编程异步爬虫——协程的基本原理(一)https://developer.aliyun.com/article/1620696

多任务协程
如果想执行多次请求,应该怎么办?可以定义一个task列表,然后使用asyncio包中的wait方法执行,如下所示:

import asyncio
import requests

async def request():
    url = 'https://www.baidu.com'
    status = requests.get(url)
    return status

tasks = [asyncio.ensure_future(request()) for _ in range(5)]
print('Tasks:', tasks)

loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))

for task in tasks:
    print('Task Result:', task.result())

运行结果如下:
Tasks: [<Task pending name='Task-1' coro=<request() running at /Users/bruce_liu/PycharmProjects/崔庆才--爬虫/6章异步爬虫/多任务协程.py:5>>, <Task pending name='Task-2' coro=<request() running at /Users/bruce_liu/PycharmProjects/崔庆才--爬虫/6章异步爬虫/多任务协程.py:5>>, <Task pending name='Task-3' coro=<request() running at /Users/bruce_liu/PycharmProjects/崔庆才--爬虫/6章异步爬虫/多任务协程.py:5>>, <Task pending name='Task-4' coro=<request() running at /Users/bruce_liu/PycharmProjects/崔庆才--爬虫/6章异步爬虫/多任务协程.py:5>>, <Task pending name='Task-5' coro=<request() running at /Users/bruce_liu/PycharmProjects/崔庆才--爬虫/6章异步爬虫/多任务协程.py:5>>]
Task Result: <Response [200]>
Task Result: <Response [200]>
Task Result: <Response [200]>
Task Result: <Response [200]>
Task Result: <Response [200]>

协程实现
协程在解决IO密集型任务方面的优势,耗时等待一般都是IO操作,例如文件读取、网络请求等。协程在处理这种操作时是有很大优势的,当遇到需要等待的情况时,程序可以暂时挂起,转而执行其他操作,避免浪费时间。
https://www.httpbin.org/delay/5为例,体验一下协程的效果。示例代码如下:

import asyncio
import requests
import time

start = time.time()

async def request():
    url = 'https://www.httpbin.org/delay/5'
    print('waiting for', url)
    response = requests.get(url)
    print('Get response from', url, 'response', response)

tasks = [asyncio.ensure_future(request()) for _ in range(10)]
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))

end = time.time()
print('Cost time:', end - start)

运行结果如下:
waiting for https://www.httpbin.org/delay/5
Get response from https://www.httpbin.org/delay/5 response <Response [200]>
waiting for https://www.httpbin.org/delay/5
Get response from https://www.httpbin.org/delay/5 response <Response [200]>
waiting for https://www.httpbin.org/delay/5
Get response from https://www.httpbin.org/delay/5 response <Response [200]>
...
waiting for https://www.httpbin.org/delay/5
Get response from https://www.httpbin.org/delay/5 response <Response [200]>
waiting for https://www.httpbin.org/delay/5
Get response from https://www.httpbin.org/delay/5 response <Response [200]>
waiting for https://www.httpbin.org/delay/5
Get response from https://www.httpbin.org/delay/5 response <Response [200]>
Cost time: 63.61974787712097

可以发现,与正常的顺序请求没有啥区别。那么异步处理的优势呢?要实现异步处理,先得有挂起操作,当一个任务需要等待IO结果的时候,可以挂起当前任务,转而执行其他任务,这样才能充分利用好资源。

使用aiohttp
aiohttp是一个支持异步请求的库,它和asyncio配合使用,可以使我们非常方便地实现异步请求操作。
aiohttp分为两部分:一部分是Client,一部分是Server。
下面我们将aiohttp投入使用,将代码改成如下:

import asyncio
import aiohttp
import time

start = time.time()

async def get(url):
    session = aiohttp.ClientSession()
    response = await session.get(url)
    await response.text()
    await session.close()
    return response

async def request():
    url = 'https://www.httpbin.org/delay/5'
    print('Waiting for', url)
    response = await get(url)
    print('Get response from', url, 'response', response)

tasks = [asyncio.ensure_future(request()) for _ in range(10)]
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))

end = time.time()
print('Cost time:', end - start)

运行结果如下:
Waiting for https://www.httpbin.org/delay/5
Waiting for https://www.httpbin.org/delay/5
Waiting for https://www.httpbin.org/delay/5
Waiting for https://www.httpbin.org/delay/5
Waiting for https://www.httpbin.org/delay/5
Waiting for https://www.httpbin.org/delay/5
...
Get response from https://www.httpbin.org/delay/5 response <ClientResponse(https://www.httpbin.org/delay/5) [200 OK]>
<CIMultiDictProxy('Date': 'Sat, 23 Mar 2024 13:42:05 GMT', 'Content-Type': 'application/json', 'Content-Length': '367', 'Connection': 'keep-alive', 'Server': 'gunicorn/19.9.0', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Credentials': 'true')>

Get response from https://www.httpbin.org/delay/5 response <ClientResponse(https://www.httpbin.org/delay/5) [200 OK]>
<CIMultiDictProxy('Date': 'Sat, 23 Mar 2024 13:42:05 GMT', 'Content-Type': 'application/json', 'Content-Length': '367', 'Connection': 'keep-alive', 'Server': 'gunicorn/19.9.0', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Credentials': 'true')>
...
Get response from https://www.httpbin.org/delay/5 response <ClientResponse(https://www.httpbin.org/delay/5) [200 OK]>
<CIMultiDictProxy('Date': 'Sat, 23 Mar 2024 13:42:05 GMT', 'Content-Type': 'application/json', 'Content-Length': '367', 'Connection': 'keep-alive', 'Server': 'gunicorn/19.9.0', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Credentials': 'true')>

Get response from https://www.httpbin.org/delay/5 response <ClientResponse(https://www.httpbin.org/delay/5) [200 OK]>
<CIMultiDictProxy('Date': 'Sat, 23 Mar 2024 13:42:05 GMT', 'Content-Type': 'application/json', 'Content-Length': '367', 'Connection': 'keep-alive', 'Server': 'gunicorn/19.9.0', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Credentials': 'true')>

Cost time: 6.868626832962036

这里将请求库由requests改成了aiohttp,利用aiohttp库里ClientSession类的get方法进行请求。

测试一下并发量分别为1、3、5、10、….、500时的耗时情况,代码如下:

import asyncio
import aiohttp
import time

def test(number):
    start = time.time()

    async def get(url):
        session = aiohttp.ClientSession()
        response = await session.get(url)
        await response.text()
        await session.close()
        return response

    async def request():
        url = 'https://www.baidu.com/'
        await get(url)

    tasks = [asyncio.ensure_future(request()) for _ in range(number)]
    loop = asyncio.get_event_loop()
    loop.run_until_complete(asyncio.wait(tasks))

    end = time.time()
    print('Number:', number, 'Cost time:', end - start)

for number in [1, 3, 5, 10, 15, 30, 50, 75, 100, 200, 500]:
    test(number)

 运行结果如下:
Number: 1 Cost time: 0.23929095268249512
Number: 3 Cost time: 0.19086170196533203
Number: 5 Cost time: 0.20035600662231445
Number: 10 Cost time: 0.21305394172668457
Number: 15 Cost time: 0.25495195388793945
Number: 30 Cost time: 0.769071102142334
Number: 50 Cost time: 0.3470029830932617
Number: 75 Cost time: 0.4492309093475342
Number: 100 Cost time: 0.586918830871582
Number: 200 Cost time: 1.0910720825195312
Number: 500 Cost time: 2.4768006801605225
相关文章
|
7月前
|
数据采集 Web App开发 数据安全/隐私保护
实战:Python爬虫如何模拟登录与维持会话状态
实战:Python爬虫如何模拟登录与维持会话状态
|
8月前
|
数据采集 Web App开发 自然语言处理
新闻热点一目了然:Python爬虫数据可视化
新闻热点一目了然:Python爬虫数据可视化
|
7月前
|
数据可视化 关系型数据库 MySQL
【可视化大屏】全流程讲解用python的pyecharts库实现拖拽可视化大屏的背后原理,简单粗暴!
本文详解基于Python的电影TOP250数据可视化大屏开发全流程,涵盖爬虫、数据存储、分析及可视化。使用requests+BeautifulSoup爬取数据,pandas存入MySQL,pyecharts实现柱状图、饼图、词云图、散点图等多种图表,并通过Page组件拖拽布局组合成大屏,支持多种主题切换,附完整源码与视频讲解。
741 4
【可视化大屏】全流程讲解用python的pyecharts库实现拖拽可视化大屏的背后原理,简单粗暴!
|
7月前
|
数据采集 监控 数据库
Python异步编程实战:爬虫案例
🌟 蒋星熠Jaxonic,代码为舟的星际旅人。从回调地狱到async/await协程天堂,亲历Python异步编程演进。分享高性能爬虫、数据库异步操作、限流监控等实战经验,助你驾驭并发,在二进制星河中谱写极客诗篇。
Python异步编程实战:爬虫案例
|
7月前
|
数据采集 人工智能 JSON
Prompt 工程实战:如何让 AI 生成高质量的 aiohttp 异步爬虫代码
Prompt 工程实战:如何让 AI 生成高质量的 aiohttp 异步爬虫代码
|
8月前
|
数据采集 存储 XML
Python爬虫技术:从基础到实战的完整教程
最后强调: 父母法律法规限制下进行网络抓取活动; 不得侵犯他人版权隐私利益; 同时也要注意个人安全防止泄露敏感信息.
1030 19
|
7月前
|
数据采集 存储 JSON
Python爬虫常见陷阱:Ajax动态生成内容的URL去重与数据拼接
Python爬虫常见陷阱:Ajax动态生成内容的URL去重与数据拼接
|
8月前
|
机器学习/深度学习 文字识别 Java
Python实现PDF图片OCR识别:从原理到实战的全流程解析
本文详解2025年Python实现扫描PDF文本提取的四大OCR方案(Tesseract、EasyOCR、PaddleOCR、OCRmyPDF),涵盖环境配置、图像预处理、核心识别与性能优化,结合财务票据、古籍数字化等实战场景,助力高效构建自动化文档处理系统。
2113 0
|
7月前
|
机器学习/深度学习 监控 数据挖掘
Python 高效清理 Excel 空白行列:从原理到实战
本文介绍如何使用Python的openpyxl库自动清理Excel中的空白行列。通过代码实现高效识别并删除无数据的行与列,解决文件臃肿、读取错误等问题,提升数据处理效率与准确性,适用于各类批量Excel清理任务。
679 0
|
7月前
|
数据采集 存储 JavaScript
解析Python爬虫中的Cookies和Session管理
Cookies与Session是Python爬虫中实现状态保持的核心。Cookies由服务器发送、客户端存储,用于标识用户;Session则通过唯一ID在服务端记录会话信息。二者协同实现登录模拟与数据持久化。

推荐镜像

更多