Splash的爬虫应用（二）-阿里云开发者社区

Splash的爬虫应用（二）

2024-10-09 43

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

本文涉及的产品

全局流量管理 GTM，标准版 1个月

公共DNS（含HTTPDNS解析），每月1000万次HTTP解析

云解析 DNS，旗舰版 1个月

简介： Splash的爬虫应用（二）

接上文 Splash的爬虫应用（一）https://developer.aliyun.com/article/1617947

Splash中的HTTP API
Splash提供了API接口，可以实现Python与Splash之间的交互。Splash比较常用的API接口及使用方法如下：

1、render.html
通过该接口可以实现获取JavaScript渲染后的HTML代码，接口的请求地址如下：

http://localhost:8050/render.html

代码如下：

#_*_coding:utf-8_*_
# 作者      ：liuxiaowei
# 创建时间   ：2/10/22 9:57 PM
# 文件      ：使用render.html接口获取百度首页图片链接.py
# IDE      ：PyCharm

# 导入网路请求模块
import requests
# 导入HTML解析模块
from bs4 import BeautifulSoup
# Splash 的 render.html接口地址
splash_url = 'http://localhost:8050/render.html'

# 需要爬取的页面地址
args = {
   'url': 'https://www.baidu.com'}

# 使用render.html接口对百度首页发送网络请求
resp = requests.get(splash_url, args)
# 不使用render.html接口对百度首页发送网络请求
#resp = requests.get('https://www.baidu.com/')
# 设置编码方式
resp.encoding = 'utf-8'
# 创建解析HTML代码BeautifulSoup对象
soup = BeautifulSoup(resp.text, 'html.parser')

# 获取百度首页Logo图片的链接
img_url = 'https:' + soup.select('div[class="s-p-top"]')[0].select('img')[0].attrs['src']

# 打印链接地址
print(img_url)

程序运行结果如下:

https://www.baidu.com/img/PC_880906d2a4ad95f5fafb2e540c5cdad7.png

Process finished with exit code 0

如果不使用render.html接口直接访问百度首页，将出现报错信息，因为百度首页中Logo图片的链接地址是渲染后的结果，所以在没经过Splash渲染的情况下是不能直接从HTML代码中提取出来的。错误信息如下：

Traceback (most recent call last):
File "/Users/liuxiaowei/PycharmProjects/爬虫练习/明日科技/爬取动态渲染的信息/搭建和运行Splash环境/使用render.html接口获取百度首页图片链接.py", line 26, in

img_url = 'https:' + soup.select('div[class="s-p-top"]')[0].select('img')[0].attrs['src']
IndexError: list index out of range

Process finished with exit code 1

在使用render.html接口时，除了可以使用简单的url参数以外，还有多种参数可以应用，比较常用的参数及含义如表所示：

render.html 接口常用参数含义及描述

参数名

描述

timeout

设置渲染页面超时的时间

proxy

设置代理服务的地址

wait

设置页面加载后等待更新的时间

images

设置是否下载图片，默认值为1表示下载图片，值为0时表示不下载图片

js_source

设置用户自定义的JavaScript代码，在页面渲染前执行

说明

关于Splash API接口中的其他参数可以参考官方文档，地址如下：

https://splash.readthedocs.io/en/stable/api.html

2、render.png
通过该接口可以实现获取目标网页的截图，接口的请求地址如下：

http://localhost:8050/render.png

render.png接口比render.html接口多了两个比较重要的参数，分别为“width“与”height“，使用这两个参数即可指定目标网页截图的宽度与高度，以获取百度首页截图为例。

示例代码如下：

#_*_coding:utf-8_*_
# 作者      ：liuxiaowei
# 创建时间   ：2/10/22 10:51 PM
# 文件      ：使用render.png接口获取百度首页截图.py
# IDE      ：PyCharm

# 导入网络请求模块
import requests

# Splash的render.png接口地址
splash_url = 'http://localhost:8050/render.png'
# 需要爬取的网页地址
args = {
   'url':'https://www.baidu.com/', 'width':1280, 'height':800}

# 发送网络请求
resp = requests.get(splash_url, args)
# 调用open()函数
with open('baidu.png', 'wb') as f:
    # 将返回的二进制数据保存成图片
    f.write(resp.content)

程序运行结果在当前目录下将自动生成名为"baidu.png"的图片文件，打开文件的效果如下：

说明

Splash还提供了一个render.jpeg接口，该接口与render.png类似，只不过返回的是JPEG格式的二进制数据。

3、render.json
通过该接口可以实现获取JavaScript渲染网页信息的JSON，根据传递的参数，它可以包含HTML、PNG和其他信息。接口的请求地址如下：

http://localhost:8050/render.json

在默认情况下使用render.json接口，将返回请求地址、页面标题、页面尺寸的JSON信息。代码如下：

#_*_coding:utf-8_*_
# 作者      ：liuxiaowei
# 创建时间   ：2/10/22 11:11 PM
# 文件      ：获取请求页面的JSON信息.py
# IDE      ：PyCharm

# 导入网络请求模块
import requests
# Splash的render.json接口地址
splash_url = 'http://localhost:8050/render.json'

# 需要爬取的网页地址
args = {
   'url':'https://www.baidu.com/'}
#发送网络请求
resp = requests.get(splash_url, args)

# 打印返回的JSON信息
print(resp.json())

程序运行结果：

{
   'url': 'https://www.baidu.com/', 'requestedUrl': 'https://www.baidu.com/', 'geometry': [0, 0, 1024, 768], 'title': '百度一下，你就知道'}

Process finished with exit code 0

4、执行Lua自定义脚本
Splash还提供了一个非常强大的execute接口，该接口可以实现在Python代码中执行Lua脚本。使用该接口就必须指定lua_source参数，该参数表示需要执行的Lua脚本，Splash执行完成以后将结果返回Python。以获取百度首页渲染后的HMTL代码为例，示例代码如下：

#_*_coding:utf-8_*_
# 作者      ：liuxiaowei
# 创建时间   ：2/10/22 11:22 PM
# 文件      ：获取百度渲染后的HTML代码.py
# IDE      ：PyCharm

# 导入网络请求模块
import requests
# 导入quote()方法
from urllib.parse import quote

# 自定义的Lua脚本
lua_script = '''
function main(splash)
    splash:go("https://www.baidu.com/")
    splash:wait(0.5)
    return splash:html()
end
'''
# Splash的execute接口地址
splash_url = 'http://localhost:8050/execute?lua_source='+quote(lua_script)

# 定义headers信息
headers = {
   
  "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.80 Safari/537.36"
}
# 发送网络请求
resp = requests.get(splash_url, headers=headers)

# 打印渲染后的HMTL代码
print(resp.text)

程序运行结果如下：

<!DOCTYPE html><!--STATUS OK--><html><head><meta http-equiv="Content-Type" content="text/html;charset=utf-8"><meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"><meta content="always" name="referrer"><meta name="theme-color" content="#ffffff"><meta name="description" content="全球领先的中文搜索引擎、致力于让网民更便捷地获取信息，找到所求。百度超过千亿的中文网页数据库，可以瞬间找到相关的搜索结果。"><link rel="shortcut icon" href="/favicon.ico" type="image/x-icon"><link rel="search" type="application/opensearchdescription+xml" href="/content-search.xml" title="百度搜索"><link rel="icon" sizes="any" mask="" href="//www.baidu.com/img/baidu_85beaf5496f291521eb75ba38eacbd87.svg"><link rel="dns-prefetch" href="//dss0.bdstatic.com"><link rel="dns-prefetch" href="//dss1.bdstatic.com"><link rel="dns-prefetch" href="//ss1.bdstatic.com"><link rel="dns-prefetch" href="//sp0.baidu.com"><link rel="dns-prefetch" href="//sp1.baidu.com"><link rel="dns-prefetch" href="//sp2.baidu.com"><title>百度一下，你就知道</title><style index="newi" type="text/css">#form .bdsug{
   top:39px}.bdsug{
   display:none;position:absolute;width:535px;background:#fff;border:1px solid #ccc!important;_overflow:hidden;box-shadow:1px 1px 3px #ededed;-webkit-box-shadow:1px 1px 3px #ededed;-moz-box-shadow:1px 1px 3px #ededed;-o-box-shadow:1px 1px 3px #ededed}.bdsug li{
   width:519px;color:#000;font:14px arial;line-height:25px;padding:0 8px;position:relative;cursor:default}.bdsug li.bdsug-s{
   background:#f0f0f0}.bdsug-store span,.bdsug-store b{
   color:#7A77C8}.bdsug-store-del{
   font-size:12px;color:#666;text-decoration:underline;position:absolute;right:8px;top:0;cursor:pointer;display:none}.bdsug-s .bdsug-store-del{
   display:inline-block}.bdsug-ala{
   display:inline-block;border-bottom:1px solid #e6e6e6}.bdsug-ala h3{
   line-height:14px;background:url(//www.baidu.com/img/sug_bd.png?v=09816787.png)

在Splash中使用Lua脚本可以执行一系列的渲染操作，这样便可以通过Splash模拟浏览器实现网页数据的提取工作。Lua脚本中的语法是比较简单的，可以通过“splash:“的方式调用其内部的方法和属性，“function main(splash)“表示脚本入口；“splash:go(“https://www.baidu.com/”)表示调用go()方法访问百度首页（网址）；代码“splash:wait(0.5)“表示等待0.5秒；“return splash:html()“表示返回渲染后的HTML代码；”end“表示脚本结束。

Lua脚本的常用属性与方法：

Lua脚本常用的属性与方法含义

说明

由于Lua脚本中的属性与方法较多，如果感兴趣的朋友需要了解更多相关资料可以参考官网API文档,网址：

https://splash.readthedocs.io/en/stable/scripting-ref.html

总结

Splash的爬虫应用

搭建Splash环境
Splash中的HTTP API
§ 获取JavaScript渲染后的HTML代码
§ 获取目标网页的截图
§ 获取javascript渲染网页信息的JSON
执行Lua自定义脚本

Splash的爬虫应用（二）

热门文章

最新文章

相关课程

相关电子书

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

Splash的爬虫应用（二）

热门文章

最新文章

相关课程

相关电子书