Selenium爬虫过程中遇到弹窗验证-阿里云开发者社区

Selenium爬虫过程中遇到弹窗验证

2023-04-28 222

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： Selenium爬虫过程中遇到弹窗验证

我们在做爬虫的时候，会遇到一些商业网站对爬虫程序限制较多，在数据采集的过程中对爬虫请求进行了多种验证，导致爬虫程序需要深入分析目标网站的反爬策略，定期更新和维护爬虫程序，增加了研发的时间和投入成本。这种情况下，使用无头浏览器例如 Selenium，模拟用户的请求进行数据采集是更加方便快捷的方式。同时为了避免目标网站出现IP限制，配合爬虫代理，实现每次请求自动切换IP，能够保证长期稳定的数据采集。以python的demo为例：
from selenium import webdriver
import string
import zipfile
# 代理服务器(产品官网 )
proxyHost = "t.16yun.cn"
proxyPort = "31111"
# 代理验证信息
proxyUser = "username"
proxyPass = "password"
def create_proxy_auth_extension(proxy_host, proxy_port,

                              proxy_username, proxy_password,
                              scheme='http', plugin_path=None):
   if plugin_path is None:
       plugin_path = r'D:/{}_{}@t.16yun.zip'.format(proxy_username, proxy_password)
   manifest_json = """
   {
       "version": "1.0.0",
       "manifest_version": 2,
       "name": "16YUN Proxy",
       "permissions": [
           "proxy",
           "tabs",
           "unlimitedStorage",
           "storage",
           "",
           "webRequest",
           "webRequestBlocking"
       ],
       "background": {
           "scripts": ["background.js"]
       },
       "minimum_chrome_version":"22.0.0"
   }
   """
   background_js = string.Template(
       """
       var config = {
           mode: "fixed_servers",
           rules: {
               singleProxy: {
                   scheme: "${scheme}",
                   host: "${host}",
                   port: parseInt(${port})
               },
               bypassList: ["foobar.com"]
           }
         };
       chrome.proxy.settings.set({value: config, scope: "regular"}, function() {});
       function callbackFn(details) {
           return {
               authCredentials: {
                   username: "${username}",
                   password: "${password}"
               }
           };
       }
       chrome.webRequest.onAuthRequired.addListener(
           callbackFn,
           {urls: [""]},
           ['blocking']
       );
       """
   ).substitute(
       host=proxy_host,
       port=proxy_port,
       username=proxy_username,
       password=proxy_password,
       scheme=scheme,
   )
   with zipfile.ZipFile(plugin_path, 'w') as zp:
       zp.writestr("manifest.json", manifest_json)
       zp.writestr("background.js", background_js)
   return plugin_path

proxy_auth_plugin_path = create_proxy_auth_extension(

   proxy_host=proxyHost,
   proxy_port=proxyPort,
   proxy_username=proxyUser,
   proxy_password=proxyPass)

option = webdriver.ChromeOptions()
option.add_argument("--start-maximized")
# 如报错 chrome-extensions
# option.add_argument("--disable-extensions")
option.add_extension(proxy_auth_plugin_path)
# 关闭webdriver的一些标志
# option.add_experimental_option('excludeSwitches', ['enable-automation'])
driver = webdriver.Chrome(chrome_options=option)
# 修改webdriver get属性
# script = '''
# Object.defineProperty(navigator, 'webdriver', {
# get: () => undefined
# })
# '''
# driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {"source": script})
driver.get(")

要注意必须保证 plugin_path参数下的文件存放目录是存在的，同时程序拥有该目录的读写权限，否则浏览器会出现代理认证信息读取失败的情况，就会强制弹出认证窗口，要求输入代理用户名和密码，出现程序运行中断的情况。

Selenium爬虫过程中遇到弹窗验证

热门文章

最新文章

相关课程

相关电子书

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

Selenium爬虫过程中遇到弹窗验证

热门文章

最新文章

相关课程

相关电子书