Python 爬虫IP代理池的实现-阿里云开发者社区

Python 爬虫IP代理池的实现

2017-11-14 57510

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介：

很多时候，如果要多线程的爬取网页，或者是单纯的反爬，我们需要通过代理IP来进行访问。下面看看一个基本的实现方法。

代理IP的提取，网上有很多网站都提供这个服务。基本上可靠性和银子是成正比的。国内提供的免费IP基本上都是没法用的，如果要可靠的代理只能付费；国外稍微好些，有些免费IP还是比较靠谱的。

网上随便搜索了一下，找了个网页，本来还想手动爬一些对应的IP，结果发现可以直接下载现成的txt文件
http://www.thebigproxylist.com/

下载之后，试试看用不同的代理去爬百度首页

#！/usr/bin/env python
#! -*- coding:utf-8 -*-
# Author: Yuan Li

import re,urllib.request

fp=open("c:\\temp\\thebigproxylist-17-12-20.txt",'r')
lines=fp.readlines()

for ip in lines:
    try:
            print("当前代理IP "+ip)
            proxy=urllib.request.ProxyHandler({"http":ip})
            opener=urllib.request.build_opener(proxy,urllib.request.HTTPHandler)
            urllib.request.install_opener(opener)
            url="http://www.baidu.com"
            data=urllib.request.urlopen(url).read().decode('utf-8','ignore')
            print("通过")

            print("-----------------------------")
    except Exception as err:
        print(err)
        print("-----------------------------")

fp.close()
        
          
        
        
        
          
          AI 代码解读

结果如下：


C:\Python36\python.exe C:/Users/yuan.li/Documents/GitHub/Python/Misc/爬虫/proxy.py
当前代理IP 137.74.168.174:80

通过
-----------------------------
当前代理IP 103.28.161.68:8080

通过
-----------------------------
当前代理IP 91.151.106.127:53281

HTTP Error 503: Service Unavailable
-----------------------------
当前代理IP 177.136.252.7:3128

<urlopen error [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond>
-----------------------------
当前代理IP 47.89.22.200:80

通过
-----------------------------
当前代理IP 118.69.61.57:8888

HTTP Error 503: Service Unavailable
-----------------------------
当前代理IP 192.241.190.167:8080

通过
-----------------------------
当前代理IP 185.124.112.130:80

通过
-----------------------------
当前代理IP 83.65.246.181:3128

通过
-----------------------------
当前代理IP 79.137.42.124:3128

通过
-----------------------------
当前代理IP 95.0.217.32:8080

<urlopen error [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond>
-----------------------------
当前代理IP 104.131.94.221:8080

通过
        
          
        
        
        
          
          AI 代码解读

不过上面这种方式只适合比较稳定的IP源，如果IP不稳定的话，可能很快对应的文本就失效了，最好可以动态地去获取最新的IP地址。很多网站都提供API可以实时地去查询
还是用刚才的网站，这次我们用API去调用，这里需要浏览器伪装一下才能爬取

#！/usr/bin/env python
#! -*- coding:utf-8 -*-
# Author: Yuan Li

import re,urllib.request

headers=("User-Agent","Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.22 Safari/537.36 SE 2.X MetaSr 1.0")
opener=urllib.request.build_opener()
opener.addheaders=[headers]
#安装为全局
urllib.request.install_opener(opener)
data=urllib.request.urlopen("http://www.thebigproxylist.com/members/proxy-api.php?output=all&user=list&pass=8a544b2637e7a45d1536e34680e11adf").read().decode('utf8')
ippool=data.split('\n')

for ip in ippool:
    ip=ip.split(',')[0]
    try:
            print("当前代理IP "+ip)
            proxy=urllib.request.ProxyHandler({"http":ip})
            opener=urllib.request.build_opener(proxy,urllib.request.HTTPHandler)
            urllib.request.install_opener(opener)
            url="http://www.baidu.com"
            data=urllib.request.urlopen(url).read().decode('utf-8','ignore')
            print("通过")

            print("-----------------------------")
    except Exception as err:
        print(err)
        print("-----------------------------")

fp.close()
        
          
        
        
        
          
          AI 代码解读

结果如下：


C:\Python36\python.exe C:/Users/yuan.li/Documents/GitHub/Python/Misc/爬虫/proxy.py
当前代理IP 213.233.57.134:80
HTTP Error 403: Forbidden
-----------------------------
当前代理IP 144.76.81.79:3128
通过
-----------------------------
当前代理IP 45.55.132.29:53281
HTTP Error 503: Service Unavailable
-----------------------------
当前代理IP 180.254.133.124:8080
通过
-----------------------------
当前代理IP 5.196.215.231:3128
HTTP Error 503: Service Unavailable
-----------------------------
当前代理IP 177.99.175.195:53281
HTTP Error 503: Service Unavailable
        
          
        
        
        
          
          AI 代码解读

因为直接for循环来按顺序读取文本实在是太慢了，我试着改成多线程来读取，这样速度就快多了

#！/usr/bin/env python
#! -*- coding:utf-8 -*-
# Author: Yuan Li

import threading
import queue
import re,urllib.request

#Number of threads
n_thread = 10
#Create queue
queue = queue.Queue()

class ThreadClass(threading.Thread):
    def __init__(self, queue):
        threading.Thread.__init__(self)
                super(ThreadClass, self).__init__()
    #Assign thread working with queue
        self.queue = queue

    def run(self):
        while True:
        #Get from queue job
            host = self.queue.get()
            print (self.getName() + ":" + host)
            try:
                # print("当前代理IP " + host)
                proxy = urllib.request.ProxyHandler({"http": host})
                opener = urllib.request.build_opener(proxy, urllib.request.HTTPHandler)
                urllib.request.install_opener(opener)
                url = "http://www.baidu.com"
                data = urllib.request.urlopen(url).read().decode('utf-8', 'ignore')
                print("通过")

                print("-----------------------------")
            except Exception as err:
                print(err)
                print("-----------------------------")

            #signals to queue job is done
            self.queue.task_done()

#Create number process
for i in range(n_thread):
    t = ThreadClass(queue)
    t.setDaemon(True)
    #Start thread
    t.start()

#Read file line by line
hostfile = open("c:\\temp\\thebigproxylist-17-12-20.txt","r")
for line in hostfile:
    #Put line to queue
    queue.put(line)
#wait on the queue until everything has been processed
queue.join()
        
          
        
        
        
          
          AI 代码解读

本文转自 beanxyz 51CTO博客，原文链接：http://blog.51cto.com/beanxyz/2052791，如需转载请自行联系原作者

Python 爬虫IP代理池的实现

热门文章

最新文章

相关课程

相关电子书

相关实验场景

推荐镜像

探索云世界

热门

云计算

大数据

云原生

人工智能

数据库

开发与运维

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

直播

下载

镜像站

技术资料

Python 爬虫IP代理池的实现

热门文章

最新文章

相关课程

相关电子书

相关实验场景

推荐镜像