<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html><head><meta http-equiv="Cont

本文涉及的产品
转发路由器TR,750小时连接 100GB跨地域
简介: #!C:\Python27\python.exe#coding=utf8import osimport pdfkitimport urllib2from bs4 import BeautifulSoupfr...
#!C:\Python27\python.exe
#coding=utf8

import os
import pdfkit
import urllib2
from bs4 import BeautifulSoup
from multiprocessing import Pool


import socket
socket.setdefaulttimeout(60)

import sys
reload(sys)
sys.setdefaultencoding('utf-8')


def url_open(url):
    user_agent = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36'
    headers = {'User-Agent': user_agent}
    request = urllib2.Request(url=url, headers=headers)
    try:
        page = urllib2.urlopen(request, timeout=60)
    except urllib2.HTTPError as e:
        return 1
    contents = page.read()
    # print contents
    soup = BeautifulSoup(contents.decode('utf-8','ignore'), "lxml")
    return soup


def retrieve_pdf(dir, link):
    savedStderr = sys.stderr
    with open('errlog.txt', 'w+') as file:
        sys.stderr = file
        try:
            pdfkit.from_url(link, dir)
            pass
        finally:
            pass
    sys.stderr = savedStderr


def strip_char(string):
    char = ['*', '/', '\\', ':', '"', '?', '<', '>', '|']
    processed = []
    for i in string:
        if i not in char:
            processed.append(i)
    return ''.join(processed)


def crawler(root, url, num):
        # print url
        if url_open(url) != 1:
            soup = url_open(url)
            # print soup
            for tr in soup.find_all("tr"):
                # print tr
                td = tr.find_all('td')
                if list(td) == None:
                    continue
                if len(td) > 0:
                    if td[0].get_text() == u"提交时间":
                        continue
                    date = td[0].get_text()
                    title = td[1].get_text()
                    dir = title + '.pdf'
                    type = td[2].get_text()
                    poster = td[3].get_text()
                    print date + "  " + title + "   " + type + "    " + poster
                    link = root + '.'.join(tr.get('onclick').split('\'')[1].split('.')[1:])
                    print link
                    print "Retrieving PDF..."
                    print dir
                    dir = strip_char(dir).encode('utf-8').decode('utf-8')
                    temp_name = 'temp' + str(num) + '.pdf'
                    try:
                        retrieve_pdf(temp_name, link)
                    except Exception:
                        if os.path.exists(temp_name):
                            print "Retrieved Successfully!"
                            os.rename(temp_name, dir)
                        else:
                            print 'Retrieve failed!'
                        continue


def single_func(num):
    root = 'http://cb.drops.wiki'
    url = "http://cb.drops.wiki/search.php?kind=drops&keywords=&page=" + str(num)
    crawler(root, url, num)


if __name__ == '__main__':
    # single_func(1) #func test
    # for page in range(1, 86):
    #     single_func(page)
    pool = Pool(processes=4)
    for i in range(1, 86):
        result = pool.apply_async(single_func, (i,))
    pool.close()
    pool.join()

目录
相关文章
|
Web App开发 新零售 前端开发
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html><head><meta http-equiv="Cont
1.尽可能地了解需求,系统层面适用开闭原则 2.模块化,低耦合,能快速响应变化,也可以避免一个子系统的问题波及整个大系统 3.
750 0
|
Web App开发 前端开发 Java
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html><head><meta http-equiv="Cont
服务端需在vm arguments一栏下加上    -agentlib:jdwp=transport=dt_socket,server=y,address=8000 并以run模式启动 如果以debug模式启动服务端...
722 0
|
Web App开发 前端开发 Java
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html><head><meta http-equiv="Cont
 Connection reset by peer的常见原因: 1)服务器的并发连接数超过了其承载量,服务器会将其中一些连接关闭;    如果知道实际连接服务器的并发客户数没有超过服务器的承载量,看下有没有网络流量异常。
859 0
|
Web App开发 存储 前端开发
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html><head><meta http-equiv="Cont
NoSuchObjectException(message:There is no database named cloudera_manager_metastore_canary_test_db_hive_hivemetastore_df61080e04cd7eb36c4336f71b5a8bc4) at org.
1080 0
|
Web App开发 前端开发
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html><head><meta http-equiv="Cont
service cloudera-scm-agent stop service cloudera-scm-agent stop umount /var/run/cloudera-scm-agent/process umo...
760 0
|
Web App开发 前端开发 数据库
|
Web App开发 前端开发
|
Web App开发 前端开发 Linux
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html><head><meta http-equiv="Cont
[root@hadoop058 ~]# mii-tool eth0: negotiated 100baseTx-FD, link ok 100M linux 下查看网卡工作速率 Ethtool是用于查询及设置网卡参数的命令。
648 0

热门文章

最新文章

下一篇
无影云桌面