如何用python“考上”研究生？学习了!-阿里云开发者社区

基本原理：对于网页上的学习资料，很适合用python 爬虫来收集

我们仔细分析了一下考研每日一句（https://bj.wendu.com/zixun/yingyu/6697.html）的网页结构，可以使用chrome 浏览器 F12 来调试查看或者使用比较专业一点的工具Fiddler 来抓取网页请求并查看网页的html DOM文件构造，该网页为一个table 里填充了三百多个a标签，指向了正文内容的链接，我们需要先把这三百多个链接收集起来，再逐一访问，并解析网页，获取内容，然后写入word文档。

这里用到三个python 模块，分别如下：

requests 负责处理http协议，来完成网页请求。
xpath 负责解析网页DOM文件，处理html。
python-docs 负责处理word样式，写入word文档。

版本1 DailyEnglishGet-v1.py

在版本1中，我们没有用多线程，只是顺序的执行，每一句都单独写入一个文件，但效率会慢一些，完成300多句的收集，会花一点时间。效果如图1 和图2：

(图1)

（图2）

######################################################################################
#
# Author: zuoguocai@126.com
#
# Function: Get daily english TO DOCX file  Version 1.0
#
# Modified Time:  2021年5月25日
# 
# Help:  need install  requests,lxml and python-docx
#        pip3 install requests lxml python-docx -i https://pypi.douban.com/simple
#        Test successfully in WIN10  and python3.9
#
######################################################################################
#!/usr/bin/env python
import requests
# 忽略请求https，客户端没有证书警告信息
requests.packages.urllib3.disable_warnings()
from lxml import etree
from docx import Document
from docx.shared import Inches
import time
import random
url = "https://bj.wendu.com/zixun/yingyu/6697.html"
headers = {
   'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36'
}
# 获取表格中 所有a标签链接
r = requests.get(url,headers=headers,verify=False,timeout=120)
html = etree.HTML(r.text)
html_urls = html.xpath("//tr//a/@href")
num_count = len(html_urls)
print("总共发现:   {}句".format(num_count))
# 获取链接下内容
for i in html_urls:
 r = requests.get(i, headers=headers, verify=False,timeout=120)
 result_html = etree.HTML(r.content, parser=etree.HTMLParser(encoding='utf8'))
 html_data = result_html.xpath('//div[@class="article-body"]/p//text()')
 # 获取标题
 head = html_data[1]
 Message = "正在处理===>" + head  + "  "+ i + "  请稍等..."
 print(Message)
 # 句子和问题
 juzi = '\n'.join(html_data[2:4])
 # 选项
 xuanxiang = '\n'.join(html_data[4:10])
 # 分析
 fengxi = '\n'.join(html_data[10:-4])
 # 合并为内容，中间以换行符分隔
 content = "\n\n\n".join((head,juzi,xuanxiang,fengxi))
 file_name = 'C:\\Users\\zuoguocai\\Desktop\\pachong\\docs\\' + head + '.docx'
 document = Document()
 paragraph = document.add_paragraph(content)
 document.save(file_name)
 # 限制请求
 myrandom = random.randint(3,10)
 time.sleep(myrandom)

版本2 DailyEnglishGet-v2.py

在版本2中，我们使用了多线程，仍然是每一句都单独写入一个文件，这次效率会很高，较版本1节约了至少10倍时间。但是由于请求太频繁，会出现空文档。

、

######################################################################################
#
# Author: zuoguocai@126.com
#
# Function: Get daily english TO DOCX file Version 2.0
#
# Modified Time:  2021年5月25日
# 
# Help:  need install  requests,lxml and python-docx
#        pip3 install requests lxml python-docx -i https://pypi.douban.com/simple
#        Test successfully in WIN10  and python3.9
#
######################################################################################
#!/usr/bin/env python
import requests
# 忽略请求https，客户端没有证书警告信息
requests.packages.urllib3.disable_warnings()
from lxml import etree
from docx import Document
from docx.shared import Inches
import time
import random
import concurrent.futures
url = "https://bj.wendu.com/zixun/yingyu/6697.html"
headers = {
   'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36'
}
# 获取表格中 所有a标签链接
r = requests.get(url,headers=headers,verify=False,timeout=120)
html = etree.HTML(r.text)
html_urls = html.xpath("//tr//a/@href")
num_count = len(html_urls)
print("总共发现:   {}句".format(num_count))
# 获取链接下内容, 去除广告内容
def craw(url):
 r = requests.get(url, headers=headers, verify=False,timeout=120)
 result_html = etree.HTML(r.content, parser=etree.HTMLParser(encoding='utf8'))
 html_data = result_html.xpath('//div[@class="article-body"]/p//text()')
 # 获取标题
 head = html_data[1]
 # 句子和问题
 juzi = '\n'.join(html_data[2:4])
 # 选项
 xuanxiang = '\n'.join(html_data[4:10])
 # 分析
 fengxi = '\n'.join(html_data[10:-4])
 # 合并为一篇内容，中间以换行符分隔
 content = "\n\n\n".join((head,juzi,xuanxiang,fengxi))
 file_name = 'C:\\Users\\zuoguocai\\Desktop\\pachong\\docs\\' + head + '.docx'
 # 写入word文档 
 document = Document()
 paragraph = document.add_paragraph(content)
 document.save(file_name)
 Message = "正在处理===>" + head  + "  "+ url + "  处理完成..."
 return Message
# 使用线程池加速IO操作, 缺点:可能因为网络问题或者网站限制，导致出现空文件
with concurrent.futures.ThreadPoolExecutor() as pool:
   futures =  [ pool.submit(craw,url) for url in html_urls ]
   for future in futures:
       print(future.result())

版本3 DailyEnglishGet-v3.py

在版本3中，我们没有使用多线程，主要是避免出现版本2中的空文档，慢点也行，至少要保证收集的内容完整。这次没有每一句都单独写入一个文件，而是所有的句子都写到一个文档中，并且加了考研试卷专用字体 Times New Roman 。效果如图3:

######################################################################################
#
# Author: zuoguocai@126.com
#
# Function: Get daily english TO DOCX file  Version 3.0
#
# Modified Time:  2021年5月25日
# 
# Help:  need install  requests,lxml and python-docx
#        pip3 install requests lxml python-docx -i https://pypi.douban.com/simple
#        Test successfully in WIN10  and python3.9
#
######################################################################################
#!/usr/bin/env python
import requests
# 忽略请求https，客户端没有证书警告信息
requests.packages.urllib3.disable_warnings()
from lxml import etree
from docx import Document
from docx.shared import Inches
import time
import random
url = "https://bj.wendu.com/zixun/yingyu/6697.html"
headers = {
   'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36'
}
# 获取表格中 所有a标签链接
r = requests.get(url,headers=headers,verify=False,timeout=120)
html = etree.HTML(r.text)
html_urls = html.xpath("//tr//a/@href")
num_count = len(html_urls)
print("总共发现:   {}句".format(num_count))
# 获取链接下内容
for i in html_urls:
 r = requests.get(i, headers=headers, verify=False,timeout=120)
 result_html = etree.HTML(r.content, parser=etree.HTMLParser(encoding='utf8'))
 html_data = result_html.xpath('//div[@class="article-body"]/p//text()')
 # 获取标题
 head = html_data[1]
 #print(head)
 Message = "正在处理===>" + head  + "  "+ i + "  请稍等..."
 print(Message)
 # 句子和问题
 juzi = '\n'.join(html_data[2:4])
 # 选项
 xuanxiang = '\n'.join(html_data[4:10])
 # 分析
 fengxi = '\n'.join(html_data[10:-4])
 # 合并为一篇内容，中间以换行符分隔
 content = "\n\n\n".join((juzi,xuanxiang,fengxi))
 #把300多句写入到一个文件里，每次打开后追加新内容，需要提前建立一个word空文件并需要写一个标题，设置标题样式
 #python-docx仅可使用文档中定义的样式，因此您需要将该样式添加到模板文档中才能使用它，否则会报错
 file_name = 'C:\\Users\\zuoguocai\\Desktop\\pachong\\docs\\' +  '何凯文每日一句.docx'
 document = Document(file_name)
 # 设置字体为 考研试卷专用字体 Times New Roman 
 # 选中全部，设置为宋体，选中全部，设置为Times New Roman  
 document.styles['Normal'].font.name = u'Times New Roman'
 # 设置标题1，为后续 在word中查看方便(视图--导航窗格)做准备，为后续在 word中 生成目录做准备(引用--目录--自动目录)
 # 或者使用考虑使用win32com.client包对于目录进行操作
 document.add_heading(head, level=1)
 document.add_paragraph(content)
 document.add_page_break()
 document.save(file_name)
 # 限制请求
 myrandom = random.randint(3,10)
 time.sleep(myrandom)

综上，我们尽可能的尝试优化代码的逻辑来使我们收集的内容更加完整，执行效率更高，但仍不完美，有待进一步探索，但在期间优化逻辑的过程也有另外一番乐趣。在今后的工作和学习中，用python 来提质增效，这样我们能把重复繁杂的工作自动化，拿出更多的时间来思考未来，这是一件大有裨益的事情。友情提醒：爬虫虽好，请恰当使用，本案例仅用于非商业用途，请尊重原作者版权。本案例源码下载地址：https://github.com/ZuoGuocai/DailyEnglishGet