文章目录
1. 简介
使用的技术栈 : python3, re, BeautifulSoup
目标网站: https://www.umei.net/p/gaoqing/cn/
在21年6-11摸鱼时间补充一份【多线程版本】,学习愉快!!
免责声明:仅用于学习,请勿商用!!!!
2. 开始行动
2.1 步骤
- 获取
html
- 数据清洗(获取图片标签)
- 获取图片标签里面的
src
- 发起请求并保存图片
2.2 实现代码
import requests import re from bs4 import BeautifulSoup url = 'https://www.umei.net/p/gaoqing/cn/' r = requests.get(url) # with open('./meinv.html','wb+') as f: # f.write(r.content) if(r.status_code == 200 ): imgs = [] soup = BeautifulSoup(r.content, 'html5lib') img_list = soup.select('.TypeBigPics img ') for i in img_list: # print(i) res = re.search('src="(.*?)"', str(i) , re.M | re.I) imgs.append( str (res.group(1)) ) for i,k in enumerate (imgs): # print(i,type(k)) ans = requests.get(k) if (ans.status_code == 200): with open(str (i) +'.jpg', 'wb+') as f: f.write(ans.content)
2.3 成果
2.4 成果分析
- 虽然成功拿到了图片,但是图片的清晰度不够,可进一步优化
2.5 优化
- 优化分析
通过分析可知:我们可以通过点击图片外面的
a
标签获取到图片大图
2.6 代码优化
import requests import re from bs4 import BeautifulSoup url = 'https://www.umei.net/p/gaoqing/cn/' r = requests.get(url) # with open('./meinv.html','wb+') as f: # f.write(r.content) if(r.status_code == 200 ): imgs = [] soup = BeautifulSoup(r.content, 'html5lib') # 获取img外面的a标签 aList = soup.select('.TypeBigPics') for item in aList: obj = re.search('.*?\/cn\/(.*?)".*', str(item), re.M | re.I ) imgs.append( str( obj.group(1)) ) ans_imgs = [] for i,k in enumerate(imgs): # print(str(url + k)) ans = requests.get(str(url+k)) if(ans.status_code==200): soup1 = BeautifulSoup(ans.content, 'html5lib') imgBody = soup1.select('.ImageBody img') # print(imgBody) # 获取大图的src obj = re.search('.*?src="(.*?)"', str(imgBody), re.M | re.I ) ans_imgs.append( obj.group(1)) # print(ans_imgs) # 保存大图 for i,k in enumerate(ans_imgs): b = requests.get(str(k)) if(b.status_code==200): with open('./'+str(i)+'.jpg','wb+') as f: f.write(b.content)
2.7 成果