3、获取标签的文本值·例如:span标签、a标签文本
我这里处理的是a标签的文本,咱们没有使用框架,相对来说纯使用正则表达式稍微麻烦一些,但是处理方式还是不复杂的,可以看到我获取a标签所有内容后,将左右的标签符号去掉就剩下中间的文本了,还是挺容易获取的。如果是span标签,您直接把a标签替换成span标签就行了。
获取方法1:
import re import requests ''' 获取某网站·某个·class元素下·所有内容·返回字符串·根据字符串匹配超链接的文本内容 ''' url = "https://book.zongheng.com/showchapter/1243826.html" context = requests.get(url).content.decode("utf-8") result1 = re.findall(r"<ul class=\"chapter-list clearfix\">.*?</ul>", context, re.S) a_href = re.findall(r'<a.*?>.*?</a>', result1[0]) data_list = [] left_text = ">" right_text = "</a>" for item in a_href: result = re.findall(r">.*?</a>", item) for item1 in result: item1 = item1.replace(left_text, "", 1).replace(right_text, "", 1) data_list.append(item1) for item in data_list: print(item) print("爬取成功:", len(data_list))
结果呈现,可以看到138条都筛选出来了。
获取方法2:
通过【()】的方法直接获取我们需要的内容
import re import requests ''' 获取某网站·某个·class元素下·所有内容·返回字符串·根据字符串匹配超链接的文本内容 ''' url = "https://book.zongheng.com/showchapter/1243826.html" context = requests.get(url).content.decode("utf-8") result1 = re.findall(r"<ul class=\"chapter-list clearfix\">.*?</ul>", context, re.S) a_href = re.findall(r'<a.*?>.*?</a>', result1[0]) data_list = [] left_text = ">" right_text = "</a>" for item in a_href: result = re.findall(r">(.*?)</a>", item) data_list.append(result[0]) for item in data_list: print(item) print("爬取成功:", len(data_list))
这种方法直接就能获取我们需要的内容:
4、key:value格式的数据
在下图中可以看到字典格式的数据,{"adv_type":"bookDirectory00","adv_res":"zongheng","pos":""}我们想要其中的"adv_type"的值,那么,我们需要用另外一种正则表达式了:
import re import requests ''' 获取key:value的数据 ''' url = "https://book.zongheng.com/showchapter/1243826.html" context = requests.get(url).content.decode("utf-8") result1 = re.findall('"adv_type":"(.*?)"', context) for item in result1: print(item)
结果示例:
5、匹配url
1、匹配短连接
r'(http[|s]?://[^\s]*/)'
import re import requests ''' 提取字符串中的链接 ''' headers = { "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36" } url_info = """6.69 Cuf:/ 售价888一碗的面!自己在家做需要花多少钱?# 百蟹面 # 六虾面 https://v.douyin.com/rhkGMfh/ 复制此链接,打开Dou音搜索,直接观看视频!""" short_url = re.findall(r'(http[|s]?://[^\s]*/)', url_info)[0] print(short_url) url = requests.get(url=short_url, headers=headers).url item_id = url.split('/')[4][0:19] url = "https://www.iesdouyin.com/web/api/v2/aweme/iteminfo/?item_ids={0}".format(item_id) html = requests.get(url, headers=headers) title = html.json()['item_list'][0]['desc'] # 标题获取 print(title)
结果获取:
2、匹配长连接
r"""(http|ftp|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?"""
import re import requests ''' 获取某网站·获取所有的长链接 ''' url = "https://book.zongheng.com/showchapter/1243826.html" context = requests.get(url).content.decode("utf-8") result1 = re.findall(r"<ul class=\"chapter-list clearfix\">.*?</ul>", context, re.S) # result1中包含多个长连接 rule = r"""(http|ftp|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?""" result2 = re.findall(rule, result1[0], re.S) data_list = [] for item in result2: new_url = "{0}://book.zongheng.com{1}".format(item[0], item[2]) data_list.append(new_url) for item in data_list: print(item) print("获取连接:", len(data_list))
匹配的内容需要拼接一下:
3、各类网址url列表
4、网页中中文提取
r"[\u4E00-\u9FA5]"
import re import requests ''' 获取某网站·获取网站中所有中文 ''' url = "https://book.zongheng.com/showchapter/1243826.html" context = requests.get(url).content.decode("utf-8") rule = r"[\u4E00-\u9FA5]" result2 = re.findall(rule, context, re.S) count = 0 for item in result2: if count % 10 == 0: print() print(item, end="") count += 1 print("\n共计提取:", len(result2), "个中文")
结果展示:
总结:
暂时先写这些,后面随时用到随时写。