字体反爬案例分析与爬取实战
该案例将真实的数据隐藏到字体文件里,即使获取了页面源代码,也没法直接提取数据的真实值。
案例介绍
案例网站https://antispider4.scrape.center/,爬取电影标题、类别、评分等,代码实现如下:
from selenium import webdriver
from pyquery import PyQuery as pq
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.chrome.service import Service
options = webdriver.ChromeOptions()
services = Service('../Selenium/chromedriver')
browser = webdriver.Chrome(service=services, options=options)
browser.get('<https://antispider4.scrape.center/>')
WebDriverWait(browser, 10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '.item')))
html = browser.page_source
doc = pq(html)
items = doc('.item')
for item in items.items():
name = item('.name').text()
categories = [o.text() for o in item('.categories button').items()]
score = item('.score').text()
print(f'name:{name} categories: {categories} score: {score}')
browser.close()
先用Selenium打开案例网站,等待所有电影加载出来,然后获取页面源代码,并通过pyquery提取和解析每一个电影的信息,得到名称、类别和评分,之后输出,运行结果如下:
name:霸王别姬 - Farewell My Concubine categories: ['剧情', '爱情'] score:
name:这个杀手不太冷 - Léon categories: ['剧情', '动作', '犯罪'] score:
name:肖申克的救赎 - The Shawshank Redemption categories: ['剧情', '犯罪'] score:
name:泰坦尼克号 - Titanic categories: ['剧情', '爱情', '灾难'] score:
name:罗马假日 - Roman Holiday categories: ['剧情', '喜剧', '爱情'] score:
name:唐伯虎点秋香 - Flirting Scholar categories: ['喜剧', '爱情', '古装'] score:
name:乱世佳人 - Gone with the Wind categories: ['剧情', '爱情', '历史', '战争'] score:
name:喜剧之王 - The King of Comedy categories: ['剧情', '喜剧', '爱情'] score:
name:楚门的世界 - The Truman Show categories: ['剧情', '科幻'] score:
name:狮子王 - The Lion King categories: ['动画', '歌舞', '冒险'] score:
很奇怪,结果中的score字段不包含任何信息,怎么回事?观察分析,对应的源代码并不包含数字信息,如图所示:
span节点就是什么信息都没有,提取不出来自然也不足为奇了,那页面上的评分结果是怎么显示出来的呢?
其实也是CSS的原因。
案例分析
观察源码,各个span节点的不同之处在于内部i节点的class取值不太一样。可以看到一共有3个span节点,对应的class取值分别为icon-789、icon-981、icon-504,这和评分9.5有啥关系呢?
发现i节点内部有一个::before字段,在CSS中,该字段用于创建一个伪节点,即这个节点和i节点或者span节点不一样。::before可以往特定的节点中插入内容,同时在CSS中使用content字段定义这个内容。第一个i节点看到了9,另外两个i节点,看到.和5,组合起来为9.5。
实战
那class取值和content字段值的映射关系是怎么定义的?可以在浏览器中追踪CSS源代码,代码如下图所示:
进入文件后,可以看到这个CSS源代码都在一行放着,点击“{}”按钮格式化代码,如图所示:
可以从中找出如下内容:
.icon-437:before {
content: "3"
}
.icon-378:before {
content: "4"
}
.icon-504:before {
content: "5"
}
.icon-203:before {
content: "6"
}
.icon-102:before {
content: "7"
原来class对应的值就是一个个评分结果。这样我们只要解析对应的结果再做转换即可。这里需要读取CSS文件并提取映射关系,这个CSS文件是https://antispider4.scrape.center/css/app.654ba59e.css,其部分内容如图所示:
我们可以试着用requests库读取结果,并通过正则表达式将映射关系提取出来,代码如下:
import re
import requests
url = '<https://antispider4.scrape.center/css/app.654ba59e.css>'
response = requests.get(url)
pattern = re.compile('.icon-(.*?):before\\{content:"(.*?)"\\}')
results = re.findall(pattern, response.text)
icon_map = {
item[0]: item[1] for item in results}
这里首先使用requests库提取了CSS文件的内容,然后使用正则表达式进行文本匹配,表达式写作.icon-(.?):before{content:”(.?)”},这个表达式并没有考虑空格,因为CSS源代码本身就是一行放着而去去除了所有空格。
例如,对于如下CSS样式:
.icon-789:before{
content:"9"}
就会提取得到两个group,第一个是789,第二个是9。
这里使用re里的findall方法进行内容匹配,得到结果如下:
[....., ('asterisk', '*'), ('plus', '+'), ('comma', ','), ('hyphen', '-'), ('981', '.'), ('slash', '/'), ('272', '0'), ('643', '1'), ('180', '2'), ('437', '3'), ('378', '4'), ('504', '5'), ('203', '6'), ('102', '7'), ('281', '8'), ('789', '9'), ('colon', ':'), ('semicolon', ';'), ('less', '<'), ('equal', '='), ('greater', '>'), ('question', '?'), ('at', '@'), ('A', 'A'), ('B', 'B'), ('C', 'C'), ('D', 'D'), ('E', 'E'), ('F', 'F'), ('G', 'G'), ('H', 'H'), ('I', 'I'), ('J', 'J'), ('K', 'K'), ('L', 'L'), ('M', 'M'), ('N', 'N'), ('O', 'O'), ('P', 'P'), ('Q', 'Q'), ('R', 'R'), ('S', 'S'), ('T', 'T'), ('U', 'U'), ('V', 'V'), ('W', 'W'), ('X', 'X'), ('Y', 'Y'), ('Z', 'Z'), ('bracketleft', '['), ('backslash', '\\\\\\\\'), ('bracketright', ']'), ('asciicircum', '^'), ('underscore', '_'), ('grave', '`'), ('a', 'a'), ('b', 'b'), ('c', 'c'), ('d', 'd'), ('e', 'e'), ('f', 'f'), ('g', 'g'), ('h', 'h'), ('i', 'i'), ('j', 'j'), ('k', 'k'), ('l', 'l'), ('m', 'm'), ('n', 'n'), ('o', 'o'), ('p', 'p'), ('q', 'q'), ('r', 'r'), ('s', 's'), ('t', 't'), ('u', 'u'), ('v', 'v'), ('w', 'w'), ('x', 'x'), ('y', 'y'), ('z', 'z'), ('braceleft', '{'), ('bar', '|'), ('braceright', '}'), ('asciitilde', '~'), ('Adieresis', '\\\\80,\\\\C4'), ('Aring', '\\\\81,\\\\C5'), ('Ccedilla', '\\\\82,\\\\C7'), ('Eacute', '\\\\83,\\\\C9'), ('Ntilde', '\\\\84,\\\\D1'), ('Odieresis', '\\\\85,\\\\D6'), ('Udieresis', '\\\\86,\\\\DC'), ('aacute', '\\\\87,\\\\E1'), ('agrave', '\\\\88,\\\\E0'), ('acircumflex', '\\\\89,\\\\E2'), ('adieresis', '\\\\8A,\\\\E4'), ('atilde', '\\\\8B,\\\\E3'), ('aring', '\\\\8C,\\\\E5'), ('ccedilla', '\\\\8D,\\\\E7'), ('eacute', '\\\\8E,\\\\E9'), ('egrave', '\\\\8F,\\\\E8'), ('ecircumflex', '\\\\90,\\\\EA'), ('edieresis', '\\\\91,\\\\EB'), ('iacute', '\\\\92,\\\\ED'), ('igrave', '\\\\93,\\\\EC'), ('icircumflex', '\\\\94,\\\\EE'), ('idieresis', '\\\\95,\\\\EF'), ('ntilde', '\\\\96,\\\\F1'), ('oacute', '\\\\97,\\\\F3'), ('ograve', '\\\\98,\\\\F2'), ('ocircumflex', '\\\\99,\\\\F4'), ('odieresis', '\\\\9A,\\\\F6'), ('otilde', '\\\\9B,\\\\F5'), ('uacute', '\\\\9C,\\\\FA'), ('ugrave', '\\\\9D,\\\\F9'), ('ucircumflex', '\\\\9E,\\\\FB'), ('udieresis', '\\\\9F,\\\\FC'), ('dagger', '\\\\2020'), ('degree', '\\\\B0'), ('cent', '\\\\A2'), ('sterling', '\\\\A3'), ('section', '\\\\A7'), ('bullet', '\\\\2022'), ('paragraph', '\\\\B6'), ('germandbls', '\\\\DF'), ('registered', '\\\\AE'), ('copyright', '\\\\A9'), ('trademark', '\\\\2122'), ('acute', '\\\\B4'), ('dieresis', '\\\\A8'), ('notequal', '\\\\AD,\\\\2260'), ('AE', '\\\\C6'), ('Oslash', '\\\\D8'), ('infinity', '\\\\221E'), ('plusminus', '\\\\B1'), ('lessequal', '\\\\2264'), ('greaterequal', '\\\\2265'), ('yen', '\\\\A5'), ('mu', '\\\\B5'), ('partialdiff', '\\\\2202'), ('summation', '\\\\2211'), ('product', '\\\\220F'), ('pi', '\\\\3C0'), ('integral', '\\\\222B'), ('ordfeminine', '\\\\AA'), ('ordmasculine', '\\\\BA'), ('Omega', '\\\\2126'), ('ae', '\\\\E6'), ('oslash', '\\\\F8'), ('questiondown', '\\\\BF'), ('exclamdown', '\\\\A1'), ('logicalnot', '\\\\AC'), ('radical', '\\\\221A'), ('florin', '\\\\192'), ('approxequal', '\\\\2248'), ('Delta', '\\\\2206'), ('guillemotleft', '\\\\AB'), ('guillemotright', '\\\\BB'), ('ellipsis', '\\\\2026'), .....]
这个结果由很多二元组组成的列表。我们遍历这个列表,将其赋值成字典即可,最后icon_map就变成了如下这样:
{
...
'at':'@',
'A':'A',
'B':'B',
....
'789':'9',
....
'bar':'|',
...
)
例如使用789索引,得到结果就是9。
运行测试一下:
print(icon_map['789'])
print(icon_map['437'])
运行结果:
9
3
和源代码保持一致。
所以,只需要修改一下提取逻辑即可,实现代码如下:
import re
import requests
from selenium.webdriver.support.wait import WebDriverWait
from pyquery import PyQuery as pq
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.service import Service
from selenium import webdriver
url = '<https://antispider4.scrape.center/css/app.654ba59e.css>'
response = requests.get(url)
pattern = re.compile('.icon-(.*?):before\\{content:"(.*?)"\\}')
results = re.findall(pattern, response.text)
icon_map = {
item[0]: item[1] for item in results}
def parse_score(item):
elements = item('.icon')
icon_values = []
for element in elements.items():
class_name = (element.attr('class'))
icon_key = re.search('icon-(\\d+)', class_name).group(1)
icon_value = icon_map.get(icon_key)
icon_values.append(icon_value)
return ''.join(icon_values)
options = webdriver.ChromeOptions()
services = Service('../Selenium/chromedriver')
browser = webdriver.Chrome(service=services, options=options)
browser.get('<https://antispider4.scrape.center/>')
WebDriverWait(browser, 10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '.item')))
html = browser.page_source
doc = pq(html)
items = doc('.item')
for item in items.items():
name = item('.name').text()
categories = [o.text() for o in item('.categories button').items()]
score = parse_score(item)
print(f'name:{name} categories: {categories} score:{score}')
browser.close()
运行结果如下:
name:霸王别姬 - Farewell My Concubine categories: ['剧情', '爱情'] score:9.5
name:这个杀手不太冷 - Léon categories: ['剧情', '动作', '犯罪'] score:9.5
name:肖申克的救赎 - The Shawshank Redemption categories: ['剧情', '犯罪'] score:9.5
name:泰坦尼克号 - Titanic categories: ['剧情', '爱情', '灾难'] score:9.5
name:罗马假日 - Roman Holiday categories: ['剧情', '喜剧', '爱情'] score:9.5
name:唐伯虎点秋香 - Flirting Scholar categories: ['喜剧', '爱情', '古装'] score:9.5
name:乱世佳人 - Gone with the Wind categories: ['剧情', '爱情', '历史', '战争'] score:9.5
name:喜剧之王 - The King of Comedy categories: ['剧情', '喜剧', '爱情'] score:9.5
name:楚门的世界 - The Truman Show categories: ['剧情', '科幻'] score:9.0
name:狮子王 - The Lion King categories: ['动画', '歌舞', '冒险'] score:9.0