使用Python爬取华为市场APP应用进行分析

简介: 这个网站也是作者最近接触到的一个APP应用市场类网站。讲实话,还是蛮适合新手朋友去动手学习的。毕竟爬虫领域要想进步,还是需要多实战、多分析!该网站中的一些小细节也是能够锻炼分析能力的,也有反爬虫处理。甚至是下载APP的话在Web端是无法拿到APK下载的直链,需要去APP端接口数据获取

这个网站也是作者最近接触到的一个APP应用市场类网站。讲实话,还是蛮适合新手朋友去动手学习的。毕竟爬虫领域要想进步,还是需要多实战、多分析!该网站中的一些小细节也是能够锻炼分析能力的,也有反爬虫处理。甚至是下载APP的话在Web端是无法拿到APK下载的直链,需要去APP端接口数据获取

接口分析

需要抓取的内容为整个游戏板块(当然可以是所有板块甚至是关键词去搜素命中)。游戏板块包含了所有分类与子分类下APP信息,如下所示:

1717828849485.jpg

首先我们打开控制台发个包先,监测一下请求内容,如下所示:

1717828880712.jpg

这里可以直接把请求CURL出来,转换成Python代码,如下所示:

import requests

headers = {
    "sec-ch-ua": "\"Chromium\";v=\"124\", \"Google Chrome\";v=\"124\", \"Not-A.Brand\";v=\"99\"",
    "Accept": "application/json, text/plain, */*",
    "Referer": "https://appgallery.huawei.com/",
    "Interface-Code": "6bc2bec970e616747dc0a99b57aa6730_1716973000358",
    "sec-ch-ua-mobile": "?0",
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
    "sec-ch-ua-platform": "\"macOS\""
}
url = "https://web-drcn.hispace.dbankcloud.com/edge/uowap/index"
params = {
    "method": "internal.getTabDetail",
    "serviceType": "20",
    "reqPageNum": "1",
    "uri": "b2b4752f0a524fe5ad900870f88c11ed",
    "maxResults": "25",
    "zone": "",
    "locale": "en"
}
response = requests.get(url, headers=headers, params=params)

print(response.json())

这里直接请求,会失败!因为其中有一个细节的反爬虫检测,代码运行会提示接口验证失败,如下所示:

{'rtnCode': 1002, 'rtnDesc': 'InterfaceCode Verification failed.'}

这个是什么原因?根据返回提示中接口检验的问题,回看请求头内携带有Interface-Code参数,大概率是这个参数的问题,重放失败!这个参数是动态的,不过好在并非算法加密生成,


Interface-Code参数其实相当于一个动态注册的令牌,在我们请求接口的时候。需要去固定的接口请求并拿到值,然后携带这个参数的值进行后续的任何请求。


注意!这个动态的值也是有时效性的,大概在一分钟时间不等~

1717828960032.jpg

如上图,动态获取Interface-Code参数的接口,可每次在提交页面请求之前,从这个接口拿动态值携带!

爬虫开发

这里我们经过分析可以找到属于游戏板块全量分类的一个唯一ID,拿它可以获取到所有的子分类ID,如下所示:

1717829079480.jpg

完整链接地址如下:

https://web-drcn.hispace.dbankcloud.com/edge/uowap/index?method=internal.getTabDetail&serviceType=20&reqPageNum=1&uri=33ef450cbac34770a477cfa78db4cf8c&maxResults=25&zone=&locale=en

这个接口请求之后,去解析拿到tabInfo下面的tabId字段即可!这是子分类板块的ID,用来后续请求细分栏目下的所有游戏类APP所需要的字段,同样放到上面链接替换uri,如下所示:

1717829169496.jpg

接下来就是详情页数据抓取,找到详情页面的接口,如下所示:

1717829204009.jpg

这里我们构造出对详情接口请求的具体实现功能,代码如下:

# appDetail是从子分类接口过来的结构化数据
appid = appDetail.get('appid', '')
if appid:
    detail_link = f"https://appgallery.huawei.com/#/app/{appid}"
    encoded_appid = quote(f"app|{appid}")
    encoded_detail_link = quote(quote(detail_link))
    detail_json_link = (
        'https://web-drcn.hispace.dbankcloud.cn/uowap/index?'
        'method=internal.getTabDetail&serviceType=20&reqPageNum=1&maxResults=25&'
        f'uri={encoded_appid}&shareTo=&currentUrl={encoded_detail_link}&accessId=&'
        f'appid={appid}&zone=&locale=zh'
    )
    response = requests.get(detail_json_link, headers=headers)

对详情页面接口构造完请求后同样将会得到JSON数据,需要对结构化数据进行解析,实现代码如下:

import json
import dateparser

item = {}
app_keys = [
    "icoUintrori", "intro", "name", "sizeDesc", "versionName", "downurl", 
    "package", "commentCount", "appid", "md5", "kindName", "images", 
    "releaseDate", "developer", "appIntro", "authority", "portalUrl"
]
appdata = {}
response_data = json.loads(response)
layoutDatas = response_data.get('layoutData', [])

for layoutData in layoutDatas:
    # 详情里面数据分支结构不一样,在3、6中
    if layoutData.get('dataList-type') in (3, 6):
        appinfo = layoutData.get('dataList', [])[0]
        appdata.update({key: value for key, value in appinfo.items() if key in app_keys and key not in appdata})

item['apk_name'] = appdata.get('name')
item['developer'] = appdata.get("developer")
item["apk_size"] = appdata.get("sizeDesc")
item['apk_version_name'] = appdata.get("versionName")
pub_time = appdata.get("releaseDate")
item['apk_update_time'] = dateparser.parse(pub_time).strftime('%Y-%m-%d %H:%M:%S') if pub_time else None
item['apk_screenshots'] = appdata.get("images")
item['apk_introduction'] = appdata.get("appIntro")
item['apk_download_times'] = process_downloads_times(appdata.get('intro', ''))

如上,拿到完整的APP信息数据(包括名称、版本、发布时间、大小、下载量、开发者...)其中process_downloads_times方法是一个自定义方法,对下载量数据进行清洗,页面下载量如下:

1717829278747.jpg

所以需要一个对该下载量数据清洗的方法,实现代码如下所示:

import re

def process_downloads_times(data):
    if not data:
        return None
    
    try:
        return int(data)
    except (ValueError, TypeError):
        pass

    data = re.sub(r"[\+\+,|次|,|<|>]", "", data).strip()
    pa = re.compile(r"[\d+\.]+")
    units = {
        "百亿": 10000000000,
        "千万": 10000000,
        "百万": 1000000,
        "十万": 100000,
        "亿": 100000000,
        "十": 10,
        "百": 100,
        "千": 1000,
        "万": 10000,
    }

    for unit, number in units.items():
        if unit in data:
            num = "".join(pa.findall(data))
            try:
                return str(int(float(num) * number))
            except ValueError:
                return data

    return data

下载链接获取

接下来,最重要的就是对APK包的下载了,在Web端可以看到有一个getAppDownloadUrl的接口,但是它并不是APK包的下载链接,它是华为应用商店的APP下载链接,所以注意不要被误导了,如下所示:

1717829336243.jpg

要获取APK包的下载链接,需要从APP端入手,Web网站内不再提供APK的下载直链!以前是可以的,早期的下载链接直接拼appid即可,早期链接如下所示:

ttps://appgallery.cloud.huawei.com/appdl/C100210735

这里需要对手机端的华为应用商店APP进行抓包分析,拿到下载链接接口!最终APK在移动端的下载接口如下所示:

https://store-drcn.hispace.dbankcloud.com/hwmarket/api/clientApi

这里下载接口的请求参数巨长,好在没有加密的参数字段,如下所示:

data = 'apsid=1705374325915&brand=google&channelId=background&clientPackage=com.huawei.appmarket&cno=4010001&code=0200&contentPkg=&dataFilterSwitch=&deviceId=a7a6dcc273d12e3f63865060533dec04f24e5042e48a621c1b5e601953e1bb69&deviceIdRealType=4&deviceIdType=9&deviceSpecParams=%7B%22abis%22%3A%22arm64-v8a%2Carmeabi-v7a%2Carmeabi%22%2C%22deviceFeatures%22%3A%22U%2Ccom.verizon.hardware.telephony.lte%2Ccom.verizon.hardware.telephony.ehrpd%2CP%2CB%2C0c%2Ce%2C0J%2Cp%2Ca%2Cb%2C04%2Cm%2Candroid.hardware.wifi.rtt%2Ccom.google.android.feature.PIXEL_2017_EXPERIENCE%2C08%2C03%2CC%2CS%2C0G%2Cq%2Ccom.google.android.feature.PIXEL_2018_EXPERIENCE%2CL%2C2%2C6%2Ccom.google.android.feature.GOOGLE_BUILD%2CY%2C0M%2Candroid.hardware.vr.high_performance%2Cf%2C1%2C07%2C8%2C9%2Candroid.hardware.sensor.hifi_sensors%2Candroid.hardware.strongbox_keystore%2CO%2CH%2Ccom.google.android.feature.TURBO_PRELOAD%2Candroid.hardware.vr.headtracking%2CW%2Cx%2CG%2Co%2C06%2C0N%2Ccom.google.android.feature.PIXEL_EXPERIENCE%2C3%2CR%2Cd%2CQ%2Cn%2Candroid.hardware.telephony.carrierlock%2Cy%2CT%2Ci%2Cr%2Cu%2Ccom.google.android.feature.WELLBEING%2Cl%2C4%2C0Q%2CN%2Candroid.software.device_id_attestation%2CM%2C01%2C09%2CV%2C7%2C5%2C0H%2Cg%2Cs%2Cc%2CF%2Ct%2C0L%2C0W%2Ccom.google.hardware.camera.easel_2018%2Ccom.google.android.apps.dialer.SUPPORTED%2C0X%2Ck%2C00%2Ccom.google.android.feature.GOOGLE_EXPERIENCE%2Ccom.google.android.feature.EXCHANGE_6_2%2Ccom.google.android.apps.photos.PIXEL_2018_PRELOAD%2Candroid.hardware.sensor.assist%2Ccom.google.android.feature.DREAMLINER%2Candroid.hardware.audio.pro%2CK%2CE%2C02%2CI%2CJ%2Cj%2CD%2Ch%2Candroid.hardware.wifi.aware%2CX%2Cv%22%2C%22dpi%22%3A560%2C%22openglExts%22%3A%221%2C0y%2C1d%2C1e%2C1f%2C0c%2CM%2CO%2C0q%2C0r%2C0s%2CB%2CA%2C5%2C4%2C0p%2CD%2CE%2C0m%2C0n%2C7%2C0i%2C0j%2C0k%2C8%2C0t%2C0w%2C1g%2C1o%2C1m%2C1n%2CH%2C0u%2C1i%2C1l%22%2C%22preferLan%22%3A%22zh%22%2C%22usesLibrary%22%3A%225%2C6%2Ccom.vzw.apnlib%2C09%2C0D%2Ccom.google.android.camera.experimental2018%2CA%2C0E%2C0L%2Ccom.google.android.poweranomalydatamodeminterface%2C9%2C2%2Cb%2C0C%2Ccom.android.ims.rcsmanager%2CE%2C1%2Ccom.verizon.embms%2Ccom.qualcomm.qti.uim.uimservicelibrary%2Ccom.google.android.lowpowermonitordevicefactory%2Ccom.google.android.lowpowermonitordeviceinterface%2Ccom.google.android.poweranomalydatafactory%2C0H%2C08%2C7%2Ccom.google.android.dialer.support%2Ccom.google.vr.platform%2CF%2Ccom.verizon.provider%2Cd%2C0A%2Ccom.google.android.hardwareinfo%2CB%2C0K%2CC%2C07%22%7D&fid=0&globalTrace=null&gradeLevel=0&gradeType=&hardwareType=0&isSupportPage=1&manufacturer=Google&maxResults=25&method=client.getTabDetail&net=1&outside=0&recommendSwitch=1&reqPageNum=1&roamingTime=0&runMode=2&serviceType=0&shellApkVer=0&sid=1705386105840&sign=h9001090fp00010920000000000001000a0000000600100000011190000010000040240b0100001001000%40000F6C248B294FAD99F665A13B23795D&thirdPartyPkg=com.huawei.appmarket&translateFlag=1&ts={ts}&uri=app%7C{appid}__HiAd__fed345ef91e24079b46e93c92d7e7d4e__cds_111122701__21__null____null%3Faglocation%3D%257B%2522cres%2522%253A%257B%2522lPos%2522%253A0%252C%2522lid%2522%253A%2522905310%2522%252C%2522pos%2522%253A21%252C%2522relResId%2522%253A%2522{appid}%2522%252C%2522relType%2522%253A%2522app%2522%252C%2522resid%2522%253A%2522{appid}%2522%252C%2522rest%2522%253A%2522app%2522%252C%2522tid%2522%253A%2522dist_83ae03cda9994714a499347e25bbccd1%2522%257D%252C%2522ftid%2522%253A%2522dist_25f5040f852a48c1b3bb52c3a99f4cf3%2522%252C%2522pres%2522%253A%257B%2522lPos%2522%253A0%252C%2522pos%2522%253A1%252C%2522resid%2522%253A%252283ae03cda9994714a499347e25bbccd1%2522%252C%2522rest%2522%253A%2522tab%2522%252C%2522tid%2522%253A%2522dist_25f5040f852a48c1b3bb52c3a99f4cf3%2522%257D%257D%26templateId%3D1428744958c44b80aad5091114d7fce4%26maple%3D0%26trackId%3D0%26attribution%3D%257B%2522taskid%2522%253A%2522910297447%2522%252C%2522subTaskId%2522%253A%25229102974470000%2522%252C%2522RTAID%2522%253A%2522%2522%252C%2522callback%2522%253A%2522security%253A44D8410ED7A32A50E8AABFD6%253A03EC36808D57EC58A89C207497BB1CED2B6192ED09D44BAC9D825C50BF49E14A95A9E0E6281F83A1DA4C1F2E6D7A8A%2522%252C%2522channel%2522%253A%25220%2522%257D%26appoid%3Ddde9f2d763154f37acc76b8620d195a9%26listId%3Dcds_111122701%26listIdType%3Dcds%26adFlag%3DHiAd%26cdrInfo%3D20240116142810el21s692259%255E%257BopType%257D%255E910297447%255E{appid}%255Ecds_111122701%255E21%255E608743720197185117_p21%255E608743720197185117%255E14effd3e76d34d76d6aa4185d8d08e5bb340dfdcb8ce0227285d09b8c9001ce2f315fce6f842b5449b4eac5ebebbd78bf7f6f470baaadc081ae820e43b6653825f66a6e7b6f2892cd760d063df840dd9954254fb48fa6a9f19259d68c7170c54%255Ectr%255Efn5-Y29tLmh1YXdlaS5hcHBtYXJrZXR-LTF-MTcwNTMxMDUxODQ0OQ%255E2024-01-16%2B14%253A28%253A10%255E%255EU8800%255E0.000126%255E0%255E2.0%255E2.0%255E9102974470000%255E900011964%255E%255E%255E%255E%255E11.4.1.303%255E1705374325914%255E0%255E1%255E%255E%255E%255E9%255E00000000-0000-0000-0000-000000000000%255E1%255E%255E%255ECN%255ECNY%255E243728192482956951%255Eappgallery%255E%255E%255E%255E2ea67221765846d69b880f43a78a7ab3%255ECPD%255E69%255E%255E%255E243728192570833732%255EWIFI%255EMA%255E%255EZXhwU3RyYXRlZ3k9I2V4cFBhcmFtcz0xMzAxLmFncmVjX2Fkdl9leHRlbmRfdWJyX2Rvd25fYXBwXzF5X3dpdGhfcmVhbHRpbWVfcXVlcnkjYnVja2V0SWQ9I2V4cERlcz0xMzAxLmFncmVjX2Fkdl9leHRlbmRfdWJyX2Rvd25fYXBwXzF5X3dpdGhfcmVhbHRpbWVfcXVlcnkjZXhwSWQ9MTMxN19WMi5hZ18xMzE3X1YyX2FsbF90YXJnZXRfb3B0aW1pemF0aW9uX29uZWZvcmFsbHwxMzAxLmFncmVjX2Fkdl9leHRlbmRfdWJyX2Rvd25fYXBwXzF5X3dpdGhfcmVhbHRpbWVfcXVlcnl8MTMwN19WMi5hZ18xMzA3X1YyX2NvbnRyb2xsZXI%255E%255E%255E%255E1%255E1%255E4010001%255E%255E3G%255E%255E0%255E%255E%255E%255E%255E%255E0%255E1%255E1027162%255E%255E1%255EP8%255E%255E%255Ecom.huawei.appmarket%255E%255E%255E%255E%255E%255E%255E%255E%255Ecom.huawei.appmarket%255E0.000252%255E%255E%255E20240116142810el21s692259%255E20240116142810el21s692182%255E1%255E0%255E%255E%255Esign%253A987064c954dab704e2b30dfe17f5ca63dae4245392d4fff32e75670e44da8998%26version%3D2%26phase%3D0%26requestId%3D2ea67221765846d69b880f43a78a7ab3%26algExp%3D%257B%2522expStrategy%2522%253A%2522246353%2522%252C%2522expParams%2522%253A%25221112.appstore_mtp_fulllist_new_baseline_dataops%2522%252C%2522bucketId%2522%253A%252236%2522%252C%2522expDes%2522%253A%25221112.appstore_mtp_fulllist_new_baseline_dataops%2522%252C%2522expId%2522%253A%25221112.appstore_mtp_fulllist_new_baseline_dataops%2522%257D&ver=1.1'

这里我们拿一个APP的appid对移动端的详情进行请求,可以看到接口数据内包含APK的下载链接,如下所示:

1717829410152.jpg

最终,完整的APP类爬虫运行及数据抓取效果如下所示:

1717829419687.jpg

相关文章
|
1天前
|
算法 搜索推荐 开发者
解锁Python代码的速度之谜:性能瓶颈分析与优化实践
探索Python性能优化,关注解释器开销、GIL、数据结构选择及I/O操作。使用cProfile和line_profiler定位瓶颈,通过Cython减少解释器影响,多进程避开GIL,优化算法与数据结构,以及借助asyncio提升I/O效率。通过精准优化,Python可应对高性能计算挑战。【6月更文挑战第15天】
|
2天前
|
数据采集 存储 数据挖掘
Python网络爬虫实战:抓取并分析网页数据
使用Python的`requests`和`BeautifulSoup`,本文演示了一个简单的网络爬虫,抓取天气网站数据并进行分析。步骤包括发送HTTP请求获取HTML,解析HTML提取温度和湿度信息,以及计算平均温度。注意事项涉及遵守robots.txt、控制请求频率及处理动态内容。此基础爬虫展示了数据自动收集和初步分析的基础流程。【6月更文挑战第14天】
|
3天前
|
数据采集 机器学习/深度学习 数据可视化
数据挖掘实战:Python在金融数据分析中的应用案例
Python在金融数据分析中扮演关键角色,用于预测市场趋势和风险管理。本文通过案例展示了使用Python库(如pandas、numpy、matplotlib等)进行数据获取、清洗、分析和建立预测模型,例如计算苹果公司(AAPL)股票的简单移动平均线,以展示基本流程。此示例为更复杂的金融建模奠定了基础。【6月更文挑战第13天】
|
4天前
|
机器学习/深度学习 数据采集 分布式计算
如何用Python处理大数据分析?
【6月更文挑战第14天】如何用Python处理大数据分析?
18 4
|
4天前
|
机器学习/深度学习 存储 监控
基于YOLOv8深度学习的无人机视角高精度太阳能电池板检测与分析系统【python源码+Pyqt5界面+数据集+训练代码】深度学习实战、目标分割
基于YOLOv8深度学习的无人机视角高精度太阳能电池板检测与分析系统【python源码+Pyqt5界面+数据集+训练代码】深度学习实战、目标分割
|
5天前
|
机器学习/深度学习 存储 计算机视觉
基于YOLOv8深度学习的智能道路裂缝检测与分析系统【python源码+Pyqt5界面+数据集+训练代码】深度学习实战、目标检测、目标分割(2)
基于YOLOv8深度学习的智能道路裂缝检测与分析系统【python源码+Pyqt5界面+数据集+训练代码】深度学习实战、目标检测、目标分割
|
5天前
|
机器学习/深度学习 存储 安全
基于YOLOv8深度学习的智能道路裂缝检测与分析系统【python源码+Pyqt5界面+数据集+训练代码】深度学习实战、目标检测、目标分割(1)
基于YOLOv8深度学习的智能道路裂缝检测与分析系统【python源码+Pyqt5界面+数据集+训练代码】深度学习实战、目标检测、目标分割
|
6天前
|
存储 XML 数据处理
Python网络实践:去哪儿旅游数据爬取指南
Python网络实践:去哪儿旅游数据爬取指南
|
7天前
|
JSON 数据挖掘 API
数据分析实战丨基于pygal与requests分析GitHub最受欢迎的Python库
数据分析实战丨基于pygal与requests分析GitHub最受欢迎的Python库
18 2
|
8天前
|
存储 算法 数据挖掘
LeetCode 题目 43:字符串相乘 多种算法分析对比 【python】
LeetCode 题目 43:字符串相乘 多种算法分析对比 【python】