⭐前言
大家好,我是yma16,本文分享selenuim联合echarts——可视化分析csdn新星赛道选手城市和参赛信息的有效性。
该系列文章:
python爬虫_django+vue3可视化csdn用户质量分
python爬虫_正则表达式获取天气预报并用echarts折线图显示
python爬虫_requests获取bilibili锻刀村系列的字幕并用分词划分可视化词云图展示
python爬虫_selenuim登录个人markdown博客站点
python爬虫_requests获取小黄人表情保存到文件夹
python_selenuim获取csdn新星赛道选手所在城市用echarts地图显示
⭐selenuim打开赛道报名界面获取新星赛道选手主页
目标网址仍然是个人新开赛道的报名页:https://bbs.csdn.net/topics/616574177
直奔主题:思路分析+实现
实现效果:https://yma16.inscode.cc/
💖 获取参赛选手主页思路分析
基本逻辑:
- 获取表格行的元素
- 获取行行内的用户id和提交内容
- 获取完之后点击下一页按钮
实现:根据className获取父级元素(表格单行),单行元素分别提取用户id和用户提交记录
表格行
用户id元素class
用户提交记录class
下一个按钮class
💖 selenuim获取参数选手代码块
from selenium import webdriver import time,json,re dir_path='C:\\Users\MY\PycharmProjects\Spider_python\study2021\day07\dirver\msedgedriver.exe' driver=webdriver.Edge(executable_path=dir_path) url='https://bbs.csdn.net/topics/616574177' driver.get(url) now_url=driver.current_url userUrlObj={} userUidArray=[] # get uid def getUid(): # 表格行数据 cells=driver.find_elements_by_xpath('//tr[@class="el-table__row"]') for i in cells: uid='' aDom=i.find_elements_by_tag_name('a') realUrl='' postUrl='' for aItem in aDom: print(aItem.text) print(aItem.get_attribute('class')) aItemClassName=aItem.get_attribute('class') # 用户id if aItemClassName == 'set-ellipsis def-color': realUrl=aItem.get_attribute('href') uid=aItem.text # 用户提交 elif aItemClassName == 'set-ellipsis link': postUrl=aItem.get_attribute('href') userItem={ 'uid':uid, 'realUrl':realUrl, 'postUrl':postUrl, } userUidArray.append(userItem) userUrlObj[uid]=userItem print(userUrlObj[uid],len(userUidArray)) time.sleep(5) # next def nextBtn(): try: nextBtnDom=driver.find_element_by_xpath('//button[@class="btn-next"]') print(nextBtnDom,nextBtnDom.text) disabled=nextBtnDom.get_attribute('disabled') print(disabled,'disabled') print(type(disabled),'disabled') print('str(disabled)',str(disabled)) if nextBtnDom and str(disabled)!='true': nextBtnDom.click() return True return False except Exception as e: print(e) return False def work(): time.sleep(2) getUid() nextFlag=nextBtn() # return if nextFlag is True: time.sleep(1) return work() else: # end return writeJson() def writeJson(): with open("./joinUserProfile.json", 'w', encoding='utf-8') as write_f: write_f.write(json.dumps(userUrlObj, indent=4, ensure_ascii=False)) if __name__=='__main__': work() driver.close()
获取用户JSON结果:
💖 selenuim获取参数选手主页城市
实现逻辑分析:
- 主页ip属地获取:通过类名
- 用户头像:className
- 用户昵称:className
个人主页html渲染图如下:
用户头像html
python代码块实现数据扫描:
from selenium import webdriver import time,json dir_path='C:\\Users\MY\PycharmProjects\Spider_python\study2021\day07\dirver\msedgedriver.exe' driver=webdriver.Edge(executable_path=dir_path) f = open('joinUserProfile.json', 'r') content = f.read() f.close() joinJson = json.loads(content) userIpInfo={} userIpInfoArray=[] def getUserInfo(): for key in joinJson.keys(): print(key,'userIpInfo') requestUserInfo(key,joinJson[key]['realUrl']) writeJson() # open url def requestUserInfo(key,url): time.sleep(3) try: userIpInfoItem = {} driver.get(url) imgDom = driver.find_element_by_xpath('//div[@class="user-profile-avatar"]') imgSrc = imgDom.find_element_by_tag_name('img').get_attribute('src') nameDom = driver.find_element_by_xpath('//div[@class="user-profile-head-name"]') # first nickName = nameDom.find_element_by_tag_name('div').text ip = driver.find_element_by_xpath('//span[@class="address el-popover__reference"]').text userIpInfoItem['uid'] = key userIpInfoItem['name'] = nickName userIpInfoItem['imgSrc'] = imgSrc userIpInfoItem['ip'] = ip userIpInfoItem['url'] = url userIpInfoItem['postUrl'] = joinJson[key]['postUrl'] userIpInfo[key] = userIpInfoItem userIpInfoArray.append(userIpInfoItem) except Exception as e: print(e) print(userIpInfo,len(userIpInfoItem)) def writeJson(): with open("./joinUserInfo.json", 'w', encoding='utf-8') as write_f: write_f.write(json.dumps(userIpInfo, indent=4, ensure_ascii=False)) if __name__=='__main__': getUserInfo() driver.close()
获取结果:
selenuim&echarts——可视化分析csdn新星赛道选手展示头像、展示ip城市和断言参赛信息的有效性(进阶篇)(二)https://developer.aliyun.com/article/1492676