好久没做爬虫了,今天爬上来跟大家分享一个爬虫,有关《八佰》电影的豆瓣短评:
具体代码如下:
import pandas as pd import requestsimport bs4import timedef crawl(url): headers={ 'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9', 'Cookie': 'bid=ggoU9ogRZTI; __gads=ID=cf4b76203c51526a:T=1585391454:S=ALNI_Mangm1-lZDdaHGhDsZDd87LK4ajEQ; douban-fav-remind=1; ll="118159"; _vwo_uuid_v2=DA05DCADC910BEDC1D1D3D0773318CF78|c22091fa3bd072eb5c0220641c8b64d8; __yadk_uid=xabOiJQfYcXeppS6VjvXA4HVrPFvOuqf; _pk_ref.100001.4cf6=%5B%22%22%2C%22%22%2C1599530216%2C%22https%3A%2F%2Fwww.baidu.com%2Flink%3Furl%3D57aywD0Q6WTnl7XKbIHuEyjr2fhNhzUbukAYUnqmpctowMOR8Q6mCG95WAPj1uJY%26wd%3D%26eqid%3Df1c6d005000db872000000045f56e401%22%5D; _pk_ses.100001.4cf6=*; ap_v=0,6.0; __utma=30149280.361671231.1587465102.1599446059.1599530217.5; __utmb=30149280.0.10.1599530217; __utmc=30149280; __utmz=30149280.1599530217.5.5.utmcsr=baidu|utmccn=(organic)|utmcmd=organic; __utma=223695111.48667557.1599446059.1599446059.1599530217.2; __utmb=223695111.0.10.1599530217; __utmc=223695111; __utmz=223695111.1599530217.2.2.utmcsr=baidu|utmccn=(organic)|utmcmd=organic; _pk_id.100001.4cf6=ae0e18afc045976b.1599446058.2.1599530449.1599446537.'} html=requests.get(url,headers=headers) soup=bs4.BeautifulSoup(html.text) data=[] web_name=soup.select('span.comment-info > a') short_text=soup.select('div.comment > p > span.short') for i,j in zip(web_name,short_text): name=i.get_text() text=j.get_text() data.append([name,text]) #保存一版只带评论的txt文件 with open('<八佰>影评.txt','a+') as f: f.write(text) f.close() return(data) for i in range(0,6): #建立一个空的DataFrame inidata=pd.DataFrame() url=f'https://movie.douban.com/subject/26754233/comments?start={i*20}&limit=20&sort=new_score&status=P' print(url) time.sleep(0.5) inidata=inidata.append(crawl(url)) inidata.to_csv('《八佰》豆瓣影评.csv',mode='a+',index=False)
得到数据如下:
做个词云吧,原图:
词云为:
好了,爬虫到此,Bye!