最近为了爬取自己想要的东西,又开始回忆起了python爬虫。
首先,需要找到登录页面的url。
https://passport.csdn.net/account/login?from=http://my.csdn.net/my/mycsdn
用基本的urllib抓取网页代码发现提交的表单代码
<form id="fm1" action="/account/verify;jsessionid=78D8B598F6A7667130715F7491D6AFDD.tomcat1" method="post">
<input id="username" name="username" tabindex="1" placeholder="输入用户名/邮箱/手机号" class="user-name" type="text" value=""/>
<div class="mobile-auth" style="display:none"><span>该手机已绑定账号,可使用 </span><a href="" id="mloginurl" class="mobile-btn" >手机验证码登录</a></div>
<input id="password" name="password" tabindex="2" placeholder="输入密码" class="pass-word" type="password" value="" autocomplete="off"/>
<div class="error-mess" style="display:none;">
<span class="error-icon"></span><span id="error-message"></span>
</div>
<div class="row forget-password">
<span class="col-xs-6 col-sm-6 col-md-6 col-lg-6">
<input type="checkbox" name="rememberMe" id="rememberMe" value="true" class="auto-login" tabindex="4"/>
<label for="rememberMe">下次自动登录</label>
</span>
<span class="col-xs-6 col-sm-6 col-md-6 col-lg-6 forget tracking-ad" data-mod="popu_26">
<a href="/account/fpwd?action=forgotpassword&service=http%3A%2F%2Fmy.csdn.net%2Fmy%2Fmycsdn" tabindex="5">忘记密码</a>
</span>
</div>
<!-- 该参数可以理解成每个需要登录的用户都有一个流水号。只有有了webflow发放的有效的流水号,用户才可以说明是已经进入了webflow流程。否则,没有流水号的情况下,webflow会认为用户还没有进入webflow流程,从而会重新进入一次webflow流程,从而会重新出现登录界面。 -->
<input type="hidden" name="lt" value="LT-9098-f0M45K9ONcaHCXC7e00ykfOpTxPheC" />
<input type="hidden" name="execution" value="e1s1" />
<input type="hidden" name="_eventId" value="submit" />
<input class="logging" accesskey="l" value="登 录" tabindex="6" type="button" />
</form>
注意到其中有个jsessionid。
另外通过fiddler抓取的登录信息中包含username, password, lt, execution, _eventId信息
在抓取的代码中,csdn也对lt等信息做了注释。
并且包含本表单信息的url是
https://passport.csdn.net/account/verify
于是猜想这几个参数就是登录的关键。那如何获取呢?
关于jsessionid, 在登录主页多试了几次发现每次都不一样,lt与exection也都在变化,于是猜想这些数据是需要动态获取的。于是更改代码如下:
import urllib.request
import urllib.parse
import urllib.error
import http.cookiejar
import re
import sys
class CsdnCookie:
def __init__(self):
self.login_url = 'https://passport.csdn.net/account/login?from=http://my.csdn.net/my/mycsdn'
self.verify_url = 'https://passport.csdn.net/account/verify'
self.my_url = 'https://my.csdn.net/'
self.user_agent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
self.user_headers = {
'Accept': "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
'Accept - Encoding': "gzip, deflate, br",
'Connection': "Keep-Alive",
'User-Agent': self.user_agent
}
self.cookie_dir = 'C:/Users/ecaoyng/Desktop/PPT/cookie_csdn.txt'
def get_lt_execution(self):
cookie = http.cookiejar.MozillaCookieJar(self.cookie_dir)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
# request = urllib.request.Request(self.login_url, headers=self.user_headers)
try:
response = opener.open(self.login_url)
page_src = response.read().decode(encoding="utf-8")
pattern = re.compile(
'login.css;jsessionid=(.*?)".*?name="lt" value="(.*?)" />.*?name="execution" value="(.*?)" />', re.S)
items = re.findall(pattern, page_src)
print(items)
print('='*80)
values = {
'username' : "username",
'password' : "password",
'lt' : items[0][1],
'execution' : items[0][2],
'_eventId' : "submit"
}
post_data = urllib.parse.urlencode(values)
post_data = post_data.encode('utf-8')
opener.addheaders = [('User-Agent','Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36')]
self.verify_url = self.verify_url + ';jsessionid=' + items[0][0]
print('=' * 80)
print(self.verify_url)
print('=' * 80)
response_login=opener.open(self.verify_url,post_data)
print(response_login.read().decode(encoding="utf-8"))
for i in cookie:
print('Name: %s' % i.name)
print('Value: %s' % i.value)
print('=' * 80)
cookie.save(ignore_discard=True, ignore_expires=True)
my_page=opener.open(self.my_url)
print(my_page.read().decode(encoding = 'utf-8'))
except urllib.error.URLError as e:
print('Error msg: %s' % e.reason)
def access_other_page(self):
try:
cookie = http.cookiejar.MozillaCookieJar()
cookie.load(self.cookie_dir, ignore_discard=True, ignore_expires=True)
get_request = urllib.request.Request(self.my_url, headers=self.user_headers)
access_opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cookie))
get_response = access_opener.open(get_request)
print('='*80)
print(get_response.read().decode(encoding="utf-8"))
except Exception as e:
print('Error msg when entry other pages: %s' % e.reason())
if __name__ == '__main__':
print(sys.getdefaultencoding())
print('='*80)
cookie_obj=CsdnCookie()
# cookie_obj.get_lt_execution()
cookie_obj.access_other_page()
下面是获取到的csdn的cookie
# Netscape HTTP Cookie File
# http://curl.haxx.se/rfc/cookie_spec.html
# This is a generated file! Do not edit.
.csdn.net TRUE / FALSE AU 1FF
.csdn.net TRUE / FALSE xxx BT xxx
.csdn.net TRUE / FALSE UD Python%E7%88%B1%E5%A5%BD%E8%80%85
.csdn.net TRUE / FALSE 1543915400 UE "username@163.com"
.csdn.net TRUE / FALSE 1543915398 UN username
.csdn.net TRUE / FALSE UserInfo 2MufVKKubW9%2FasttTNA6s3WQr%2BaBa08G3ijawR7NBVftvqoXgXWKvKxjvv2g3YMtJINvNyXOlJM%2FMpWjlo3nxZmMLRQbY5D51X2sJgag7QtsKAGN6NBORCEWVZ1W0BzbQ%2FFZwXUiAjK7CwakS5fGJg%3D%3D
.csdn.net TRUE / FALSE UserName username
.csdn.net TRUE / FALSE UserNick nickname
.csdn.net TRUE / FALSE access-token 012f25c0-3341-444d-864c-3a1f497948c9
.csdn.net TRUE / FALSE 1735689600 dc_session_id 10_1512379399618.695892
.csdn.net TRUE / FALSE 1735689600 uuid_tt_dd 10_9929133460-1512379399943-185666
passport.csdn.net FALSE / TRUE CASTGC TGT-151925-6sfuatdEVoSydhGYa1jjSpaMxCzxmKVOI9dfLECPmqfdMqxDuT-passport.csdn.net
passport.csdn.net FALSE / FALSE JSESSIONID 42458334234E0ED744ED275311851BE4.tomcat1
passport.csdn.net FALSE / TRUE 1514971397 LSSC LSSC-1192363-wi3f7cybBifLidkrZxeiGMpCjUyOkE-passport.csdn.net