第一步:分析网页流程,确定目标
进入百合网首页,分析要抓取的数据内容
进入到首页推荐表的妹子界面 设置好筛选的条件;可以看到更多展示的妹子。这也是我们接下来要抓取的。
每张图片点开后 都有相应的详细介绍;比如我们要抓取的数据是 【名字、年龄、身高 、学历、婚姻使、自我介绍等】
第二步:请求网络网站,获取网页数据
import
requests
import
json
from
lxml
import
etree
import
pandas
as
pd
cookies
=
{
'orderSource'
:
'10130301'
,
'accessID'
:
'20220525160212300807'
,
'lastLoginDate'
:
'Wed%20May%2025%202022%2016%3A02%3A12%20GMT+0800%20%28%u4E2D%u56FD%u6807%u51C6%u65F6%u95F4%29'
,
'accessToken'
:
'BH1653465732565956999'
,
'Hm_lvt_5caa30e0c191a1c525d4a6487bf45a9d'
:
'1653465735'
,
'tempID'
:
'2328584815'
,
'NTKF_T2D_CLIENTID'
:
'guest46B34559-18E8-1BB5-1B84-FA3D400112EF'
,
'AuthCookie'
:
'4BFFD62B611D896E561FCDFCA2AC50840C888FC5354A7C8F02C453B1F486849BA543E98E4E32B2920B4F70C256EF513E1B711A8BC10FC1DF9BF608CB30C4F468740A0A3FA06C20992D2ABCEAC15741654D3542C75E463CD2'
,
'AuthMsgCookie'
:
'DF8460C627701442B7016AED70C6828D82D7E467447C72E69E49D2A6B8815481FFB474631E375FBDED51DD0A0BDFCBB06E580573940DA59132515B2BEA677360104068C9C41BBE1272765712500DB8532613ED82D5EDBD2F'
,
'GCUserID'
:
'307535896'
,
'OnceLoginWEB'
:
'307535896'
,
'LoginEmail'
:
'15565222558%40mobile.baihe.com'
,
'userID'
:
'307535896'
,
'spmUserID'
:
'307535896'
,
'nTalk_CACHE_DATA'
:
'{uid:kf_9847_ISME9754_307535896,tid:1653465759744600}'
,
'AuthTokenCookie'
:
'bh.1653466069969_1800.04BE6CEADC682D6A9E8630E1E15B85A5190CC534.bhkOo8o.6'
,
'noticeEvent_307535896'
:
'25'
,
'hasphoto'
:
'1'
,
'AuthCheckStatusCookie'
:
'745B4B48B7EA1CC7BFDEB8D850A700D714D1D3015978705726A127B203C4931405E9D878DCC812E7'
,
'tgw_l7_route'
:
'0dd999c63b312678b82b8668ba91d54d'
,
'Hm_lpvt_5caa30e0c191a1c525d4a6487bf45a9d'
:
'1653466330'
,
'_fmdata'
:
'Ewvc1t%2BSwfMTVNcjWwP%2B0uotvg7udIoQjotCEf9E17Cze%2FAmFlYoO9ck5kXksZIVSuxOcxl68ouCRX%2FOxc4Mjv2RNI6Ek7XQ8L5kbQaMQeA%3D'
,
}
headers
=
{
'Connection'
:
'keep-alive'
,
'Pragma'
:
'no-cache'
,
'Cache-Control'
:
'no-cache'
,
'Accept'
:
'text/javascript, application/javascript, application/ecmascript, application/x-ecmascript, */*; q=0.01'
,
'X-Requested-With'
:
'XMLHttpRequest'
,
'User-Agent'
:
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36'
,
'Content-Type'
:
'application/x-www-form-urlencoded; charset=UTF-8'
,
'Origin'
:
'https://search.baihe.com'
,
'Sec-Fetch-Site'
:
'same-origin'
,
'Sec-Fetch-Mode'
:
'cors'
,
'Referer'
:
'https://search.baihe.com/'
,
'Accept-Language'
:
'zh-CN,zh;q=0.9'
,
}
data
=
{
'minAge'
:
'19'
,
'maxAge'
:
'28'
,
'minHeight'
:
'155'
,
'maxHeight'
:
'170'
,
'education'
:
'1-7'
,
'loveType'
:
''
,
'marriage'
:
''
,
'income'
:
'1-6'
,
'city'
:
'8611'
,
'nationality'
:
''
,
'occupation'
:
''
,
'children'
:
''
,
'bloodType'
:
''
,
'constellation'
:
''
,
'religion'
:
''
,
'online'
:
''
,
'isPayUser'
:
''
,
'isCreditedByAuth'
:
''
,
'hasPhoto'
:
'1'
,
'housing'
:
''
,
'car'
:
''
,
'homeDistrict'
:
''
,
'page'
:
'1'
,
'sorterField'
:
'1'
,
}
response
=
requests.
post
(
'https://search.baihe.com/Search/getUserID?&jsonCallBack=jQuery183046819546986330085_1653466343209'
,
cookies
=
cookies,
headers
=
headers,
data
=
data).text
data
=
response.
lstrip
(
'jQuery183046819546986330085_1653466343209('
).
rstrip
(
');'
)
data
=
json.
loads
(data)
data
=
data[
'data'
]
print(data)
第三步:筛选、解析数据 提取
# 存储数据
name
=
[]
age
=
[]
hg
=
[]
x_l
=
[]
city
=
[]
h_p
=
[]
content
=
[]
for
item_id
in
data
:
# 拼接 url
url
=
'https://profile1.baihe.com/?oppID='
+
item_id
response_2
=
requests.
get
(url,
headers
=
headers,
cookies
=
cookies).text
html
=
etree.
HTML
(response_2)
# 姓名
name.
append
(html.
xpath
(
'//div[@class="name"]/span[2]/text()'
)[
0
])
# 年龄
age.
append
(html.
xpath
(
'//div[@class="inter"]/p/text()'
)[
0
])
# 身高
hg.
append
(html.
xpath
(
'//div[@class="inter"]/p/text()'
)[
1
])
# 学历
x_l.
append
(html.
xpath
(
'//div[@class="inter"]/p/text()'
)[
2
])
# 所在城市
city.
append
(html.
xpath
(
'//div[@class="inter"]/p/text()'
)[
3
])
# 是否婚配
h_p.
append
(html.
xpath
(
'//div[@class="inter"]/p/text()'
)[
4
])
# 自我介绍
content.
append
(html.
xpath
(
'//div[@class="intr"]/text()'
)[
0
])
print
(name,age,hg,x_l,city,h_p,conten
t)
第四步:持久化保存数据
df
=
pd.
DataFrame
()
df[
'网名'
]
=
name
df[
'年龄'
]
=
age
df[
'身高'
]
=
hg
df[
'学历'
]
=
x_l
df[
'所在城市'
]
=
city
df[
'是否婚配'
]
=
h_p
df[
'自我介绍'
]
=
content
df.
to_excel
(
'百合网Demo.xls'
,
encoding
=
'utf-8'
,
index
=
False
)
如此即可获取所有妹子数据,更多的信息,可以按照我的这种方式自己尝试获取呀!
有收获的话 ,可以点赞、关注奥 。