Python爬虫基础讲解

2024-06-24 32

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： Python爬虫基础讲解

网页下载器（urllib）
将url对应的网页下载到本地，存储成一个文件或字符串。

基本方法
新建baidu.py，内容如下：

import urllib.request

response = urllib.request.urlopen('http://www.baidu.com')
buff = response.read()
html = buff.decode("utf8")
print(html)
命令行中执行python baidu.py，则可以打印出获取到的页面。

构造Request
上面的代码，可以修改为：

import urllib.request

request = urllib.request.Request('http://www.baidu.com')
response = urllib.request.urlopen(request)
buff = response.read()
html = buff.decode("utf8")
print(html)
携带参数
新建baidu2.py，内容如下：
//代码效果参考：https://v.youku.com/v_show/id_XNjQwNjYzNDkxNg==.html

import urllib.request
import urllib.parse

url = 'http://www.baidu.com'
values = {'name': 'voidking','language': 'Python'}
data = urllib.parse.urlencode(values).encode(encoding='utf-8',errors='ignore')
headers = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0' }
request = urllib.request.Request(url=url, data=data,headers=headers,method='GET')
response = urllib.request.urlopen(request)
buff = response.read()
html = buff.decode("utf8")
print(html)
使用Fiddler监听数据
我们想要查看一下，我们的请求是否真的携带了参数，所以需要使用fiddler。
打开fiddler之后，却意外发现，上面的代码会报错504，无论是baidu.py还是baidu2.py。

虽然python有报错，但是在fiddler中，我们可以看到请求信息，确实携带了参数。

经过查找资料，发现python以前版本的Request都不支持代理环境下访问https。但是，最近的版本应该支持了才对。那么，最简单的办法，就是换一个使用http协议的url来爬取，比如，换成http://www.csdn.net。结果，依然报错，只不过变成了400错误。

然而，然而，然而。。。神转折出现了！！！
当我把url换成http://www.csdn.net/后，请求成功！没错，就是在网址后面多加了一个斜杠/。同理，把http://www.baidu.com改成http://www.baidu.com/，请求也成功了！神奇！！！

添加处理器

import urllib.request
import http.cookiejar

创建cookie容器

cj = http.cookiejar.CookieJar()

创建opener

opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))

给urllib.request安装opener

urllib.request.install_opener(opener)

请求

request = urllib.request.Request('http://www.baidu.com/')
response = urllib.request.urlopen(request)
buff = response.read()
html = buff.decode("utf8")
print(html)
print(cj)
网页解析器（BeautifulSoup）
从网页中提取出有价值的数据和新的url列表。
//代码效果参考：https://v.youku.com/v_show/id_XNjQwMDE5NzM2NA==.html

解析器选择
为了实现解析器，可以选择使用正则表达式、html.parser、BeautifulSoup、lxml等，这里我们选择BeautifulSoup。
其中，正则表达式基于模糊匹配，而另外三种则是基于DOM结构化解析。

BeautifulSoup
安装测试
1、安装，在命令行下执行pip install beautifulsoup4。
2、测试

import bs4
print(bs4)
使用说明

基本用法
1、创建BeautifulSoup对象

import bs4
from bs4 import BeautifulSoup

根据html网页字符串创建BeautifulSoup对象

html_doc = """

The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were Elsie,Lacie,Tillie； and they lived at the bottom of a well.

...

"""
soup = BeautifulSoup(html_doc)
print(soup.prettify())
2、访问节点

print(soup.title)
print(soup.title.name)
print(soup.title.string)
print(soup.title.parent.name)

print(soup.p)
print(soup.p['class'])

Python爬虫基础讲解

创建cookie容器

创建opener

给urllib.request安装opener

请求

根据html网页字符串创建BeautifulSoup对象

热门文章

最新文章

相关课程

相关电子书

相关实验场景

推荐镜像

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

Python爬虫基础讲解

创建cookie容器

创建opener

给urllib.request安装opener

请求

根据html网页字符串创建BeautifulSoup对象

热门文章

最新文章

相关课程

相关电子书

相关实验场景

推荐镜像