使用Python从零开始编写网络爬虫爬取网页数据与下载图片-开发者社区-阿里云

Python实现网络爬虫自动化：从基础到实践

2024-11-08 2381

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

本文涉及的产品

RDS MySQL DuckDB 分析主实例，集群系列 4核8GB

RDS AI 助手，专业版

PolarDB Agent Express，2核4GB

简介： 本文将介绍如何使用Python编写网络爬虫，从最基础的请求与解析，到自动化爬取并处理复杂数据。我们将通过实例展示如何抓取网页内容、解析数据、处理图片文件等常用爬虫任务。

1. 基础：使用 `requests` 抓取网页

在网络爬虫中，requests库是最常用的基础库，用来向网页发送请求并获取响应内容。

示例：抓取网页内容

以下代码示例展示如何请求某网站的HTML内容，并简单打印出其标题。

import requests
from bs4 import BeautifulSoup

def fetch_page(url):
    response = requests.get(url)
    if response.status_code == 200:
        return response.text
    else:
        print("请求失败，状态码：", response.status_code)
        return None

def get_page_title(url):
    html = fetch_page(url)
    if html:
        soup = BeautifulSoup(html, "html.parser")
        title = soup.title.string
        print("网页标题:", title)

# 使用示例
get_page_title("https://example.com")

2. 使用 `BeautifulSoup` 解析网页内容

BeautifulSoup 是一款强大的HTML和XML解析工具。通过它，我们可以轻松地提取网页中的关键信息。

示例：抓取新闻标题和链接

假设我们要抓取一个新闻网站首页上所有新闻的标题和链接。下面代码展示了如何使用 BeautifulSoup 实现这一目标：

import requests
from bs4 import BeautifulSoup

def fetch_news_titles(url):
    html = fetch_page(url)
    if html:
        soup = BeautifulSoup(html, "html.parser")
        news_list = []

        # 假设每个新闻标题在 <h2> 标签，链接在 <a> 标签中
        for news in soup.find_all("h2"):
            title = news.get_text()
            link = news.find("a")["href"]
            news_list.append({
   "title": title, "link": link})

        return news_list

# 使用示例
news = fetch_news_titles("https://example-news-website.com")
for item in news:
    print(item)

3. 数据清洗与存储

我们可以利用 pandas 将爬取到的数据进行处理，并保存为Excel或CSV文件，以便后续分析。

示例：将数据保存到Excel

import pandas as pd

def save_to_excel(data, filename="news_data.xlsx"):
    df = pd.DataFrame(data)
    df.to_excel(filename, index=False)
    print(f"数据已保存到 {filename}")

# 使用示例
news_data = fetch_news_titles("https://example-news-website.com")
save_to_excel(news_data)

4. 爬取带有图片的内容

很多网页包含图片，而爬取图片通常可以结合requests和文件操作，将图片下载到本地保存。

示例：爬取并保存图片

假设我们要爬取包含图片的网址，以下代码展示如何自动下载图片到本地。

import os
import requests
from bs4 import BeautifulSoup

def fetch_images(url, folder="images"):
    os.makedirs(folder, exist_ok=True)
    html = fetch_page(url)
    if html:
        soup = BeautifulSoup(html, "html.parser")

        for i, img in enumerate(soup.find_all("img")):
            img_url = img.get("src")
            img_data = requests.get(img_url).content
            with open(f"{folder}/image_{i}.jpg", "wb") as f:
                f.write(img_data)
                print(f"已保存图片：{folder}/image_{i}.jpg")

# 使用示例
fetch_images("https://example-website-with-images.com")

5. 自动化爬取多页内容

很多网站分页展示数据，这时需要自动化获取多页数据。我们可以使用一个循环并拼接URL，自动化爬取每一页内容。

示例：自动抓取多页数据

以下代码自动化爬取一个分页的新闻网站上所有页面的标题和链接。

def fetch_paginated_news(base_url, pages=5):
    all_news = []
    for page in range(1, pages + 1):
        url = f"{base_url}?page={page}"
        news = fetch_news_titles(url)
        all_news.extend(news)
        print(f"已爬取第 {page} 页")

    return all_news

# 使用示例
all_news_data = fetch_paginated_news("https://example-news-website.com")
save_to_excel(all_news_data, "all_news_data.xlsx")

6. 模拟浏览器请求

有些网站对简单的请求会进行限制，可能需要模拟浏览器请求或在请求中添加Headers来伪装。

示例：添加Headers模拟请求

以下代码在请求中添加Headers以模拟真实浏览器请求：

def fetch_page_with_headers(url):
    headers = {
   
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36"
    }
    response = requests.get(url, headers=headers)
    return response.text if response.status_code == 200 else None

# 使用示例
html_content = fetch_page_with_headers("https://example-website.com")

总结

Python的requests、BeautifulSoup和pandas等库，可以帮助我们轻松实现网络爬虫自动化，包括网页内容提取、图片下载、数据清洗与存储等任务。掌握这些方法后，可以用于自动化数据采集、舆情监测等多个应用场景。

Python实现网络爬虫自动化：从基础到实践

1. 基础：使用 `requests` 抓取网页

示例：抓取网页内容

2. 使用 `BeautifulSoup` 解析网页内容

示例：抓取新闻标题和链接

3. 数据清洗与存储

示例：将数据保存到Excel

4. 爬取带有图片的内容

示例：爬取并保存图片

5. 自动化爬取多页内容

示例：自动抓取多页数据

6. 模拟浏览器请求

示例：添加Headers模拟请求

总结

数据库

热门文章

最新文章

相关课程

相关电子书

推荐镜像

Python实现网络爬虫自动化：从基础到实践

1. 基础：使用 requests 抓取网页

示例：抓取网页内容

2. 使用 BeautifulSoup 解析网页内容

示例：抓取新闻标题和链接

3. 数据清洗与存储

示例：将数据保存到Excel

4. 爬取带有图片的内容

示例：爬取并保存图片

5. 自动化爬取多页内容

示例：自动抓取多页数据

6. 模拟浏览器请求

示例：添加Headers模拟请求

总结

数据库

热门文章

最新文章

相关课程

相关电子书

推荐镜像

1. 基础：使用 `requests` 抓取网页

2. 使用 `BeautifulSoup` 解析网页内容