Python Scrapy 跨平台爬虫实战：XPath 解析与结构化数据提取

2026-06-29 20

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： Python Scrapy 跨平台爬虫实战：XPath 解析与结构化数据提取

爬虫开发中，请求—下载—解析—存储是最基础的四段流水线。请求和下载部分各语言方案大同小异，真正拉开效率差距的是解析层。BeautifulSoup 面对深层嵌套、条件筛选时力不从心；正则可读性差、维护成本高。XPath 是 W3C 标准查询语言，专为树结构设计，配合 Scrapy 的异步引擎，在大规模、跨平台爬虫项目中几乎没有对手。
一、Scrapy 项目初始化
pip install scrapy
scrapy startproject multispider && cd multispider
scrapy genspider technews example.com
在 items.py 中声明结构化字段：
import scrapy

class NewsItem(scrapy.Item):
title = scrapy.Field()
url = scrapy.Field()
author = scrapy.Field()
publish_date = scrapy.Field()
content = scrapy.Field()
tags = scrapy.Field()
source = scrapy.Field()
二、XPath 高频语法速查
场景表达式说明
全局搜索 //div[@class="list"] 不关心层级
相对定位 .//h2/a/@href 以当前节点为根，实战最关键
模糊匹配 contains(@class, "active") 多 class 场景必用
位置限定 //li[position()<=3] 取前 N 个
轴遍历 //h2/following-sibling::p 取兄弟节点
条件排除 //p[not(contains(@class,"ad"))] XPath 原生过滤广告
核心原则：循环遍历列表项时，子元素 XPath 必须以 . 开头（.//），否则会回到整个文档根节点全局搜索，导致数据错位。
三、核心爬虫：列表页 → 详情页两级解析
编辑 spiders/technews.py：
import scrapy
from multispider.items import NewsItem

class TechNewsSpider(scrapy.Spider):
name = 'technews'
allowed_domains = ['example.com']
start_urls = ['https://example.com/news']

def parse(self, response):
    # 列表页：定位所有文章条目
    for article in response.xpath('//div[@class="article-list"]/article'):
        detail_url = article.xpath('.//h2/a/@href').get()
        if detail_url:
            yield response.follow(
                detail_url,
                callback=self.parse_detail,
                meta={'list_title': article.xpath('.//h2/a/text()').get(default='').strip()}
            )

    # 翻页
    next_page = response.xpath('//a[contains(@class,"next")]/@href').get()
    if next_page:
        yield response.follow(next_page, callback=self.parse)

def parse_detail(self, response):
    item = NewsItem()
    item['title'] = (
        response.xpath('//h1[@class="article-title"]/text()').get(default='').strip()
        or response.meta.get('list_title', '')
    )
    item['url']          = response.url
    item['author']       = response.xpath('//span[@class="author-name"]/text()').get(default='匿名').strip()
    item['publish_date'] = response.xpath('//time[@class="publish-date"]/@datetime').get()
    item['tags']         = response.xpath('//div[@class="tags"]//a/text()').getall()
    item['source']       = 'technews'

    # 正文提取：排除广告/推荐节点
    paragraphs = response.xpath(
        '//div[@class="article-body"]'
        '//p[not(contains(@class,"ad")) and not(contains(@class,"recommend"))]'
        '/text()'
    ).getall()
    item['content'] = '\n'.join(p.strip() for p in paragraphs if p.strip())

    yield item

四个关键技巧：

.// 相对路径：循环体内必须用 . 开头，避免跨条目误抓
get(default='')：防止 NoneType 错误，提供安全兜底
response.follow()：自动补全相对 URL，无需手动拼域名

meta 透传：列表页元数据传递到详情页，做 fallback 容错
四、跨平台适配：规则配置与爬虫逻辑解耦
不同站点 HTML 结构不同，但数据模型和清洗逻辑完全可复用。核心思路是将 XPath 规则抽成配置字典：
SITE_RULES = {
'siteA': {

 'start_urls':    ['https://site-a.com/news'],
 'list_item':     '//div[@class="news-item"]',
 'detail_link':   './/a[@class="title"]/@href',
 'title':         '//h1[@class="post-title"]/text()',
 'author':        '//span[@itemprop="author"]/text()',
 'publish_date':  '//meta[@property="article:published_time"]/@content',
 'content':       '//div[@class="post-content"]//p/text()',
 'tags':          '//div[@class="tag-list"]//a/text()',
 'next_page':     '//a[@rel="next"]/@href',

},
'siteB': {

 # ... 另一个站点的规则

},
}

class MultiSiteSpider(scrapy.Spider):
name = 'multisite'

def start_requests(self):
    for site_name, rules in SITE_RULES.items():
        for url in rules['start_urls']:
            yield scrapy.Request(url, callback=self.parse_list,
                                 meta={'site_name': site_name, 'rules': rules})

def parse_list(self, response):
    rules = response.meta['rules']
    for article in response.xpath(rules['list_item']):
        link = article.xpath(rules['detail_link']).get()
        if link:
            yield response.follow(link, callback=self.parse_detail, meta=response.meta)
    # 翻页
    next_page = response.xpath(rules['next_page']).get()
    if next_page:
        yield response.follow(next_page, callback=self.parse_list, meta=response.meta)

def parse_detail(self, response):
    rules = response.meta['rules']
    item = NewsItem()
    item['url']    = response.url
    item['source'] = response.meta['site_name']
    item['title']  = response.xpath(rules['title']).get(default='').strip()
    item['author'] = response.xpath(rules['author']).get(default='匿名').strip()
    item['content'] = '\n'.join(p.strip() for p in response.xpath(rules['content']).getall() if p.strip())
    item['tags']   = response.xpath(rules['tags']).getall()
    yield item

新增站点只需加一段规则配置，核心代码零改动——这是 Scrapy 跨平台扩展的工程优势。
五、接入代理 IP：突破反爬封锁
跨平台大规模爬虫必然触发目标站点的 IP 频率限制。以亿牛云爬虫代理为例，在 Scrapy 中接入代理只需编写一个下载器中间件。
新建 middlewares.py：
import base64
import random

def base64ify(bytes_or_str):
"""生成 Proxy-Authorization 认证头"""
input_bytes = bytes_or_str.encode('utf8') if isinstance(bytes_or_str, str) else bytes_or_str
return base64.urlsafe_b64encode(input_bytes).decode('ascii')

class ProxyMiddleware(object):
def process_request(self, request, spider):

    # 亿牛云爬虫代理参数（官网 www.16yun.cn）
    proxyHost = "t.16yun.cn"
    proxyPort = "31111"
    proxyUser = "username"    # 替换为你的用户名
    proxyPass = "password"    # 替换为你的密码

    # 设置代理地址
    request.meta['proxy'] = f"http://{proxyHost}:{proxyPort}"

    # 添加认证头（Scrapy 2.6.2+ 可省略，会自动设置）
    request.headers['Proxy-Authorization'] = 'Basic ' + base64ify(f"{proxyUser}:{proxyPass}")

    # 设置 Proxy-Tunnel：相同随机数 = 相同出口 IP（适合需要登录态保持的场景）
    tunnel = random.randint(1, 10000)
    request.headers['Proxy-Tunnel'] = str(tunnel)

    # 如需每个请求强制切换 IP，关闭连接复用
    request.headers['Connection'] = "Close"

在 settings.py 中启用中间件并配置重试策略：
DOWNLOADER_MIDDLEWARES = {
'multispider.middlewares.ProxyMiddleware': 100,
}

代理认证失败（407）时自动重试

RETRY_ENABLED = True
RETRY_TIMES = 3
RETRY_HTTP_CODES = [500, 502, 503, 504, 407, 408, 429]

并发与限速

CONCURRENT_REQUESTS = 16
DOWNLOAD_DELAY = 0.5
DOWNLOAD_TIMEOUT = 15
代理 IP 使用要点：
场景配置方式说明
每次请求换 IP Connection: Close

随机 Tunnel 最常用，适合批量抓取
保持同一 IP 固定 Proxy-Tunnel
值适合需要登录/Cookie 缓存的流程
HTTPS 站点使用库原生代理认证避免手动 Proxy-Authorization
被转发到目标站
407 错误检查域名/端口/用户名/密码认证信息错误
429 错误降低并发或增加延迟请求速率超出订单上限
六、数据清洗管道
编辑 pipelines.py，将清洗逻辑与爬虫逻辑分离：
import re
import json
from datetime import datetime
from itemadapter import ItemAdapter
from scrapy.exceptions import DropItem

class DataCleaningPipeline:
def process_item(self, item, spider):
adapter = ItemAdapter(item)

    # 去除控制字符和首尾空白
    for field in ['title', 'author', 'content']:
        val = adapter.get(field, '')
        if val:
            val = re.sub(r'[\x00-\x1f\x7f-\x9f\u00a0]', '', val).strip()
            adapter[field] = val if val else None

    # 标签去重
    tags = adapter.get('tags', [])
    seen, cleaned = set(), []
    for tag in (t.strip() for t in tags if t.strip()):
        key = tag.lower()
        if key not in seen:
            seen.add(key)
            cleaned.append(tag)
    adapter['tags'] = cleaned[:10]

    # 必填校验
    if not adapter.get('title'):
        raise DropItem("Missing title")
    return item

class JsonExportPipeline:
def open_spider(self, spider):
self.file = open('output.jsonl', 'w', encoding='utf-8')
def process_item(self, item, spider):
self.file.write(json.dumps(dict(item), ensure_ascii=False) + '\n')
return item
def close_spider(self, spider):
self.file.close()

settings.py

ITEM_PIPELINES = {
'multispider.pipelines.DataCleaningPipeline': 100,
'multispider.pipelines.JsonExportPipeline': 200,
}
七、调试与运行

Scrapy Shell 验证 XPath（写代码前必做）

scrapy shell 'https://example.com/news'

response.xpath('//h1[@class="article-title"]/text()').get()
'Python 3.12 新特性解析'

运行爬虫

scrapy crawl technews -o results.json
八、XPath 避坑指南
陷阱错误写法正确写法
全局搜索误抓 article.xpath('//h2/text()') article.xpath('.//h2/text()')
多 class 失配 @class="item active" contains(@class, "active")
空白未处理 .get()
直接用 .get(default='').strip()
编码乱码默认编码 FEED_EXPORT_ENCODING='utf-8'
九、总结
Scrapy + XPath 的工程价值集中在三个层面：

解析层：XPath 的树结构查询能力远超 BeautifulSoup，深层嵌套、多条件筛选、跨轴遍历是原生优势
架构层：异步引擎 + 中间件 + Pipeline 天然支持大规模、跨平台扩展。规则配置与爬虫逻辑解耦，新增站点边际成本趋近于零
反爬层：通过代理 IP 中间件（如亿牛云爬虫代理）无缝接入 IP 池，Proxy-Tunnel 机制精确控制 IP 切换时机，配合 407 重试策略保障稳定性
实际项目中，先用 scrapy shell 验证 XPath 表达式再写代码；清洗逻辑统一收敛到 Pipeline；代理中间件根据业务场景选择随机 IP 或固定 IP 模式。这三点做到位，爬虫的可维护性和稳定性会有质的提升。

Python Scrapy 跨平台爬虫实战：XPath 解析与结构化数据提取

代理认证失败（407）时自动重试

并发与限速

settings.py

Scrapy Shell 验证 XPath（写代码前必做）

运行爬虫

热门文章

最新文章

相关电子书

探索云世界

热门

云计算

大数据

云原生

人工智能

数据库

开发与运维

活动广场

任务中心

训练营

直播

乘风者计划

下载

镜像站

技术资料

Python Scrapy 跨平台爬虫实战：XPath 解析与结构化数据提取

代理认证失败（407）时自动重试

并发与限速

settings.py

Scrapy Shell 验证 XPath（写代码前必做）

运行爬虫

热门文章

最新文章

相关电子书