一、为什么加了UA和代理还是被封?
很多开发者遇到这个问题:明明加了User-Agent伪装、配置了代理IP池,爬虫还是一跑就被封。
核心原因:90%的爬虫被封,不是因为IP不够多、UA不够真,而是忽略了请求头完整性和设备指纹一致性。现代网站的反爬早已从"单一特征识别"升级为"多维度行为校验"。
二、请求头指纹详解
一个真实的浏览器请求包含数十个请求头字段,而大多数爬虫只设置了UA:
text
不完整的请求头 → 容易被识别headers = {"User-Agent": "Mozilla/5.0..."}# 完整的请求头 → 更像真实浏览器headers = { "User-Agent": "Mozilla/5.0 ...", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8", "Accept-Language": "zh-CN,zh;q=0.9,ja;q=0.8,en;q=0.7", "Accept-Encoding": "gzip, deflate, br", "Connection": "keep-alive", "Upgrade-Insecure-Requests": "1", "Sec-Fetch-Dest": "document", "Sec-Fetch-Mode": "navigate", "Sec-Fetch-Site": "none", "Cache-Control": "max-age=0"}
三、设备指纹一致性实现
python
import randomimport hashlibimport timeclass DeviceFingerprint: """设备指纹生成器 - 保证每次请求的指纹一致""" def init(self, device_id: str = None): self.device_id = device_id or self._generate_device_id() self.fingerprint = self._build_fingerprint() def _generate_device_id(self) -> str: """生成唯一设备ID""" base = f"{time.time()}{random.randint(1000, 9999)}" return hashlib.md5(base.encode()).hexdigest()[:16] def _build_fingerprint(self) -> dict: """构建完整的设备指纹""" return { "device_id": self.device_id, "screen_resolution": f"{random.choice([1920, 1366, 1440])}x{random.choice([1080, 768, 900])}", "timezone": "Asia/Tokyo", "language": "ja-JP", "platform": "Win32", "webgl_vendor": random.choice(["Google Inc.", "Intel"]), "webgl_renderer": random.choice(["ANGLE", "Mesa"]), "fonts": self._get_font_list() } def _get_font_list(self) -> str: """模拟系统字体列表""" fonts = ["MS Gothic", "MS PGothic", "Meiryo", "Yu Gothic", "Arial", "Helvetica"] return ",".join(random.sample(fonts, 4)) def get_headers(self) -> dict: """生成完整的请求头""" ua = self._generate_ua() return { "User-Agent": ua, "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,/;q=0.8", "Accept-Language": "ja-JP,ja;q=0.9,en-US;q=0.8,en;q=0.7", "Accept-Encoding": "gzip, deflate, br", "Connection": "keep-alive", "Upgrade-Insecure-Requests": "1", "Sec-Fetch-Dest": "document", "Sec-Fetch-Mode": "navigate", "Sec-Fetch-Site": "none", "Cache-Control": "max-age=0", "sec-ch-ua": f'"Not_A Brand";v="99", "Chromium";v="{random.randint(110, 120)}"', "sec-ch-ua-mobile": "?0", "sec-ch-ua-platform": '"Windows"' } def _generate_ua(self) -> str: """生成真实浏览器UA""" chrome_version = f"{random.randint(110, 122)}.0.{random.randint(5000, 6500)}.{random.randint(100, 300)}" return f"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/{chrome_version} Safari/537.36"
四、完整请求流程
python
import requestsfrom requests.adapters import HTTPAdapterfrom urllib3.util.retry import Retryclass SmartCrawler: """带设备指纹的智能爬虫""" def init(self): self.fingerprint = DeviceFingerprint() self.session = self._create_session() def _create_session(self): session = requests.Session() retry = Retry(total=3, backoff_factor=0.5) adapter = HTTPAdapter(max_retries=retry) session.mount('http://', adapter) session.mount('https://', adapter) return session def get(self, url: str): """发送带完整指纹的请求""" headers = self.fingerprint.get_headers() # 每次请求使用相同的设备指纹 response = self.session.get( url, headers=headers, timeout=10 ) return response
五、实战效果
这套指纹伪装方案应用于日本电商数据采集项目后,请求被拒绝率从35%降至3%以下。核心经验:反爬对抗的关键在于模仿人类行为的完整性,而非单一维度的伪装。