1.安装lxml
命令:pip install lxml
image.png
2. 语法
from lxml import etree tree = etree.parse(filepath)
/xxx/text()
:取xxx节点里头的文本
/xxx//yyytext()
:取xxx后代中所有的yyy节点里头的文本
/xxx/*/yyy/text()
:取xxx节点用任意节点包裹的yyy节点里头的文本
/xxx/*/yyy[n]/text()
:取xxx节点用任意节点包裹的第n个索引的yyy节点里头的文本
/xxx/*/yyy[@attr_name="attr_value"]/text()
:取xxx节点用任意节点包裹的属性名为attr_name,属性值为attr_value的yyy节点里头的文本
./
:相对位置
.../xxx/@attr_name
:取属性attr_name的值
3. 一个Google小工具技巧
偷懒小技巧.png
4. Demo 猪八戒网站爬取商品信息
from lxml import etree import requests url = 'https://wuhan.zbj.com/search/service/?kw=saas' content = requests.get(url) content.encoding = 'utf-8' html = etree.HTML(content.text) oDivs1 = html.xpath('//*[@id="__layout"]/div/div[3]/div/div[4]/div/div[2]/div[1]/div') for div in oDivs1: price = div.xpath('./div[1]/div[3]/div[1]/span/text()')[0].strip('¥') title = div.xpath('./div/div[3]/div[2]/a/text()')[0] rate = div.xpath('./div/div[3]/div[4]/div[1]/span[1]/span/text()')[0] print(price,title,rate)