前言
selenium有一个爬虫特别喜欢的功能,就是driver.page_source功能,它可以打印整个html页面的内容,我们可以从整个页面的内容中提取出我们想要的内容,playwright同样支持打印整个html页面的内容。
获取完整页面html内容
playwright提供了page.content()
方法来获取页面内容,示例如下:
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
context = browser.new_context()
page = context.new_page()
page.goto("https://ceshiren.com/")
print(page.content())
运行脚本,结果如下图:
获取部分HTML内容
page.content() 是获取整个页面的HTML,但是有时候我们不需要获取完整的HTML内容,例如下面的页面,我们只取部分的内容:
playwright提供了locator().inner_html()
方法获取页面内容
- inner_html() 获取元素的整个html源码内容
- inner_text() 获取元素的文本内容
示例代码如下:
from playwright.sync_api import Playwright, sync_playwright, expect
def run(playwright: Playwright) -> None:
browser = playwright.chromium.launch(headless=False)
context = browser.new_context()
page = context.new_page()
page.goto("https://blog.csdn.net/Tester_muller?type=lately")
# 获取某个元素的HTML
blog = page.locator('.user-profile-head-info-r-c')
print(blog.inner_html())
print('-----------------------------------')
print(blog.inner_text())
# ---------------------
context.close()
browser.close()
with sync_playwright() as playwright:
run(playwright)
----------------------------------------------
打印的inner_text如下:
119,102
总访问量
545
原创
3,577
排名
53
粉丝
0
铁粉
学习成就
获取页面文本
text_content() 用来获取某个元素内所有文本内容,包含子元素内容,隐藏元素也能获取。
inner_text() 的返回值会被格式化 ,但是text_content()的返回值不会被格式化
最重要的区别 inner_text()返回的值, 依赖于页面的显示, text_content()依赖于代码的内容
示例代码如下:
from playwright.sync_api import Playwright, sync_playwright, expect
def run(playwright: Playwright) -> None:
browser = playwright.chromium.launch(headless=False)
context = browser.new_context()
page = context.new_page()
page.goto("https://blog.csdn.net/Tester_muller?type=lately")
# 获取某个元素的HTML
blog = page.locator('.user-profile-head-info-r-c')
# print(blog.inner_html())
print('-----------------------------------')
print(blog.inner_text())
print('-----------------------------------')
print(blog.text_content())
--------------------------------
输出结果如下:
119,126
总访问量
545
原创
3,577
排名
53
粉丝
0
铁粉
学习成就
-----------------------------------
博客:119,124 视频:2 119,126 总访问量 545 原创 3,577 排名 53 粉丝 0 铁粉 学习成就
all_inner_texts() 与 all_text_contents()
all_inner_texts() 和 all_text_contents() 也是用于获取页面上的文本,但是返回的是list列表,示例如下:
from playwright.sync_api import Playwright, sync_playwright, expect
def run(playwright: Playwright) -> None:
browser = playwright.chromium.launch(headless=False)
context = browser.new_context()
page = context.new_page()
page.goto("https://blog.csdn.net/Tester_muller?type=lately")
# 获取某个元素的HTML
blog = page.locator('.user-profile-head-info-r-c')
print(blog.all_inner_texts())
print('-----------------------------------')
print(blog.all_text_contents())
# ---------------------
context.close()
browser.close()
-----------------------------------------
输出结果如下:
['119,131\n总访问量\n545\n原创\n3,577\n排名\n53\n粉丝\n0\n铁粉\n学习成就']
-----------------------------------
['博客:119,129 视频:2 119,131 总访问量 545 原创 3,577 排名 53 粉丝 0 铁粉 学习成就']
总结
本文主要介绍了playwright打印页面内容的方法,playwright相比selenium的一大优点就是,playwright能够打印部分页面内容,还可以提取文本等信息,我们熟练使用playwright,能够解决一些使用selenium无法解决的问题。