PyPDF2:使用Python操作PDF文件

简介: PDF是文档常用格式,使用Python包PyPDF2可以对PDF文档实现批量、迅速的操作,包括提取文字、切分或合并PDF文件、创建annotation、加密和解密等。本文将介绍PyPDF2包的安装及简单使用方式。PyPDF的GitHub项目官网:py-pdf/PyPDF2: A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files

1. 使用pip安装PyPDF2


PyPDF2支持如下版本的Python解释器:

image.png

直接使用pip即可安装:pip install PyPDF2


2. 使用PyPDF2提取PDF文档内容的简单示例


以一篇论文文档为例,展示PyPDF2如何提取PDF文件中的内容。


论文《ImageNet Classification with Deep Convolutional Neural Networks》,一共9页,其首页布局为:

08dfbd7a21604f7198cb741b1e680f6b.png


Python脚本代码:


from PyPDF2 import PdfReader
#早期版本里叫PdfFileReader,已经过时,改名为PdfReader了,见:https://pypdf2.readthedocs.io/en/latest/_modules/PyPDF2/_reader.html?highlight=PdfFileReader#
reader = PdfReader(pdf_path)
number_of_pages = len(reader.pages)
#1.28.0版本之前用numPages,已经过时,见:https://pypdf2.readthedocs.io/en/latest/modules/PdfReader.html#PyPDF2.PdfReader.numPages
print(number_of_pages)  #打印页数
page = reader.pages[0]
#1.28.0版本之前用getPage(pageNumber),已经过时,见:https://pypdf2.readthedocs.io/en/latest/modules/PdfReader.html#PyPDF2.PdfReader.getPage
print(page)  #打印“PDF第一页”这个Page<PyPDF2._page.Page>对象
text = page.extract_text()
#1.28.0版本之前用extractText(),已经过时,见:https://pypdf2.readthedocs.io/en/latest/modules/PageObject.html#PyPDF2._page.PageObject.extractText
print(text)  #提取出第一页的文字


输出:


9
{'/Contents': IndirectObject(13, 0), '/Parent': IndirectObject(1, 0), '/Type': '/Page', '/Resources': IndirectObject(14, 0), '/MediaBox': [0, 0, 612, 792]}
ImageNet Classication with Deep Convolutional
Neural Networks
Alex Krizhevsky
University of Toronto
kriz@cs.utoronto.ca
Ilya Sutskever
University of Toronto
ilya@cs.utoronto.ca
Geoffrey E. Hinton
University of Toronto
hinton@cs.utoronto.ca
Abstract
We trained a large, deep convolutional neural network to classify the 1.2 million
high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif-
ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5%
and 17.0% which is considerably better than the previous state-of-the-art. The
neural network, which has 60 million parameters and 650,000 neurons, consists
of ve convolutional layers, some of which are followed by max-pooling layers,
and three fully-connected layers with a nal 1000-way softmax. To make train-
ing faster, we used non-saturating neurons and a very efcient GPU implemen-
tation of the convolution operation. To reduce overtting in the fully-connected
layers we employed a recently-developed regularization method called fidropoutfl
that proved to be very effective. We also entered a variant of this model in the
ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%,
compared to 26.2% achieved by the second-best entry.
1 Introduction
Current approaches to object recognition make essential use of machine learning methods. To im-
prove their performance, we can collect larger datasets, learn more powerful models, and use bet-
ter techniques for preventing overtting. Until recently, datasets of labeled images were relatively
small Š on the order of tens of thousands of images (e.g., NORB [16], Caltech-101/256 [8, 9], and
CIFAR-10/100 [12]). Simple recognition tasks can be solved quite well with datasets of this size,
especially if they are augmented with label-preserving transformations. For example, the current-
best error rate on the MNIST digit-recognition task (<0.3%) approaches human performance [4].
But objects in realistic settings exhibit considerable variability, so to learn to recognize them it is
necessary to use much larger training sets. And indeed, the shortcomings of small image datasets
have been widely recognized (e.g., Pinto et al. [21]), but it has only recently become possible to col-
lect labeled datasets with millions of images. The new larger datasets include LabelMe [23], which
consists of hundreds of tho usands of fully-segmented images, and ImageNet [6], which consists of
over 15 million labeled high-resolution images in over 22,000 categories.
To learn about thousands of objects from millions of images, we need a model with a large learning
capacity. However, the immense complexity of the object recognition task means that this prob-
lem cannot be specied even by a dataset as large as ImageNet, so our model should also have lots
of prior knowledge to compensate for all the data we don't have. Convolutional neural networks
(CNNs) constitute one such class of models [16, 11, 13, 18, 15, 22, 26]. Their capacity can be con-
trolled by varying their depth and breadth, and they also make strong and mostly correct assumptions
about the nature of images (namely, stationarity of statistics and locality of pixel dependencies).
Thus, compared to standard feedforward neural networks with similarly-sized layers, CNNs have
much fewer connections and parameters and so they are easier to train, while their theoretically-best
performance is likely to be only slightly worse.
1


可以看到页数和PDF中的文字都能正确提取出来。


相关文章
|
6月前
|
数据可视化 Linux iOS开发
Python脚本转EXE文件实战指南:从原理到操作全解析
本教程详解如何将Python脚本打包为EXE文件,涵盖PyInstaller、auto-py-to-exe和cx_Freeze三种工具,包含实战案例与常见问题解决方案,助你轻松发布独立运行的Python程序。
1629 2
|
5月前
|
监控 机器人 编译器
如何将python代码打包成exe文件---PyInstaller打包之神
PyInstaller可将Python程序打包为独立可执行文件,无需用户安装Python环境。它自动分析代码依赖,整合解释器、库及资源,支持一键生成exe,方便分发。使用pip安装后,通过简单命令即可完成打包,适合各类项目部署。
1069 68
|
7月前
|
安全 JavaScript 开发者
Python 自动化办公神器|一键转换所有文档为 PDF
本文介绍一个自动化批量将 Word、Excel、PPT、TXT、HTML 及图片转换为 PDF 的 Python 脚本。支持多格式识别、错误处理与日志记录,适用于文档归档、报告整理等场景,大幅提升办公效率。仅限 Windows 平台,需安装 Office 及相关依赖。
398 0
|
6月前
|
机器学习/深度学习 文字识别 Java
Python实现PDF图片OCR识别:从原理到实战的全流程解析
本文详解2025年Python实现扫描PDF文本提取的四大OCR方案(Tesseract、EasyOCR、PaddleOCR、OCRmyPDF),涵盖环境配置、图像预处理、核心识别与性能优化,结合财务票据、古籍数字化等实战场景,助力高效构建自动化文档处理系统。
1708 0
|
7月前
|
程序员 数据安全/隐私保护 Python
1行Python代码,实现PDF的加密、解密
程序员晚枫分享使用python-office库实现PDF批量加密与解密的新方法。只需一行代码,即可完成单个或多个PDF文件的加密、解密操作,支持文件路径与正则筛选,适合自动化办公需求。更新至最新版,适配性更佳,操作更简单。
281 8
1行Python代码,实现PDF的加密、解密
|
8月前
|
C#
【PDF提取内容改名】批量提取PDF指定区域内容重命名PDF文件,PDF自动提取内容命名的方案和详细步骤
本工具可批量提取PDF中的合同编号、日期、发票号等关键信息,支持PDF自定义区域提取并自动重命名文件,适用于合同管理、发票处理、文档归档和数据录入场景。基于iTextSharp库实现,提供完整代码示例与百度、腾讯网盘下载链接,助力高效处理PDF文档。
1072 40
|
7月前
|
缓存 数据可视化 Linux
Python文件/目录比较实战:排除特定类型的实用技巧
本文通过四个实战案例,详解如何使用Python比较目录差异并灵活排除特定文件,涵盖基础比较、大文件处理、跨平台适配与可视化报告生成,助力开发者高效完成目录同步与数据校验任务。
254 0
|
7月前
|
监控 Linux 数据安全/隐私保护
Python实现Word转PDF全攻略:从入门到实战
在数字化办公中,Python实现Word转PDF自动化,可大幅提升处理效率,解决格式兼容问题。本文详解五种主流方案,包括跨平台的docx2pdf、Windows原生的pywin32、服务器部署首选的LibreOffice命令行、企业级的Aspose.Words,以及轻量级的python-docx+pdfkit组合。每种方案均提供核心代码与适用场景,并涵盖中文字体处理、表格优化、批量进度监控等实用技巧,助力高效办公自动化。
1683 0
|
8月前
|
安全 Linux 网络安全
Python极速搭建局域网文件共享服务器:一行命令实现HTTPS安全传输
本文介绍如何利用Python的http.server模块,通过一行命令快速搭建支持HTTPS的安全文件下载服务器,无需第三方工具,3分钟部署,保障局域网文件共享的隐私与安全。
2119 0
|
8月前
|
数据管理 开发工具 索引
在Python中借助Everything工具实现高效文件搜索的方法
使用上述方法,你就能在Python中利用Everything的强大搜索能力实现快速的文件搜索,这对于需要在大量文件中进行快速查找的场景尤其有用。此外,利用Python脚本可以灵活地将这一功能集成到更复杂的应用程序中,增强了自动化处理和数据管理的能力。
694 0

热门文章

最新文章

推荐镜像

更多