PyPDF2:使用Python操作PDF文件

简介: PDF是文档常用格式,使用Python包PyPDF2可以对PDF文档实现批量、迅速的操作,包括提取文字、切分或合并PDF文件、创建annotation、加密和解密等。本文将介绍PyPDF2包的安装及简单使用方式。PyPDF的GitHub项目官网:py-pdf/PyPDF2: A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files

1. 使用pip安装PyPDF2


PyPDF2支持如下版本的Python解释器:

image.png

直接使用pip即可安装:pip install PyPDF2


2. 使用PyPDF2提取PDF文档内容的简单示例


以一篇论文文档为例,展示PyPDF2如何提取PDF文件中的内容。


论文《ImageNet Classification with Deep Convolutional Neural Networks》,一共9页,其首页布局为:

08dfbd7a21604f7198cb741b1e680f6b.png


Python脚本代码:


from PyPDF2 import PdfReader
#早期版本里叫PdfFileReader,已经过时,改名为PdfReader了,见:https://pypdf2.readthedocs.io/en/latest/_modules/PyPDF2/_reader.html?highlight=PdfFileReader#
reader = PdfReader(pdf_path)
number_of_pages = len(reader.pages)
#1.28.0版本之前用numPages,已经过时,见:https://pypdf2.readthedocs.io/en/latest/modules/PdfReader.html#PyPDF2.PdfReader.numPages
print(number_of_pages)  #打印页数
page = reader.pages[0]
#1.28.0版本之前用getPage(pageNumber),已经过时,见:https://pypdf2.readthedocs.io/en/latest/modules/PdfReader.html#PyPDF2.PdfReader.getPage
print(page)  #打印“PDF第一页”这个Page<PyPDF2._page.Page>对象
text = page.extract_text()
#1.28.0版本之前用extractText(),已经过时,见:https://pypdf2.readthedocs.io/en/latest/modules/PageObject.html#PyPDF2._page.PageObject.extractText
print(text)  #提取出第一页的文字


输出:


9
{'/Contents': IndirectObject(13, 0), '/Parent': IndirectObject(1, 0), '/Type': '/Page', '/Resources': IndirectObject(14, 0), '/MediaBox': [0, 0, 612, 792]}
ImageNet Classication with Deep Convolutional
Neural Networks
Alex Krizhevsky
University of Toronto
kriz@cs.utoronto.ca
Ilya Sutskever
University of Toronto
ilya@cs.utoronto.ca
Geoffrey E. Hinton
University of Toronto
hinton@cs.utoronto.ca
Abstract
We trained a large, deep convolutional neural network to classify the 1.2 million
high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif-
ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5%
and 17.0% which is considerably better than the previous state-of-the-art. The
neural network, which has 60 million parameters and 650,000 neurons, consists
of ve convolutional layers, some of which are followed by max-pooling layers,
and three fully-connected layers with a nal 1000-way softmax. To make train-
ing faster, we used non-saturating neurons and a very efcient GPU implemen-
tation of the convolution operation. To reduce overtting in the fully-connected
layers we employed a recently-developed regularization method called fidropoutfl
that proved to be very effective. We also entered a variant of this model in the
ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%,
compared to 26.2% achieved by the second-best entry.
1 Introduction
Current approaches to object recognition make essential use of machine learning methods. To im-
prove their performance, we can collect larger datasets, learn more powerful models, and use bet-
ter techniques for preventing overtting. Until recently, datasets of labeled images were relatively
small Š on the order of tens of thousands of images (e.g., NORB [16], Caltech-101/256 [8, 9], and
CIFAR-10/100 [12]). Simple recognition tasks can be solved quite well with datasets of this size,
especially if they are augmented with label-preserving transformations. For example, the current-
best error rate on the MNIST digit-recognition task (<0.3%) approaches human performance [4].
But objects in realistic settings exhibit considerable variability, so to learn to recognize them it is
necessary to use much larger training sets. And indeed, the shortcomings of small image datasets
have been widely recognized (e.g., Pinto et al. [21]), but it has only recently become possible to col-
lect labeled datasets with millions of images. The new larger datasets include LabelMe [23], which
consists of hundreds of tho usands of fully-segmented images, and ImageNet [6], which consists of
over 15 million labeled high-resolution images in over 22,000 categories.
To learn about thousands of objects from millions of images, we need a model with a large learning
capacity. However, the immense complexity of the object recognition task means that this prob-
lem cannot be specied even by a dataset as large as ImageNet, so our model should also have lots
of prior knowledge to compensate for all the data we don't have. Convolutional neural networks
(CNNs) constitute one such class of models [16, 11, 13, 18, 15, 22, 26]. Their capacity can be con-
trolled by varying their depth and breadth, and they also make strong and mostly correct assumptions
about the nature of images (namely, stationarity of statistics and locality of pixel dependencies).
Thus, compared to standard feedforward neural networks with similarly-sized layers, CNNs have
much fewer connections and parameters and so they are easier to train, while their theoretically-best
performance is likely to be only slightly worse.
1


可以看到页数和PDF中的文字都能正确提取出来。


相关文章
|
1月前
|
安全 Linux 数据安全/隐私保护
python知识点100篇系列(15)-加密python源代码为pyd文件
【10月更文挑战第5天】为了保护Python源码不被查看,可将其编译成二进制文件(Windows下为.pyd,Linux下为.so)。以Python3.8为例,通过Cython工具,先写好Python代码并加入`# cython: language_level=3`指令,安装easycython库后,使用`easycython *.py`命令编译源文件,最终生成.pyd文件供直接导入使用。
python知识点100篇系列(15)-加密python源代码为pyd文件
|
16天前
|
开发者 Python
Python中__init__.py文件的作用
`__init__.py`文件在Python包管理中扮演着重要角色,通过标识目录为包、初始化包、控制导入行为、支持递归包结构以及定义包的命名空间,`__init__.py`文件为组织和管理Python代码提供了强大支持。理解并正确使用 `__init__.py`文件,可以帮助开发者更好地组织代码,提高代码的可维护性和可读性。
17 2
|
1月前
|
Linux 区块链 Python
Python实用记录(十三):python脚本打包exe文件并运行
这篇文章介绍了如何使用PyInstaller将Python脚本打包成可执行文件(exe),并提供了详细的步骤和注意事项。
55 1
Python实用记录(十三):python脚本打包exe文件并运行
|
1月前
|
Java Python
> python知识点100篇系列(19)-使用python下载文件的几种方式
【10月更文挑战第7天】本文介绍了使用Python下载文件的五种方法,包括使用requests、wget、线程池、urllib3和asyncio模块。每种方法适用于不同的场景,如单文件下载、多文件并发下载等,提供了丰富的选择。
|
1月前
|
数据安全/隐私保护 流计算 开发者
python知识点100篇系列(18)-解析m3u8文件的下载视频
【10月更文挑战第6天】m3u8是苹果公司推出的一种视频播放标准,采用UTF-8编码,主要用于记录视频的网络地址。HLS(Http Live Streaming)是苹果公司提出的一种基于HTTP的流媒体传输协议,通过m3u8索引文件按序访问ts文件,实现音视频播放。本文介绍了如何通过浏览器找到m3u8文件,解析m3u8文件获取ts文件地址,下载ts文件并解密(如有必要),最后使用ffmpeg合并ts文件为mp4文件。
|
1月前
|
Java Apache Maven
将word文档转换成pdf文件方法
在Java中,将Word文档转换为PDF文件可采用多种方法:1) 使用Apache POI和iText库,适合处理基本转换需求;2) Aspose.Words for Java,提供更高级的功能和性能;3) 利用LibreOffice命令行工具,适用于需要开源解决方案的场景。每种方法都有其适用范围,可根据具体需求选择。
|
1月前
|
Java Apache Maven
Java将word文档转换成pdf文件的方法?
【10月更文挑战第13天】Java将word文档转换成pdf文件的方法?
161 1
|
1月前
|
JSON 数据格式 Python
Python实用记录(十四):python统计某个单词在TXT/JSON文件中出现的次数
这篇文章介绍了一个Python脚本,用于统计TXT或JSON文件中特定单词的出现次数。它包含两个函数,分别处理文本和JSON文件,并通过命令行参数接收文件路径、目标单词和文件格式。文章还提供了代码逻辑的解释和示例用法。
44 0
Python实用记录(十四):python统计某个单词在TXT/JSON文件中出现的次数
|
1月前
|
JavaScript 前端开发 容器
Vue生成PDF文件攻略:html2canvas与jspdf联手,中文乱码与自动换行难题攻克
Vue生成PDF文件攻略:html2canvas与jspdf联手,中文乱码与自动换行难题攻克
90 0
|
1月前
|
Python
Python实用记录(十二):文件夹下所有文件重命名以及根据图片路径保存到新路径下保存
这篇文章介绍了如何使用Python脚本对TTK100_VOC数据集中的JPEGImages文件夹下的图片文件进行批量重命名,并将它们保存到指定的新路径。
33 0
下一篇
无影云桌面