PyPDF2:使用Python操作PDF文件

简介: PDF是文档常用格式,使用Python包PyPDF2可以对PDF文档实现批量、迅速的操作,包括提取文字、切分或合并PDF文件、创建annotation、加密和解密等。本文将介绍PyPDF2包的安装及简单使用方式。PyPDF的GitHub项目官网:py-pdf/PyPDF2: A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files

1. 使用pip安装PyPDF2


PyPDF2支持如下版本的Python解释器:

image.png

直接使用pip即可安装:pip install PyPDF2


2. 使用PyPDF2提取PDF文档内容的简单示例


以一篇论文文档为例,展示PyPDF2如何提取PDF文件中的内容。


论文《ImageNet Classification with Deep Convolutional Neural Networks》,一共9页,其首页布局为:

08dfbd7a21604f7198cb741b1e680f6b.png


Python脚本代码:


from PyPDF2 import PdfReader
#早期版本里叫PdfFileReader,已经过时,改名为PdfReader了,见:https://pypdf2.readthedocs.io/en/latest/_modules/PyPDF2/_reader.html?highlight=PdfFileReader#
reader = PdfReader(pdf_path)
number_of_pages = len(reader.pages)
#1.28.0版本之前用numPages,已经过时,见:https://pypdf2.readthedocs.io/en/latest/modules/PdfReader.html#PyPDF2.PdfReader.numPages
print(number_of_pages)  #打印页数
page = reader.pages[0]
#1.28.0版本之前用getPage(pageNumber),已经过时,见:https://pypdf2.readthedocs.io/en/latest/modules/PdfReader.html#PyPDF2.PdfReader.getPage
print(page)  #打印“PDF第一页”这个Page<PyPDF2._page.Page>对象
text = page.extract_text()
#1.28.0版本之前用extractText(),已经过时,见:https://pypdf2.readthedocs.io/en/latest/modules/PageObject.html#PyPDF2._page.PageObject.extractText
print(text)  #提取出第一页的文字


输出:


9
{'/Contents': IndirectObject(13, 0), '/Parent': IndirectObject(1, 0), '/Type': '/Page', '/Resources': IndirectObject(14, 0), '/MediaBox': [0, 0, 612, 792]}
ImageNet Classication with Deep Convolutional
Neural Networks
Alex Krizhevsky
University of Toronto
kriz@cs.utoronto.ca
Ilya Sutskever
University of Toronto
ilya@cs.utoronto.ca
Geoffrey E. Hinton
University of Toronto
hinton@cs.utoronto.ca
Abstract
We trained a large, deep convolutional neural network to classify the 1.2 million
high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif-
ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5%
and 17.0% which is considerably better than the previous state-of-the-art. The
neural network, which has 60 million parameters and 650,000 neurons, consists
of ve convolutional layers, some of which are followed by max-pooling layers,
and three fully-connected layers with a nal 1000-way softmax. To make train-
ing faster, we used non-saturating neurons and a very efcient GPU implemen-
tation of the convolution operation. To reduce overtting in the fully-connected
layers we employed a recently-developed regularization method called fidropoutfl
that proved to be very effective. We also entered a variant of this model in the
ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%,
compared to 26.2% achieved by the second-best entry.
1 Introduction
Current approaches to object recognition make essential use of machine learning methods. To im-
prove their performance, we can collect larger datasets, learn more powerful models, and use bet-
ter techniques for preventing overtting. Until recently, datasets of labeled images were relatively
small Š on the order of tens of thousands of images (e.g., NORB [16], Caltech-101/256 [8, 9], and
CIFAR-10/100 [12]). Simple recognition tasks can be solved quite well with datasets of this size,
especially if they are augmented with label-preserving transformations. For example, the current-
best error rate on the MNIST digit-recognition task (<0.3%) approaches human performance [4].
But objects in realistic settings exhibit considerable variability, so to learn to recognize them it is
necessary to use much larger training sets. And indeed, the shortcomings of small image datasets
have been widely recognized (e.g., Pinto et al. [21]), but it has only recently become possible to col-
lect labeled datasets with millions of images. The new larger datasets include LabelMe [23], which
consists of hundreds of tho usands of fully-segmented images, and ImageNet [6], which consists of
over 15 million labeled high-resolution images in over 22,000 categories.
To learn about thousands of objects from millions of images, we need a model with a large learning
capacity. However, the immense complexity of the object recognition task means that this prob-
lem cannot be specied even by a dataset as large as ImageNet, so our model should also have lots
of prior knowledge to compensate for all the data we don't have. Convolutional neural networks
(CNNs) constitute one such class of models [16, 11, 13, 18, 15, 22, 26]. Their capacity can be con-
trolled by varying their depth and breadth, and they also make strong and mostly correct assumptions
about the nature of images (namely, stationarity of statistics and locality of pixel dependencies).
Thus, compared to standard feedforward neural networks with similarly-sized layers, CNNs have
much fewer connections and parameters and so they are easier to train, while their theoretically-best
performance is likely to be only slightly worse.
1


可以看到页数和PDF中的文字都能正确提取出来。


相关文章
|
23天前
|
Python
【python】python跨文件使用全局变量
【python】python跨文件使用全局变量
|
18天前
|
人工智能 机器人 C++
【C++/Python】Windows用Swig实现C++调用Python(史上最简单详细,80岁看了都会操作)
【C++/Python】Windows用Swig实现C++调用Python(史上最简单详细,80岁看了都会操作)
|
1天前
|
JSON JavaScript 数据格式
python遍历目录文件_结合vue获取所有的html文件并且展示
python遍历目录文件_结合vue获取所有的html文件并且展示
4 0
|
2天前
|
人工智能 Python
【Python实用技能】建议收藏:自动化实现网页内容转PDF并保存的方法探索(含代码,亲测可用)
【Python实用技能】建议收藏:自动化实现网页内容转PDF并保存的方法探索(含代码,亲测可用)
21 0
|
2天前
|
JSON 数据格式 索引
python 又一个点运算符操作的字典库:Munch
python 又一个点运算符操作的字典库:Munch
20 0
|
7天前
|
索引 Python
如何使用Python的Pandas库进行数据透视表(pivot table)操作?
使用Pandas在Python中创建数据透视表的步骤包括:安装Pandas库,导入它,创建或读取数据(如DataFrame),使用`pd.pivot_table()`指定数据框、行索引、列索引和值,计算聚合函数(如平均分),并可打印或保存结果到文件。这允许对数据进行高效汇总和分析。
10 2
|
7天前
|
存储 Python
用Python实现批量下载文件——代理ip排除万难
用Python实现批量下载文件——代理ip排除万难
|
8天前
|
JSON 关系型数据库 数据库
《Python 简易速速上手小册》第6章:Python 文件和数据持久化(2024 最新版)
《Python 简易速速上手小册》第6章:Python 文件和数据持久化(2024 最新版)
35 0
|
8天前
|
数据挖掘 索引 Python
Python 读写 Excel 文件
Python 读写 Excel 文件
12 0
|
8天前
|
数据安全/隐私保护 Python
Python文件与目录操作:面试中的高频考点
【4月更文挑战第15天】本文介绍了Python文件和目录操作的面试重点,包括文件的读写、目录遍历及权限管理。强调了文件关闭、异常处理、特殊文件判断以及权限位和权限字符串的理解。提供了代码示例,如读写文件、遍历目录和更改文件权限,帮助读者在面试中表现出色。掌握这些技能将对编程求职之路大有裨益。
17 0