PyMuPDF 1.24.4 中文文档（一）（4）-阿里云开发者社区

PyMuPDF 1.24.4 中文文档（一）（3）https://developer.aliyun.com/article/1559433

从 PDF 中提取文本

要从 PDF 文件提取所有文本，请执行以下操作：

import pymupdf
doc = pymupdf.open("a.pdf") # open a document
out = open("output.txt", "wb") # create a text output
for page in doc: # iterate the document pages
    text = page.get_text().encode("utf8") # get plain text (is in UTF-8)
    out.write(text) # write text of page
    out.write(bytes((12,))) # write page delimiter (form feed 0x0C)
out.close()

当然，不仅 PDF 可以提取文本 - 所有支持的文档文件格式（例如 MOBI、EPUB、TXT）都可以提取其文本。

注意

深入探讨

如果您的文档包含基于图像的文本内容，则在页面上使用 OCR 进行后续文本提取：

tp = page.get_textpage_ocr()
text = page.get_text(textpage=tp)

还有许多示例，说明如何从特定区域提取文本或如何从文档中提取表格。请参阅文本指南。

现在您还可以将文本以 Markdown 格式提取出来提取 Markdown 格式的文本。

API 参考

Page.get_text()

从 PDF 中提取图像

要从 PDF 文件提取所有图像，请执行以下操作：

import pymupdf
doc = pymupdf.open("test.pdf") # open a document
for page_index in range(len(doc)): # iterate over pdf pages
    page = doc[page_index] # get the page
    image_list = page.get_images()
    # print the number of images found on the page
    if image_list:
        print(f"Found {len(image_list)} images on page {page_index}")
    else:
        print("No images found on page", page_index)
    for image_index, img in enumerate(image_list, start=1): # enumerate the image list
        xref = img[0] # get the XREF of the image
        pix = pymupdf.Pixmap(doc, xref) # create a Pixmap
        if pix.n - pix.alpha > 3: # CMYK: convert to RGB first
            pix = pymupdf.Pixmap(pymupdf.csRGB, pix)
        pix.save("page_%s-image_%s.png" % (page_index, image_index)) # save the image as png
        pix = None

注意

深入探讨

还有许多示例，说明如何从特定区域提取文本或如何从文档中提取表格。请参阅文本指南。

API 参考

Page.get_images()
Pixmap

提取矢量图形

要从文档页面提取所有矢量图形，请执行以下操作：

doc = pymupdf.open("some.file")
page = doc[0]
paths = page.get_drawings()

这将返回一个包含页面上找到的所有矢量图形路径的字典。

注意

深入探讨

请参考：如何提取绘图。

API 参考

Page.get_drawings()

合并 PDF 文件

要合并 PDF 文件，请执行以下操作：

import pymupdf
doc_a = pymupdf.open("a.pdf") # open the 1st document
doc_b = pymupdf.open("b.pdf") # open the 2nd document
doc_a.insert_pdf(doc_b) # merge the docs
doc_a.save("a+b.pdf") # save the merged document with a new filename

将 PDF 文件与其他类型文件合并

使用 Document.insert_file() 您可以调用方法与 PDF 合并支持的文件。例如：

import pymupdf
doc_a = pymupdf.open("a.pdf") # open the 1st document
doc_b = pymupdf.open("b.svg") # open the 2nd document
doc_a.insert_file(doc_b) # merge the docs
doc_a.save("a+b.pdf") # save the merged document with a new filename

注意

深入探讨

使用Document.insert_pdf()和Document.insert_file()轻松合并 PDF 文件。在打开的 PDF 文档中，您可以从一个文档复制页面范围到另一个文档中。您可以选择复制页面应放置的位置，可以反转页面顺序，还可以更改页面旋转。这篇维基文章包含了详细说明。

GUI 脚本join.py使用此方法来连接文件列表，同时连接相应的目录段。它看起来像这样：

API 参考

Document.insert_pdf()
Document.insert_file()

将 PDF 文件与其他类型的文件合并

使用Document.insert_file()您可以调用此方法与 PDF 合并支持的文件。例如：

import pymupdf
doc_a = pymupdf.open("a.pdf") # open the 1st document
doc_b = pymupdf.open("b.svg") # open the 2nd document
doc_a.insert_file(doc_b) # merge the docs
doc_a.save("a+b.pdf") # save the merged document with a new filename

注意

进一步探索

GUI 脚本join.py使用此方法来连接文件列表，同时连接相应的目录段。它看起来像这样：

API 参考

Document.insert_pdf()
Document.insert_file()

使用坐标

当使用PyMuPDF时，有一个数学术语你应该对其感到舒适 - “coordinates（坐标）”。请快速查看 Coordinates（坐标）部分，以理解坐标系统，帮助您定位对象并理解文档空间。

添加水印到 PDF

要向 PDF 文件添加水印，请执行以下操作：

import pymupdf
doc = pymupdf.open("document.pdf") # open a document
for page_index in range(len(doc)): # iterate over pdf pages
    page = doc[page_index] # get the page
    # insert an image watermark from a file name to fit the page bounds
    page.insert_image(page.bound(),filename="watermark.png", overlay=False)
doc.save("watermarked-document.pdf") # save the document with a new filename

注意

进一步探索

添加水印实质上就像在每个 PDF 页面的底部添加图像一样简单。确保图像具有所需的不透明度和长宽比以使其看起来符合您的需求。

在上面的示例中，从每个文件引用创建了一个新图像，但为了提高性能（节省内存和文件大小），应该只引用一次图像数据 - 请参阅代码示例和Page.insert_image()的实现说明。

API 参考

Page.bound()
Page.insert_image()

向 PDF 添加图像

要向 PDF 文件添加图像（例如徽标），请执行以下操作：

import pymupdf
doc = pymupdf.open("document.pdf") # open a document
for page_index in range(len(doc)): # iterate over pdf pages
    page = doc[page_index] # get the page
    # insert an image logo from a file name at the top left of the document
    page.insert_image(pymupdf.Rect(0,0,50,50),filename="my-logo.png")
doc.save("logo-document.pdf") # save the document with a new filename

注

进一步探讨

与水印示例类似，如果可能的话，确保只引用图像一次以提高性能 - 请参阅代码示例和Page.insert_image()上的说明。

API 参考

矩形
Page.insert_image()

旋转 PDF

要向页面添加旋转，请执行以下操作：

import pymupdf
doc = pymupdf.open("test.pdf") # open document
page = doc[0] # get the 1st page of the document
page.set_rotation(90) # rotate the page
doc.save("rotated-page-1.pdf")

注

API 参考

Page.set_rotation()

裁剪 PDF

要将页面裁剪到指定的矩形，请执行以下操作：

import pymupdf
doc = pymupdf.open("test.pdf") # open document
page = doc[0] # get the 1st page of the document
page.set_cropbox(pymupdf.Rect(100, 100, 400, 400)) # set a cropbox for the page
doc.save("cropped-page-1.pdf")

注

API 参考

Page.set_cropbox()

附加文件

要向页面附加另一个文件，请执行以下操作：

import pymupdf
doc = pymupdf.open("test.pdf") # open main document
attachment = pymupdf.open("my-attachment.pdf") # open document you want to attach
page = doc[0] # get the 1st page of the document
point = pymupdf.Point(100, 100) # create the point where you want to add the attachment
attachment_data = attachment.tobytes() # get the document byte data as a buffer
# add the file annotation with the point, data and the file name
file_annotation = page.add_file_annot(point, attachment_data, "attachment.pdf")
doc.save("document-with-attachment.pdf") # save the document

注

进一步探讨

使用Page.add_file_annot()添加文件时，请注意filename的第三个参数应包括实际的文件扩展名。如果没有这样做，附件可能无法被识别为可以打开的内容。例如，如果filename只是“attachment”，在查看生成的 PDF 并尝试打开附件时可能会出错。但是，使用“attachment.pdf”可以被 PDF 查看器识别并打开为有效的文件类型。

附件的默认图标默认是“图钉”，但是您可以通过设置icon参数进行更改。

API 参考

点
Document.tobytes()
Page.add_file_annot()

嵌入文件

要将文件嵌入文档中，请执行以下操作：

import pymupdf
doc = pymupdf.open("test.pdf") # open main document
embedded_doc = pymupdf.open("my-embed.pdf") # open document you want to embed
embedded_data = embedded_doc.tobytes() # get the document byte data as a buffer
# embed with the file name and the data
doc.embfile_add("my-embedded_file.pdf", embedded_data)
doc.save("document-with-embed.pdf") # save the document

注

进一步探讨

与附加文件类似，使用Document.embfile_add()添加文件时，请注意filename的第一个参数应包括实际的文件扩展名。

API 参考

Document.tobytes()
Document.embfile_add()

删除页面

要从文档中删除页面，请执行以下操作：

import pymupdf
doc = pymupdf.open("test.pdf") # open a document
doc.delete_page(0) # delete the 1st page of the document
doc.save("test-deleted-page-one.pdf") # save the document

要从文档中删除多个页面，请执行以下操作：

import pymupdf
doc = pymupdf.open("test.pdf") # open a document
doc.delete_pages(from_page=9, to_page=14) # delete a page range from the document
doc.save("test-deleted-pages.pdf") # save the document

如果删除书签或超链接引用的页面会发生什么？

目录中的书签（目录条目）将变为不活跃状态，并且将不再导航到任何页面。
页面上的超链接将被删除。该页面上的可见内容不会以任何方式改变。

注意

深入了解

页面索引从零开始，因此要删除文档的第 10 页，您需要执行以下操作：doc.delete_page(9)。

类似地，doc.delete_pages(from_page=9, to_page=14) 将删除第 10 至第 15 页（包括）。

API 参考

Document.delete_page()
Document.delete_pages()

如果删除书签或超链接引用的页面会发生什么？

目录中的书签（目录条目）将变为不活跃状态，并且将不再导航到任何页面。
页面上的超链接将被删除。该页面上的可见内容不会以任何方式改变。

注意

深入了解

页面索引从零开始，因此要删除文档的第 10 页，您需要执行以下操作：doc.delete_page(9)。

类似地，doc.delete_pages(from_page=9, to_page=14) 将删除第 10 至第 15 页（包括）。

API 参考

Document.delete_page()
Document.delete_pages()

重新排列页面

要更改页面顺序，即重新排列页面，请执行以下操作：

import pymupdf
doc = pymupdf.open("test.pdf") # open a document
doc.move_page(1,0) # move the 2nd page of the document to the start of the document
doc.save("test-page-moved.pdf") # save the document

注意

API 参考

Document.move_page()

PyMuPDF 1.24.4 中文文档（一）（5）https://developer.aliyun.com/article/1559435

PyMuPDF 1.24.4 中文文档（一）（4）

从 PDF 中提取文本

从 PDF 中提取图像

提取矢量图形

合并 PDF 文件

将 PDF 文件与其他类型文件合并

将 PDF 文件与其他类型的文件合并

使用坐标

添加水印到 PDF

向 PDF 添加图像

旋转 PDF

裁剪 PDF

附加文件

嵌入文件

删除页面

如果删除书签或超链接引用的页面会发生什么？

如果删除书签或超链接引用的页面会发生什么？

重新排列页面

热门文章

最新文章

相关课程

相关电子书