PyMuPDF 1.24.4 中文文档(四)(1)

简介: PyMuPDF 1.24.4 中文文档(四)


原文:https://pymupdf.readthedocs.io/en/latest/

故事

原文:pymupdf.readthedocs.io/en/latest/recipes-stories.html

本文展示了一些典型的故事使用案例。

如教程中所述,故事可以使用最多三个输入源创建:HTML、CSS 和存档 - 所有这些都是可选的,并且可以分别以编程方式提供。

以下示例将展示使用这些输入的组合。

注意

这些配方的许多源代码都包含在docs文件夹中作为示例。

如何添加一行带有一些格式的文本

这里是不可避免的“Hello World”示例。我们将展示两种变体:

  1. 使用现有 HTML 源创建[1],可以来自任何地方。
  2. 使用 Python API 创建。

使用现有 HTML 源[1]的变体 - 在这种情况下,已在脚本中定义为常量:

import pymupdf
HTML = """
<p style="font-family: sans-serif;color: blue">Hello World!</p>
"""
MEDIABOX = pymupdf.paper_rect("letter")  # output page format: Letter
WHERE = MEDIABOX + (36, 36, -36, -36)  # leave borders of 0.5 inches
story = pymupdf.Story(html=HTML)  # create story from HTML
writer = pymupdf.DocumentWriter("output.pdf")  # create the writer
more = 1  # will indicate end of input once it is set to 0
while more:  # loop outputting the story
    device = writer.begin_page(MEDIABOX)  # make new page
    more, _ = story.place(WHERE)  # layout into allowed rectangle
    story.draw(device)  # write on page
    writer.end_page()  # finish page
writer.close()  # close output file 

注意

上述效果(无衬线和蓝色文本)可以通过使用单独的 CSS 源文件来实现,如下所示:

import pymupdf
CSS = """
body {
 font-family: sans-serif;
 color: blue;
}
"""
HTML = """
<p>Hello World!</p>
"""
# the story would then be created like this:
story = pymupdf.Story(html=HTML, user_css=CSS) 

Python API 变体 - 一切都是以编程方式创建的:

import pymupdf
MEDIABOX = pymupdf.paper_rect("letter")
WHERE = MEDIABOX + (36, 36, -36, -36)
story = pymupdf.Story()  # create an empty story
body = story.body  # access the body of its DOM
with body.add_paragraph() as para:  # store desired content
    para.set_font("sans-serif").set_color("blue").add_text("Hello World!")
writer = pymupdf.DocumentWriter("output.pdf")
more = 1
while more:
    device = writer.begin_page(MEDIABOX)
    more, _ = story.place(WHERE)
    story.draw(device)
    writer.end_page()
writer.close() 

两个变体将产生相同的输出 PDF。

  • 如何使用图像

图像可以在提供的 HTML 源中引用,或者也可以通过 Python API 存储对所需图像的引用。无论哪种情况,这都需要使用 Archive,该存档引用了可以找到图像的位置。

注意

在 HTML 源中嵌入二进制内容的图像不受故事支持。

我们将上面的“Hello World”示例扩展,并在文本后显示我们星球的图像。假设图像名称为“world.jpg”,并且位于脚本文件夹中,则这是上述 Python API 变体的修改版本:

import pymupdf
MEDIABOX = pymupdf.paper_rect("letter")
WHERE = MEDIABOX + (36, 36, -36, -36)
# create story, let it look at script folder for resources
story = pymupdf.Story(archive=".")
body = story.body  # access the body of its DOM
with body.add_paragraph() as para:
    # store desired content
    para.set_font("sans-serif").set_color("blue").add_text("Hello World!")
# another paragraph for our image:
with body.add_paragraph() as para:
    # store image in another paragraph
    para.add_image("world.jpg")
writer = pymupdf.DocumentWriter("output.pdf")
more = 1
while more:
    device = writer.begin_page(MEDIABOX)
    more, _ = story.place(WHERE)
    story.draw(device)
    writer.end_page()
writer.close() 
  • 如何阅读故事的外部 HTML 和 CSS

这些情况相当简单。

作为一般建议,应该将 HTML 和 CSS 源文件作为二进制文件读取并在在故事中使用之前进行解码。Python 的pathlib.Path提供了方便的方法来实现这一点:

import pathlib
import pymupdf
htmlpath = pathlib.Path("myhtml.html")
csspath = pathlib.Path("mycss.css")
HTML = htmlpath.read_bytes().decode()
CSS = csspath.read_bytes().decode()
story = pymupdf.Story(html=HTML, user_css=CSS) 
  • 如何使用故事模板输出数据库内容

此脚本演示了如何使用HTML 模板报告 SQL 数据库内容。

示例 SQL 数据库包含两个表:

  1. 表“films”每部电影都包含一行,其中包含字段**“title”“director”和(发布)“year”**。
  2. 表“actors”每部电影和演员名称都包含一行(字段(演员)“name”和(电影)“title”)。

故事 DOM 由一个电影模板组成,该模板将电影数据与演员列表一起报告。

文件:

  • docs/samples/filmfestival-sql.py
  • docs/samples/filmfestival-sql.db

查看配方

"""
This is a demo script for using PyMuPDF with its "Story" feature.
The following aspects are being covered here:
* The script produces a report of films that are stored in an SQL database
* The report format is provided as a HTML template
The SQL database contains two tables:
1\. Table "films" which has the columns "title" (film title, str), "director"
 (str) and "year" (year of release, int).
2\. Table "actors" which has the columns "name" (actor name, str) and "title"
 (the film title where the actor had been casted, str).
The script reads all content of the "films" table. For each film title it
reads all rows from table "actors" which took part in that film.
Comment 1
---------
To keep things easy and free from pesky technical detail, the relevant file
names inherit the name of this script:
- the database's filename is the script name with ".py" extension replaced
 by ".db".
- the output PDF similarly has script file name with extension ".pdf".
Comment 2
---------
The SQLITE database has been created using https://sqlitebrowser.org/, a free
multi-platform tool to maintain or manipulate SQLITE databases.
"""
import os
import sqlite3
import pymupdf
# ----------------------------------------------------------------------
# HTML template for the film report
# There are four placeholders coded as "id" attributes.
# One "id" allows locating the template part itself, the other three
# indicate where database text should be inserted.
# ----------------------------------------------------------------------
festival_template = (
    "<html><head><title>Just some arbitrary text</title></head>"
    '<body><h1 style="text-align:center">Hook Norton Film Festival</h1>'
    "<ol>"
    '<li id="filmtemplate">'
    '<b id="filmtitle"></b>'
    "<dl>"
    '<dt>Director<dd id="director">'
    '<dt>Release Year<dd id="filmyear">'
    '<dt>Cast<dd id="cast">'
    "</dl>"
    "</li>"
    "</ol>"
    "</body></html"
)
# -------------------------------------------------------------------
# define database access
# -------------------------------------------------------------------
dbfilename = __file__.replace(".py", ".db")  # the SQLITE database file name
assert os.path.isfile(dbfilename), f'{dbfilename}'
database = sqlite3.connect(dbfilename)  # open database
cursor_films = database.cursor()  # cursor for selecting the films
cursor_casts = database.cursor()  # cursor for selecting actors per film
# select statement for the films - let SQL also sort it for us
select_films = """SELECT title, director, year FROM films ORDER BY title"""
# select stament for actors, a skeleton: sub-select by film title
select_casts = """SELECT name FROM actors WHERE film = "%s" ORDER BY name"""
# -------------------------------------------------------------------
# define the HTML Story and fill it with database data
# -------------------------------------------------------------------
story = pymupdf.Story(festival_template)
body = story.body  # access the HTML body detail
template = body.find(None, "id", "filmtemplate")  # find the template part
# read the films from the database and put them all in one Python list
# NOTE: instead we might fetch rows one by one (advisable for large volumes)
cursor_films.execute(select_films)  # execute cursor, and ...
films = cursor_films.fetchall()  # read out what was found
for title, director, year in films:  # iterate through the films
    film = template.clone()  # clone template to report each film
    film.find(None, "id", "filmtitle").add_text(title)  # put title in templ
    film.find(None, "id", "director").add_text(director)  # put director
    film.find(None, "id", "filmyear").add_text(str(year))  # put year
    # the actors reside in their own table - find the ones for this film title
    cursor_casts.execute(select_casts % title)  # execute cursor
    casts = cursor_casts.fetchall()  # read actors for the film
    # each actor name appears in its own tuple, so extract it from there
    film.find(None, "id", "cast").add_text("\n".join([c[0] for c in casts]))
    body.append_child(film)
template.remove()  # remove the template
# -------------------------------------------------------------------
# generate the PDF
# -------------------------------------------------------------------
writer = pymupdf.DocumentWriter(__file__.replace(".py", ".pdf"), "compress")
mediabox = pymupdf.paper_rect("a4")  # use pages in ISO-A4 format
where = mediabox + (72, 36, -36, -72)  # leave page borders
more = 1  # end of output indicator
while more:
    dev = writer.begin_page(mediabox)  # make a new page
    more, filled = story.place(where)  # arrange content for this page
    story.draw(dev, None)  # write content to page
    writer.end_page()  # finish the page
writer.close()  # close the PDF 
  • 如何与现有 PDF 集成

因为 DocumentWriter 只能写入新文件,无法将故事放置在现有页面上。此脚本演示了绕过此限制的方法。

基本思路是让 DocumentWriter 输出到内存中的 PDF。故事完成后,我们重新打开这个内存 PDF,并将其页面放置到期望的位置上现有页面通过方法Page.show_pdf_page()

文件:

  • docs/samples/showpdf-page.py

查看配方

"""
Demo of Story class in PyMuPDF
-------------------------------
This script demonstrates how to the results of a pymupdf.Story output can be
placed in a rectangle of an existing (!) PDF page.
"""
import io
import os
import pymupdf
def make_pdf(fileptr, text, rect, font="sans-serif", archive=None):
  """Make a memory DocumentWriter from HTML text and a rect.
 Args:
 fileptr: a Python file object. For example an io.BytesIO().
 text: the text to output (HTML format)
 rect: the target rectangle. Will use its width / height as mediabox
 font: (str) font family name, default sans-serif
 archive: pymupdf.Archive parameter. To be used if e.g. images or special
 fonts should be used.
 Returns:
 The matrix to convert page rectangles of the created PDF back
 to rectangle coordinates in the parameter "rect".
 Normal use will expect to fit all the text in the given rect.
 However, if an overflow occurs, this function will output multiple
 pages, and the caller may decide to either accept or retry with
 changed parameters.
 """
    # use input rectangle as the page dimension
    mediabox = pymupdf.Rect(0, 0, rect.width, rect.height)
    # this matrix converts mediabox back to input rect
    matrix = mediabox.torect(rect)
    story = pymupdf.Story(text, archive=archive)
    body = story.body
    body.set_properties(font=font)
    writer = pymupdf.DocumentWriter(fileptr)
    while True:
        device = writer.begin_page(mediabox)
        more, _ = story.place(mediabox)
        story.draw(device)
        writer.end_page()
        if not more:
            break
    writer.close()
    return matrix
# -------------------------------------------------------------
# We want to put this in a given rectangle of an existing page
# -------------------------------------------------------------
HTML = """
<p>PyMuPDF is a great package! And it still improves significantly from one version to the next one!</p>
<p>It is a Python binding for <b>MuPDF</b>, a lightweight PDF, XPS, and E-book viewer, renderer, and toolkit.<br> Both are maintained and developed by Artifex Software, Inc.</p>
<p>Via MuPDF it can access files in PDF, XPS, OpenXPS, CBZ, EPUB, MOBI and FB2 (e-books) formats,<br> and it is known for its top
<b><i>performance</i></b> and <b><i>rendering quality.</p>"""
# Make a PDF page for demo purposes
root = os.path.abspath( f"{__file__}/..")
doc = pymupdf.open(f"{root}/mupdf-title.pdf")
page = doc[0]
WHERE = pymupdf.Rect(50, 100, 250, 500)  # target rectangle on existing page
fileptr = io.BytesIO()  # let DocumentWriter use this as its file
# -------------------------------------------------------------------
# call DocumentWriter and Story to fill our rectangle
matrix = make_pdf(fileptr, HTML, WHERE)
# -------------------------------------------------------------------
src = pymupdf.open("pdf", fileptr)  # open DocumentWriter output PDF
if src.page_count > 1:  # target rect was too small
    raise ValueError("target WHERE too small")
# its page 0 contains our result
page.show_pdf_page(WHERE, src, 0)
doc.ez_save(f"{root}/mupdf-title-after.pdf") 

此脚本输出一篇文章(来自维基百科),包含文本和多个图像,并使用双列页面布局。

此外,使用了来自pymupdf-fonts包中的两个“Ubuntu”字体系列,而不是默认的 Base-14 字体。

这里使用的另一个特性是所有数据 - 图像和文章 HTML - 都联合存储在 ZIP 文件中。

文件:

  • docs/samples/quickfox.py
  • docs/samples/quickfox.zip

查看配方

"""
This is a demo script using PyMuPDF's Story class to output text as a PDF with
a two-column page layout.
The script demonstrates the following features:
* How to fill columns or table cells of complex page layouts
* How to embed images
* How to modify existing, given HTML sources for output (text indent, font size)
* How to use fonts defined in package "pymupdf-fonts"
* How to use ZIP files as Archive
--------------
The example is taken from the somewhat modified Wikipedia article
https://en.wikipedia.org/wiki/The_quick_brown_fox_jumps_over_the_lazy_dog.
--------------
"""
import io
import os
import zipfile
import pymupdf
thisdir = os.path.dirname(os.path.abspath(__file__))
myzip = zipfile.ZipFile(os.path.join(thisdir, "quickfox.zip"))
arch = pymupdf.Archive(myzip)
if pymupdf.fitz_fontdescriptors:
    # we want to use the Ubuntu fonts for sans-serif and for monospace
    CSS = pymupdf.css_for_pymupdf_font("ubuntu", archive=arch, name="sans-serif")
    CSS = pymupdf.css_for_pymupdf_font("ubuntm", CSS=CSS, archive=arch, name="monospace")
else:
    # No pymupdf-fonts available.
    CSS=""
docname = __file__.replace(".py", ".pdf")  # output PDF file name
HTML = myzip.read("quickfox.html").decode()
# make the Story object
story = pymupdf.Story(HTML, user_css=CSS, archive=arch)
# --------------------------------------------------------------
# modify the DOM somewhat
# --------------------------------------------------------------
body = story.body  # access HTML body
body.set_properties(font="sans-serif")  # and give it our font globally
# modify certain nodes
para = body.find("p", None, None)  # find relevant nodes (here: paragraphs)
while para != None:
    para.set_properties(  # method MUST be used for existing nodes
        indent=15,
        fontsize=13,
    )
    para = para.find_next("p", None, None)
# choose PDF page size
MEDIABOX = pymupdf.paper_rect("letter")
# text appears only within this subrectangle
WHERE = MEDIABOX + (36, 36, -36, -36)
# --------------------------------------------------------------
# define page layout within the WHERE rectangle
# --------------------------------------------------------------
COLS = 2  # layout: 2 cols 1 row
ROWS = 1
TABLE = pymupdf.make_table(WHERE, cols=COLS, rows=ROWS)
# fill the cells of each page in this sequence:
CELLS = [TABLE[i][j] for i in range(ROWS) for j in range(COLS)]
fileobject = io.BytesIO()  # let DocumentWriter write to memory
writer = pymupdf.DocumentWriter(fileobject)  # define the writer
more = 1
while more:  # loop until all input text has been written out
    dev = writer.begin_page(MEDIABOX)  # prepare a new output page
    for cell in CELLS:
        # content may be complete after any cell, ...
        if more:  # so check this status first
            more, _ = story.place(cell)
            story.draw(dev)
    writer.end_page()  # finish the PDF page
writer.close()  # close DocumentWriter output
# for housekeeping work re-open from memory
doc = pymupdf.open("pdf", fileobject)
doc.ez_save(docname) 
  • 如何创建围绕预定义“禁止区域”布局的布局

这是一个演示脚本,使用 PyMuPDF 的 Story 类将文本输出为 PDF,采用双列页面布局。

该脚本展示了以下功能:

  • 围绕现有(“目标”)PDF 的图像布局文本。
  • 基于几个全局参数,每页的区域被识别出来,可用于接收 Story 布局的文本。
  • 这些全局参数不会存储在目标 PDF 中,因此必须以某种方式提供:
  • 每页边框的宽度。
  • 用于文本的字体大小。此值决定提供的文本是否适合目标 PDF 的空白处。无法以任何方式预测。如果目标 PDF 的页面不足,脚本会以异常结束,并在不是所有页面至少接收到一些文本时打印警告消息。在这两种情况下,可以更改字体大小值(浮点值)。
  • 用于文本的双列页面布局。
  • 布局创建一个临时(内存)PDF。其生成的页面内容(文本)用于覆盖对应的目标页面。如果文本需要的页数超过目标 PDF 中可用的页数,则会引发异常。如果不是所有目标页面至少接收到一些文本,则会打印警告。
  • 脚本在自己的文件夹中读取“image-no-go.pdf”。这是“目标”PDF。它包含了每页 2 个图像(来自原始文章),这些图像的位置创建了广泛的整体测试覆盖。否则页面为空白。
  • 此脚本生成了“quickfox-image-no-go.pdf”,其中包含原始页面和图像位置,但周围布局了原始文章文本。

文件:

  • docs/samples/quickfox-image-no-go.py
  • docs/samples/quickfox-image-no-go.pdf
  • docs/samples/quickfox.zip

查看示例

"""
This is a demo script using PyMuPDF's Story class to output text as a PDF with
a two-column page layout.
The script demonstrates the following features:
* Layout text around images of an existing ("target") PDF.
* Based on a few global parameters, areas on each page are identified, that
 can be used to receive text layouted by a Story.
* These global parameters are not stored anywhere in the target PDF and
 must therefore be provided in some way.
 - The width of the border(s) on each page.
 - The fontsize to use for text. This value determines whether the provided
 text will fit in the empty spaces of the (fixed) pages of target PDF. It
 cannot be predicted in any way. The script ends with an exception if
 target PDF has not enough pages, and prints a warning message if not all
 pages receive at least some text. In both cases, the FONTSIZE value
 can be changed (a float value).
 - Use of a 2-column page layout for the text.
* The layout creates a temporary (memory) PDF. Its produced page content
 (the text) is used to overlay the corresponding target page. If text
 requires more pages than are available in target PDF, an exception is raised.
 If not all target pages receive at least some text, a warning is printed.
* The script reads "image-no-go.pdf" in its own folder. This is the "target" PDF.
 It contains 2 pages with each 2 images (from the original article), which are
 positioned at places that create a broad overall test coverage. Otherwise the
 pages are empty.
* The script produces "quickfox-image-no-go.pdf" which contains the original pages
 and image positions, but with the original article text laid out around them.
Note:
--------------
This script version uses just image positions to derive "No-Go areas" for
layouting the text. Other PDF objects types are detectable by PyMuPDF and may
be taken instead or in addition, without influencing the layouting.
The following are candidates for other such "No-Go areas". Each can be detected
and located by PyMuPDF:
* Annotations
* Drawings
* Existing text
--------------
The text and images are taken from the somewhat modified Wikipedia article
https://en.wikipedia.org/wiki/The_quick_brown_fox_jumps_over_the_lazy_dog.
--------------
"""
import io
import os
import zipfile
import pymupdf
thisdir = os.path.dirname(os.path.abspath(__file__))
myzip = zipfile.ZipFile(os.path.join(thisdir, "quickfox.zip"))
docname = os.path.join(thisdir, "image-no-go.pdf")  # "no go" input PDF file name
outname = os.path.join(thisdir, "quickfox-image-no-go.pdf")  # output PDF file name
BORDER = 36  # global parameter
FONTSIZE = 12.5  # global parameter
COLS = 2  # number of text columns, global parameter
def analyze_page(page):
  """Compute MediaBox and rectangles on page that are free to receive text.
 Notes:
 Assume a BORDER around the page, make 2 columns of the resulting
 sub-rectangle and extract the rectangles of all images on page.
 For demo purposes, the image rectangles are taken as "NO-GO areas"
 on the page when writing text with the Story.
 The function returns free areas for each of the columns.
 Returns:
 (page.number, mediabox, CELLS), where CELLS is a list of free cells.
 """
    prect = page.rect  # page rectangle - will be our MEDIABOX later
    where = prect + (BORDER, BORDER, -BORDER, -BORDER)
    TABLE = pymupdf.make_table(where, rows=1, cols=COLS)
    # extract rectangles covered by images on this page
    IMG_RECTS = sorted(  # image rects on page (sort top-left to bottom-right)
        [pymupdf.Rect(item["bbox"]) for item in page.get_image_info()],
        key=lambda b: (b.y1, b.x0),
    )
    def free_cells(column):
  """Return free areas in this column."""
        free_stripes = []  # y-value pairs wrapping a free area stripe
        # intersecting images: block complete intersecting column stripe
        col_imgs = [(b.y0, b.y1) for b in IMG_RECTS if abs(b & column) > 0]
        s_y0 = column.y0  # top y-value of column
        for y0, y1 in col_imgs:  # an image stripe
            if y0 > s_y0 + FONTSIZE:  # image starts below last free btm value
                free_stripes.append((s_y0, y0))  # store as free stripe
            s_y0 = y1  # start of next free stripe
        if s_y0 + FONTSIZE < column.y1:  # enough room to column bottom
            free_stripes.append((s_y0, column.y1))
        if free_stripes == []:  # covers "no image in this column"
            free_stripes.append((column.y0, column.y1))
        # make available cells of this column
        CELLS = [pymupdf.Rect(column.x0, y0, column.x1, y1) for (y0, y1) in free_stripes]
        return CELLS
    # collection of available Story rectangles on page
    CELLS = []
    for i in range(COLS):
        CELLS.extend(free_cells(TABLE[0][i]))
    return page.number, prect, CELLS
HTML = myzip.read("quickfox.html").decode()
# --------------------------------------------------------------
# Make the Story object
# --------------------------------------------------------------
story = pymupdf.Story(HTML)
# modify the DOM somewhat
body = story.body  # access HTML body
body.set_properties(font="sans-serif")  # and give it our font globally
# modify certain nodes
para = body.find("p", None, None)  # find relevant nodes (here: paragraphs)
while para != None:
    para.set_properties(  # method MUST be used for existing nodes
        indent=15,
        fontsize=FONTSIZE,
    )
    para = para.find_next("p", None, None)
# we remove all image references, because the target PDF already has them
img = body.find("img", None, None)
while img != None:
    next_img = img.find_next("img", None, None)
    img.remove()
    img = next_img
page_info = {}  # contains MEDIABOX and free CELLS per page
doc = pymupdf.open(docname)
for page in doc:
    pno, mediabox, cells = analyze_page(page)
    page_info[pno] = (mediabox, cells)
doc.close()  # close target PDF for now - re-open later
fileobject = io.BytesIO()  # let DocumentWriter write to memory
writer = pymupdf.DocumentWriter(fileobject)  # define output writer
more = 1  # stop if this ever becomes zero
pno = 0  # count output pages
while more:  # loop until all HTML text has been written
    try:
        MEDIABOX, CELLS = page_info[pno]
    except KeyError:  # too much text space required: reduce fontsize?
        raise ValueError("text does not fit on target PDF")
    dev = writer.begin_page(MEDIABOX)  # prepare a new output page
    for cell in CELLS:  # iterate over free cells on this page
        if not more:  # need to check this for every cell
            continue
        more, _ = story.place(cell)
        story.draw(dev)
    writer.end_page()  # finish the PDF page
    pno += 1
writer.close()  # close DocumentWriter output
# Re-open writer output, read its pages and overlay target pages with them.
# The generated pages have same dimension as their targets.
src = pymupdf.open("pdf", fileobject)
doc = pymupdf.open(doc.name)
for page in doc:  # overlay every target page with the prepared text
    if page.number >= src.page_count:
        print(f"Text only uses {src.page_count} target pages!")
        continue  # story did not need all target pages?
    # overlay target page
    page.show_pdf_page(page.rect, src, page.number)
    # DEBUG start --- draw the text rectangles
    # mb, cells = page_info[page.number]
    # for cell in cells:
    #     page.draw_rect(cell, color=(1, 0, 0))
    # DEBUG stop ---
doc.ez_save(outname) 
  • 如何输出 HTML 表格

支持如下输出 HTML 表格:

  • 支持扁平的表格布局(“行数 x 列数”),不支持“colspan” / “rowspan”属性。
  • 表头标签th支持属性“scope”,取值为“row”或“col”。适用文本默认为粗体。
  • 列宽度根据列内容自动计算,不能直接设置。
  • 表格单元格可能包含图像,这将考虑在列宽度计算魔法中。
  • 行高根据行内容自动计算,导致需要时多行显示。
  • 表行的潜在多行将始终保持在一页上(即“where”矩形),不会分割。
  • 表头行仅显示在第一页 / “where”矩形上。
  • 直接在 HTML 表元素中给出的“style”属性将被忽略。表格及其元素的样式必须分别在 CSS 源或style标签中定义。
  • 不支持也将忽略对tr元素的样式设置。因此,不支持表格整体网格或交替行背景颜色。以下示例脚本中的一个展示了如何轻松处理此限制。

文件:

  • docs/samples/table01.py 该脚本反映了基本功能。

查看示例

"""
Demo script for basic HTML table support in Story objects
Outputs a table with three columns that fits on one Letter page.
The content of each row is filled via the Story's template mechanism.
Column widths and row heights are automatically computed by MuPDF.
Some styling via a CSS source is also demonstrated:
- The table header row has a gray background
- Each cell shows a border at its top
- The Story's body uses the sans-serif font family
- The text of one of the columns is set to blue
Dependencies
-------------
PyMuPDF v1.22.0 or later
"""
import pymupdf
table_text = (  # the content of each table row
    (
        "Length",
        "integer",
  """(Required) The number of bytes from the beginning of the line following the keyword stream to the last byte just before the keyword endstream. (There may be an additional EOL marker, preceding endstream, that is not included in the count and is not logically part of the stream data.) See “Stream Extent,” above, for further discussion.""",
    ),
    (
        "Filter",
        "name or array",
  """(Optional) The name of a filter to be applied in processing the stream data found between the keywords stream and endstream, or an array of such names. Multiple filters should be specified in the order in which they are to be applied.""",
    ),
    (
        "FFilter",
        "name or array",
  """(Optional; PDF 1.2) The name of a filter to be applied in processing the data found in the stream's external file, or an array of such names. The same rules apply as for Filter.""",
    ),
    (
        "FDecodeParms",
        "dictionary or array",
  """(Optional; PDF 1.2) A parameter dictionary, or an array of such dictionaries, used by the filters specified by FFilter. The same rules apply as for DecodeParms.""",
    ),
    (
        "DecodeParms",
        "dictionary or array",
  """(Optional) A parameter dictionary or an array of such dictionaries, used by the filters specified by Filter. If there is only one filter and that filter has parameters, DecodeParms must be set to the filter's parameter dictionary unless all the filter's parameters have their default values, in which case the DecodeParms entry may be omitted. If there are multiple filters and any of the filters has parameters set to nondefault values, DecodeParms must be an array with one entry for each filter: either the parameter dictionary for that filter, or the null object if that filter has no parameters (or if all of its parameters have their default values). If none of the filters have parameters, or if all their parameters have default values, the DecodeParms entry may be omitted. (See implementation note 7 in Appendix H.)""",
    ),
    (
        "DL",
        "integer",
  """(Optional; PDF 1.5) A non-negative integer representing the number of bytes in the decoded (defiltered) stream. It can be used to determine, for example, whether enough disk space is available to write a stream to a file.\nThis value should be considered a hint only; for some stream filters, it may not be possible to determine this value precisely.""",
    ),
    (
        "F",
        "file specification",
  """(Optional; PDF 1.2) The file containing the stream data. If this entry is present, the bytes between stream and endstream are ignored, the filters are specified by FFilter rather than Filter, and the filter parameters are specified by FDecodeParms rather than DecodeParms. However, the Length entry should still specify the number of those bytes. (Usually, there are no bytes and Length is 0.) (See implementation note 46 in Appendix H.)""",
    ),
)
# Only a minimal HTML source is required to provide the Story's working
HTML = """
<html>
<body><h2>TABLE 3.4 Entries common to all stream dictionaries</h2>
<table>
 <tr>
 <th>KEY</th><th>TYPE</th><th>VALUE</th>
 </tr>
 <tr id="row">
 <td id="col0"></td><td id="col1"></td><td id="col2"></td>
 </tr>
"""
"""
---------------------------------------------------------------------
Just for demo purposes, set:
- header cell background to gray
- text color in col1 to blue
- a border line at the top of all table cells
- all text to the sans-serif font
---------------------------------------------------------------------
"""
CSS = """th {
 background-color: #aaa;
}
td[id="col1"] {
 color: blue;
}
td, tr {
 border: 1px solid black;
 border-right-width: 0px;
 border-left-width: 0px;
 border-bottom-width: 0px;
}
body {
 font-family: sans-serif;
}
"""
story = pymupdf.Story(HTML, user_css=CSS)  # define the Story
body = story.body  # access the HTML <body> of it
template = body.find(None, "id", "row")  # find the template with name "row"
parent = template.parent  # access its parent i.e., the <table>
for col0, col1, col2 in table_text:
    row = template.clone()  # make a clone of the row template
    # add text to each cell in the duplicated row
    row.find(None, "id", "col0").add_text(col0)
    row.find(None, "id", "col1").add_text(col1)
    row.find(None, "id", "col2").add_text(col2)
    parent.append_child(row)  # add new row to <table>
template.remove()  # remove the template
# Story is ready - output it via a writer
writer = pymupdf.DocumentWriter(__file__.replace(".py", ".pdf"), "compress")
mediabox = pymupdf.paper_rect("letter")  # size of one output page
where = mediabox + (36, 36, -36, -36)  # use this sub-area for the content
more = True  # detects end of output
while more:
    dev = writer.begin_page(mediabox)  # start a page, returning a device
    more, filled = story.place(where)  # compute content fitting into "where"
    story.draw(dev)  # output it to the page
    writer.end_page()  # finalize the page
writer.close()  # close the output 
  • docs/samples/national-capitals.py 使用简单的额外代码扩展表格输出选项的高级脚本:
  • 模拟重复页眉行的多页输出

  • 交替的表格行背景颜色

  • 表格行和列由网格线分隔

  • 表行动态生成 / 填充来自 SQL 数据库的数据


PyMuPDF 1.24.4 中文文档(四)(2)https://developer.aliyun.com/article/1559455

相关文章
|
6月前
|
存储 XML 编解码
PyMuPDF 1.24.4 中文文档(三)(2)
PyMuPDF 1.24.4 中文文档(三)
136 0
PyMuPDF 1.24.4 中文文档(三)(2)
|
6月前
|
存储 XML 编解码
PyMuPDF 1.24.4 中文文档(三)(1)
PyMuPDF 1.24.4 中文文档(三)
204 0
|
6月前
|
XML JavaScript 前端开发
PyMuPDF 1.24.4 中文文档(十)(1)
PyMuPDF 1.24.4 中文文档(十)
63 0
|
6月前
|
存储 Python
PyMuPDF 1.24.4 中文文档(四)(4)
PyMuPDF 1.24.4 中文文档(四)
65 0
|
6月前
|
安全 API 数据安全/隐私保护
PyMuPDF 1.24.4 中文文档(一)(5)
PyMuPDF 1.24.4 中文文档(一)
160 3
PyMuPDF 1.24.4 中文文档(一)(5)
|
6月前
|
XML 存储 编解码
PyMuPDF 1.24.4 中文文档(八)(5)
PyMuPDF 1.24.4 中文文档(八)
404 1
PyMuPDF 1.24.4 中文文档(八)(5)
|
6月前
|
存储 XML 数据安全/隐私保护
PyMuPDF 1.24.4 中文文档(八)(2)
PyMuPDF 1.24.4 中文文档(八)
647 1
|
6月前
|
文字识别 API 数据安全/隐私保护
PyMuPDF 1.24.4 中文文档(一)(1)
PyMuPDF 1.24.4 中文文档(一)
210 1
|
6月前
|
存储 XML 编解码
PyMuPDF 1.24.4 中文文档(八)(3)
PyMuPDF 1.24.4 中文文档(八)
275 1
|
6月前
|
编解码 API 图形学
PyMuPDF 1.24.4 中文文档(九)(2)
PyMuPDF 1.24.4 中文文档(九)
84 0
PyMuPDF 1.24.4 中文文档(九)(2)