PyMuPDF 1.24.4 中文文档(四)(3)

简介: PyMuPDF 1.24.4 中文文档(四)

PyMuPDF 1.24.4 中文文档(四)(2)https://developer.aliyun.com/article/1559455


如何使用图片

图像可以在提供的 HTML 源码中引用,或者可以通过 Python API 存储对所需图像的引用。无论哪种方式,都需要使用一个 Archive,这指的是可以找到图像的位置。

注意

在 HTML 源码中嵌入二进制内容的图像不受支持

我们从上面扩展我们的“Hello World”示例,并在文本后显示我们星球的图像。假设图像名称为“world.jpg”并位于脚本文件夹中,则以上 Python API 变体的修改版本如下:

import pymupdf
MEDIABOX = pymupdf.paper_rect("letter")
WHERE = MEDIABOX + (36, 36, -36, -36)
# create story, let it look at script folder for resources
story = pymupdf.Story(archive=".")
body = story.body  # access the body of its DOM
with body.add_paragraph() as para:
    # store desired content
    para.set_font("sans-serif").set_color("blue").add_text("Hello World!")
# another paragraph for our image:
with body.add_paragraph() as para:
    # store image in another paragraph
    para.add_image("world.jpg")
writer = pymupdf.DocumentWriter("output.pdf")
more = 1
while more:
    device = writer.begin_page(MEDIABOX)
    more, _ = story.place(WHERE)
    story.draw(device)
    writer.end_page()
writer.close() 

如何读取故事的外部 HTML 和 CSS

这些情况都相当简单。

作为一般建议,应将 HTML 和 CSS 源文件作为二进制文件读取并在使用之前解码。Python 的pathlib.Path提供了方便的方法来实现这一点:

import pathlib
import pymupdf
htmlpath = pathlib.Path("myhtml.html")
csspath = pathlib.Path("mycss.css")
HTML = htmlpath.read_bytes().decode()
CSS = csspath.read_bytes().decode()
story = pymupdf.Story(html=HTML, user_css=CSS) 

如何使用故事模板输出数据库内容

这个脚本演示了如何使用HTML 模板报告 SQL 数据库内容。

此示例 SQL 数据库包含两个表:

  1. 表“films”包含每部电影的一行,字段包括**“title”“director”和(发布)“year”**。
  2. 表“actors”包含每个演员和电影标题的一行(字段(演员)“name”和(电影)“title”)。

故事 DOM 包括一个电影模板,其中报告了与一组演员的电影数据。

文件:

  • docs/samples/filmfestival-sql.py
  • docs/samples/filmfestival-sql.db

查看配方

"""
This is a demo script for using PyMuPDF with its "Story" feature.
The following aspects are being covered here:
* The script produces a report of films that are stored in an SQL database
* The report format is provided as a HTML template
The SQL database contains two tables:
1\. Table "films" which has the columns "title" (film title, str), "director"
 (str) and "year" (year of release, int).
2\. Table "actors" which has the columns "name" (actor name, str) and "title"
 (the film title where the actor had been casted, str).
The script reads all content of the "films" table. For each film title it
reads all rows from table "actors" which took part in that film.
Comment 1
---------
To keep things easy and free from pesky technical detail, the relevant file
names inherit the name of this script:
- the database's filename is the script name with ".py" extension replaced
 by ".db".
- the output PDF similarly has script file name with extension ".pdf".
Comment 2
---------
The SQLITE database has been created using https://sqlitebrowser.org/, a free
multi-platform tool to maintain or manipulate SQLITE databases.
"""
import os
import sqlite3
import pymupdf
# ----------------------------------------------------------------------
# HTML template for the film report
# There are four placeholders coded as "id" attributes.
# One "id" allows locating the template part itself, the other three
# indicate where database text should be inserted.
# ----------------------------------------------------------------------
festival_template = (
    "<html><head><title>Just some arbitrary text</title></head>"
    '<body><h1 style="text-align:center">Hook Norton Film Festival</h1>'
    "<ol>"
    '<li id="filmtemplate">'
    '<b id="filmtitle"></b>'
    "<dl>"
    '<dt>Director<dd id="director">'
    '<dt>Release Year<dd id="filmyear">'
    '<dt>Cast<dd id="cast">'
    "</dl>"
    "</li>"
    "</ol>"
    "</body></html"
)
# -------------------------------------------------------------------
# define database access
# -------------------------------------------------------------------
dbfilename = __file__.replace(".py", ".db")  # the SQLITE database file name
assert os.path.isfile(dbfilename), f'{dbfilename}'
database = sqlite3.connect(dbfilename)  # open database
cursor_films = database.cursor()  # cursor for selecting the films
cursor_casts = database.cursor()  # cursor for selecting actors per film
# select statement for the films - let SQL also sort it for us
select_films = """SELECT title, director, year FROM films ORDER BY title"""
# select stament for actors, a skeleton: sub-select by film title
select_casts = """SELECT name FROM actors WHERE film = "%s" ORDER BY name"""
# -------------------------------------------------------------------
# define the HTML Story and fill it with database data
# -------------------------------------------------------------------
story = pymupdf.Story(festival_template)
body = story.body  # access the HTML body detail
template = body.find(None, "id", "filmtemplate")  # find the template part
# read the films from the database and put them all in one Python list
# NOTE: instead we might fetch rows one by one (advisable for large volumes)
cursor_films.execute(select_films)  # execute cursor, and ...
films = cursor_films.fetchall()  # read out what was found
for title, director, year in films:  # iterate through the films
    film = template.clone()  # clone template to report each film
    film.find(None, "id", "filmtitle").add_text(title)  # put title in templ
    film.find(None, "id", "director").add_text(director)  # put director
    film.find(None, "id", "filmyear").add_text(str(year))  # put year
    # the actors reside in their own table - find the ones for this film title
    cursor_casts.execute(select_casts % title)  # execute cursor
    casts = cursor_casts.fetchall()  # read actors for the film
    # each actor name appears in its own tuple, so extract it from there
    film.find(None, "id", "cast").add_text("\n".join([c[0] for c in casts]))
    body.append_child(film)
template.remove()  # remove the template
# -------------------------------------------------------------------
# generate the PDF
# -------------------------------------------------------------------
writer = pymupdf.DocumentWriter(__file__.replace(".py", ".pdf"), "compress")
mediabox = pymupdf.paper_rect("a4")  # use pages in ISO-A4 format
where = mediabox + (72, 36, -36, -72)  # leave page borders
more = 1  # end of output indicator
while more:
    dev = writer.begin_page(mediabox)  # make a new page
    more, filled = story.place(where)  # arrange content for this page
    story.draw(dev, None)  # write content to page
    writer.end_page()  # finish the page
writer.close()  # close the PDF 

如何与现有的 PDF 整合

因为 DocumentWriter 只能写入新文件,所以无法将故事放置在现有页面上。此脚本演示了如何绕过此限制。

基本思路是让 DocumentWriter 将输出到内存中的 PDF。一旦故事完成,我们重新打开此内存 PDF,并通过方法Page.show_pdf_page()将其页面放置到现有页面的所需位置。

文件:

  • docs/samples/showpdf-page.py

查看配方

"""
Demo of Story class in PyMuPDF
-------------------------------
This script demonstrates how to the results of a pymupdf.Story output can be
placed in a rectangle of an existing (!) PDF page.
"""
import io
import os
import pymupdf
def make_pdf(fileptr, text, rect, font="sans-serif", archive=None):
  """Make a memory DocumentWriter from HTML text and a rect.
 Args:
 fileptr: a Python file object. For example an io.BytesIO().
 text: the text to output (HTML format)
 rect: the target rectangle. Will use its width / height as mediabox
 font: (str) font family name, default sans-serif
 archive: pymupdf.Archive parameter. To be used if e.g. images or special
 fonts should be used.
 Returns:
 The matrix to convert page rectangles of the created PDF back
 to rectangle coordinates in the parameter "rect".
 Normal use will expect to fit all the text in the given rect.
 However, if an overflow occurs, this function will output multiple
 pages, and the caller may decide to either accept or retry with
 changed parameters.
 """
    # use input rectangle as the page dimension
    mediabox = pymupdf.Rect(0, 0, rect.width, rect.height)
    # this matrix converts mediabox back to input rect
    matrix = mediabox.torect(rect)
    story = pymupdf.Story(text, archive=archive)
    body = story.body
    body.set_properties(font=font)
    writer = pymupdf.DocumentWriter(fileptr)
    while True:
        device = writer.begin_page(mediabox)
        more, _ = story.place(mediabox)
        story.draw(device)
        writer.end_page()
        if not more:
            break
    writer.close()
    return matrix
# -------------------------------------------------------------
# We want to put this in a given rectangle of an existing page
# -------------------------------------------------------------
HTML = """
<p>PyMuPDF is a great package! And it still improves significantly from one version to the next one!</p>
<p>It is a Python binding for <b>MuPDF</b>, a lightweight PDF, XPS, and E-book viewer, renderer, and toolkit.<br> Both are maintained and developed by Artifex Software, Inc.</p>
<p>Via MuPDF it can access files in PDF, XPS, OpenXPS, CBZ, EPUB, MOBI and FB2 (e-books) formats,<br> and it is known for its top
<b><i>performance</i></b> and <b><i>rendering quality.</p>"""
# Make a PDF page for demo purposes
root = os.path.abspath( f"{__file__}/..")
doc = pymupdf.open(f"{root}/mupdf-title.pdf")
page = doc[0]
WHERE = pymupdf.Rect(50, 100, 250, 500)  # target rectangle on existing page
fileptr = io.BytesIO()  # let DocumentWriter use this as its file
# -------------------------------------------------------------------
# call DocumentWriter and Story to fill our rectangle
matrix = make_pdf(fileptr, HTML, WHERE)
# -------------------------------------------------------------------
src = pymupdf.open("pdf", fileptr)  # open DocumentWriter output PDF
if src.page_count > 1:  # target rect was too small
    raise ValueError("target WHERE too small")
# its page 0 contains our result
page.show_pdf_page(WHERE, src, 0)
doc.ez_save(f"{root}/mupdf-title-after.pdf") 

如何制作多栏布局并从包pymupdf-fonts中访问字体

此脚本输出一篇文章(摘自维基百科),包含文本和多个图像,并使用两栏页面布局。

此外,使用包pymupdf-fonts中的两个“Ubuntu”字体系列,而不是默认的 Base-14 字体。

此处使用的另一个功能是将所有数据 - 图像和文章 HTML - 共同存储在 ZIP 文件中。

文件:

  • docs/samples/quickfox.py
  • docs/samples/quickfox.zip

查看配方

"""
This is a demo script using PyMuPDF's Story class to output text as a PDF with
a two-column page layout.
The script demonstrates the following features:
* How to fill columns or table cells of complex page layouts
* How to embed images
* How to modify existing, given HTML sources for output (text indent, font size)
* How to use fonts defined in package "pymupdf-fonts"
* How to use ZIP files as Archive
--------------
The example is taken from the somewhat modified Wikipedia article
https://en.wikipedia.org/wiki/The_quick_brown_fox_jumps_over_the_lazy_dog.
--------------
"""
import io
import os
import zipfile
import pymupdf
thisdir = os.path.dirname(os.path.abspath(__file__))
myzip = zipfile.ZipFile(os.path.join(thisdir, "quickfox.zip"))
arch = pymupdf.Archive(myzip)
if pymupdf.fitz_fontdescriptors:
    # we want to use the Ubuntu fonts for sans-serif and for monospace
    CSS = pymupdf.css_for_pymupdf_font("ubuntu", archive=arch, name="sans-serif")
    CSS = pymupdf.css_for_pymupdf_font("ubuntm", CSS=CSS, archive=arch, name="monospace")
else:
    # No pymupdf-fonts available.
    CSS=""
docname = __file__.replace(".py", ".pdf")  # output PDF file name
HTML = myzip.read("quickfox.html").decode()
# make the Story object
story = pymupdf.Story(HTML, user_css=CSS, archive=arch)
# --------------------------------------------------------------
# modify the DOM somewhat
# --------------------------------------------------------------
body = story.body  # access HTML body
body.set_properties(font="sans-serif")  # and give it our font globally
# modify certain nodes
para = body.find("p", None, None)  # find relevant nodes (here: paragraphs)
while para != None:
    para.set_properties(  # method MUST be used for existing nodes
        indent=15,
        fontsize=13,
    )
    para = para.find_next("p", None, None)
# choose PDF page size
MEDIABOX = pymupdf.paper_rect("letter")
# text appears only within this subrectangle
WHERE = MEDIABOX + (36, 36, -36, -36)
# --------------------------------------------------------------
# define page layout within the WHERE rectangle
# --------------------------------------------------------------
COLS = 2  # layout: 2 cols 1 row
ROWS = 1
TABLE = pymupdf.make_table(WHERE, cols=COLS, rows=ROWS)
# fill the cells of each page in this sequence:
CELLS = [TABLE[i][j] for i in range(ROWS) for j in range(COLS)]
fileobject = io.BytesIO()  # let DocumentWriter write to memory
writer = pymupdf.DocumentWriter(fileobject)  # define the writer
more = 1
while more:  # loop until all input text has been written out
    dev = writer.begin_page(MEDIABOX)  # prepare a new output page
    for cell in CELLS:
        # content may be complete after any cell, ...
        if more:  # so check this status first
            more, _ = story.place(cell)
            story.draw(dev)
    writer.end_page()  # finish the PDF page
writer.close()  # close DocumentWriter output
# for housekeeping work re-open from memory
doc = pymupdf.open("pdf", fileobject)
doc.ez_save(docname) 

如何制作布局以包围预定义的“不适合区域”布局

这是一个演示脚本,使用 PyMuPDF 的 Story 类将文本输出为具有两栏页面布局的 PDF。

该脚本演示了以下功能:

  • 将文本布局在现有(“目标”)PDF 的图像周围。
  • 基于几个全局参数,识别每个页面上可用于接收由 Story 布局的文本的区域。
  • 这些全局参数未存储在目标 PDF 中,因此必须以某种方式提供:
  • 每个页面上边框的宽度。
  • 用于文本的字体大小。该值决定提供的文本是否适合目标 PDF  的(固定)页面上的空白处。无法以任何方式预测。如果目标 PDF  页面不足,脚本将以异常结束,并且如果不是所有页面至少接收到一些文本,则打印警告消息。在这两种情况下,可以更改 FONTSIZE  的值(浮点数值)。
  • 用于文本的两栏页面布局。
  • 此布局创建一个临时(内存)PDF。其生成的页面内容(文本)用于覆盖相应的目标页面。如果文本需要的页面比目标 PDF 中可用的页面多,将引发异常。如果并非所有目标页面都至少接收到一些文本,则会打印警告。
  • 此脚本在其自己的文件夹中读取“image-no-go.pdf”。这是“目标”PDF。它包含 2 页,每页有 2 张图片(来自原始文章),它们被定位在创建广泛的整体测试覆盖范围的地方。否则页面为空白。
  • 此脚本生成了“quickfox-image-no-go.pdf”,其中包含原始页面和图像位置,但文本围绕它们布局。

文件:

  • docs/samples/quickfox-image-no-go.py
  • docs/samples/quickfox-image-no-go.pdf
  • docs/samples/quickfox.zip

查看步骤

"""
This is a demo script using PyMuPDF's Story class to output text as a PDF with
a two-column page layout.
The script demonstrates the following features:
* Layout text around images of an existing ("target") PDF.
* Based on a few global parameters, areas on each page are identified, that
 can be used to receive text layouted by a Story.
* These global parameters are not stored anywhere in the target PDF and
 must therefore be provided in some way.
 - The width of the border(s) on each page.
 - The fontsize to use for text. This value determines whether the provided
 text will fit in the empty spaces of the (fixed) pages of target PDF. It
 cannot be predicted in any way. The script ends with an exception if
 target PDF has not enough pages, and prints a warning message if not all
 pages receive at least some text. In both cases, the FONTSIZE value
 can be changed (a float value).
 - Use of a 2-column page layout for the text.
* The layout creates a temporary (memory) PDF. Its produced page content
 (the text) is used to overlay the corresponding target page. If text
 requires more pages than are available in target PDF, an exception is raised.
 If not all target pages receive at least some text, a warning is printed.
* The script reads "image-no-go.pdf" in its own folder. This is the "target" PDF.
 It contains 2 pages with each 2 images (from the original article), which are
 positioned at places that create a broad overall test coverage. Otherwise the
 pages are empty.
* The script produces "quickfox-image-no-go.pdf" which contains the original pages
 and image positions, but with the original article text laid out around them.
Note:
--------------
This script version uses just image positions to derive "No-Go areas" for
layouting the text. Other PDF objects types are detectable by PyMuPDF and may
be taken instead or in addition, without influencing the layouting.
The following are candidates for other such "No-Go areas". Each can be detected
and located by PyMuPDF:
* Annotations
* Drawings
* Existing text
--------------
The text and images are taken from the somewhat modified Wikipedia article
https://en.wikipedia.org/wiki/The_quick_brown_fox_jumps_over_the_lazy_dog.
--------------
"""
import io
import os
import zipfile
import pymupdf
thisdir = os.path.dirname(os.path.abspath(__file__))
myzip = zipfile.ZipFile(os.path.join(thisdir, "quickfox.zip"))
docname = os.path.join(thisdir, "image-no-go.pdf")  # "no go" input PDF file name
outname = os.path.join(thisdir, "quickfox-image-no-go.pdf")  # output PDF file name
BORDER = 36  # global parameter
FONTSIZE = 12.5  # global parameter
COLS = 2  # number of text columns, global parameter
def analyze_page(page):
  """Compute MediaBox and rectangles on page that are free to receive text.
 Notes:
 Assume a BORDER around the page, make 2 columns of the resulting
 sub-rectangle and extract the rectangles of all images on page.
 For demo purposes, the image rectangles are taken as "NO-GO areas"
 on the page when writing text with the Story.
 The function returns free areas for each of the columns.
 Returns:
 (page.number, mediabox, CELLS), where CELLS is a list of free cells.
 """
    prect = page.rect  # page rectangle - will be our MEDIABOX later
    where = prect + (BORDER, BORDER, -BORDER, -BORDER)
    TABLE = pymupdf.make_table(where, rows=1, cols=COLS)
    # extract rectangles covered by images on this page
    IMG_RECTS = sorted(  # image rects on page (sort top-left to bottom-right)
        [pymupdf.Rect(item["bbox"]) for item in page.get_image_info()],
        key=lambda b: (b.y1, b.x0),
    )
    def free_cells(column):
  """Return free areas in this column."""
        free_stripes = []  # y-value pairs wrapping a free area stripe
        # intersecting images: block complete intersecting column stripe
        col_imgs = [(b.y0, b.y1) for b in IMG_RECTS if abs(b & column) > 0]
        s_y0 = column.y0  # top y-value of column
        for y0, y1 in col_imgs:  # an image stripe
            if y0 > s_y0 + FONTSIZE:  # image starts below last free btm value
                free_stripes.append((s_y0, y0))  # store as free stripe
            s_y0 = y1  # start of next free stripe
        if s_y0 + FONTSIZE < column.y1:  # enough room to column bottom
            free_stripes.append((s_y0, column.y1))
        if free_stripes == []:  # covers "no image in this column"
            free_stripes.append((column.y0, column.y1))
        # make available cells of this column
        CELLS = [pymupdf.Rect(column.x0, y0, column.x1, y1) for (y0, y1) in free_stripes]
        return CELLS
    # collection of available Story rectangles on page
    CELLS = []
    for i in range(COLS):
        CELLS.extend(free_cells(TABLE[0][i]))
    return page.number, prect, CELLS
HTML = myzip.read("quickfox.html").decode()
# --------------------------------------------------------------
# Make the Story object
# --------------------------------------------------------------
story = pymupdf.Story(HTML)
# modify the DOM somewhat
body = story.body  # access HTML body
body.set_properties(font="sans-serif")  # and give it our font globally
# modify certain nodes
para = body.find("p", None, None)  # find relevant nodes (here: paragraphs)
while para != None:
    para.set_properties(  # method MUST be used for existing nodes
        indent=15,
        fontsize=FONTSIZE,
    )
    para = para.find_next("p", None, None)
# we remove all image references, because the target PDF already has them
img = body.find("img", None, None)
while img != None:
    next_img = img.find_next("img", None, None)
    img.remove()
    img = next_img
page_info = {}  # contains MEDIABOX and free CELLS per page
doc = pymupdf.open(docname)
for page in doc:
    pno, mediabox, cells = analyze_page(page)
    page_info[pno] = (mediabox, cells)
doc.close()  # close target PDF for now - re-open later
fileobject = io.BytesIO()  # let DocumentWriter write to memory
writer = pymupdf.DocumentWriter(fileobject)  # define output writer
more = 1  # stop if this ever becomes zero
pno = 0  # count output pages
while more:  # loop until all HTML text has been written
    try:
        MEDIABOX, CELLS = page_info[pno]
    except KeyError:  # too much text space required: reduce fontsize?
        raise ValueError("text does not fit on target PDF")
    dev = writer.begin_page(MEDIABOX)  # prepare a new output page
    for cell in CELLS:  # iterate over free cells on this page
        if not more:  # need to check this for every cell
            continue
        more, _ = story.place(cell)
        story.draw(dev)
    writer.end_page()  # finish the PDF page
    pno += 1
writer.close()  # close DocumentWriter output
# Re-open writer output, read its pages and overlay target pages with them.
# The generated pages have same dimension as their targets.
src = pymupdf.open("pdf", fileobject)
doc = pymupdf.open(doc.name)
for page in doc:  # overlay every target page with the prepared text
    if page.number >= src.page_count:
        print(f"Text only uses {src.page_count} target pages!")
        continue  # story did not need all target pages?
    # overlay target page
    page.show_pdf_page(page.rect, src, page.number)
    # DEBUG start --- draw the text rectangles
    # mb, cells = page_info[page.number]
    # for cell in cells:
    #     page.draw_rect(cell, color=(1, 0, 0))
    # DEBUG stop ---
doc.ez_save(outname) 

如何输出 HTML 表格

输出 HTML 表格的支持如下:

  • 支持平面表格布局(“行 x 列”),不支持“colspan”/“rowspan”属性。
  • 表头标签 th 支持带有值“row”或“col”的“scope”属性。适用的文本将默认为粗体。
  • 列宽度根据列内容自动计算。它们不能直接设置。
  • 表格单元格可能包含图片,这将被考虑在列宽度计算魔法中。
  • 行高根据行内容自动计算 - 导致需要时出现多行行。
  • 表格行的潜在多行将始终保持在一页(相应的“位置”矩形)上,并且不会被分割。
  • 表头行仅在第一页/“位置”矩形上显示
  • 当直接在 HTML 表格元素中给出时,“style”属性将被忽略。表格及其元素的样式必须分别在 CSS 源或style标签中进行。
  • 不支持和忽略tr元素的样式。因此,不支持整个表格范围的网格或交替行背景颜色。然而,以下示例脚本之一展示了处理此限制的简单方法。

文件:

  • docs/samples/table01.py 这个脚本反映了基本特性。

查看步骤

"""
Demo script for basic HTML table support in Story objects
Outputs a table with three columns that fits on one Letter page.
The content of each row is filled via the Story's template mechanism.
Column widths and row heights are automatically computed by MuPDF.
Some styling via a CSS source is also demonstrated:
- The table header row has a gray background
- Each cell shows a border at its top
- The Story's body uses the sans-serif font family
- The text of one of the columns is set to blue
Dependencies
-------------
PyMuPDF v1.22.0 or later
"""
import pymupdf
table_text = (  # the content of each table row
    (
        "Length",
        "integer",
  """(Required) The number of bytes from the beginning of the line following the keyword stream to the last byte just before the keyword endstream. (There may be an additional EOL marker, preceding endstream, that is not included in the count and is not logically part of the stream data.) See “Stream Extent,” above, for further discussion.""",
    ),
    (
        "Filter",
        "name or array",
  """(Optional) The name of a filter to be applied in processing the stream data found between the keywords stream and endstream, or an array of such names. Multiple filters should be specified in the order in which they are to be applied.""",
    ),
    (
        "FFilter",
        "name or array",
  """(Optional; PDF 1.2) The name of a filter to be applied in processing the data found in the stream's external file, or an array of such names. The same rules apply as for Filter.""",
    ),
    (
        "FDecodeParms",
        "dictionary or array",
  """(Optional; PDF 1.2) A parameter dictionary, or an array of such dictionaries, used by the filters specified by FFilter. The same rules apply as for DecodeParms.""",
    ),
    (
        "DecodeParms",
        "dictionary or array",
  """(Optional) A parameter dictionary or an array of such dictionaries, used by the filters specified by Filter. If there is only one filter and that filter has parameters, DecodeParms must be set to the filter's parameter dictionary unless all the filter's parameters have their default values, in which case the DecodeParms entry may be omitted. If there are multiple filters and any of the filters has parameters set to nondefault values, DecodeParms must be an array with one entry for each filter: either the parameter dictionary for that filter, or the null object if that filter has no parameters (or if all of its parameters have their default values). If none of the filters have parameters, or if all their parameters have default values, the DecodeParms entry may be omitted. (See implementation note 7 in Appendix H.)""",
    ),
    (
        "DL",
        "integer",
  """(Optional; PDF 1.5) A non-negative integer representing the number of bytes in the decoded (defiltered) stream. It can be used to determine, for example, whether enough disk space is available to write a stream to a file.\nThis value should be considered a hint only; for some stream filters, it may not be possible to determine this value precisely.""",
    ),
    (
        "F",
        "file specification",
  """(Optional; PDF 1.2) The file containing the stream data. If this entry is present, the bytes between stream and endstream are ignored, the filters are specified by FFilter rather than Filter, and the filter parameters are specified by FDecodeParms rather than DecodeParms. However, the Length entry should still specify the number of those bytes. (Usually, there are no bytes and Length is 0.) (See implementation note 46 in Appendix H.)""",
    ),
)
# Only a minimal HTML source is required to provide the Story's working
HTML = """
<html>
<body><h2>TABLE 3.4 Entries common to all stream dictionaries</h2>
<table>
 <tr>
 <th>KEY</th><th>TYPE</th><th>VALUE</th>
 </tr>
 <tr id="row">
 <td id="col0"></td><td id="col1"></td><td id="col2"></td>
 </tr>
"""
"""
---------------------------------------------------------------------
Just for demo purposes, set:
- header cell background to gray
- text color in col1 to blue
- a border line at the top of all table cells
- all text to the sans-serif font
---------------------------------------------------------------------
"""
CSS = """th {
 background-color: #aaa;
}
td[id="col1"] {
 color: blue;
}
td, tr {
 border: 1px solid black;
 border-right-width: 0px;
 border-left-width: 0px;
 border-bottom-width: 0px;
}
body {
 font-family: sans-serif;
}
"""
story = pymupdf.Story(HTML, user_css=CSS)  # define the Story
body = story.body  # access the HTML <body> of it
template = body.find(None, "id", "row")  # find the template with name "row"
parent = template.parent  # access its parent i.e., the <table>
for col0, col1, col2 in table_text:
    row = template.clone()  # make a clone of the row template
    # add text to each cell in the duplicated row
    row.find(None, "id", "col0").add_text(col0)
    row.find(None, "id", "col1").add_text(col1)
    row.find(None, "id", "col2").add_text(col2)
    parent.append_child(row)  # add new row to <table>
template.remove()  # remove the template
# Story is ready - output it via a writer
writer = pymupdf.DocumentWriter(__file__.replace(".py", ".pdf"), "compress")
mediabox = pymupdf.paper_rect("letter")  # size of one output page
where = mediabox + (36, 36, -36, -36)  # use this sub-area for the content
more = True  # detects end of output
while more:
    dev = writer.begin_page(mediabox)  # start a page, returning a device
    more, filled = story.place(where)  # compute content fitting into "where"
    story.draw(dev)  # output it to the page
    writer.end_page()  # finalize the page
writer.close()  # close the output 
  • docs/samples/national-capitals.py 通过简单的附加代码扩展了表格输出选项的高级脚本:
  • 模拟重复标题行的多页输出

  • 交替的表格行背景颜色

  • 表格行和列由网格线分隔

  • 表格行动态生成/填充来自 SQL 数据库的数据


PyMuPDF 1.24.4 中文文档(四)(4)https://developer.aliyun.com/article/1559457

相关文章
|
6月前
|
存储 文字识别 自然语言处理
PyMuPDF 1.24.4 中文文档(五)(2)
PyMuPDF 1.24.4 中文文档(五)
137 3
|
6月前
|
JSON API 数据格式
PyMuPDF 1.24.4 中文文档(四)(5)
PyMuPDF 1.24.4 中文文档(四)
50 0
|
6月前
|
存储 XML 编解码
PyMuPDF 1.24.4 中文文档(三)(1)
PyMuPDF 1.24.4 中文文档(三)
197 0
|
6月前
|
XML 编解码 文字识别
PyMuPDF 1.24.4 中文文档(八)(4)
PyMuPDF 1.24.4 中文文档(八)
326 1
|
6月前
|
文字识别 API 数据安全/隐私保护
PyMuPDF 1.24.4 中文文档(一)(1)
PyMuPDF 1.24.4 中文文档(一)
201 1
|
6月前
|
存储 资源调度 JavaScript
PyMuPDF 1.24.4 中文文档(八)(1)
PyMuPDF 1.24.4 中文文档(八)
211 0
PyMuPDF 1.24.4 中文文档(八)(1)
|
6月前
|
存储 API Python
PyMuPDF 1.24.4 中文文档(九)(4)
PyMuPDF 1.24.4 中文文档(九)
90 0
|
6月前
|
编解码 文字识别 C语言
PyMuPDF 1.24.4 中文文档(十)(3)
PyMuPDF 1.24.4 中文文档(十)
100 0
|
6月前
|
存储 API 数据安全/隐私保护
PyMuPDF 1.24.4 中文文档(十)(4)
PyMuPDF 1.24.4 中文文档(十)
125 0
|
6月前
|
存储 测试技术 数据安全/隐私保护
PyMuPDF 1.24.4 中文文档(七)(2)
PyMuPDF 1.24.4 中文文档(七)
88 0