PyMuPDF 1.24.4 中文文档(二)(4)https://developer.aliyun.com/article/1559645
如何指定自己的字体
使用@font-face
语句在 CSS 语法中定义字体文件。对于希望支持的每种字体粗细和字体风格组合(例如粗体或斜体),都需要一个单独的@font-face
。以下示例使用著名的 MS Comic Sans 字体及其四个变体:正常、粗体、斜体和粗斜体。
由于这四个字体文件位于系统文件夹C:/Windows/Fonts
中,因此该方法需要一个指向该文件夹的 Archive 定义:
""" How to use your own fonts with method Page.insert_htmlbox(). """ import pymupdf # Example text text = """Lorem ipsum dolor sit amet, consectetur adipisici elit, sed eiusmod tempor incidunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation <b>ullamco <i>laboris</i></b> nisi ut aliquid ex ea commodi consequat. Quis aute iure <span style="color: red;">reprehenderit</span> in <span style="color: green;font-weight:bold;">voluptate</span> velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint obcaecat cupiditat non proident, sunt in culpa qui <a href="https://www.artifex.com">officia</a> deserunt mollit anim id est laborum.""" """ We need an Archive object to show where font files are located. We intend to use the font family "MS Comic Sans". """ arch = pymupdf.Archive("C:/Windows/Fonts") # These statements define which font file to use for regular, bold, # italic and bold-italic text. # We assign an arbitary common font-family for all 4 font files. # The Story algorithm will select the right file as required. # We request to use "comic" throughout the text. css = """ @font-face {font-family: comic; src: url(comic.ttf);} @font-face {font-family: comic; src: url(comicbd.ttf);font-weight: bold;} @font-face {font-family: comic; src: url(comicz.ttf);font-weight: bold;font-style: italic;} @font-face {font-family: comic; src: url(comici.ttf);font-style: italic;} * {font-family: comic;} """ doc = pymupdf.Document() page = doc.new_page(width=150, height=150) # make small page page.insert_htmlbox(page.rect, text, css=css, archive=arch) doc.subset_fonts(verbose=True) # build subset fonts to reduce file size doc.ez_save(__file__.replace(".py", ".pdf"))
如何请求文本对齐
此示例结合了多个需求:
- 逆时针旋转文本 90 度。
- 使用pymupdf-fonts包中的字体。您会发现在这种情况下,相应的 CSS 定义要简单得多。
- 使用“justify”选项使文本对齐。
""" How to use a pymupdf font with method Page.insert_htmlbox(). """ import pymupdf # Example text text = """Lorem ipsum dolor sit amet, consectetur adipisici elit, sed eiusmod tempor incidunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation <b>ullamco <i>laboris</i></b> nisi ut aliquid ex ea commodi consequat. Quis aute iure <span style="color: red;">reprehenderit</span> in <span style="color: green;font-weight:bold;">voluptate</span> velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint obcaecat cupiditat non proident, sunt in culpa qui <a href="https://www.artifex.com">officia</a> deserunt mollit anim id est laborum.""" """ This is similar to font file support. However, we can use a convenience function for creating required CSS definitions. We still need an Archive for finding the font binaries. """ arch = pymupdf.Archive() # We request to use "myfont" throughout the text. css = pymupdf.css_for_pymupdf_font("ubuntu", archive=arch, name="myfont") css += "* {font-family: myfont;text-align: justify;}" doc = pymupdf.Document() page = doc.new_page(width=150, height=150) page.insert_htmlbox(page.rect, text, css=css, archive=arch, rotate=90) doc.subset_fonts(verbose=True) doc.ez_save(__file__.replace(".py", ".pdf"))
### 如何编写文本行
在页面上输出一些文本行:
import pymupdf doc = pymupdf.open(...) # new or existing PDF page = doc.new_page() # new or existing page via doc[n] p = pymupdf.Point(50, 72) # start point of 1st line text = "Some text,\nspread across\nseveral lines." # the same result is achievable by # text = ["Some text", "spread across", "several lines."] rc = page.insert_text(p, # bottom-left of 1st char text, # the text (honors '\n') fontname = "helv", # the default font fontsize = 11, # the default font size rotate = 0, # also available: 90, 180, 270 ) print("%i lines printed on page %i." % (rc, page.number)) doc.save("text.pdf")
使用此方法,只控制行数以确保不超过页面高度。多余的行将不会被写入,并返回实际行数。计算使用从fontsize
和 36 点(0.5 英寸)底部边距计算的行高。
忽略宽度。一行的多余部分将简单地不可见。
但是,对于内置字体,有办法预先计算行宽 - 参见get_text_length()
。
这里是另一个例子。它使用四种不同的旋转选项插入了 4 个文本字符串,并因此解释了如何选择文本插入点以实现所需的结果:
import pymupdf doc = pymupdf.open() page = doc.new_page() # the text strings, each having 3 lines text1 = "rotate=0\nLine 2\nLine 3" text2 = "rotate=90\nLine 2\nLine 3" text3 = "rotate=-90\nLine 2\nLine 3" text4 = "rotate=180\nLine 2\nLine 3" red = (1, 0, 0) # the color for the red dots # the insertion points, each with a 25 pix distance from the corners p1 = pymupdf.Point(25, 25) p2 = pymupdf.Point(page.rect.width - 25, 25) p3 = pymupdf.Point(25, page.rect.height - 25) p4 = pymupdf.Point(page.rect.width - 25, page.rect.height - 25) # create a Shape to draw on shape = page.new_shape() # draw the insertion points as red, filled dots shape.draw_circle(p1,1) shape.draw_circle(p2,1) shape.draw_circle(p3,1) shape.draw_circle(p4,1) shape.finish(width=0.3, color=red, fill=red) # insert the text strings shape.insert_text(p1, text1) shape.insert_text(p3, text2, rotate=90) shape.insert_text(p2, text3, rotate=-90) shape.insert_text(p4, text4, rotate=180) # store our work to the page shape.commit() doc.save(...)
这是结果:
如何填充文本框
此脚本使用 4 种不同的旋转值填充 4 个不同的矩形框内的文本:
import pymupdf doc = pymupdf.open() # new or existing PDF page = doc.new_page() # new page, or choose doc[n] # write in this overall area rect = pymupdf.Rect(100, 100, 300, 150) # partition the area in 4 equal sub-rectangles CELLS = pymupdf.make_table(rect, cols=4, rows=1) t1 = "text with rotate = 0." # these texts we will written t2 = "text with rotate = 90." t3 = "text with rotate = 180." t4 = "text with rotate = 270." text = [t1, t2, t3, t4] red = pymupdf.pdfcolor["red"] # some colors gold = pymupdf.pdfcolor["gold"] blue = pymupdf.pdfcolor["blue"] """ We use a Shape object (something like a canvas) to output the text and the rectangles surrounding it for demonstration. """ shape = page.new_shape() # create Shape for i in range(len(CELLS[0])): shape.draw_rect(CELLS[0][i]) # draw rectangle shape.insert_textbox( CELLS[0][i], text[i], fontname="hebo", color=blue, rotate=90 * i ) shape.finish(width=0.3, color=red, fill=gold) shape.commit() # write all stuff to the page doc.ez_save(__file__.replace(".py", ".pdf"))
上面使用了一些默认值:字体大小 11 和文本对齐“左”。结果如下所示:
如何使用 HTML 文本填充框
方法Page.insert_htmlbox()
提供了一种更强大的方式来在矩形框中插入文本。
这种方法不仅接受 HTML 标签,还可以包含样式指令来影响字体、字体粗细(加粗)和样式(斜体)、颜色等等。
也可以混合多种字体和语言,输出 HTML 表格并插入图像和 URI 链接。
为了更大的样式灵活性,还可以提供额外的 CSS 源。
该方法基于 Story 类。因此,支持复杂的脚本系统如 Devanagari,Nepali,Tamil 等,且使用 HarfBuzz 库正确地编写 - 提供这种所谓的**“文本整形”**功能。
自动从 Google NOTO 字体库中获取输出字符所需的字体作为后备(当选择性提供的用户字体不包含某些字形时)。
作为这里提供功能的小窥视,我们将输出以下 HTML 丰富的文本:
import pymupdf rect = pymupdf.Rect(100, 100, 400, 300) text = """Lorem ipsum dolor sit amet, consectetur adipisici elit, sed eiusmod tempor incidunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation <b>ullamco <i>laboris</i></b> nisi ut aliquid ex ea commodi consequat. Quis aute iure <span style="color: #f00;">reprehenderit</span> in <span style="color: #0f0;font-weight:bold;">voluptate</span> velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint obcaecat cupiditat non proident, sunt in culpa qui <a href="https://www.artifex.com">officia</a> deserunt mollit anim id est laborum.""" doc = pymupdf.Document() page = doc.new_page() page.insert_htmlbox(rect, text, css="* {font-family: sans-serif;font-size:14px;}") doc.ez_save(__file__.replace(".py", ".pdf"))
请注意“css”参数如何用于全局选择默认的“sans-serif”字体和字体大小为 14。
结果如下所示:
如何输出 HTML 表格和图像
这里是另一个例子,使用此方法输出一个包含表格的文本。这次,我们将所有样式都包含在 HTML 源码中。还请注意,如何在表格单元格内包含图像的工作方式:
import pymupdf import os filedir = os.path.dirname(__file__) text = """ <style> body { font-family: sans-serif; } td, th { border: 1px solid blue; border-right: none; border-bottom: none; padding: 5px; text-align: center; } table { border-right: 1px solid blue; border-bottom: 1px solid blue; border-spacing: 0; } </style> <body> <p><b>Some Colors</b></p> <table> <tr> <th>Lime</th> <th>Lemon</th> <th>Image</th> <th>Mauve</th> </tr> <tr> <td>Green</td> <td>Yellow</td> <td><img src="img-cake.png" width=50></td> <td>Between<br>Gray and Purple</td> </tr> </table> </body> """ doc = pymupdf.Document() page = doc.new_page() rect = page.rect + (36, 36, -36, -36) # we must specify an Archive because of the image page.insert_htmlbox(rect, text, archive=pymupdf.Archive(".")) doc.ez_save(__file__.replace(".py", ".pdf"))
结果如下所示:
如何输出世界各地的语言
我们的第三个示例将演示自动多语言支持。它包括对复杂脚本系统(如天城文)和从右到左语言的自动文本成形:
import pymupdf greetings = ( "Hello, World!", # english "Hallo, Welt!", # german "سلام دنیا!", # persian "வணக்கம், உலகம்!", # tamil "สวัสดีชาวโลก!", # thai "Привіт Світ!", # ucranian "שלום עולם!", # hebrew "ওহে বিশ্ব!", # bengali "你好世界!", # chinese "こんにちは世界!", # japanese "안녕하세요, 월드!", # korean "नमस्कार, विश्व !", # sanskrit "हैलो वर्ल्ड!", # hindi ) doc = pymupdf.open() page = doc.new_page() rect = (50, 50, 200, 500) # join greetings into one text string text = " ... ".join([t for t in greetings]) # the output of the above is simple: page.insert_htmlbox(rect, text) doc.save(__file__.replace(".py", ".pdf"))
这就是输出结果:
如何指定自己的字体
使用 @font-face
语句在 CSS 语法中定义您的字体文件。您需要为每种字体重量和字体样式的组合(例如粗体或斜体)创建一个单独的 @font-face
。以下示例使用著名的 MS Comic Sans 字体的四个变体:正常、粗体、斜体和粗斜体。
由于这四个字体文件位于系统文件夹 C:/Windows/Fonts
中,该方法需要一个指向该文件夹的 Archive 定义:
""" How to use your own fonts with method Page.insert_htmlbox(). """ import pymupdf # Example text text = """Lorem ipsum dolor sit amet, consectetur adipisici elit, sed eiusmod tempor incidunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation <b>ullamco <i>laboris</i></b> nisi ut aliquid ex ea commodi consequat. Quis aute iure <span style="color: red;">reprehenderit</span> in <span style="color: green;font-weight:bold;">voluptate</span> velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint obcaecat cupiditat non proident, sunt in culpa qui <a href="https://www.artifex.com">officia</a> deserunt mollit anim id est laborum.""" """ We need an Archive object to show where font files are located. We intend to use the font family "MS Comic Sans". """ arch = pymupdf.Archive("C:/Windows/Fonts") # These statements define which font file to use for regular, bold, # italic and bold-italic text. # We assign an arbitary common font-family for all 4 font files. # The Story algorithm will select the right file as required. # We request to use "comic" throughout the text. css = """ @font-face {font-family: comic; src: url(comic.ttf);} @font-face {font-family: comic; src: url(comicbd.ttf);font-weight: bold;} @font-face {font-family: comic; src: url(comicz.ttf);font-weight: bold;font-style: italic;} @font-face {font-family: comic; src: url(comici.ttf);font-style: italic;} * {font-family: comic;} """ doc = pymupdf.Document() page = doc.new_page(width=150, height=150) # make small page page.insert_htmlbox(page.rect, text, css=css, archive=arch) doc.subset_fonts(verbose=True) # build subset fonts to reduce file size doc.ez_save(__file__.replace(".py", ".pdf"))
如何请求文本对齐
这个示例结合了多个要求:
- 将文本逆时针旋转 90 度。
- 使用来自 pymupdf-fonts 包的字体。你会发现在这种情况下,相关的 CSS 定义要简单得多。
- 使用“对齐”选项对齐文本。
""" How to use a pymupdf font with method Page.insert_htmlbox(). """ import pymupdf # Example text text = """Lorem ipsum dolor sit amet, consectetur adipisici elit, sed eiusmod tempor incidunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation <b>ullamco <i>laboris</i></b> nisi ut aliquid ex ea commodi consequat. Quis aute iure <span style="color: red;">reprehenderit</span> in <span style="color: green;font-weight:bold;">voluptate</span> velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint obcaecat cupiditat non proident, sunt in culpa qui <a href="https://www.artifex.com">officia</a> deserunt mollit anim id est laborum.""" """ This is similar to font file support. However, we can use a convenience function for creating required CSS definitions. We still need an Archive for finding the font binaries. """ arch = pymupdf.Archive() # We request to use "myfont" throughout the text. css = pymupdf.css_for_pymupdf_font("ubuntu", archive=arch, name="myfont") css += "* {font-family: myfont;text-align: justify;}" doc = pymupdf.Document() page = doc.new_page(width=150, height=150) page.insert_htmlbox(page.rect, text, css=css, archive=arch, rotate=90) doc.subset_fonts(verbose=True) doc.ez_save(__file__.replace(".py", ".pdf"))
如何输出 HTML 表格和图片
这是另一个示例,使用此方法输出表格。这次,我们在 HTML 源码中包含了所有的样式。请注意,如何在表格单元格中包含图片的工作原理:
import pymupdf import os filedir = os.path.dirname(__file__) text = """ <style> body { font-family: sans-serif; } td, th { border: 1px solid blue; border-right: none; border-bottom: none; padding: 5px; text-align: center; } table { border-right: 1px solid blue; border-bottom: 1px solid blue; border-spacing: 0; } </style> <body> <p><b>Some Colors</b></p> <table> <tr> <th>Lime</th> <th>Lemon</th> <th>Image</th> <th>Mauve</th> </tr> <tr> <td>Green</td> <td>Yellow</td> <td><img src="img-cake.png" width=50></td> <td>Between<br>Gray and Purple</td> </tr> </table> </body> """ doc = pymupdf.Document() page = doc.new_page() rect = page.rect + (36, 36, -36, -36) # we must specify an Archive because of the image page.insert_htmlbox(rect, text, archive=pymupdf.Archive(".")) doc.ez_save(__file__.replace(".py", ".pdf"))
结果将如下所示:
如何输出世界各地的语言
我们的第三个示例将演示自动多语言支持。它包括对复杂脚本系统(如天城文)和从右到左语言的自动文本成形:
import pymupdf greetings = ( "Hello, World!", # english "Hallo, Welt!", # german "سلام دنیا!", # persian "வணக்கம், உலகம்!", # tamil "สวัสดีชาวโลก!", # thai "Привіт Світ!", # ucranian "שלום עולם!", # hebrew "ওহে বিশ্ব!", # bengali "你好世界!", # chinese "こんにちは世界!", # japanese "안녕하세요, 월드!", # korean "नमस्कार, विश्व !", # sanskrit "हैलो वर्ल्ड!", # hindi ) doc = pymupdf.open() page = doc.new_page() rect = (50, 50, 200, 500) # join greetings into one text string text = " ... ".join([t for t in greetings]) # the output of the above is simple: page.insert_htmlbox(rect, text) doc.save(__file__.replace(".py", ".pdf"))
这就是输出结果:
如何指定自己的字体
使用 @font-face
语句在 CSS 语法中定义您的字体文件。您需要为每种字体重量和字体样式的组合(例如粗体或斜体)创建一个单独的 @font-face
。以下示例使用著名的 MS Comic Sans 字体的四个变体:正常、粗体、斜体和粗斜体。
由于这四个字体文件位于系统文件夹 C:/Windows/Fonts
中,该方法需要一个指向该文件夹的 Archive 定义:
""" How to use your own fonts with method Page.insert_htmlbox(). """ import pymupdf # Example text text = """Lorem ipsum dolor sit amet, consectetur adipisici elit, sed eiusmod tempor incidunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation <b>ullamco <i>laboris</i></b> nisi ut aliquid ex ea commodi consequat. Quis aute iure <span style="color: red;">reprehenderit</span> in <span style="color: green;font-weight:bold;">voluptate</span> velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint obcaecat cupiditat non proident, sunt in culpa qui <a href="https://www.artifex.com">officia</a> deserunt mollit anim id est laborum.""" """ We need an Archive object to show where font files are located. We intend to use the font family "MS Comic Sans". """ arch = pymupdf.Archive("C:/Windows/Fonts") # These statements define which font file to use for regular, bold, # italic and bold-italic text. # We assign an arbitary common font-family for all 4 font files. # The Story algorithm will select the right file as required. # We request to use "comic" throughout the text. css = """ @font-face {font-family: comic; src: url(comic.ttf);} @font-face {font-family: comic; src: url(comicbd.ttf);font-weight: bold;} @font-face {font-family: comic; src: url(comicz.ttf);font-weight: bold;font-style: italic;} @font-face {font-family: comic; src: url(comici.ttf);font-style: italic;} * {font-family: comic;} """ doc = pymupdf.Document() page = doc.new_page(width=150, height=150) # make small page page.insert_htmlbox(page.rect, text, css=css, archive=arch) doc.subset_fonts(verbose=True) # build subset fonts to reduce file size doc.ez_save(__file__.replace(".py", ".pdf"))
如何请求文本对齐
这个示例结合了多个要求:
- 将文本逆时针旋转 90 度。
- 使用来自 pymupdf-fonts 包的字体。你会发现在这种情况下,相关的 CSS 定义要简单得多。
- 使用“对齐”选项对齐文本。
""" How to use a pymupdf font with method Page.insert_htmlbox(). """ import pymupdf # Example text text = """Lorem ipsum dolor sit amet, consectetur adipisici elit, sed eiusmod tempor incidunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation <b>ullamco <i>laboris</i></b> nisi ut aliquid ex ea commodi consequat. Quis aute iure <span style="color: red;">reprehenderit</span> in <span style="color: green;font-weight:bold;">voluptate</span> velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint obcaecat cupiditat non proident, sunt in culpa qui <a href="https://www.artifex.com">officia</a> deserunt mollit anim id est laborum.""" """ This is similar to font file support. However, we can use a convenience function for creating required CSS definitions. We still need an Archive for finding the font binaries. """ arch = pymupdf.Archive() # We request to use "myfont" throughout the text. css = pymupdf.css_for_pymupdf_font("ubuntu", archive=arch, name="myfont") css += "* {font-family: myfont;text-align: justify;}" doc = pymupdf.Document() page = doc.new_page(width=150, height=150) page.insert_htmlbox(page.rect, text, css=css, archive=arch, rotate=90) doc.subset_fonts(verbose=True) doc.ez_save(__file__.replace(".py", ".pdf"))
## 如何提取带颜色的文本
通过迭代你的文本块,找到你需要的信息的文本范围。
for page in doc: text_blocks = page.get_text("dict", flags=pymupdf.TEXTFLAGS_TEXT)["blocks"] for block in text_blocks: for line in block["lines"]: for span in line["spans"]: text = span["text"] color = pymupdf.sRGB_to_rgb(span["color"]) print(f"Text: {text}, Color: {color}")
对这个页面有任何反馈吗?
本软件按原样提供,没有明示或暗示的任何保证。本软件根据许可协议分发,除非在该许可协议的条款下明确授权,否则不得复制、修改或分发本软件。请参阅 artifex.com 上的许可信息或联系 Artifex Software Inc.,39 Mesa Street, Suite 108A, San Francisco CA 94129, United States 以获取更多信息。
本文档覆盖了所有版本直到 1.24.4。