Nanonets-OCR-s开源!复杂文档转Markdown SoTA,颠覆复杂文档工作流

本文涉及的产品
模型训练 PAI-DLC,100CU*H 3个月
交互式建模 PAI-DSW,每月250计算时 3个月
模型在线服务 PAI-EAS,A10/V100等 500元 1个月
简介: Nanonets团队开源了 Nanonets-OCR-s,该模型基于Qwen2.5-VL-3B微调,9G显存就能跑。

01.前言

Nanonets团队开源了 Nanonets-OCR-s,该模型基于Qwen2.5-VL-3B微调,9G显存就能跑。

 

Nanonets-OCR-s是一款强大的模型,能够通过智能内容识别和语义标记,将杂乱的文档转换为现代人工智能应用所需的干净、结构化且上下文丰富的 Markdown 格式。它的功能远超传统的文本提取,是目前图像转 Markdown 领域的SoTA模型。

大多数公开的图像转文本模型主要侧重于从图像中提取纯文本,然而它们通常无法区分常规内容和水印、签名或页码等元素。图像等视觉元素经常被忽略,表格、复选框和公式等复杂结构也无法有效处理,这使得这些模型不太适合下游任务。与仅提取纯文本的传统 OCR 系统不同,Nanonets-OCR-s 能够理解文档结构和内容上下文(如表格、方程式、图像、图表、水印、复选框等),从而提供智能格式的 markdown 输出,可供大型语言模型进行下游处理。

 

Nanonets-OCR-s 通过将结构化数据从非结构化格式中分离出来,能够简化各行各业复杂的文档工作流程。

 

-学术与研究:使用 LaTeX 公式和表格将论文数字化。

-法律与金融:从合同和财务文件中提取数据,包括签名和表格。

-医疗保健与制药:从医疗表格中准确捕获文本和复选框。

-企业:将报告转换为可搜索的图像感知知识库。

 

魔搭创空间:

https://www.modelscope.cn/studios/nanonets/Nanonets-ocr-s

02.主要特性和功能

  • LaTeX 公式识别
  • 智能图像描述
  • 签名检测与隔离
  • 水印提取
  • 智能复选框处理
  • 复杂表格提取

1. LaTeX 公式识别

Nanonets-OCR-s可以实现自动将数学方程式和公式转换为正确格式的 LaTeX 语法。内联数学公式将转换为 LaTeX 内联公式。段间公式将转换为 LaTeX 段间公式,单独占据一行。Nanonets-OCR-s模型识别结果能完整的保留等式右侧的序号。

输入

输出

Mistral OCR 输出

编辑 编辑 编辑

原始模型输出

where $\beta$ is a parameter controlling the deviation from the base reference policy $\pi_{\text{ref}}$, namely the initial SFT model $\pi^{\text{SFT}}$. In practice, the language model policy $\pi_{\theta}$ is also initialized to $\pi^{\text{SFT}}$. The added constraint is important, as it prevents the model from deviating too far from the distribution on which the reward model is accurate, as well as maintaining the generation diversity and preventing mode-collapse to single high-reward answers. Due to the discrete nature of language generation, this objective is not differentiable and is typically optimized with reinforcement learning. The standard approach [51, 40, 1, 28] has been to construct the reward function $r(x,y) = r_{\phi}(x,y) - \beta(\log \pi_{\theta}(y \mid x) - \log \pi_{\text{ref}}(y \mid x))$, and maximize using PPO [39].
### 4 Direct Preference Optimization
Motivated by the challenges of applying reinforcement learning algorithms on large-scale problems such as fine-tuning language models, our goal is to derive a simple approach for policy optimization using preferences directly. Unlike prior RLHF methods, which learn a reward and then optimize it via RL, our approach leverages a particular choice of reward model parameterization that enables extraction of its optimal policy in closed form, without an RL training loop. As we will describe next in detail, our key insight is to leverage an analytical mapping from reward functions to optimal policies, which enables us to transform a loss function over reward functions into a loss function over policies. This change-of-variables approach avoids fitting an explicit, standalone reward model, while still optimizing under existing models of human preferences, such as the Bradley-Terry model. In essence, the policy network represents both the language model and the (implicit) reward.
**Deriving the DPO objective.** We start with the same RL objective as prior work, Eq. [3], under a general reward function $r$. Following prior work [31, 30, 19, 15], it is straightforward to show that the optimal solution to the KL-constrained reward maximization objective in Eq. [3] takes the form:
$$\pi_r(y \mid x) = \frac{1}{Z(x)} \pi_{\text{ref}}(y \mid x) \exp\left(\frac{1}{\beta} r(x,y)\right), \tag{4}$$
where $Z(x) = \sum_y \pi_{\text{ref}}(y \mid x) \exp\left(\frac{1}{\beta} r(x,y)\right)$ is the partition function. See Appendix A.1 for a complete derivation. Even if we use the MLE estimate $r_{\phi}$ of the ground-truth reward function $r^*$, it is still expensive to estimate the partition function $Z(x)$ [19, 15], which makes this representation hard to utilize in practice. However, we can rearrange Eq. [4] to express the reward function in terms of its corresponding optimal policy $\pi_r$, the reference policy $\pi_{\text{ref}}$, and the unknown partition function $Z(\cdot)$. Specifically, we first take the logarithm of both sides of Eq. [4] and then with some algebra we obtain:
$$r(x,y) = \beta \log \frac{\pi_r(y \mid x)}{\pi_{\text{ref}}(y \mid x)} + \beta \log Z(x). \tag{5}$$
We can apply this reparameterization to the ground-truth reward $r^*$ and corresponding optimal model $\pi^*$. Fortunately, the Bradley-Terry model depends only on the difference of rewards between two completions, i.e., $p^*(y_1 \succ y_2 \mid x) = \sigma(r^*(x,y_1) - r^*(x,y_2))$. Substituting the reparameterization in Eq. [5] for $r^*(x,y)$ into the preference model Eq. [1], the partition function cancels, and we can express the human preference probability in terms of only the optimal policy $\pi^*$ and reference policy $\pi_{\text{ref}}$. Thus, the optimal RLHF policy $\pi^*$ under the Bradley-Terry model satisfies the preference model:
$$p^*(y_1 \succ y_2 \mid x) = \frac{1}{1 + \exp\left(\beta \log \frac{\pi^*(y_2|x)}{\pi_{\text{ref}}(y_2|x)} - \beta \log \frac{\pi^*(y_1|x)}{\pi_{\text{ref}}(y_1|x)}\right)}. \tag{6}$$
The derivation is in Appendix A.2. While Eq. [6] uses the Bradley-Terry model, we can similarly derive expressions under the more general Plackett-Luce models [32, 23], shown in Appendix A.3.
Now that we have the probability of human preference data in terms of the optimal policy rather than the reward model, we can formulate a maximum likelihood objective for a parametrized policy $\pi_{\theta}$. Analogous to the reward modeling approach (i.e. Eq. [2]), our policy objective becomes:
$$\mathcal{L}_{\text{DPO}}(\pi_{\theta}; \pi_{\text{ref}}) = -\mathbb{E}_{(x,y_w,y_l) \sim \mathcal{D}}\left[\log \sigma\left(\beta \log \frac{\pi_{\theta}(y_w \mid x)}{\pi_{\text{ref}}(y_w \mid x)} - \beta \log \frac{\pi_{\theta}(y_l \mid x)}{\pi_{\text{ref}}(y_l \mid x)}\right)\right]. \tag{7}$$
This way, we fit an implicit reward using an alternative parameterization, whose optimal policy is simply $\pi_{\theta}$. Moreover, since our procedure is equivalent to fitting a reparametrized Bradley-Terry

image.gif

 

2.智能图像描述

Nanonets-OCR-s使用结构化标签描述文档中的图像,使其易于 LLM 处理。该模型可以根据内容、样式和上下文描述单幅或多幅图像(徽标、图表、图形、二维码等)。该模型预测的图像描述结果存储于<img>标签内,预测的页码存储于<page_number>标签内,便于在RAG系统中进行使用。

输入

输出

Mistral OCR 输出

编辑 编辑 编辑

 

原始模型输出

```markdown
**Results** We report the FID score [10] on MJHQ-30k [15] for visual aesthetic quality, along with GenEval [8] and DPG-Bench [11] metrics for evaluating prompt alignment. We plot the results for each design choice at approximately every 3,200 training steps. Figure 4 shows that CLIP + Flow Matching achieves the best prompt alignment scores on both GenEval and DPG-Bench, while VAE + Flow Matching produces the lowest (best) FID, indicating superior aesthetic quality. However, FID has inherent limitations: it quantifies stylistic deviation from the target image distribution and often overlooks true generative quality and prompt alignment. In fact, our FID evaluation of GPT-4o on the MJHQ-30k dataset produced a score of around 30.0, underscoring that FID can be misleading in the image generation evaluation. In general, our experiments demonstrate CLIP + Flow Matching as the most effective design choice.
<img>
A line chart showing GenEval Score ↑ over time. The x-axis represents Step, ranging from 3K to 35K. The y-axis represents GenEval Score ↑, ranging from 0 to 65.
The legend indicates three lines:
- Green line: VAE + Flow Matching
- Blue line: CLIP + MSE
- Red line: CLIP + Flow Matching
The green line starts at around 10 GenEval Score ↑ at 3K Step and increases to around 30 by 35K Step. The blue line starts at around 10 GenEval Score ↑ at 3K Step and increases to around 60 by 35K Step. The red line starts at around 10 GenEval Score ↑ at 3K Step and increases to around 60 by 35K Step.
A line chart showing DPG Score ↑ over time. The x-axis represents Step, ranging from 3K to 35K. The y-axis represents DPG Score ↑, ranging from 0 to 80.
The legend indicates three lines:
- Green line: VAE + Flow Matching
- Blue line: CLIP + MSE
- Red line: CLIP + Flow Matching
The green line starts at around 10 DPG Score ↑ at 3K Step and increases to around 50 by 35K Step. The blue line starts at around 10 DPG Score ↑ at 3K Step and increases to around 70 by 35K Step. The red line starts at around 10 DPG Score ↑ at 3K Step and increases to around 70 by 35K Step.
A line chart showing FID Score ↓ over time. The x-axis represents Step, ranging from 3K to 35K. The y-axis represents FID Score ↓, ranging from 0 to 80.
The legend indicates three lines:
- Green line: VAE + Flow Matching
- Blue line: CLIP + MSE
- Red line: CLIP + Flow Matching
The green line starts at around 80 FID Score ↓ at 3K Step and decreases to around 20 by 35K Step. The blue line starts at around 80 FID Score ↓ at 3K Step and decreases to around 40 by 35K Step. The red line starts at around 80 FID Score ↓ at 3K Step and decreases to around 40 by 35K Step.
</img>
**Figure 4:** Comparison of different design choices.
**Discussion** In this section, we present a comprehensive evaluation of various design choices for image generation within a unified multimodal framework. Our results clearly show that CLIP's features produce more compact and semantically rich representations than VAE features, yielding higher training efficiency. Autoregressive models more effectively learn these semantic-level features compared to pixel-level features. Furthermore, flow matching proves to be a more effective training objective for modeling the image distribution, resulting in greater sample diversity and enhanced visual quality.
**Finding 1**
When integrating image generation into a unified model, autoregressive models more effectively learn the semantic-level features (CLIP) compared to pixel-level features (VAE).
**Finding 2**
Adopting flow matching as the training objective better captures the underlying image distribution, resulting in greater sample diversity and enhanced visual quality.
**4 Training Strategies for Unified Multimodal**
Building on our image generation study, the next step is to develop a unified model that can perform both image understanding and image generation. We use CLIP + Flow Matching for the image generation module. Since image understanding also operates in CLIP's embedding space, we align both tasks within the same semantic space, enabling their unification. In this context, we discuss two training strategies to achieve this integration.
**4.1 Joint Training Versus Sequential Training**
**Joint Training** Joint training of image understanding and image generation has become a common practice in recent works such as Metamorph [33], Janus-Pro [4], and Show-o [38]. Although these methods adopt different architectures for image generation, all perform multitask learning by mixing data for image generation and understanding.
<page_number>9</page_number>```

image.gif

3.签名检测与分离

智能识别文档中的签名,并区别于文档中的其他文本,这个功能对于法律和商业文档处理至关重要。该模型可检测文档中的签名,模型预测的签名文本写入<signature>标签内。

输入

输出

Mistral OCR 输出

编辑 编辑 编辑

 

原始模型输出

Invoice
Tax Invoice
Your Business Name
+61200000000
email@yourbusinessname.com.au
www.yourbusinessname.com.au
| INVOICE NO. | 2022445 |
|---|---|
| REFERENCE | 2022445 |
ISSUE DATE | 19/7/2022 |
DUE DATE | 2/8/2022 |
FROM
Your Business Name
5 Martin Pl
Sydney NSW NSW 2000
Australia
TO
Your Client
100 Harris St
Sydney NSW NSW 2009
Australia
Total due
$30.00
<table>
<thead>
<tr>
<th>DESCRIPTION</th>
<th>QUANTITY</th>
<th>UNIT PRICE ($)</th>
<th>AMOUNT ($)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Sample service</td>
<td>1 hour</td>
<td>30.00</td>
<td>30.00</td>
</tr>
</tbody>
</table>
Subtotal:
$30.00
Total (AUD):
$30.00
Issued by, signature:
<signature>J. Walker</signature>

image.gif

4.水印提取

与签名检测类似,该模型可以检测并提取文档中的水印文本。该模型水印文本写入<watermark>标签内。

输入

输出

Mistral OCR 输出

编辑 编辑 编辑

 

原始模型输出

```
Invoice
Invoice Number: INV-20250609
Date: June 9, 2025
Bill To: Souvik Mandal
123 Business Street
Kolkata, India
<table>
  <tr>
    <th>Item</th>
    <th>Description</th>
    <th>Quantity</th>
    <th>Unit Price</th>
    <th>Total</th>
  </tr>
  <tr>
    <td>001</td>
    <td><watermark>PAID</watermark> Consulting Services</td>
    <td>10</td>
    <td>₹2000</td>
    <td>₹20,000</td>
  </tr>
  <tr>
    <td>002</td>
    <td>Design Work</td>
    <td>5</td>
    <td>₹1500</td>
    <td>₹7,500</td>
  </tr>
  <tr>
    <td colspan="4">Grand Total</td>
    <td>₹27,500</td>
  </tr>
</table>
Thank you for your business!
Payment was received on June 7, 2025```

image.gif

5.智能复选框处理

Nanonets-OCR-s可以识别复选框状态。模型将表单复选框和单选按钮转换为标准化的 Unicode 符号,区别“已勾选”的复选框和“未勾选”的复选框。模型预测的复选框状态写入<checkbox>标签内。

输入

输出

Mistral OCR 输出

编辑 编辑 编辑

 

原始模型输出

```markdown
# Invoice
**Bill To:**
John Doe
123 Main Street
City, Country 456789
john.doe@example.com
<table>
<thead>
<tr>
<th>Select</th>
<th>Description</th>
<th>Quantity</th>
<th>Unit Price</th>
<th>Discount</th>
<th>Line Total</th>
</tr>
</thead>
<tbody>
<tr>
<td>☑</td>
<td>Website Development</td>
<td>1</td>
<td>$800</td>
<td>$0</td>
<td>$800</td>
</tr>
<tr>
<td>☐</td>
<td>Monthly Maintenance</td>
<td>12</td>
<td>$50</td>
<td>$100</td>
<td>$500</td>
</tr>
<tr>
<td>☑</td>
<td>Email Hosting (1 year)</td>
<td>1</td>
<td>$100</td>
<td>$0</td>
<td>$100</td>
</tr>
<tr>
<td>☑</td>
<td>SSL Certificate</td>
<td>1</td>
<td>$75</td>
<td>$0</td>
<td>$75</td>
</tr>
</tbody>
</table>
<table>
<tbody>
<tr>
<td>Subtotal:</td>
<td>$1,475</td>
</tr>
<tr>
<td>Discounts:</td>
<td>- $100</td>
</tr>
<tr>
<td>Tax (10%):</td>
<td>$137.50</td>
</tr>
<tr>
<td><strong>Grand Total:</strong></td>
<td><strong>$1,512.50</strong></td>
</tr>
</tbody>
</table>
**Select Payment Method:**
☑ Credit Card
☐ PayPal
☐ Bank Transfer```

image.gif

6.复杂表格提取

从文档中提取复杂表格并将其转换为 markdown 和 html 表格。

输入

输出

Mistral OCR 输出

编辑 编辑 编辑

 

原始模型输出

features extracted by the Vision Transformer (ViT), we first group spatially adjacent sets of four patch features. These grouped features are then concatenated and passed through a two-layer multi-layer perceptron (MLP) to project them into a dimension that aligns with the text embeddings used in the LLM. This method not only reduces computational costs but also provides a flexible way to dynamically compress image feature sequences of varying lengths.
In Table 1, the architecture and configuration of Qwen2.5-VL are detailed.
<table>
  <tr>
    <td><strong>Configuration</strong></td>
    <td><strong>Qwen2.5-VL-3B</strong></td>
    <td><strong>Qwen2.5-VL-7B</strong></td>
    <td><strong>Qwen2.5-VL-72B</strong></td>
  </tr>
  <tr>
    <td colspan="4" style="text-align:center;"><strong>Vision Transformer (ViT)</strong></td>
  </tr>
  <tr>
    <td>Hidden Size</td>
    <td>1280</td>
    <td>1280</td>
    <td>1280</td>
  </tr>
  <tr>
    <td># Layers</td>
    <td>32</td>
    <td>32</td>
    <td>32</td>
  </tr>
  <tr>
    <td># Num Heads</td>
    <td>16</td>
    <td>16</td>
    <td>16</td>
  </tr>
  <tr>
    <td>Intermediate Size</td>
    <td>3456</td>
    <td>3456</td>
    <td>3456</td>
  </tr>
  <tr>
    <td>Patch Size</td>
    <td>14</td>
    <td>14</td>
    <td>14</td>
  </tr>
  <tr>
    <td>Window Size</td>
    <td>112</td>
    <td>112</td>
    <td>112</td>
  </tr>
  <tr>
    <td>Full Attention Block Indexes</td>
    <td>{7, 15, 23, 31}</td>
    <td>{7, 15, 23, 31}</td>
    <td>{7, 15, 23, 31}</td>
  </tr>
  <tr>
    <td colspan="4" style="text-align:center;"><strong>Vision-Language Merger</strong></td>
  </tr>
  <tr>
    <td>In Channel</td>
    <td>1280</td>
    <td>1280</td>
    <td>1280</td>
  </tr>
  <tr>
    <td>Out Channel</td>
    <td>2048</td>
    <td>3584</td>
    <td>8192</td>
  </tr>
  <tr>
    <td colspan="4" style="text-align:center;"><strong>Large Language Model (LLM)</strong></td>
  </tr>
  <tr>
    <td>Hidden Size</td>
    <td>2048</td>
    <td>3,584</td>
    <td>8192</td>
  </tr>
  <tr>
    <td># Layers</td>
    <td>36</td>
    <td>28</td>
    <td>80</td>
  </tr>
  <tr>
    <td># KV Heads</td>
    <td>2</td>
    <td>4</td>
    <td>8</td>
  </tr>
  <tr>
    <td>Head Size</td>
    <td>128</td>
    <td>128</td>
    <td>128</td>
  </tr>
  <tr>
    <td>Intermediate Size</td>
    <td>4864</td>
    <td>18944</td>
    <td>29568</td>
  </tr>
  <tr>
    <td>Embedding Tying</td>
    <td>☑</td>
    <td>☒</td>
    <td>☒</td>
  </tr>
  <tr>
    <td>Vocabulary Size</td>
    <td>151646</td>
    <td>151646</td>
    <td>151646</td>
  </tr>
  <tr>
    <td># Trained Tokens</td>
    <td>4.1T</td>
    <td>4.1T</td>
    <td>4.1T</td>
  </tr>
</table>
**Table 1:** Configuration of Qwen2.5-VL.

image.gif

03.模型训练

为了训练全新的视觉语言模型 (VLM) 以实现精确的光学字符识别 (OCR),该团队精心挑选了一个包含超过 25 万页的数据集。该数据集涵盖以下文档类型:研究论文、财务文档、法律文档、医疗保健文档、税务表单、收据和发票。此外,该数据集还包含包含图像、图表、公式、签名、水印、复选框和复杂表格的文档。另外使用了合成数据集和手动注释数据集,首先在合成数据集上训练模型,然后在手动注释数据集上对其进行微调。

 

模型结构:

选择该Qwen2.5-VL-3B模型作为视觉语言模型 (VLM) 的基础模型。随后,我们基于精选数据集对该模型进行了微调,以提高其在特定文档的光学字符识别 (OCR) 任务上的性能。

04.模型推理

使用transformers推理

从魔搭下载模型

modelscope download --model nanonets/Nanonets-OCR-s --local_dir nanonets/Nanonets-OCR-s

image.gif

推理脚本

from PIL import Image
from transformers import AutoTokenizer, AutoProcessor, AutoModelForImageTextToText
model_path = "nanonets/Nanonets-OCR-s"
model = AutoModelForImageTextToText.from_pretrained(
    model_path, 
    torch_dtype="auto", 
    device_map="auto", 
    attn_implementation="flash_attention_2"
)
model.eval()
tokenizer = AutoTokenizer.from_pretrained(model_path)
processor = AutoProcessor.from_pretrained(model_path)
def ocr_page_with_nanonets_s(image_path, model, processor, max_new_tokens=4096):
    prompt = """Extract the text from the above document as if you were reading it naturally. Return the tables in html format. Return the equations in LaTeX representation. If there is an image in the document and image caption is not present, add a small description of the image inside the <img></img> tag; otherwise, add the image caption inside <img></img>. Watermarks should be wrapped in brackets. Ex: <watermark>OFFICIAL COPY</watermark>. Page numbers should be wrapped in brackets. Ex: <page_number>14</page_number> or <page_number>9/22</page_number>. Prefer using ☐ and ☑ for check boxes."""
    image = Image.open(image_path)
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": [
            {"type": "image", "image": f"file://{image_path}"},
            {"type": "text", "text": prompt},
        ]},
    ]
    text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = processor(text=[text], images=[image], padding=True, return_tensors="pt")
    inputs = inputs.to(model.device)
    output_ids = model.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=False)
    generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]
    output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
    return output_text[0]
image_path = "/path/to/your/document.jpg"
result = ocr_page_with_nanonets_s(image_path, model, processor, max_new_tokens=15000)
print(result)

image.gif

环境依赖

transformers需要大于等于4.52,torch需要大于等于2.5

 

资源要求

推理显存占用8G,如您没有本地资源,可使用魔搭的Notebook免费GPU资源

 

魔搭notebook地址:

https://www.modelscope.cn/my/mynotebook

 

点击链接, 即可跳转体验~

https://www.modelscope.cn/studios/nanonets/Nanonets-ocr-s

目录
相关文章
|
4月前
|
人工智能 文字识别 异构计算
SmolDocling:256M多模态小模型秒转文档!开源OCR效率提升10倍
SmolDocling 是一款轻量级的多模态文档处理模型,能够将图像文档高效转换为结构化文本,支持文本、公式、图表等多种元素识别,适用于学术论文、技术报告等多类型文档。
370 1
SmolDocling:256M多模态小模型秒转文档!开源OCR效率提升10倍
|
2月前
|
人工智能 前端开发 开发工具
9.2K Star!微信排版从未如此简单,这款开源神器让Markdown飞入公众号!
一款9.2K Star的开源神器,让微信公众号排版变得简单高效!支持Markdown语法,实时预览、多图床混搭、AI智能排版、自定义主题样式等功能一应俱全。通过沉浸式双栏编辑、七图床混合编排、AI写作助手和主题定制工坊等核心功能,彻底解放技术创作者的生产力。无论是技术博客迁移、多平台发布还是企业定制,都能满足需求。三步上手:在线体验、本地部署、公众号对接。项目地址:https://github.com/doocs/md
135 4
|
3月前
|
存储 人工智能 JSON
传统OCR集体阵亡!Versatile-OCR-Program:开源多语言OCR工具,精准解析表格和数学公式等复杂结构
本文解析开源OCR工具Versatile-OCR-Program的技术实现,其基于多模态融合架构实现90%以上识别准确率,支持数学公式与图表的结构化输出,为教育资料数字化提供高效解决方案。
428 5
传统OCR集体阵亡!Versatile-OCR-Program:开源多语言OCR工具,精准解析表格和数学公式等复杂结构
|
3月前
|
API
Postman 可以将文档导出为 HTML/Markdown 吗?
Postman 没有提供直接将你的文档导出为 HTML 或 Markdown 的途径。太糟糕了
|
4月前
|
机器学习/深度学习 人工智能 文字识别
Umi-OCR:31K Star!离线OCR终结者!公式+二维码+多语种,开源免费吊打付费
Umi-OCR 是一款免费开源的离线 OCR 文字识别工具,支持截图、批量图片、PDF 扫描件的文字识别,内置多语言识别库,提供命令行和 HTTP 接口调用功能。
299 0
Umi-OCR:31K Star!离线OCR终结者!公式+二维码+多语种,开源免费吊打付费
|
4月前
|
文字识别 BI API
3.4K star!全能PDF处理神器开源!文档转换/OCR识别一键搞定
PDF-Guru 是一款开箱即用的全能型PDF处理工具,支持跨平台文档转换、智能OCR识别、多格式解析等核心功能。项目采用模块化架构设计,提供简洁的Web界面和API接口,开发者可快速集成到现有系统中。
256 1
|
6月前
|
机器学习/深度学习 人工智能 文字识别
Zerox:AI驱动的万能OCR工具,精准识别复杂布局并输出Markdown格式,支持PDF、DOCX、图片等多种文件格式
Zerox 是一款开源的本地化高精度OCR工具,基于GPT-4o-mini模型,支持PDF、DOCX、图片等多种格式文件,能够零样本识别复杂布局文档,输出Markdown格式结果。
525 4
Zerox:AI驱动的万能OCR工具,精准识别复杂布局并输出Markdown格式,支持PDF、DOCX、图片等多种文件格式
|
6月前
|
人工智能 文字识别 自然语言处理
Vision Parse:开源的 PDF 转 Markdown 工具,结合视觉语言模型和 OCR,识别文本和表格并保持原格式
Vision Parse 是一款开源的 PDF 转 Markdown 工具,基于视觉语言模型,能够智能识别和提取 PDF 中的文本和表格,并保持原有格式和结构。
906 19
Vision Parse:开源的 PDF 转 Markdown 工具,结合视觉语言模型和 OCR,识别文本和表格并保持原格式
|
7月前
|
人工智能 文字识别 数据挖掘
MarkItDown:微软开源的多格式转Markdown工具,支持将PDF、Word、图像和音频等文件转换为Markdown格式
MarkItDown 是微软开源的多功能文档转换工具,支持将 PDF、PPT、Word、Excel、图像、音频等多种格式的文件转换为 Markdown 格式,具备 OCR 文字识别、语音转文字和元数据提取等功能。
1209 9
MarkItDown:微软开源的多格式转Markdown工具,支持将PDF、Word、图像和音频等文件转换为Markdown格式
|
8月前
|
人工智能 移动开发 前端开发
Markdown-to-Image:开源的在线 Markdown 转海报编辑器
Markdown-to-Image 是一款开源的在线 Markdown 转海报编辑器,能够将 Markdown 文本内容转换为图像,适用于创建社交媒体帖子、海报和其他视觉内容。该工具支持多种输出格式,并允许用户自定义样式,适用于多种应用场景。
567 4
Markdown-to-Image:开源的在线 Markdown 转海报编辑器

热门文章

最新文章