Improving the quality of the output

简介: There are a variety of reasons you might not get good quality output from Tesseract. It's important to note that unless you're using a very unusual fo...

There are a variety of reasons you might not get good quality output from Tesseract. It's important to note that unless you're using a very unusual font or a new language retraining Tesseract is unlikely to help.

Image processing

Tesseract does various image processing operations internally (using the Leptonica library) before doing the actual OCR. It generally does a very good job of this, but there will inevitably be cases where it isn't good enough, which can result in a significant reduction in accuracy.

You can see how Tesseract has processed the image by using the configuration variabletessedit_write_images to true when running Tesseract. If the resulting tessinput.tif file looks problematic, try some of these image processing operations before passing the image to Tesseract.

Rescaling

Tesseract works best on images which have a DPI of at least 300 dpi, so it may be beneficial to resize images. For more information see the FAQ.

Binarisation

binarisation.png

This is converting an image to black and white. Tesseract does this internally, but the result can be suboptimal, particularly if the page background is of uneven darkness.

Noise Removal

noise.png

Noise is random variation of brightness or colour in an image, that can make the text of the image more difficult to read. Certain types of noise cannot be removed by Tesseract in the binarisation step, which can cause accuracy rates to drop.

Rotation / Deskewing

skew-linedetection.png

A skewed image is when an page has been scanned when not straight. The quality of Tesseract's line segmentation reduces significantly if a page is too skewed, which severely impacts the quality of the OCR. To address this rotating the page image so that the text lines are horizontal.

Border Removal

borders.png

Scanned pages often have dark borders around them. These can be erroneously picked up as extra characters, especially if they vary in shape and gradation.

Tools / Libraries

Examples

If you need an example how to improve image quality programmatically, have a look at this examples:

Page segmentation method

By default Tesseract expects a page of text when it segments an image. If you're just seeking to OCR a small region try a different segmentation mode, using the -psm argument. Note that adding a white border to text which is too tightly cropped may also help, see issue 398.

To see a complete list of supported page segmentation modes, use tesseract -h. Here's the list as of 3.04:

 0   Orientation and script detection (OSD) only.
 1   Automatic page segmentation with OSD.
 2   Automatic page segmentation, but no OSD, or OCR.
 3   Fully automatic page segmentation, but no OSD. (Default)
 4   Assume a single column of text of variable sizes.
 5   Assume a single uniform block of vertically aligned text.
 6   Assume a single uniform block of text.
 7   Treat the image as a single text line.
 8   Treat the image as a single word.
 9   Treat the image as a single word in a circle.
10   Treat the image as a single character.

Dictionaries, word lists, and patterns

By default Tesseract is optimized to recognize sentences of words. If you're trying to recognize something else, like receipts, price lists, or codes, there are a few things you can do to improve the accuracy of your results, as well as double-checking that the appropriate segmentation method is selected.

Disabling the dictionaries Tesseract uses should increase recognition if most of your text isn't dictionary words. They can be disabled by setting the both of the configuration variablesload_system_dawg and load_freq_dawg to false.

It is also possible to add words to the word list Tesseract uses to help recognition, or to add common character patterns, which can further help to improve accuracy if you have a good idea of the sort of input you expect. This is explained in more detail in the Tesseract manual.

If you know you will only encounter a subset of the characters available in the language, such as only digits, you can use the tessedit_char_whitelist configuration variable. See the FAQ for an example.

Still having problems?

If you've tried the above and are still getting low accuracy results, ask on the forum for help, ideally posting an example image.

目录
相关文章
|
1月前
|
算法 数据挖掘
文献解读-Prediction of axillary lymph node metastasis in triple-negative breast cancer by multi-omics analysis and an integrated model
研究旨在为三阴性乳腺癌患者提供更准确的腋窝淋巴结转移风险评估工具。研究者综合分析了临床病理信息、基因组和转录组数据,构建了一个多组学预测模型。
33 4
|
9月前
|
算法 BI 计算机视觉
[Initial Image Segmentation Generator]论文实现:Efficient Graph-Based Image Segmentation
[Initial Image Segmentation Generator]论文实现:Efficient Graph-Based Image Segmentation
87 1
|
9月前
|
算法 光互联 计算机视觉
Locally Adaptive Color Correction for Underwater Image Dehazing and Matching
该文提出了一种新颖的水下图像处理方法,结合颜色转移和局部调整来校正颜色,以应对水下光照和散射造成的图像退化。传统颜色转移方法基于全局参数,不适应水下场景中颜色变化的局部性质。文章中,作者通过融合策略,利用光衰减水平估计来实现局部颜色校正。首先,通过暗通道先验恢复彩色补偿图像,然后估计光衰减图。接着,创建一个合成图像,该图像的统计特性代表高衰减区域,用于颜色转移。最后,通过加权融合初始图像和颜色转移图像,生成最终的颜色校正图像。这种方法旨在提高水下图像的对比度和颜色准确性,特别关注高衰减区域。
105 1
|
自然语言处理 算法
SIFRank New Baseline for Unsupervised Keyphrase Extraction Based on Pre-Trained Language Model
在社交媒体上,面临着大量的知识和信息,一个有效的关键词抽取算法可以广泛地被应用的信息检索和自然语言处理中。传统的关键词抽取算法很难使用外部的知识信息。
181 0
SIFRank New Baseline for Unsupervised Keyphrase Extraction Based on Pre-Trained Language Model
|
自然语言处理 算法 vr&ar
X-GEAR:Multilingual Generative Language Models for Zero-Shot Cross-Lingual Event Argument Extraction
我们提出了一项利用多语言预训练生成语言模型进行零样本跨语言事件论元抽取(EAE)的研究。通过将EAE定义为语言生成任务,我们的方法有效地编码事件结构并捕获论元之间的依赖关系。
144 0
|
算法 PyTorch 算法框架/工具
论文解读:LaMa:Resolution-robust Large Mask Inpainting with Fourier Convolutions
论文解读:LaMa:Resolution-robust Large Mask Inpainting with Fourier Convolutions
828 0
paraforme支持speech_noise_threshold吗?
请问:speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch 这个模型支持设置 speech_noise_threshold 这个参数吗 ? vad 本身是支持的,但对这个集成的模型好像不起作用? 如果支持,应该如何正确地设置呢 ? 如果不支持,那该模型有没有什么方法可以过滤掉背景噪声? 经常会有背景噪声被识别出文字
73 0
|
机器学习/深度学习 自然语言处理 算法
Joint Information Extraction with Cross-Task and Cross-Instance High-Order Modeling 论文解读
先前的信息抽取(IE)工作通常独立地预测不同的任务和实例(例如,事件触发词、实体、角色、关系),而忽略了它们的相互作用,导致模型效率低下。
113 0
|
机器学习/深度学习 人工智能 自然语言处理
NAACL2021 AMR-IE: Abstract Meaning Representation Guided Graph Encoding and Decoding for Joint IE
富语义解析的任务,如抽象语义表示(AMR),与信息抽取(IE)具有相似的目标,即将自然语言文本转换为结构化的语义表示。为了利用这种相似性
315 0
|
自然语言处理 算法 知识图谱
DEGREE: A Data-Efficient Generation-Based Event Extraction Model论文解读
事件抽取需要专家进行高质量的人工标注,这通常很昂贵。因此,学习一个仅用少数标记示例就能训练的数据高效事件抽取模型已成为一个至关重要的挑战。
197 0