Tesseract–OCR 库原理探索

一，简介：

Tesseract is probably the most accurate open source OCR engine available. Combined with the Leptonica Image Processing Library it can read a wide variety of image formats and convert them to text in over 60 languages. It was one of the top 3 engines in the 1995 UNLV Accuracy test. Between 1995 and 2006 it had little work done on it, but since then it has been improved extensively by Google. It is released under the Apache License 2.0.

项目主页：http://code.google.com/p/tesseract-ocr/

二，使用：

按照主页wiki的介绍，下载编译tesseract。

Sample Code ： http://code.google.com/p/tesseract-ocr/source/browse/trunk/api/tesseractmain.cpp

VS2005的工程（包括第三方库）：http://pan.baidu.com/s/13ROuA

三，原理探索：

1，Tesseract是一个开源跨平台的OCR库；

2，Tesseract主要分为两部分：训练，预测；

3，训练：

a，Tesseract能通过训练来支持第三方语言，或者提高OCR准确率。详情：http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3

b，etc.

5，预测：

a，基本输入是PIX数据结构，可通过外围操作将video data或者其他格式的数据转换为leptonica的PIX格式；

b，输入得到PIX –> ProcessPage() –> Recognize() –>

b.1: 搜索文字块；

b.2：BaseLine匹配；

b.3：字符截断，分割成单个字符；

b.4：截断连在一起的字符，补全断掉的笔画；

b.5：特征提取：早期tesseract使用字符的拓扑特征，这种匹配方式对字体变化不敏感，但是对现实中出现的字识别率鲁棒性不好；

etc.

未完待续…

原文地址：https://www.cnblogs.com/xylc/p/3410492.html