python工具——Tesseract

python工具——Tesseract
OCR可以自动对手写或者印刷字体进行类型转化为机器编码文本字符串，供我们存取和操作

1.安装Tesseract

（1）Ubuntu16下
```
sudo apt-get install tesseract-ocr
```
验证Tesseract是否安装成功
```
tesseract -v
```
（2）windows下

下载https://digi.bib.uni-mannheim.de/tesseract/

添加到环境变量的path中

验证Tesseract是否安装成功
```
tesseract -v
```
2.用命令行进行测试
```
tesseract 123.png result
```
说明：

　　第一个参数为图片名称

　　第二个参数result 为结果保存的目标文件名称

result.txt的内容

tessdata的格式：
```
tesseract 图片名称 生成的结果文件的名称 字库
```
3.使用Python

安装pytesseract
```
pip install pytesseract
```
提取图片中的文字
```
from PIL import Image
import pytesseract

image = Image.open('123.png')
content = pytesseract.image_to_string(image)   # 解析图片
print(content)
```
4.识别汉字

下载字库

https://github.com/tesseract-ocr/tessdata

下载后放到Tesseract-OCR项目的tessdata文件夹里面
```
tesseract hello.png result -l chi_sim
```
python实现

在pytesseract 库的 image_to_string() 方法里加个参数lang='chi_sim'，这个就是引用对应的中文语言包

中文语言包的全名是 chi_sim.traineddata
```
from PIL import Image
import pytesseract

image = Image.open('hello.png')
content = pytesseract.image_to_string(image,lang='chi_sim')   # 解析图片
print(content)
```
如果识别的效果不理想，可以使用 jTessBoxEditor 提高文字识别准确率

下载https://sourceforge.net/projects/vietocr/files/jTessBoxEditor/
相关阅读:
BZOJ3193: [JLOI2013]地形生成
 ARG102E：Stop. Otherwise...
51NOD1847：奇怪的数学题
 大型大常数多项式模板(已卡常...)
CF932G Palindrome Partition
51nod1538：一道难题(常系数线性递推/Cayley-Hamilton定理)
HTML——meta标签
 HTTP 格式
 Node.js——Async
设计模式——外观模式
原文地址：https://www.cnblogs.com/baby123/p/13024803.html