• 谷歌开源OCR,tesseract-ocr使用笔记


    官方教程地址:https://github.com/tesseract-ocr/tesseract/wiki/Compiling

    测试版本为

    root@9a2a063f9534:/tesseract/testing# tesseract -v
    tesseract 4.00.00dev-697-gcdc3533
     leptonica-1.74.4
      libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8
    
     Found AVX2
     Found AVX
     Found SSE

    一、Docker + Ubuntu

    git clone git@github.com:tesseract-ocr/tesseract.git
    cd tesseract
    docker pull ubuntu:latest
    docker build -t google-ocr:latest .
    docker run -itd --name ocr google-ocr:latest /bin/bash
    docker exec -it ocr /bin/bash

    进入环境后,需要训练功能要执行下面第二条

    apt-get install -y  g++ autoconf automake libtool autoconf-archive pkg-config libpng-dev libjpeg8-dev libtiff5-dev zlib1g-dev git
    #training
    apt-get install -y libicu-dev libpango1.0-dev libcairo2-dev 

    Leptonica

    Tesseract    Leptonica    Ubuntu
    4.00       1.74.2      Must build from source

    官网给出必须源码安装,所以去找源码安

    cd /tmp
    git clone https://github.com/DanBloomberg/leptonica.git
    cd leptonica
    autoreconf -vi
    ./autobuild 
    ./configure
    make
    make install

    安装主体ocr

    cd /tesseract
    ./autogen.sh
    LIBLEPT_HEADERSDIR=/usr/include 
    ./configure --with-extra-libraries=/usr/local/lib
    make install

    测试安装是否成功

    tesseract
    tesseract -v

    下载字库模型,选自己需要的就行了

    字库地址:https://github.com/tesseract-ocr/tessdata
    手册地址:https://github.com/tesseract-ocr/tesseract/wiki/Data-Files

    将字库放入指定路径

    export TESSDATA_PREFIX=/tesseract/tessdata
    cp xxx.traindata /tesseract/

    执行测试

    cd /tesseract/testing
    #english
    tesseract phototest.tif result -l eng
    #chinese
    tesseract chi.jpg result1 -l chi_sim

    检查输出

    cat result.txt
    cat result1.txt

    可通过训练提高精度,训练方法见官方文档,这个我没试过。

    附录:

    python 调用接口:https://www.pyimagesearch.com/2017/07/10/using-tesseract-ocr-python/

    python 官方调用依赖:https://github.com/madmaze/pytesseract

  • 相关阅读:
    为Jboss4配置数据库
    从JAVA源代码到EXE可执行文件
    连接数据库出现:Connections could not be acquired from the underlying database
    SVN修改用户名与密码
    linux 下开放指定端口
    Nginx 禁止IP访问及未绑定的域名跳转
    Proguard混淆后无法正常运行的问题:
    How To Automate Internet Explorer to POST Form Data
    poj 2667 hotel 线段树
    poj 3691 DNA repair AC自动机+DP
  • 原文地址:https://www.cnblogs.com/mangoVic/p/8074624.html
Copyright © 2020-2023  润新知