• Tesseract 3.02中文字库训练


    Tesseract 3.02中文字库训练

    下载chi_sim.traindata字库
    下载tesseract-ocr-setup-3.02.02.exe 
    下载jTessBoxEditor用于修改box文件 

    0.准备

    为了方便 tif文面命名格式[lang].[fontname].exp[num].tif
    lang是语言 fontname是字体 
    比如我们要训练自定义字库 mjorcen字体名normal
    那么我们把tif文件重命名 mjorcen.normal.exp0.jpg

    图片 : 

    下面开始训练字库:

    1、生成 .box文件

    tesseract mjorcen.normal.exp0.jpg mjorcen.normal.exp0 -l chi_sim batch.nochop makebox

    把图片文件和box文件放在同一目录,

    2、用jTessBoxEditor.jar打开tif文件,然后根据实际情况修改box文件 

    3、 生成 .tr文件

    tesseract  mjorcen.normal.exp0.jpg mjorcen.normal.exp0  nobatch box.train

    4、成一个unicharset文件
     

    unicharset_extractor mjorcen.normal.exp0.box

    5、新建一个font_properties文件


    里面内容写入 normal 0 0 0 0 0 表示默认普通字体

    6、运行命令

    shapeclustering -F font_properties -U unicharset mjorcen.normal.exp0.tr

    mftraining -F font_properties -U unicharset -O unicharset mjorcen.normal.exp0.tr

    cntraining mjorcen.normal.exp0.tr

    结果如下:

    E:dataUsersAdministratorDesktopocrBuider3>shapeclustering -F font_propertie
    s -U unicharset mjorcen.normal.exp0.tr
    Reading mjorcen.normal.exp0.tr ...
    Building master shape table
    Computing shape distances...
    Stopped with 0 merged, min dist 999.000000
    Computing shape distances... 0
    Stopped with 0 merged, min dist 999.000000
    Computing shape distances... 0
    Stopped with 0 merged, min dist 999.000000
    Computing shape distances... 0
    Stopped with 0 merged, min dist 999.000000
    Computing shape distances... 0
    Stopped with 0 merged, min dist 999.000000
    Computing shape distances... 0
    Stopped with 0 merged, min dist 999.000000
    Computing shape distances...
    Stopped with 0 merged, min dist 999.000000
    Computing shape distances...
    Stopped with 0 merged, min dist 999.000000
    Computing shape distances... 0 1 2 3 4
    Stopped with 0 merged, min dist 0.365385
    Master shape_table:Number of shapes = 5 max unichars = 1 number with multiple un
    ichars = 0
    
    E:dataUsersAdministratorDesktopocrBuider3>mftraining -F font_properties -U
    unicharset -O  unicharset mjorcen.normal.exp0.tr
    Read shape table shapetable of 5 shapes
    Reading mjorcen.normal.exp0.tr ...
    Done!
    
    E:dataUsersAdministratorDesktopocrBuider3>cntraining mjorcen.normal.exp0.tr
    
    Reading mjorcen.normal.exp0.tr ...
    Clustering ...
    
    Writing normproto ...

     

    7、把目录下的unicharset、inttemp、pffmtable、shapetable、normproto这五个文件前面都加上normal.

    8、执行combine_tessdata normal.

    9、把 normal.traineddata 复制到Tesseract-OCR 安装目录下的tessdata文件夹中

    10、测试

    tesseract mjorcen.normal.exp0.jpg mjorcen.normal.exp0 -l normal

    debug:

    E:dataUsersAdministratorDesktopocrBuider3>tesseract mjorcen.normal.exp0.jpg
     mjorcen.normal.exp0 -l chi_sim batch.nochop makebox
    Too many unichars in ambiguity on line 22358424
    Too many unichars in ambiguity on line 22358424
    Too many unichars in ambiguity on line 14941344
    Tesseract Open Source OCR Engine v3.02 with Leptonica
    
    E:dataUsersAdministratorDesktopocrBuider3>tesseract  mjorcen.normal.exp0.jp
    g mjorcen.normal.exp0  nobatch box.train
    Tesseract Open Source OCR Engine v3.02 with Leptonica
    APPLY_BOXES:
       Boxes read from boxfile:       6
       Found 6 good blobs.
    TRAINING ... Font name = normal
    Generated training data for 2 words
    
    E:dataUsersAdministratorDesktopocrBuider3>unicharset_extractor mjorcen.norm
    al.exp0.box
    Extracting unicharset from mjorcen.normal.exp0.box
    Wrote unicharset file ./unicharset.
    
    E:dataUsersAdministratorDesktopocrBuider3>shapeclustering -F font_propertie
    s -U unicharset mjorcen.normal.exp0.tr
    Reading mjorcen.normal.exp0.tr ...
    Building master shape table
    Computing shape distances...
    Stopped with 0 merged, min dist 999.000000
    Computing shape distances... 0
    Stopped with 0 merged, min dist 999.000000
    Computing shape distances... 0
    Stopped with 0 merged, min dist 999.000000
    Computing shape distances... 0
    Stopped with 0 merged, min dist 999.000000
    Computing shape distances... 0
    Stopped with 0 merged, min dist 999.000000
    Computing shape distances... 0
    Stopped with 0 merged, min dist 999.000000
    Computing shape distances...
    Stopped with 0 merged, min dist 999.000000
    Computing shape distances...
    Stopped with 0 merged, min dist 999.000000
    Computing shape distances... 0 1 2 3 4
    Stopped with 0 merged, min dist 0.365385
    Master shape_table:Number of shapes = 5 max unichars = 1 number with multiple un
    ichars = 0
    
    E:dataUsersAdministratorDesktopocrBuider3>mftraining -F font_properties -U
    unicharset -O  unicharset mjorcen.normal.exp0.tr
    Read shape table shapetable of 5 shapes
    Reading mjorcen.normal.exp0.tr ...
    Done!
    
    E:dataUsersAdministratorDesktopocrBuider3>cntraining mjorcen.normal.exp0.tr
    
    Reading mjorcen.normal.exp0.tr ...
    Clustering ...
    
    Writing normproto ...
    
    E:dataUsersAdministratorDesktopocrBuider3>combine_tessdata normal.
    Combining tessdata files
    TessdataManager combined tesseract data files.
    Offset for type 0 is -1
    Offset for type 1 is 140
    Offset for type 2 is -1
    Offset for type 3 is 489
    Offset for type 4 is 123081
    Offset for type 5 is 123134
    Offset for type 6 is -1
    Offset for type 7 is -1
    Offset for type 8 is -1
    Offset for type 9 is -1
    Offset for type 10 is -1
    Offset for type 11 is -1
    Offset for type 12 is -1
    Offset for type 13 is 123920
    Offset for type 14 is -1
    Offset for type 15 is -1
    Offset for type 16 is -1
    
    E:dataUsersAdministratorDesktopocrBuider3>tesseract mjorcen.normal.exp0.jpg
     mjorcen.normal.exp0 -l normal
    Tesseract Open Source OCR Engine v3.02 with Leptonica
    
    E:dataUsersAdministratorDesktopocrBuider3>tesseract mjorcen.normal.exp0.jpg
     mjorcen.normal.exp1 -l chi_sim
    Too many unichars in ambiguity on line 15280712
    Too many unichars in ambiguity on line 15280712
    Too many unichars in ambiguity on line 4324296
    Tesseract Open Source OCR Engine v3.02 with Leptonica

     normal 结果

    应收: 119

    普通的中文结果:

    应收= II苜

    脚本(需要java环境):

    目录结果如下:

    脚本如下:

    window

    @echo off 
    
    set "src=%1%" 
    set "font_name=%2%"
    set "desc=%3%" 
    
    
    if  not  defined src set /p src=" please pass your filename : "
    
    if  not  defined font_name set /p font_name=" please pass your font_name : "
    
    rem 判断参数的合法性
    
    if  not  defined src echo  IllegalArgumentException arg1 must not be null &  pause>nul & exit
    
    if  not  defined font_name echo  IllegalArgumentException arg2 must not be null &  pause>nul & exit
    
    if  not  defined desc set "desc=%src:~0,-4%"  
    
     echo desc %desc%
    
    rem 如果目录下没有font_properties 文件创建 font_properties ,并写入文件
    if exist font_properties (
     echo  font_properties exist
    ) else (
    ECHO  %font_name% 0 0 0 0 0  >"font_properties"
    )
    
    rem  删除原有文件  
    if exist %font_name%.unicharset ECHO DEL %font_name%.unicharset &   DEL  /Q  names %font_name%.unicharset
    if exist %font_name%.inttemp  ECHO DEL %font_name%.inttemp &  DEL  /Q  names %font_name%.inttemp
    if exist %font_name%.pffmtable  ECHO DEL %font_name%.pffmtable &  DEL  /Q  names %font_name%.pffmtable
    if exist %font_name%.shapetable ECHO DEL %font_name%.shapetable & DEL  /Q  names %font_name%.shapetable
    if exist %font_name%.normproto ECHO DEL %font_name%.normproto & DEL  /Q  names %font_name%.normproto
    if exist %font_name%.font_properties ECHO DEL %font_name%.font_properties & DEL  /Q  names %font_name%.font_properties
     
    rem   makebox
    
    tesseract  %src%  %desc%   -l chi_sim  batch.nochop makebox
    
    java -Xms128m -Xmx512m -jar jTessBoxEditor/jTessBoxEditor.jar
    
    ECHO Please change your results , and press any key to continue
    
    pause>nul 
      
    tesseract  %src%  %desc%  nobatch box.train
    
    unicharset_extractor %desc%.box
    
    shapeclustering -F font_properties -U unicharset %desc%.tr
    
    mftraining -F font_properties -U unicharset -O  unicharset %desc%.tr
    
    cntraining %desc%.tr
    
    
    rem  配置新文件
    if exist unicharset ECHO rename unicharset %font_name%.unicharset &  rename unicharset %font_name%.unicharset
    if exist inttemp ECHO rename inttemp %font_name%.inttemp &  rename inttemp %font_name%.inttemp
    if exist pffmtable ECHO rename pffmtable %font_name%.pffmtable &  rename pffmtable %font_name%.pffmtable
    if exist shapetable ECHO rename shapetable %font_name%.shapetable &  rename shapetable %font_name%.shapetable
    if exist normproto ECHO rename normproto %font_name%.normproto &  rename normproto %font_name%.normproto
    
    combine_tessdata %font_name%.
    
    if exist font_properties ECHO rename font_properties %font_name%.font_properties & rename font_properties %font_name%.font_properties
    
    ECHO  press any key to continue
    pause>nul 
     

    调用: 

    注意: 参数1: 文件全名 , 参数2 字体名, 参数3 :输出文件名, 不填默认为文件名

    E:dataUsersAdministratorDesktopocrBuider3>run.bat mjorcen.normal.exp0.jpg normal

    实例:

    E:dataUsersAdministratorDesktopocrBuider3>run.bat mjorcen.normal.exp0.jpg n
    ormal
    desc mjorcen.normal.exp0
     font_properties exist
    Too many unichars in ambiguity on line 2188584
    Too many unichars in ambiguity on line 2188584
    Too many unichars in ambiguity on line 2686128
    Tesseract Open Source OCR Engine v3.02 with Leptonica
    Please change your results , and press any key to continue
    Tesseract Open Source OCR Engine v3.02 with Leptonica
    APPLY_BOXES:
       Boxes read from boxfile:       6
       Found 6 good blobs.
    TRAINING ... Font name = normal
    Generated training data for 2 words
    Extracting unicharset from mjorcen.normal.exp0.box
    Wrote unicharset file ./unicharset.
    Reading mjorcen.normal.exp0.tr ...
    Building master shape table
    Computing shape distances...
    Stopped with 0 merged, min dist 999.000000
    Computing shape distances... 0
    Stopped with 0 merged, min dist 999.000000
    Computing shape distances... 0
    Stopped with 0 merged, min dist 999.000000
    Computing shape distances... 0
    Stopped with 0 merged, min dist 999.000000
    Computing shape distances... 0
    Stopped with 0 merged, min dist 999.000000
    Computing shape distances... 0
    Stopped with 0 merged, min dist 999.000000
    Computing shape distances...
    Stopped with 0 merged, min dist 999.000000
    Computing shape distances...
    Stopped with 0 merged, min dist 999.000000
    Computing shape distances... 0 1 2 3 4
    Stopped with 0 merged, min dist 0.365385
    Master shape_table:Number of shapes = 5 max unichars = 1 number with multiple un
    ichars = 0
    Read shape table shapetable of 5 shapes
    Reading mjorcen.normal.exp0.tr ...
    Done!
    Reading mjorcen.normal.exp0.tr ...
    Clustering ...
    
    Writing normproto ...
    rename unicharset normal.unicharset
    rename inttemp normal.inttemp
    rename pffmtable normal.pffmtable
    rename shapetable normal.shapetable
    rename normproto normal.normproto
    Combining tessdata files
    TessdataManager combined tesseract data files.
    Offset for type 0 is -1
    Offset for type 1 is 140
    Offset for type 2 is -1
    Offset for type 3 is 489
    Offset for type 4 is 123081
    Offset for type 5 is 123134
    Offset for type 6 is -1
    Offset for type 7 is -1
    Offset for type 8 is -1
    Offset for type 9 is -1
    Offset for type 10 is -1
    Offset for type 11 is -1
    Offset for type 12 is -1
    Offset for type 13 is 123920
    Offset for type 14 is -1
    Offset for type 15 is -1
    Offset for type 16 is -1
    rename font_properties normal.font_properties
    E:dataUsersAdministratorDesktopocrBuider3>
     

    linux (出自文档:http://tesseract-ocr.googlecode.com/svn/trunk/doc/combine_tessdata.1.asc) :

    #!/bin/bash 
    tesseract zzz.ocra.exp0.tif zzz.ocra.exp0 nobatch box.train
    unicharset_extractor zzz.ocra.exp0.box
    echo "ocra 0 0 1 0 0" >font_properties
    shapeclustering -F font_properties -U unicharset zzz.ocra.exp0.tr
    mftraining -F font_properties -U unicharset -O zzz.unicharset zzz.ocra.exp0.tr
    cntraining zzz.ocra.exp0.tr
    cp normproto zzz.normproto
    cp inttemp zzz.inttemp
    cp pffmtable zzz.pffmtable
    cp shapetable zzz.shapetable
    combine_tessdata zzz.
    cp zzz.traineddata /home/youruserid/tessdata/.
    sudo cp zzz.traineddata /usr/share/tesseract-ocr/tessdata/.
    tesseract zzz.ocra.exp0.tif output -l zzz
  • 相关阅读:
    记一次由于缓存导致的bug
    3 Task中的一些枚举 创建时候的、continue时候的
    2 Task中的延续和7种阻塞
    1 Task的简单实用
    关于内存溢出遇到的两种情况
    6 Wcf使用Stream传输
    5 wcf双工
    4 WCF中的RPC和OneWay
    3 WCF一些基础铺垫
    什么是三元表达式?“三元”表示什么意思?
  • 原文地址:https://www.cnblogs.com/mjorcen/p/3800739.html
Copyright © 2020-2023  润新知