• How to generate a new dictionary file of mmseg


    How to generate a new dictionary file of mmseg

    0.Usage about mmseg-node memtioned in github :
    var mmseg = require("mmseg");
    var q = mmseg.open('/usr/local/etc/');
    console.log(q.segmentSync("我是中文分词"));

    #"/usr/local/etc" is dir of mmseg's dictionary, which has a file "uni.lib" , which is the directionary file

    1. so we need a generate directionary file. Before this , we need to install coreseek , ref to http://www.coreseek.cn/products-install/install_on_bsd_linux/
    安装前,建议查看:源码包说明README;4.0/4.1版可参考3.2版本安装,步骤相同;如遇到问题,请看详细安装说明。

    ##下载coreseek:coreseek 3.2.14:点击下载、coreseek 4.0.1:点击下载、coreseek 4.1:点击下载
    $ wget http://www.coreseek.cn/uploads/csft/3.2/coreseek-3.2.14.tar.gz
    $ 或者 http://www.coreseek.cn/uploads/csft/4.0/coreseek-4.0.1-beta.tar.gz
    $ 或者 http://www.coreseek.cn/uploads/csft/4.0/coreseek-4.1-beta.tar.gz
    $ tar xzvf coreseek-3.2.14.tar.gz 或者 coreseek-4.0.1-beta.tar.gz 或者 coreseek-4.1-beta.tar.gz
    $ cd coreseek-3.2.14 或者 coreseek-4.0.1-beta 或者 coreseek-4.1-beta

    ##前提:需提前安装操作系统基础开发库及mysql依赖库以支持mysql数据源和xml数据源
    ##安装mmseg
    $ cd mmseg-3.2.14
    $ ./bootstrap #输出的warning信息可以忽略,如果出现error则需要解决
    $ ./configure --prefix=/usr/local/mmseg3
    $ make && make install
    $ cd ..

    ##安装coreseek
    $ cd csft-3.2.14 或者 cd csft-4.0.1 或者 cd csft-4.1
    $ sh buildconf.sh #输出的warning信息可以忽略,如果出现error则需要解决
    $ ./configure --prefix=/usr/local/coreseek --without-unixodbc --with-mmseg --with-mmseg-includes=/usr/local/mmseg3/include/mmseg/ --with-mmseg-libs=/usr/local/mmseg3/lib/ --with-mysql ##如果提示mysql问题,可以查看MySQL数据源安装说明
    ##debian5 : ubuntu9/10 install mysql:
    $ apt-get install mysql-client libmysqlclient15-dev libxml2-dev libexpat1-dev

    $ make && make install
    $ cd ..

    ##测试mmseg分词,coreseek搜索(需要预先设置好字符集为zh_CN.UTF-8,确保正确显示中文)
    $ cd testpack
    $ cat var/test/test.xml #此时应该正确显示中文
    $ /usr/local/mmseg3/bin/mmseg -d /usr/local/mmseg3/etc var/test/test.xml #we can see content in test.xml was divided in "system-default-knowed vocabulary" which base on dictionary file "/usr/local/mmseg3/etc/unilib".
    $ /usr/local/coreseek/bin/indexer -c etc/csft.conf --all #regenerate a index

    2.generate a new dictionary:
    #write the new vocabulary in word_new_input.txt, each vocabulary one line and cd in where you locate your word_new_input.txt
    #for example (no # at the beginning of each line):
    #雅阁
    #马自达

    # now you cd in your new vocabulary dir:
    $ cd ~/projects/mmseg-3.2.14/new2
    $ cat word_new_input.txt | awk '{print $1" ""1"" x:1"}' > word_new_gen.txt
    $ cat ../data/unigram.txt | word_new_gen.txt > word_new_gen.txt
    $ /usr/local/mmseg3/bin/mmseg -u word_new_gen.txt #which generate a word_new_gen.txt.lib file
    $ mv word_new_gen.txt.lib uni.lib #rename
    #$ cp /usr/local/mmseg3/etc ~/ -r #backup your dictionary file
    $ sudo cp uni.lib /usr/local/mmseg3/etc/ #replace the dictionary file with new one
    ## now you cd in your coreseek-3.2.14/testpack directory
    $ /usr/local/coreseek/bin/indexer -c ~/projects/coreseek-3.2.14/testpack/etc/csft.conf --all #regenerate a new index
    #above generate some output as the following:
    Coreseek Fulltext 3.2 [ Sphinx 0.9.9-release (r2117)]
    Copyright (c) 2007-2011,
    Beijing Choice Software Technologies Inc (http://www.coreseek.com)

    using config file 'etc/csft.conf'...
    indexing index 'xml'...
    collected 3 docs, 0.0 MB
    sorted 0.0 Mhits, 100.0% done
    total 3 docs, 7585 bytes
    total 0.010 sec, 746334 bytes/sec, 295.18 docs/sec
    total 2 reads, 0.000 sec, 4.2 kb/call avg, 0.0 msec/call avg
    total 7 writes, 0.000 sec, 3.1 kb/call avg, 0.0 msec/call avg

    #new dict store in /usr/local/mmseg3/etc/
    3.test the new dictionary:
    3.1 file "var/test/newtest.txt" is the one has new vocabulary sentence:
    $ /usr/local/mmseg3/bin/mmseg -d /usr/local/mmseg3/etc var/test/newtest.txt
    雅阁/x 现在/x 卖/x 多少/x 钱/x ?/x
    马自达/x 的/x 重量/x 是/x 多少/x ?/x
    3.2 or you can program in coffee:

    david@Wade:~/node/node$ coffee
    coffee> mmseg=require('mmseg')
    { open: [Function],
    clean: [Function],
    uniq: [Function] }
    coffee> q= mmseg.open( '/usr/local/mmseg3/etc/')
    {}
    coffee> console.log q.segmentSync('我喜欢开雅阁')
    [ '我', '喜欢', '开', '雅阁' ]
    undefined
    coffee> console.log q.segmentSync('我喜欢开丰田') #丰田 is NOT in the new dictionary
    [ '我', '喜欢', '开', '丰', '田' ]
    undefined
    coffee> console.log q.segmentSync '我喜欢开马自达'
    [ '我', '喜欢', '开', '马自达' ]


  • 相关阅读:
    【洛谷P4887】【模板】莫队二次离线(第十四分块(前体))
    查询数据库表大小
    java程序使用ssl证书连接mysql
    win32- 函数运行速度测试
    回调函数是嘛东西
    win32-读取控制台中所有的字符串
    关于 websocket 的一些学习
    idea下载地址
    ida 重新定义
    P1650 田忌赛马(贪心)
  • 原文地址:https://www.cnblogs.com/no7dw/p/3553911.html
Copyright © 2020-2023  润新知