POS标注器 OpenNLP

5 POS标注器

功能介绍：语音标记器的部分标记符号与基于符号本身和符号的上下文中它们的相应字类型。符号可能取决于符号和上下文使用多个POS标签。该OpenNLP POS标注器使用的概率模型来预测正确的POS标记出了标签组。为了限制可能的标记的符号标记字典可以使用这增加了捉人者的标记和运行时性能。

API：部分的词类打标签训练API支持一个新的POS模式的培训。三个基本步骤是必要的训练它：

应用程序必须打开一个示例数据流
调用POSTagger.train方法
保存POSModel到文件或数据库

在E盘新建一个 myText.txt 文件，内容为

Hi. How are you? This is Mike.

代码实现1：

package package01;

import opennlp.tools.cmdline.PerformanceMonitor;
import opennlp.tools.cmdline.postag.POSModelLoader;
import opennlp.tools.postag.POSModel;
import opennlp.tools.postag.POSSample;
import opennlp.tools.postag.POSTaggerME;
import opennlp.tools.tokenize.WhitespaceTokenizer;
import opennlp.tools.util.InputStreamFactory;
import opennlp.tools.util.MarkableFileInputStreamFactory;
import opennlp.tools.util.ObjectStream;
import opennlp.tools.util.PlainTextByLineStream;

import java.io.File;
import java.io.IOException;
import java.nio.charset.Charset;

public class Test05 {

    public static void main(String[] args) {
        try {
            Test05.POSTag();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    /**
     * 4.POS标注器:POS Tagger
     * @deprecated Hi._NNP How_WRB are_VBP you?_JJ This_DT is_VBZ Mike._NNP
     *
     * https://stackoverflow.com/questions/50668754/the-constructor-plaintextbylinestreamstringreader-is-undefined
     */
    public static void POSTag() throws IOException {
        POSModel model = new POSModelLoader().load(new File("E:\NLP_Practics\models\en-pos-maxent.bin"));
        PerformanceMonitor perfMon = new PerformanceMonitor(System.err, "sent");//显示加载时间
        POSTaggerME tagger = new POSTaggerME(model);
//        String input = "Hi. How are you? This is Mike.";
//        ObjectStream<String> lineStream = new PlainTextByLineStream(new StringReader(input));

        Charset charset = Charset.forName("UTF-8");
        InputStreamFactory isf = new MarkableFileInputStreamFactory(new File("E:\myText.txt"));
        ObjectStream<String> lineStream = new PlainTextByLineStream(isf, charset);

        perfMon.start();
        String line;
        while ((line = lineStream.read()) != null) {
            String whitespaceTokenizerLine[] = WhitespaceTokenizer.INSTANCE.tokenize(line);
            String[] tags = tagger.tag(whitespaceTokenizerLine);
            POSSample sample = new POSSample(whitespaceTokenizerLine, tags);
            System.out.println(sample.toString());
            perfMon.incrementCounter();
        }
        perfMon.stopAndPrintFinalResult();
        System.out.println("--------------4-------------");
        lineStream.close();
    }
}

结果

Loading POS Tagger model ... done (0.566s)
Hi._NNP How_WRB are_VBP you?_JJ This_DT is_VBZ Mike._NNP
--------------4-------------


Average: 125.0 sent/s 
Total: 1 sent
Runtime: 0.008s

代码实现2：

package package01;

import opennlp.tools.cmdline.PerformanceMonitor;
import opennlp.tools.cmdline.postag.POSModelLoader;
import opennlp.tools.postag.POSModel;
import opennlp.tools.postag.POSSample;
import opennlp.tools.postag.POSTaggerME;
import opennlp.tools.tokenize.SimpleTokenizer;
import opennlp.tools.tokenize.WhitespaceTokenizer;
import opennlp.tools.util.*;

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.nio.charset.Charset;

public class Test05 {

    public static void main(String[] args) throws IOException {

        Test05.POSMaxent("Hi. How are you? This is Mike.");
    }

    /**
     * OpenNLP词性标注工具的例子：最大熵词性标注器pos-maxent
     * JJ形容词、JJS形容词最高级、JJR形容词比较级
     * RB副词、RBR副词最高级、RBS副词比较级
     * DT限定词
     * NN名称、NNS名称复试、NNP专有名词、NNPS专有名词复数：
     * PRP：人称代词、PRP$：物主代词
     * VB动词不定式、VBD过去式、VBN过去分词、VBZ现在人称第三人称单数、VBP现在非第三人称、VBG动名词或现在分词
     */
    public static void POSMaxent(String str) throws InvalidFormatException, IOException {
        //给出词性模型所在的路径
        File posModeFile = new File("E:\NLP_Practics\models\en-pos-maxent.bin");
        FileInputStream posModeStream = new FileInputStream(posModeFile);
        POSModel model = new POSModel(posModeStream);
        //将句子切分成词
        POSTaggerME tagger = new POSTaggerME(model);
        String[] words = SimpleTokenizer.INSTANCE.tokenize(str);
        //将切好的词的句子传递给标注器
        String[] result = tagger.tag(words);
        for (int i = 0; i < words.length; i++) {
            System.out.print(words[i] + "/" + result[i] + " ");
        }
        System.out.println();//结果： Hi/UH ./. How/WRB are/VBP you/PRP ?/. This/DT is/VBZ Mike/NNP ./.
        posModeStream.close();
    }

}

　　结果

结果： Hi/UH ./. How/WRB are/VBP you/PRP ?/. This/DT is/VBZ Mike/NNP ./.

https://github.com/godmaybelieve

相关阅读:
PC端微信扫码支付和支付宝跳转支付斧头帮
 微信支付斧头帮
 PC端实现浏览器点击分享到QQ好友,空间,微信,微博等斧头帮
 图片,word,Excel等附件上传斧头帮
 java定时任务详解斧头帮
 手机浏览器实现分享给好友或是朋友圈斧头帮
 Java缓存EhcacheEhcache的Cache在SSM框架中的配置斧头帮
 通过精确地址获取经纬度斧头帮
 SpringBoot中设置自定义拦截器斧头帮
 Visual Studio 2008创建项目（ATL）
原文地址：https://www.cnblogs.com/yuyu666/p/15029748.html