• Python自然语言处理学习笔记(47):5.8 小结


    5.8 Summary 小结

    Words can be grouped into classes, such as nouns, verbs, adjectives, and adverbs. These classes are known as lexical categories or parts-of-speech. Parts-of-speech are assigned short labels, or tags, such as NN and VB.

     单词可以分成类,例如名词,动词,形容词以及副词。这些类被称为词汇类别或者词性。词性被赋给了短标签或者标记,例如NN或者VB

    The process of automatically assigning parts-of-speech to words in text is called part-of-speech tagging, POS tagging, or just tagging.

     给文中的单词自动标注词性的过程称为词性标注。

    Automatic tagging is an important step in the NLP pipeline, and is useful in a variety of situations, including predicting the behavior of previously unseen words, analyzing word usage in corpora, and text-to-speech systems.

    自动标注在NLP流程中是重要的一步,并且在各种情况下都非常有效,包括预测先前未出现单词的行为,分析语料库的单词使用,以及文字转语音系统。

    Some linguistic corpora, such as the Brown Corpus, have been POS tagged.

     一些语言语料库,例如布朗语料库,已经进行了POS标记。

    A variety of tagging methods are possible, e.g., default tagger, regular expression tagger, unigram tagger, and n-gram taggers. These can be combined using a technique known as backoff.

    各种不同的标记方法都是合适的,例如,缺省tagger,正则表达式tagger,unigram tagger以及n-gram tagger。这些可以使用一种称为backoff的技术进行组合。

    Taggers can be trained and evaluated using tagged corpora.

    Tagger可以进行训练并且用标记了的语料库进行评分。

    Backoff is a method for combining models: when a more specialized model (such as a bigram tagger) cannot assign a tag in a given context, we back off to a more general model (such as a unigram tagger).

    Backoff是一个用于组合模型的方法:当一个更详细的模型(例如bigram tagger)不能为给定内容分配标记,我们后退到一个更加一般化的模型(例如unigram tagger

    Part-of-speech tagging is an important, early example of a sequence classification task in NLP: a classification decision at any one point in the sequence makes use of words and tags in the local context.

    词性标注是NLP中一个重要的,早先的序列分类任务:在序列任意某点的分类决策使用了局部语境中的单词和标记。

    A dictionary is used to map between arbitrary types of information, such as a string and a number: freq['cat'] = 12. We create dictionaries using the brace notation: pos = {}, pos = {'furiously': 'adv', 'ideas': 'n', 'colorless': 'adj'}.

    字典用来映射任意类型之间的信息,例如字符串和数字:freq[‘cat’]=12。我们使用大括号标记来创建字典:pos = {}, pos = {'furiously': 'adv', 'ideas': 'n', 'colorless': 'adj'}.

    N-gram taggers can be defined for large values of n, but once n is larger than 3, we usually encounter the sparse data problem; even with a large quantity of training data, we see only a tiny fraction of possible contexts.

    N-gram tag可以定义为较大数值的n,但是一旦n大于3,我们常常会面临稀疏数据问题,即时使用大量的训练数据,我们仅可以看到可能的上下文的细小部分。

    Transformation-based tagging involves learning a series of repair rules of the form change tag s to tag t in context c,” where each rule fixes mistakes and possibly introduces a (smaller) number of errors.

    基于转换的标记包含了一系列的“change tag s to tag t in context c”形式的修复规则,每个规则修复错误并且可能地引入更小的错误。

  • 相关阅读:
    多线程之volatile关键字
    多线程具体实现
    多线程的概述
    Linux基本目录机构
    Java13新特性
    CF1316D【Nash Matrix】(dfs+构造+思维)
    ego商城项目学习总结+出现问题及解决
    java.lang.OutOfMemoryError: GC overhead limit exceeded之tomcat7优化
    jsp在tomcat中更新不起作用
    js取值及赋值
  • 原文地址:https://www.cnblogs.com/yuxc/p/2160125.html
Copyright © 2020-2023  润新知