• 自然语言15.1_Part of Speech Tagging 词性标注


     python机器学习生物信息学系列课(博主录制):http://dwz.date/b9vw

    项目合作QQ:231469242

    https://en.wikipedia.org/wiki/Part-of-speech_tagging

    In corpus linguistics, part-of-speech tagging (POS tagging or POST), also called grammatical tagging or word-category disambiguation, is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context—i.e., its relationship with adjacent and related words in a phrase, sentence, or paragraph. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc.

    Once performed by hand, POS tagging is now done in the context of computational linguistics, using algorithms which associate discrete terms, as well as hidden parts of speech, in accordance with a set of descriptive tags. POS-tagging algorithms fall into two distinctive groups: rule-based and stochastic. E. Brill's tagger, one of the first and most widely used English POS-taggers, employs rule-based algorithms.

    Contents

    Principle

    Part-of-speech tagging is harder than just having a list of words and their parts of speech, because some words can represent more than one part of speech at different times, and because some parts of speech are complex or unspoken. This is not rare—in natural languages (as opposed to many artificial languages), a large percentage of word-forms are ambiguous. For example, even "dogs", which is usually thought of as just a plural noun, can also be a verb:

    The sailor dogs the hatch.

    Correct grammatical tagging will reflect that "dogs" is here used as a verb, not as the more common plural noun. Grammatical context is one way to determine this; semantic analysis can also be used to infer that "sailor" and "hatch" implicate "dogs" as 1) in the nautical context and 2) an action applied to the object "hatch" (in this context, "dogs" is a nautical term meaning "fastens (a watertight door) securely").

    Schools commonly teach that there are 9 parts of speech in English: noun, verb, article, adjective, preposition, pronoun, adverb, conjunction, and interjection. However, there are clearly many more categories and sub-categories. For nouns, the plural, possessive, and singular forms can be distinguished. In many languages words are also marked for their "case" (role as subject, object, etc.), grammatical gender, and so on; while verbs are marked for tense, aspect, and other things. Linguists distinguish parts of speech to various fine degrees, reflecting a chosen "tagging system".

    In part-of-speech tagging by computer, it is typical to distinguish from 50 to 150 separate parts of speech for English. For example, NN for singular common nouns, NNS for plural common nouns, NP for singular proper nouns (see the POS tags used in the Brown Corpus). Work on stochastic methods for tagging Koine Greek (DeRose 1990) has used over 1,000 parts of speech, and found that about as many words were ambiguous there as in English. A morphosyntactic descriptor in the case of morphologically rich languages is commonly expressed using very short mnemonics, such as 'Ncmsan for Category=Noun, Type = common, Gender = masculine, Number = singular, Case = accusative, Animate = no.

    History

    The Brown Corpus

    Research on part-of-speech tagging has been closely tied to corpus linguistics. The first major corpus of English for computer analysis was the Brown Corpus developed at Brown University by Henry Kučera and W. Nelson Francis, in the mid-1960s. It consists of about 1,000,000 words of running English prose text, made up of 500 samples from randomly chosen publications. Each sample is 2,000 or more words (ending at the first sentence-end after 2,000 words, so that the corpus contains only complete sentences).

    The Brown Corpus was painstakingly "tagged" with part-of-speech markers over many years. A first approximation was done with a program by Greene and Rubin, which consisted of a huge handmade list of what categories could co-occur at all. For example, article then noun can occur, but article verb (arguably) cannot. The program got about 70% correct. Its results were repeatedly reviewed and corrected by hand, and later users sent in errata, so that by the late 70s the tagging was nearly perfect (allowing for some cases on which even human speakers might not agree).

    This corpus has been used for innumerable studies of word-frequency and of part-of-speech, and inspired the development of similar "tagged" corpora in many other languages. Statistics derived by analyzing it formed the basis for most later part-of-speech tagging systems, such as CLAWS (linguistics) and VOLSUNGA. However, by this time (2005) it has been superseded by larger corpora such as the 100 million word British National Corpus.

    For some time, part-of-speech tagging was considered an inseparable part of natural language processing, because there are certain cases where the correct part of speech cannot be decided without understanding the semantics or even the pragmatics of the context. This is extremely expensive, especially because analyzing the higher levels is much harder when multiple part-of-speech possibilities must be considered for each word.

    Use of Hidden Markov Models

    In the mid 1980s, researchers in Europe began to use hidden Markov models (HMMs) to disambiguate parts of speech, when working to tag the Lancaster-Oslo-Bergen Corpus of British English. HMMs involve counting cases (such as from the Brown Corpus), and making a table of the probabilities of certain sequences. For example, once you've seen an article such as 'the', perhaps the next word is a noun 40% of the time, an adjective 40%, and a number 20%. Knowing this, a program can decide that "can" in "the can" is far more likely to be a noun than a verb or a modal. The same method can of course be used to benefit from knowledge about following words.

    More advanced ("higher order") HMMs learn the probabilities not only of pairs, but triples or even larger sequences. So, for example, if you've just seen a noun followed by a verb, the next item may be very likely a preposition, article, or noun, but much less likely another verb.

    When several ambiguous words occur together, the possibilities multiply. However, it is easy to enumerate every combination and to assign a relative probability to each one, by multiplying together the probabilities of each choice in turn. The combination with highest probability is then chosen. The European group developed CLAWS, a tagging program that did exactly this, and achieved accuracy in the 93-95% range.

    It is worth remembering, as Eugene Charniak points out in Statistical techniques for natural language parsing (1997) [2], that merely assigning the most common tag to each known word and the tag "proper noun" to all unknowns will approach 90% accuracy because many words are unambiguous.

    CLAWS pioneered the field of HMM-based part of speech tagging, but was quite expensive since it enumerated all possibilities. It sometimes had to resort to backup methods when there were simply too many options (the Brown Corpus contains a case with 17 ambiguous words in a row, and there are words such as "still" that can represent as many as 7 distinct parts of speech (DeRose 1990, p. 82)).

    HMMs underlie the functioning of stochastic taggers and are used in various algorithms one of the most widely used being the bi-directional inference algorithm.[1]

    Dynamic programming methods

    In 1987, Steven DeRose[2] and Ken Church[3] independently developed dynamic programming algorithms to solve the same problem in vastly less time. Their methods were similar to the Viterbi algorithm known for some time in other fields. DeRose used a table of pairs, while Church used a table of triples and a method of estimating the values for triples that were rare or nonexistent in the Brown Corpus (actual measurement of triple probabilities would require a much larger corpus). Both methods achieved accuracy over 95%. DeRose's 1990 dissertation at Brown University included analyses of the specific error types, probabilities, and other related data, and replicated his work for Greek, where it proved similarly effective.

    These findings were surprisingly disruptive to the field of natural language processing. The accuracy reported was higher than the typical accuracy of very sophisticated algorithms that integrated part of speech choice with many higher levels of linguistic analysis: syntax, morphology, semantics, and so on. CLAWS, DeRose's and Church's methods did fail for some of the known cases where semantics is required, but those proved negligibly rare. This convinced many in the field that part-of-speech tagging could usefully be separated out from the other levels of processing; this in turn simplified the theory and practice of computerized language analysis, and encouraged researchers to find ways to separate out other pieces as well. Markov Models are now the standard method for part-of-speech assignment.

    Unsupervised taggers

    The methods already discussed involve working from a pre-existing corpus to learn tag probabilities. It is, however, also possible to bootstrap using "unsupervised" tagging. Unsupervised tagging techniques use an untagged corpus for their training data and produce the tagset by induction. That is, they observe patterns in word use, and derive part-of-speech categories themselves. For example, statistics readily reveal that "the", "a", and "an" occur in similar contexts, while "eat" occurs in very different ones. With sufficient iteration, similarity classes of words emerge that are remarkably similar to those human linguists would expect; and the differences themselves sometimes suggest valuable new insights.

    These two categories can be further subdivided into rule-based, stochastic, and neural approaches.

    Other taggers and methods

    Some current major algorithms for part-of-speech tagging include the Viterbi algorithm, Brill Tagger, Constraint Grammar, and the Baum-Welch algorithm (also known as the forward-backward algorithm). Hidden Markov model and visible Markov model taggers can both be implemented using the Viterbi algorithm. The rule-based Brill tagger is unusual in that it learns a set of rule patterns, and then applies those patterns rather than optimizing a statistical quantity. Unlike the Brill tagger where the rules are ordered sequentially, the POS and morphological tagging toolkit RDRPOSTagger stores rules in the form of a Ripple Down Rules tree.

    Many machine learning methods have also been applied to the problem of POS tagging. Methods such as SVM, Maximum entropy classifier, Perceptron, and Nearest-neighbor have all been tried, and most can achieve accuracy above 95%.

    A direct comparison of several methods is reported (with references) at [3]. This comparison uses the Penn tag set on some of the Penn Treebank data, so the results are directly comparable.

    However, many significant taggers are not included (perhaps because of the labor involved in reconfiguring them for this particular dataset). Thus, it should not be assumed that the results reported there are the best that can be achieved with a given approach; nor even the best that have been achieved with a given approach.

    A more recent development is using the structure regularization method for part-of-speech tagging, achieving 97.36% on the standard benchmark dataset.[4]

    Issues

    While there is broad agreement about basic categories, a number of edge cases make it difficult to settle on a single "correct" set of tags, even in a single language such as English. For example, it is hard to say whether "fire" is an adjective or a noun in

     the big green fire truck
    

    A second important example is the use/mention distinction, as in the following example, where "blue" could be replaced by a word from any POS (the Brown Corpus tag set appends the suffix "-NC" in such cases):

     the word "blue" has 4 letters.
    

    Words in a language other than that of the "main" text are commonly tagged as "foreign", usually in addition to a tag for the role the foreign word is actually playing in context.

    There are also many cases where POS categories and "words" do not map one to one, for example:

     David's
     gonna
     don't
     vice versa
     first-cut
     cannot
     pre- and post-secondary
     look (a word) up
    

    In the last example, "look" and "up" arguably function as a single verbal unit, despite the possibility of other words coming between them. Some tag sets (such as Penn) break hyphenated words, contractions, and possessives into separate tokens, thus avoiding some but far from all such problems.

    词性标注

    http://www.hankcs.com/nlp/part-of-speech-tagging.html

    词性标注(Part-of-Speech tagging 或POS tagging),又称词类标注或者简称标注,是指为分词结果中的每个单词标注一个正确的词性的程序,也即确定每个词是名词、动词、形容词或其他词性的过 程。在汉语中,词性标注比较简单,因为汉语词汇词性多变的情况比较少见,大多词语只有一个词性,或者出现频次最高的词性远远高于第二位的词性。据说,只需 选取最高频词性,即可实现80%准确率的中文词性标注程序。

    利用HMM即可实现更高准确率的词性标注,本文旨在介绍HanLP中的词性标注模块。

    开源项目

    本文代码已集成到HanLP中开源:http://www.hankcs.com/nlp/hanlp.html

    训练

    HanLP中使用了一阶隐马模型,在这个隐马尔可夫模型中,隐状态是词性,显状态是单词。

    语料库

    训练语料采用了2014人民日报切分语料:

    1. 人民网/nz 11日/讯/ng 据/《/[纽约/nsf 时报/n]/nz 》/报道/,/美国/nsf 华尔街/nsf 股市/在/2013年/的/ude1 最后/一天/mq 继续/上涨/vn ,/和/cc [全球/股市/n]/nz 一样/uyy ,/都/以/[最高/纪录/n]/nz 或/接近/[最高/纪录/n]/nz 结束/本/rz 年/qt 的/ude1 交易/vn 。/
    2. 《/[纽约/nsf 时报/n]/nz 》/报道/说/,/标普/nz 500/指数/今年/上升/vi 29.6%/m ,/为/1997年/以来/的/ude1 最大/gm 涨幅/;/[道琼斯/ntc 工业/平均/指数/n]/nz 上升/vi 26.5%/m ,/为/1996年/以来/的/ude1 最大/gm 涨幅/;/[纳斯/nrf 达/克/q]/nz 上涨/vi 38.3%/m 。/
    3. 就/1231日/来说/uls ,/由于/就业/vn 前景/看好/和/cc [经济/增长/v]/nz 明年/可能/加速/vn ,/消费者/信心/上升/vi 。/工商/协进会/nis (/ConferenceBoard/)/报告/,/12月/消费者/信心/上升/vi 到/78.1/,/明显/高于/11月/的/ude1 72/。/
    4. 另据/nz 《/[华尔街/nsf 日报/n]/nz 》/报道/,/2013年/是/vshi 1995年/以来/[美国/nsf 股市/n]/nz 表现/最好/的/ude1 一年/mq 。/这/rzv 一年/mq 里/,/投资/[美国/nsf 股市/n]/nz 的/ude1 明智/做法/是/vshi 追/着/uzhe “/傻钱/nz ”/跑/。/所谓/的/ude1 “/傻钱/nz ”/策略/,/其实/就是/买入/vn 并/cc 持有/美国/nsf 股票/这样/rzv 的/ude1 普通/组合/vn 。/这个/rz 策略/要/比/[对冲/vn 基金/n]/nz 和/cc 其它/rz 专业/投资者/nnd 使用/的/ude1 更为/复杂/的/ude1 投资/vn 方法/效果/好/得/ude3 多/。/(/老/任/)/w

    单词词性频次词典

    统计所有单词的各个词性的出现频次,得到核心词典:

    1.  v 3622 vn 598
    2. 爱因斯坦 nrf 20
    3. 爱国 a 178
    4. 爱国主义 n 68
    5. 飙升 v 200 vn 8
    6. 顺风 vi 27 vn 2
    7. 顺风吹火 i 1
    8. 顺风球 n 1
    9. 顺风耳 n 4
    10. 顺风车 nz 126
    11.  v 217 vg 151 vn 106
    12. 购书 v 7 vn 5
    13. 购买 v 3875 vn 637
    14. 购买人 n 7
    15. 购买力 n 42
    16. 购买户 n 1
    17. 购买欲 n 1
    18. 购买群 n 1
    19. 购买者 n 93
    20. 购买证 n 1
    21. 购入 v 115 vn 18
    22. ……

    从词典可以看出,汉语词汇的确词性单一,且存在歧义的词性多集中在“动词v”和“名动词vn”上。另外,我拿到的2014人民日报切分语料感觉没有经过严格的人工校对,许多单词词性单一,且存在不少错误。也许等我有机会(经济实力或学术背景),可以拿更高质量的语料来训练。所幸HanLP同时维护了一个通用的语料处理包,暂且埋下伏笔吧。

    转移矩阵

    统计每个标签的转移频次,得到如下转移矩阵:

    事实上,完整的转移矩阵非常大,请下载观看:词性标注 转移矩阵.xls

    标注

    利用上述转移矩阵和核心词典词频可以计算出HMM中的初始概率、转移概率、发射概率,进而完成求解。关于维特比算法和实现请参考《通用维特比算法的Java实现》。

    测试

    以“我的就是自然语言处理”为例:

    1.         String text = "我的爱就是爱自然语言处理";
    2.         Segment segment = new Segment();
    3.  
    4.         System.out.println("未标注:" + segment.seg(text));
    5.         segment.enableSpeechTag(true);
    6.         System.out.println("标注后:" + segment.seg(text));

    输出

    未标注:[我/rr, 的/ude1, 爱/v, 就是/v, 爱/v, 自然语言/gm, 处理/vn]

    标注后:[我/rr, 的/ude1, 爱/vn, 就是/v, 爱/v, 自然语言/gm, 处理/vn]

    前后两个“爱”的词性并不相同,前者是名动词,后者是动词。

    再比如

    未标注:[教授/nnt, 正在/d, 教授/nnt, 自然语言/gm, 处理/vn, 课程/n]

    标注后:[教授/nnt, 正在/d, 教授/v, 自然语言/gm, 处理/vn, 课程/n]

    HanLP的词性标注初见成效。

    HanLP词性标注集

    HanLP使用的HMM词性标注模型训练自2014年人民日报切分语料,随后增加了少量98年人民日报中独有的词语。所以,HanLP词性标注集兼容《ICTPOS3.0汉语词性标记集》,并且兼容《现代汉语语料库加工规范——词语切分与词性标注》。

    It is unclear whether it is best to treat words such as "be", "have", and "do" as categories in their own right (as in the Brown Corpus), or as simply verbs (as in the LOB Corpus and the Penn Treebank). "be" has more forms than other English verbs, and occurs in quite different grammatical contexts, complicating the issue.

    The most popular "tag set" for POS tagging for American English is probably the Penn tag set, developed in the Penn Treebank project. It is largely similar to the earlier Brown Corpus and LOB Corpus tag sets, though much smaller. In Europe, tag sets from the Eagles Guidelines see wide use, and include versions for multiple languages.

    POS tagging work has been done in a variety of languages, and the set of POS tags used varies greatly with language. Tags usually are designed to include overt morphological distinctions, although this leads to inconsistencies such as case-marking for pronouns but not nouns in English, and much larger cross-language differences. The tag sets for heavily inflected languages such as Greek and Latin can be very large; tagging words in agglutinative languages such as Inuit may be virtually impossible. At the other extreme, Petrov, D. Das, and R. McDonald ("A Universal Part-of-Speech Tagset" http://arxiv.org/abs/1104.2086) have proposed a "universal" tag set, with 12 categories (for example, no subtypes of nouns, verbs, punctuation, etc.; no distinction of "to" as an infinitive marker vs. preposition, etc.). Whether a very small set of very broad tags or a much larger set of more precise ones is preferable, depends on the purpose at hand. Automatic tagging is easier on smaller tag-sets.

    A different issue is that some cases are in fact ambiguous. Beatrice Santorini gives examples in "Part-of-speech Tagging Guidelines for the Penn Treebank Project," (3rd rev, June 1990 [4]), including the following (p. 32) case in which entertaining can be either an adjective or a verb, and there is no syntactic way to decide:

     The Duchess was entertaining last night.

     python机器学习-乳腺癌细胞挖掘(博主亲自录制视频)

    https://study.163.com/course/introduction.htm?courseId=1005269003&utm_campaign=commission&utm_source=cp-400000000398149&utm_medium=share

  • 相关阅读:
    java虚拟机理解探索1
    Java线程面试题 Top 50(转载)
    (转载)浅谈我对DDD领域驱动设计的理解
    最大堆
    利用筛法求质数
    递归算法及优化
    java 根据传入的时间获取当前月的第一天的0点0分0秒和最后一天的23点59分59秒
    Uncaught Error: Bootstrap's JavaScript requires jQuery version 1.9.1 or higher, but lower than version 3
    mysql 子查询问题
    微信7.0以后更新后H5页面定位不准确
  • 原文地址:https://www.cnblogs.com/webRobot/p/6086551.html
Copyright © 2020-2023  润新知