• Extracting Information from Text With NLTK


    因为现实中的数据多为‘非结构化数据’,比如一般的txt文档,或是‘半结构化数据’,比如html,对于这样的数据需要采用一些技术才能从中提取 出有用的信息。如果所有数据都是‘结构化数据’,比如Xml或关系数据库,那么就不需要特别去提取了,可以根据元数据去任意取到你想要的信息。

    那么就来讨论一下用NLTK来实现文本信息提取的方法,

    first, the raw text of the document is split into sentences using a sentence segmenter, and each sentence is further subdivided into words using a tokenizer . Next, each sentence is tagged with part-of-speech tags , which will prove very helpful in the next step,named entity recognition . In this step, we search for mentions of potentially interesting entities in each sentence. Finally, we use relation recognition to search for likely relations between different entities in the text.

    可见这儿描述的信息提取的过程,包含4步,分词,词性标注,命名实体识别,实体关系识别,对于分词和词性标注前面已经介绍过了,那么就详细来看看named entity recognition 怎么来实现的。

    Chunking

    The basic technique we will use for entity recognition is chunking, which segments and labels multitoken sequences。

    实体识别最基本的技术就是chunking,即分块,可以理解为把多个token组成词组。

    Noun Phrase Chunking

    我们就先以名词词组从chunking为例,即NP-chunking

    One of the most useful sources of information for NP-chunking is part-of-speech tags.

    >>> sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), ("dog", "NN"), ("barked", "VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")]
    >>> grammar = "NP: {<DT>?<JJ>*<NN>}" #Tag Patterns,定语(0或1个)形容词(任意个)名词(1个)
    >>> cp = nltk.RegexpParser(grammar)
    >>> result = cp.parse(sentence)
    >>> print result
    (S
    (NP the/DT little/JJ yellow/JJ dog/NN) #NP-chunking, the little yellow dog
    barked/VBD
    at/IN
    (NP the/DT cat/NN)) #NP-chunking, # NP-chunking, the cat
    上面的这个方法就是用Regular Expressions来表示tag pattern,从而找到NP-chunking

    再给个例子,tag patterns可以加上多条,可以变的更复杂

    grammar = r"""NP: {<DT|PP/$>?<JJ>*<NN>} # chunk determiner/possessive, adjectives and nouns
                                   {<NNP>+} # chunk sequences of proper nouns
                       """
    cp = nltk.RegexpParser(grammar)
    sentence = [("Rapunzel", "NNP"), ("let", "VBD"), ("down", "RP"), ("her", "PP$"), ("long", "JJ"), ("golden", "JJ"), ("hair", "NN")]
    >>> print cp.parse(sentence)
    (S
    (NP Rapunzel/NNP) #NP-chunking, Rapunzel
    let/VBD
    down/RP
    (NP her/PP$ long/JJ golden/JJ hair/NN)) #NP-chunking, her long golden hair

    下面给个例子看看怎么从语料库中找到匹配的词性组合,

    >>> cp = nltk.RegexpParser(''CHUNK: {<V.*> <TO> <V.*>}'') #找‘动词 to 动词’的组合
    >>> brown = nltk.corpus.brown
    >>> for sent in brown.tagged_sents():
    ...         tree = cp.parse(sent)
    ...         for subtree in tree.subtrees():
    ...             if subtree.node == ''CHUNK'': print subtree
    ...
    (CHUNK combined/VBN to/TO achieve/VB)
    (CHUNK continue/VB to/TO place/VB)
    (CHUNK serve/VB to/TO protect/VB)
    (CHUNK wanted/VBD to/TO wait/VB)
    (CHUNK allowed/VBN to/TO place/VB)
    (CHUNK expected/VBN to/TO become/VB)



  • 相关阅读:
    爬虫框架概述
    Django【进阶篇 】
    Django【基础篇】
    【tornado】系列项目(二)基于领域驱动模型的区域后台管理+前端easyui实现
    刷题记录:[安洵杯 2019]easy_serialize_php
    刷题记录:[SWPU2019]Web1
    2019-2020-1 20175210 20175211 20175219 实验五 通讯协议设计
    2019-2020-1 20175210 20175211 20175219 实验三 实时系统
    2019-2020-1 20175210 20175211 20175219 实验四 外设驱动程序设计
    2019-2020-1 20175210 20175211 20175219 实验二 固件程序设计
  • 原文地址:https://www.cnblogs.com/fxjwind/p/2097734.html
Copyright © 2020-2023  润新知