• NLTK 停用词、罕见词



    一、停用词 stopwords

    停用词:跟要做的实际主题不相关的文本,在 NPL任务中(信息检索、分类)毫无意义;通常情况下,冠词 和 代词都会被列为;一般歧义不大,移除后影响小。

    一般情况下,给定语言的停用词都是人工制定,跨语料库,针对最常见单词的停用词表。停用词表可能使用网站上找到已有的,也可能是基于给定语料库自动生成。
    简单的生成停用词表方式之一:基于相关单词在文档中出现的频率。

    NLTK 库中涵盖了 22 种语言的停用词表。


    相关函数:

    • nltk.corpus. stopwords

    1、查看停用词

    from nltk.corpus import stopwords # 加载停用词
    stopwords.readme().replace('
    ', ' ')  # 停用词说明文档,由于有很多 
     符号,所以这样操作来方便查看
     
    '''
        'Stopwords Corpus  This corpus contains lists of stop words for several languages.  These are high-frequency grammatical words which are usually ignored in text retrieval applications.  They were obtained from: http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/snowball/stopwords/  The stop words for the Romanian language were obtained from: http://arlc.ro/resources/  The English list has been augmented https://github.com/nltk/nltk_data/issues/22  The German list has been corrected https://github.com/nltk/nltk_data/pull/49  A Kazakh list has been added https://github.com/nltk/nltk_data/pull/52  A Nepali list has been added https://github.com/nltk/nltk_data/pull/83  An Azerbaijani list has been added https://github.com/nltk/nltk_data/pull/100  A Greek list has been added https://github.com/nltk/nltk_data/pull/103  An Indonesian list has been added https://github.com/nltk/nltk_data/pull/112 '
    
    '''
    
    # 查看停用词表,不同语言;没有对中文的支持
    stopwords.fileids() 
     
    
    '''
        ['arabic',
         'azerbaijani',
         'danish',
         'dutch',
         'english',
         'finnish',
         'french',
         'german',
         'greek',
         'hungarian',
         'indonesian',
         'italian',
         'kazakh',
         'nepali',
         'norwegian',
         'portuguese',
         'romanian',
         'russian',
         'slovene',
         'spanish',
         'swedish',
         'tajik',
         'turkish']
    '''
     
    # 查看英文停用词表
    
    stopwords.raw('english').replace('
    ', ' ')
     
    '''
        "i me my myself we our ours ourselves you you're you've you'll you'd your yours yourself yourselves he him his himself she she's her hers herself it it's its itself they them their theirs themselves what which who whom this that that'll these those am is are was were be been being have has had having do does did doing a an the and but if or because as until while of at by for with about against between into through during before after above below to from up down in out on off over under again further then once here there when where why how all any both each few more most other some such no nor not only own same so than too very s t can will just don don't should should've now d ll m o re ve y ain aren aren't couldn couldn't didn didn't doesn doesn't hadn hadn't hasn hasn't haven haven't isn isn't ma mightn mightn't mustn mustn't needn needn't shan shan't shouldn shouldn't wasn wasn't weren weren't won won't wouldn wouldn't "
    
    '''
    
    
    

    2、停用词过滤

    
    
     
    test_words = [word.lower() for word in tokens]
    
    # 转化为集合,方便求和停用词表的交集
    test_words_set = set(test_words)  
    
    test_words_set
    '''
        {',',
         '.',
         'and',
         'api',
         'articles',
         'browse',
         'code',
         'developer',
         'documentation',
         'including',
         'latest',
         'reference',
         'sample',
         'the',
         'tutorials'}
    '''
    
    # 查看和停用词表的交集
    stopwords_english = set(stopwords.words('english'))
    test_words_set.intersection(stopwords_english)
     
     #   {'and', 'the'}
    
    # 把停用词过滤掉
    filtered = [w for w in test_words_set if(w not in stopwords_english) ]
    
    filtered
    '''
        ['documentation',
         'api',
         'tutorials',
         'articles',
         '.',
         'including',
         'latest',
         'code',
         'sample',
         'developer',
         ',',
         'reference',
         'browse']
    '''
    

    二、罕见词

    去除噪音性质的分词;
    根据不同场景指定不同的规则,如 html 的标签、过长的名字等

    import nltk
    from nltk.tokenize import word_tokenize
    
    str = 'arXiv is a free distribution service and an open-access archive for 1,812,439 scholarly articles. Materials on this site are not peer-reviewed by arXiv.'
    tokens = word_tokenize(str) 
    
    # 获取相关术语在语料库中的分布情况
    freq_dist = nltk.FreqDist(tokens)
    '''
    FreqDist({'arXiv': 2, '.': 2, 'is': 1, 'a': 1, 'free': 1, 'distribution': 1, 'service': 1, 'and': 1, 'an': 1, 'open-access': 1, ...})
    '''
    
    # 选取其中最稀有的词组成一个表,用这个表来过滤原始语料库
    rarewords = freq_dist.keys()[-5:]
    rarewords
    
    after_rare_words = [w for w in tokens not in rarewords]
    

  • 相关阅读:
    数组的扩展搜集自无忧脚本
    C#简单模拟用户登录类
    C++ builder数据库连接大全
    童话故事下载地址
    如何对GridView行自动编号?
    document.execCommand() 用法说明
    兼容IE和FF的js脚本做法(比较常用)
    人民币数字转换成大写形式
    C# webBrowser 模拟登陆填充操作等(写网页注册机之类的时候要用到)
    拖动布局之保存布局页面
  • 原文地址:https://www.cnblogs.com/fldev/p/14370974.html
Copyright © 2020-2023  润新知