• ahocorasick从安装到使用


    简介:

    pyahocorasick是个python模块,由两种数据结构实现:trie和Aho-Corasick自动机。

    Trie是一个字符串索引的词典,检索相关项时时间和字符串长度成正比。

    AC自动机能够在一次运行中找到给定集合所有字符串。AC自动机其实就是在Trie树上实现KMP,可以完成多模式串的匹配。
    (推荐学习资料:http://blog.csdn.net/niushuai666/article/details/7002823http://www.cnblogs.com/kuangbin/p/3164106.html

    作者

    Wojciech Muła, wojciech_mula@poczta.onet.pl
    官方地址:
    https://pypi.python.org/pypi/pyahocorasick/

    pip过程中遇到了报错不能下载,例如:

    error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools"

    因为并没有直接可用的whl文件,那么只能安装相应的工具包,直接默认安装重启就可以。

    在这里给个百度网盘的链接:

    链接:https://pan.baidu.com/s/13Fv3dPYQNq6u_ErO9nqCAw
    提取码:nhll 

    他的具体使用方法可以细看这两个例子:

    import ahocorasick
    A = ahocorasick.Automaton()
    
    # 向trie树中添加单词
    for index,word in enumerate("he her hers she".split()):
        A.add_word(word, (index, word))
    # 用法分析add_word(word,[value]) => bool
    # 根据Automaton构造函数的参数store设置,value这样考虑:
    # 1. 如果store设置为STORE_LENGTH,不能传递value,默认保存len(word)
    # 2. 如果store设置为STORE_INTS,value可选,但必须是int类型,默认是len(automaton)
    # 3. 如果store设置为STORE_ANY,value必须写,可以是任意类型
    
    # 测试单词是否在树中
    if "he" in A:
        print True
    else:
        print False
    A.get("he")
    # (0,'he')
    A.get("cat","<not exists>")
    # '<not exists>'
    A.get("dog")
    # KeyError
    
    # 将trie树转化为Aho-Corasick自动机
    A.make_automaton()
    
    # 找到所有匹配字符串
    for item in A.iter("_hershe_"):
        print item
    #(2,(0,'he'))
    #(3,(1,'her'))
    #(4, (2, 'hers'))
    #(6, (3, 'she'))
    #(6, (0, 'he'))
     1 import ahocorasick
     2 A = ahocorasick.Automaton()
     3 
     4 # 添加单词
     5 for index,word in enumerate("cat catastropha rat rate bat".split()):
     6     A.add_word(word, (index, word))
     7 
     8 # prefix
     9 list(A.keys("cat"))
    10 ## ["cat","catastropha"]
    11 
    12 list(A.keys("?at","?",ahocprasick.MATCH_EXACT_LENGTH))
    13 ## ['bat','cat','rat']
    14 
    15 list(A.keys("?at?", "?", ahocorasick.MATCH_AT_MOST_PREFIX))
    16 ## ["bat", "cat", "rat", "rate"]
    17 
    18 list(A.keys("?at?", "?", ahocorasick.MATCH_AT_LEAST_PREFIX))
    19 ## ['rate']
    20 ## keys用法分析
    21 ## keys([prefix, [wildcard, [how]]]) => yield strings
    22 ## If prefix (a string) is given, then only words sharing this prefix are yielded.
    23 ## If wildcard (single character) is given, then prefix is treated as a simple pattern with selected wildcard. Optional parameter how controls which strings are matched:
    24 ## MATCH_EXACT_LENGTH [default]:Only strings with the same length as a pattern’s length are yielded. In other words, literally match a pattern.
    25 ## MATCH_AT_LEAST_PREFIX:Strings that have length greater or equal to a pattern’s length are yielded.
    26 ## MATCH_AT_MOST_PREFIX:Strings that have length less or equal to a pattern’s length are yielded.
  • 相关阅读:
    Excel利用剪贴板或错位引用将多列行数不一的数据复制粘帖到一列 来自:Office之家 链接:http://www.icanzc.com/excel/5246.html
    R语言之避免for循环示例
    R语言之as.numeric(对于factor型转数值型)
    R语言之row_number
    如何在Windows环境下开发Python(转)
    单行代码 之Python 3.5版
    install xgboost package in python (Anaconda)
    汉诺塔,杨辉三角之python实现
    special-symbols in package(data.table)
    R&SQL合并记录并去重
  • 原文地址:https://www.cnblogs.com/smartisn/p/14033551.html
Copyright © 2020-2023  润新知