简介:
pyahocorasick是个python模块,由两种数据结构实现:trie和Aho-Corasick自动机。
Trie是一个字符串索引的词典,检索相关项时时间和字符串长度成正比。
AC自动机能够在一次运行中找到给定集合所有字符串。AC自动机其实就是在Trie树上实现KMP,可以完成多模式串的匹配。
(推荐学习资料:http://blog.csdn.net/niushuai666/article/details/7002823;http://www.cnblogs.com/kuangbin/p/3164106.html)
作者
Wojciech Muła, wojciech_mula@poczta.onet.pl
官方地址:
https://pypi.python.org/pypi/pyahocorasick/
pip过程中遇到了报错不能下载,例如:
error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools"
因为并没有直接可用的whl文件,那么只能安装相应的工具包,直接默认安装重启就可以。
在这里给个百度网盘的链接:
链接:https://pan.baidu.com/s/13Fv3dPYQNq6u_ErO9nqCAw
提取码:nhll
他的具体使用方法可以细看这两个例子:
import ahocorasick A = ahocorasick.Automaton() # 向trie树中添加单词 for index,word in enumerate("he her hers she".split()): A.add_word(word, (index, word)) # 用法分析add_word(word,[value]) => bool # 根据Automaton构造函数的参数store设置,value这样考虑: # 1. 如果store设置为STORE_LENGTH,不能传递value,默认保存len(word) # 2. 如果store设置为STORE_INTS,value可选,但必须是int类型,默认是len(automaton) # 3. 如果store设置为STORE_ANY,value必须写,可以是任意类型 # 测试单词是否在树中 if "he" in A: print True else: print False A.get("he") # (0,'he') A.get("cat","<not exists>") # '<not exists>' A.get("dog") # KeyError # 将trie树转化为Aho-Corasick自动机 A.make_automaton() # 找到所有匹配字符串 for item in A.iter("_hershe_"): print item #(2,(0,'he')) #(3,(1,'her')) #(4, (2, 'hers')) #(6, (3, 'she')) #(6, (0, 'he'))
1 import ahocorasick 2 A = ahocorasick.Automaton() 3 4 # 添加单词 5 for index,word in enumerate("cat catastropha rat rate bat".split()): 6 A.add_word(word, (index, word)) 7 8 # prefix 9 list(A.keys("cat")) 10 ## ["cat","catastropha"] 11 12 list(A.keys("?at","?",ahocprasick.MATCH_EXACT_LENGTH)) 13 ## ['bat','cat','rat'] 14 15 list(A.keys("?at?", "?", ahocorasick.MATCH_AT_MOST_PREFIX)) 16 ## ["bat", "cat", "rat", "rate"] 17 18 list(A.keys("?at?", "?", ahocorasick.MATCH_AT_LEAST_PREFIX)) 19 ## ['rate'] 20 ## keys用法分析 21 ## keys([prefix, [wildcard, [how]]]) => yield strings 22 ## If prefix (a string) is given, then only words sharing this prefix are yielded. 23 ## If wildcard (single character) is given, then prefix is treated as a simple pattern with selected wildcard. Optional parameter how controls which strings are matched: 24 ## MATCH_EXACT_LENGTH [default]:Only strings with the same length as a pattern’s length are yielded. In other words, literally match a pattern. 25 ## MATCH_AT_LEAST_PREFIX:Strings that have length greater or equal to a pattern’s length are yielded. 26 ## MATCH_AT_MOST_PREFIX:Strings that have length less or equal to a pattern’s length are yielded.