起源
最大匹配法是最简单的分词方法,他完全使用词典进行分词,如果词典好,则分词的效果好
正向最大匹配法
正向,即从左往右进行匹配
#Maximum Match Method 最大匹配法 class MM: def __init__(self): self.window_size = 4 def cut(self,text): result = [] index = 0 text_lenght = len(text) #研究生命的起源 dic = ["研究","研究生","生命"] while text_lenght >index: #range(3,0,-1) for size in range(min(self.window_size+index,text_lenght),index,-1): piece = text[index:size] print("size:", size,piece) if piece in dic: index = size-1 break index = index+1 #第一次结束index = 3 result.append(piece) print(result) return result
逆向最大匹配法
逆向即从右往左进行匹配
#RMM:Reverse Maxmium Match method 逆向最大匹配 class RMM: def __init__(self): self.window_size = 3 def cut(self,text): result = [] index = len(text) #研究生命的起源 dic = ["研究","研究生","生命"] while index>0: for size in range(max((index-self.window_size),0),index): piece = text[size:index] print("size:", size,piece) if piece in dic: index = size+1 print("index:", index) break print("index:",index) index = index - 1 result.append(piece) result.reverse() print(result) return result
双向最大匹配法
同时根据正向和逆向的结果,进行匹配
class MCut(): def __init__(self): self.mm = MM() self.rmm = RMM() def cut(self,sentence): """ 1. 词语数量不相同,选择分词后词语数量少的 2. 如果词语数量相同,返回单字数量少的 """ mm_ret = self.mm.cut(sentence) rmm_ret = self.rmm.cut(sentence) if len(mm_ret)==len(rmm_ret): mm_ret_signle_len = len([i for i in mm_ret if len(i)==1]) rmm_ret_signle_len = len([i for i in rmm_ret if len(i)==1]) return mm_ret if rmm_ret_signle_len>mm_ret_signle_len else rmm_ret else: return mm_ret if len(mm_ret)<len(rmm_ret) else rmm_ret