NLP学习笔记：实现简单的单词纠错功能

一、知识储备

　　1.单词的编辑距离

　　　　这个概念表示的就是一个单词经过几次编辑形成新的单词。这个编辑包含了增加，删除，替换三种。例如 apple 的编辑距离为 1 的单词可以有 aapple bapple pple aple bpple 等，这就是所谓的编辑距离。如果需要得到编辑距离为2的单词。只需要在编辑距离为 1 的基础上再次调用函数。python的实现代码如下

功能，生成所有的编辑距离为 1 的单词。

def generate_candidates(word):
    """
    :param word: 给定的输入单词
    :return: 返回所有候选集合
    """
    ##生成编辑距离为 1 的单词  1）insert  2）delete 3）replace

    #假设添加的字母如下
    letters = 'abcdefghijklmnopqrstuvwxyz'
    #将单词进行拆分，明确可以插入字母的位置
    splits =  [(word[:i],word[i:]) for i in range(len(word)+1)]
    #插入操作
    inserts = [L+c+R for L,R in splits   for c in letters]
    #删除操作  删除右边部分的第一个字母
    deletes = [L+R[1:] for L,R in splits if R]
    #替换操作  将右边的第一个字符替换
    replaces = [L+c+R[1:] for L,R in splits if R for c in letters]
    #生成的所有结果
    candidates = set(inserts+deletes+replaces)
    #将不再词库的单词去掉     下边的词库为vocab 的 txt文件，后边也会用到
    result = [word for word in candidates if word in vocab]
    return result

　　2.语言模型

　　　　目前，我学习到的语言模型是比较简单的 n-gream 。这个涉及到了一些概率论的知识。举个例子说明一下：句子 I like learning NLP。我们需要计算 like 在I 之后出现的概率，我们选取n-gream 中 n 为2 的情况，即 Bi-gream。P（like| I ）=P(I | like)*P(like) /P(I) 这个公式就是贝叶斯的公式。 P（Wi | Wi-1）=P(Wi-1 | PW)*P（W）/P(Wi-1)

计算的这个概率即为出现这种组合的概率。同理，如果 n = 3 则计算 P（Wi | Wi-1 Wi-2）的概率。这种联合概率的计算是概率论的知识。

以下是根据nltk库构建语言模型的代码。

#导入nltk
from nltk.corpus import reuters
#读取语料库
categories = reuters.categories()
corpus = reuters.sents(categories=categories)

#构建语言模型 ，使用bigram
term_count = {}
bigram_count = {}
for doc in corpus:
    doc=['<s>']+doc
    for i in range(0,len(doc)-1):
        # [i,i+1]
        term = doc[i]
        bigram = doc[i:i+2]

        if term in term_count:
            term_count[term]+=1
        else:
            term_count[term]=1
        bigram = ' '.join(bigram)
        if bigram in bigram_count:
            bigram_count[bigram]+=1
        else:
            bigram_count[bigram]=1
print(term_count)
print(bigram_count )

　　3.平滑操作

　　　　目前我学习的平滑操作包括 add-one smoothing ； add-n smoothing ；good turning smoothing 。简单介绍一下 add-one smoothing 。当我们训练语言模型的的时候，有些单词没有出现在训练的数据中，所以会有单词的概率为 0 ，这样就会导致部分的计算结果为0 ，会影响最后的效果。所以，为了效果更好一点，需要在计算概率的时候，分子加1 ，分母加上所有单词出现频率。P = P（Wi-1|Wi）+1 / P（Wi）+V 。这样最终结果的效果会好一点。

二、所有的代码如下

用到的数据集为

放到百度网盘

链接：https://pan.baidu.com/s/1pZbTOx9C8Y1xqV9rUWF4Eg
提取码：mir6

import numpy as np
#读入词典库
vocab=set([line.rstrip()  for line in open("data/vocab.txt")])
# print(vocab)

#生成所有首选集合
def generate_candidates(word):
    """
    :param word: 给定的输入单词
    :return: 返回所有候选集合
    """
    ##生成编辑距离为 1 的单词  1）insert  2）delete 3）replace

    #假设添加的字母如下
    letters = 'abcdefghijklmnopqrstuvwxyz'
    #将单词进行拆分，明确可以插入字母的位置
    splits =  [(word[:i],word[i:]) for i in range(len(word)+1)]
    #插入操作
    inserts = [L+c+R for L,R in splits   for c in letters]
    #删除操作  删除右边部分的第一个字母
    deletes = [L+R[1:] for L,R in splits if R]
    #替换操作  将右边的第一个字符替换
    replaces = [L+c+R[1:] for L,R in splits if R for c in letters]
    #生成的所有结果
    candidates = set(inserts+deletes+replaces)
    #将不再词库的单词去掉
    result = [word for word in candidates if word in vocab]
    return result

#导入nltk
from nltk.corpus import reuters
#读取语料库
categories = reuters.categories()
corpus = reuters.sents(categories=categories)

#构建语言模型 ，使用bigram
term_count = {}
bigram_count = {}
for doc in corpus:
    doc=['<s>']+doc
    for i in range(0,len(doc)-1):
        # [i,i+1]
        term = doc[i]
        bigram = doc[i:i+2]

        if term in term_count:
            term_count[term]+=1
        else:
            term_count[term]=1
        bigram = ' '.join(bigram)
        if bigram in bigram_count:
            bigram_count[bigram]+=1
        else:
            bigram_count[bigram]=1
print(term_count)

#用户犯错的概率  channel probability
channel_pro = {}
#读入写错的概率
for line in open('data/spell-errors.txt'):
    items = line.split(":")
    correct = items[0].strip()
    mistacks = [errorword.strip() for errorword in items[1].strip().split(",")]
    channel_pro[correct] = {}
    for mis in mistacks:
        channel_pro[correct][mis]=1.0/len(mistacks)

V = len(term_count.keys())

file = open("data/testdata.txt",'r')
for line in file:
    items = line.rstrip().split("	")
    sentence = items[2].split()
    for word in sentence:
        if word not in vocab:
            #找到不再词典里的单词    s生成候选集合 然后   计算分数  score = P(correct)*P(mis|correct)= log(correct)+log(mis|correct)
            candidates = generate_candidates(word)
            probs = []
            for candi in candidates:
                prob = 0
                if candi in channel_pro and word in channel_pro[candi]:
                    prob += np.log(channel_pro[candi][word])
                else:
                    prob += np.log(0.00001)
                #计算语言模型概率  I like apply  computer   计算apply的语言模型概率
                idx = items[2].index(word)+1
                if items[2][idx-1] in bigram_count and candi in bigram_count[items[2][idx-1]]:
                    #  P = P（apply|like）+1 / P(like) + V
                    prob += np.log(
                        (bigram_count[items[2][idx-1]][candi] +1.0) /
                        (term_count[items[2][idx-1]]+ V)
                    )
                else:
                    prob += np.log(1.0/V)  #平滑操作
                probs.append(prob)
            max_idx = probs.index(max(probs))
            print(word , candidates[max_idx])

相关阅读:
googleMapReduce
leveldb0
大端模式和小端模式
 信号
 js中判断对象类型的几种方法
 js DOM之基础详解
 JavaScript作用域与闭包总结
 SCRIPT438: 对象不支持“trim”属性或方法
 JS合并多个数组去重算法
 js的 break 和 continue 计算问题
原文地址：https://www.cnblogs.com/wys-373/p/13456715.html