• How to Write a Spelling Corrector


    转载:http://norvig.com/spell-correct.html

    第一次看到这篇文章,还是有些瑟瑟发抖的,不过仔细思考和阅读之后,发现这个功能还是比较容易实现的,就目前来说。同时也感谢作者对于知识的分享。

    import re
    from collections import Counter

    def words(text): return re.findall(r'w+', text.lower())

    WORDS = Counter(words(open('big.txt').read()))

    def P(word, N=sum(WORDS.values())):
        "Probability of `word`."
        return WORDS[word] / N

    def correction(word):
        "Most probable spelling correction for word."
        return max(candidates(word), key=P)

    def candidates(word):
        "Generate possible spelling corrections for word."
        return (known([word]) or known(edits1(word)) or known(edits2(word)) or [word])

    def known(words):
        "The subset of `words` that appear in the dictionary of WORDS."
        return set(w for w in words if w in WORDS)

    def edits1(word):
        "All edits that are one edit away from `word`."
        letters    = 'abcdefghijklmnopqrstuvwxyz'
        splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]
        deletes    = [L + R[1:]               for L, R in splits if R]
        transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
        replaces   = [L + c + R[1:]           for L, R in splits if R for c in letters]
        inserts    = [L + c + R               for L, R in splits for c in letters]
        return set(deletes + transposes + replaces + inserts)

    def edits2(word):
        "All edits that are two edits away from `word`."
        return (e2 for e1 in edits1(word) for e2 in edits1(e1))

    The function correction(word) returns a likely spelling correction:

    >>> correction('speling')
    'spelling'

    >>> correction('korrectud')
    'corrected'
    这里贴上作者的代码。主要实现的功能是通过读入字典或者其他带有大量词汇的txt文件,实现自动识别拼写错误。
    虽然看似高大上,实现起来并不困难。作者通过各种python的技巧缩减了代码量。但核心理念就是,取得各个可能的正确词汇,然后代入字典中去一一比较,并通过权值(出现次数)得到最可能是实际想要拼写出来的单词。
    先从edits1看起,它实现的是一个比较简单的错误修正。这里我们引用原文:
     a simple edit to a word is a deletion (remove one letter), a transposition (swap two adjacent letters), a replacement (change one letter to another) or an insertion (add a letter).
    The function edits1 returns a set of all the edited strings (whether words or not) that can be made with one simple edit:
    作者对每个部分做了非常详细的解释。
    再看edits2,它是在edits1的基础上,有两个错误修正(正常人拼一个单词拼错两个地方比较极限了吧。。),用了两个循环,for e1 in edits1(word)得到修改一次后的结果,for e2 in edits1(e1)得到修改二次后的结果,同样以set的结果return
    known函数是把set中在字典中的词语取出来。
    candidates就是综合了edits1中的set和edits2中的set以及原word,对他们用known函数处理后得到的值。
    另:
    set or set = set |! set
    set and set = set &! set
    具体运算方式待补
    最后用
    max(candidates(word), key=P)
    处理得到的set,P是比较用的属性,为set中WORD[word]/SUM(WORDS.values)
    最后得到权值最高的word。
  • 相关阅读:
    HDU1106 排序
    HDU2050 折线分割平面
    HDU2048 神、上帝以及老天爷
    POJ1836 Alignment
    POJ1182 食物链
    HDU2067 小兔的棋盘
    HTML中的ID不能以数字开头
    automake,autoconf使用详解
    How to install Samba server on Ubuntu 12.04
    Netbeans使用UTF-8编码
  • 原文地址:https://www.cnblogs.com/silencestorm/p/8557988.html
Copyright © 2020-2023  润新知