使用朴素贝叶斯分类器过滤垃圾邮件

1.从文本中构建词向量

将每个文本用python分割成单词，构建成词向量，这里首先需要一个语料库，为了简化我们直接从所给文本中抽出所有出现的单词构成一个词库。

2.利用词向量计算概率p(x|y)

When we attempt to classify a document, we multiply a lot of probabilities together to
get the probability that a document belongs to a given class. This will look something
like p(w0|1)p(w1|1)p(w2|1). If any of these numbers are 0, then when we multiply
them together we get 0. To lessen the impact of this, we’ll initialize all of our occurrence
counts to 1, and we’ll initialize the denominators to 2.

Another problem is underflow: doing too many multiplications of small numbers.
When we go to calculate the product p(w0|ci)p(w1|ci)p(w2|ci)...p(wN|ci) and many
of these numbers are very small, we’ll get underflow, or an incorrect answer. (Try to
multiply many small numbers in Python. Eventually it rounds off to 0.) One solution
to this is to take the natural logarithm of this product. If you recall from algebra,
ln(a*b) = ln(a)+ln(b). Doing this allows us to avoid the underflow or round-off
error problem. Do we lose anything by using the natural log of a number rather than
the number itself? The answer is no.

3.使用词袋模型

Up until this point we’ve treated the presence or absence of a word as a feature. This
could be described as a set-of-words model. If a word appears more than once in a
document, that might convey some sort of information about the document over just
the word occurring in the document or not. This approach is known as a bag-of-words
model.

4.代码

  1 # -*- coding: utf-8 -*-
  2 """
  3 Created on Tue Mar 28 17:22:48 2017
  4 
  5 @author: MyHome
  6 """
  7 '''使用python把文本分割成一个个单词，构建词向量
  8 利用朴素贝叶斯构建分类器从概率的角度对文本进行分类'''
  9 import numpy as np
 10 import re
 11 from random import shuffle
 12 
 13 '''创建一个词汇表'''
 14 def createVocabList(Dataset):
 15     vocabSet = set([])
 16     for document in Dataset:
 17         vocabSet = vocabSet | set(document)
 18 
 19     return list(vocabSet)
 20 
 21 
 22 '''  将文本转化成词向量'''
 23 
 24 def setOfWords2Vec(vocabList,inputSet):
 25     returnVec = [0]*len(vocabList)
 26     for word in inputSet:
 27         if word in vocabList:
 28 
 29             #returnVec[vocabList.index(word)] = 1#词集模型
 30             returnVec[vocabList.index(word)] += 1#词袋模型
 31         else:
 32             print "the word:%s is not in VocabList"%word
 33     return returnVec
 34 
 35 
 36 '''训练'''
 37 def trainNB(trainMatrix,trainCategory):
 38     numTrainDocs = len(trainMatrix)
 39     numWords = len(trainMatrix[0])
 40     p = sum(trainCategory)/float(numTrainDocs)#属于类1的概率
 41     '''初始化在类0和类1中单词出现个数及概率'''
 42     p0Num = np.ones(numWords)
 43     p1Num = np.ones(numWords)
 44     p0Denom = 0.0
 45     p1Denom = 0.0
 46     for i in range(numTrainDocs):
 47         if trainCategory[i] == 1:
 48             p1Num += trainMatrix[i]
 49             p1Denom += sum(trainMatrix[i])
 50         else:
 51             p0Num += trainMatrix[i]
 52             p0Denom += sum(trainMatrix[i])
 53     p1_vec = np.log(p1Num/p1Denom)
 54     p0_vec = np.log(p0Num/p0Denom)
 55 
 56     return p0_vec,p1_vec,p
 57 
 58 
 59 '''构造分类器'''
 60 
 61 def classifyNB(Input,p0,p1,p):
 62     p1 = sum(Input*p1) + np.log(p)
 63     p0 = sum(Input*p0) + np.log(1.0-p)
 64     if p1 > p0:
 65         return 1
 66     else:
 67         return 0
 68 
 69 
 70 '''预处理文本'''
 71 def textParse(bigString):
 72     listOfTokens = re.split(r"W*",bigString)
 73     return [tok.lower() for tok in listOfTokens if len(tok)>2]
 74 
 75 """垃圾邮件分类"""
 76 def spamTest():
 77     docList = []
 78     classList = []
 79     fullText = []
 80 
 81     for i in range(1,26):
 82         wordList = textParse(open('email/spam/%d.txt'%i).read())
 83         docList.append(wordList)
 84         fullText.extend(wordList)
 85         classList.append(1)
 86         wordList = textParse(open("email/ham/%d.txt"%i).read())
 87         docList.append(wordList)
 88         fullText.extend(wordList)
 89         classList.append(0)
 90 
 91     vocabList = createVocabList(docList)
 92     DataSet = zip(docList,classList)
 93     print shuffle(DataSet)
 94     Data ,Y = zip(*DataSet)
 95     trainMat = []
 96     trainClass=[]
 97     testData = Data[40:]
 98     test_label = Y[40:]
 99     for index in xrange(len(Data[:40])):
100         trainMat.append(setOfWords2Vec(vocabList,Data[index]))
101         trainClass.append(Y[index])
102 
103     p0,p1,p = trainNB(np.array(trainMat),np.array(trainClass))
104     errorCount = 0
105     for index in xrange(len(testData)):
106         wordVector = setOfWords2Vec(vocabList,testData[index])
107         if classifyNB(np.array(wordVector),p0,p1,p) != test_label[index]:
108             errorCount += 1
109     print "the error rate is : " ,float(errorCount)/len(testData)
110 
111 
112 if __name__ == "__main__":
113     spamTest()
114 
115 
116 
117 
118 
119 
120

5.总结

Using probabilities can sometimes be more effective than using hard rules for classification.
Bayesian probability and Bayes’ rule gives us a way to estimate unknown probabilities
from known values.
You can reduce the need for a lot of data by assuming conditional independence
among the features in your data. The assumption we make is that the probability of
one word doesn’t depend on any other words in the document. We know this assumption
is a little simple. That’s why it’s known as naïve Bayes. Despite its incorrect
assumptions, naïve Bayes is effective at classification.
There are a number of practical considerations when implementing naïve Bayes in
a modern programming language. Underflow is one problem that can be addressed
by using the logarithm of probabilities in your calculations. The bag-of-words model is
an improvement on the set-of-words model when approaching document classification.
There are a number of other improvements, such as removing stop words, and
you can spend a long time optimizing a tokenizer.

相关阅读:
C# 日志本地化工具
 javascript面向对象的写法01
VM12-Pro 安装CentOS7 并配置静态IP出坑记
 基于CentOS7.x安装Nginx-1.18.0
程序员思维导图、web初学者必备、web前端知识集锦-不断更新中...
js知识
 swiper的使用
 web 移动端键盘处理-vue移动端那些事
 vue学习计划-vuex生态
 vue 组件复用
原文地址：https://www.cnblogs.com/lpworkstudyspace1992/p/6636709.html