• 朴素贝叶斯应用:垃圾邮件分类


    1. 数据准备:收集数据与读取

    2. 数据预处理:处理数据

    3. 训练集与测试集:将先验数据按一定比例进行拆分。

    4. 提取数据特征,将文本解析为词向量 。

    5. 训练模型:建立模型,用训练数据训练模型。即根据训练样本集,计算词项出现的概率P(xi|y),后得到各类下词汇出现概率的向量 。

    6. 测试模型:用测试数据集评估模型预测的正确率。

    混淆矩阵

    准确率、精确率、召回率、F值

    7. 预测一封新邮件的类别。

    8. 考虑如何进行中文的文本分类(期末作业之一)。 

    要点:

    理解朴素贝叶斯算法

    理解机器学习算法建模过程

    理解文本常用处理流程

    理解模型评估方法

    import csv
    from sklearn.model_selection import train_test_split
    import nltk
    from nltk.corpus import stopwords
    from nltk.stem import WordNetLemmatizer
    from sklearn.naive_bayes import  MultinomialNB
    
    
    # 预处理
    def preprocessing(text):
        # text = text.decode("utf-8")
        tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]  # 进行分词
        stops = stopwords.words('a')  # 去掉停用词
        tokens = [token for token in tokens if token not in stops]
    
        tokens = [token.lower() for token in tokens if len(token) >= 3]
        lmtzr = WordNetLemmatizer()  # 还原词性
        tokens = [lmtzr.lemmatize(token) for token in tokens]
        preprocessed_text = ' '.join(tokens)
        return preprocessed_text
    
    def read_data():
        '''读取文件并进行预处理'''
        sms=open(r'G:大三数据挖掘SMSSSMSSpamCollectionjs.txt','r',encoding='utf-8')
        sms_data = []
        sms_label = []
        csv_reader=csv.reader(sms,delimiter='	')
        nltk.download('punkt')
        nltk.download('wordnet')
        for line in csv_reader:
            print(line)
            sms_label.append(line[0])
            sms_data.append(preprocessing(line[1]))
        sms.close()
        x_train,x_test,y_train,y_test = train_test_split(sms_data,sms_label,test_size=0.3,random_state=0,stratify=sms_label)
        print(len(sms_data),len(x_train),len(x_test))
        print(x_train)
        return sms_data,sms_label,x_train,x_test,y_train,y_test
    
    
    # 向量化
    def xiangliang(x_train, x_test):
        from sklearn.feature_extraction.text import TfidfVectorizer
        vectorizer = TfidfVectorizer(min_df=2, ngram_range=(1, 2), stop_words='a',
                                     strip_accents='unicode')  # ,norm='12'
        x_train = vectorizer.fit_transform(x_train)
        x_test = vectorizer.transform(x_test)
        return x_train, x_test, vectorizer
    
    
    # 朴素贝叶斯分类器
    def beiNB(x_train, y_train, x_test):
        clf = MultinomialNB().fit(x_train, y_train)
        y_nb_pred = clf.predict(x_test)
        return y_nb_pred, clf
    
    
    def result(vectorizer, clf):
        # 分类结果
        from sklearn.metrics import confusion_matrix
        from sklearn.metrics import classification_report
        print(y_nb_pred.shape, y_nb_pred)
        print('nb_confusion_matrix:')
        cm = confusion_matrix(y_test, y_nb_pred)
        print(cm)
        cr = classification_report(y_test, y_nb_pred)
        print(cr)
    
        feature_names = vectorizer.get_feature_names()
        coefs = clf.coef_
        intercept = clf.intercept_
        coefs_with_fns = sorted(zip(coefs[0], feature_names))
    
        n = 10
        top = zip(coefs_with_fns[:n], coefs_with_fns[:-(n + 1):-1])
        for (coef_1, fn_1), (coef_2, fn_2) in top:
            print('	%.4f	%-15s		%.4f	%-15s' % (coef_1, fn_1, coef_2, fn_2))
    

    if __name__ == '__main__': sms_data, sms_lable, x_train, x_test, y_train, y_test = read_data() X_train, X_test, vectorizer = xiangliang(x_train, x_test) y_nb_pred, clf = beiNB(X_train, y_train, X_test) result(vectorizer, clf)
  • 相关阅读:
    C# WinForm dataGridView 技巧小结
    Win7设置局域网共享
    vs2010快捷键大全
    C# WebBrowser.DocumentCompleted 多次调用解决方法
    为应用程序池 'DefaultAppPool' 提供服务的进程关闭时间超过了限制
    VB高清图标制作方法
    sqlite 中文排序
    一个vbs文件将指定文件夹下的文件名输出到指定文件夹下
    用DOS命令获取文件列表
    文件搜索神器 Everything
  • 原文地址:https://www.cnblogs.com/la-vie/p/10075095.html
Copyright © 2020-2023  润新知