• 机器学习之路: python 朴素贝叶斯分类器 MultinomialNB 预测新闻类别


    使用python3 学习朴素贝叶斯分类api

    设计到字符串提取特征向量

    欢迎来到我的git下载源代码: https://github.com/linyi0604/MachineLearning

     1 from sklearn.datasets import fetch_20newsgroups
     2 from sklearn.cross_validation import train_test_split
     3 # 导入文本特征向量转化模块
     4 from sklearn.feature_extraction.text import CountVectorizer
     5 # 导入朴素贝叶斯模型
     6 from sklearn.naive_bayes import MultinomialNB
     7 # 模型评估模块
     8 from sklearn.metrics import classification_report
     9 
    10 '''
    11 朴素贝叶斯模型广泛用于海量互联网文本分类任务。
    12 由于假设特征条件相互独立,预测需要估计的参数规模从幂指数量级下降接近线性量级,节约内存和计算时间
    13 但是 该模型无法将特征之间的联系考虑,数据关联较强的分类任务表现不好。
    14 '''
    15 
    16 '''
    17 1 读取数据部分
    18 '''
    19 # 该api会即使联网下载数据
    20 news = fetch_20newsgroups(subset="all")
    21 # 检查数据规模和细节
    22 # print(len(news.data))
    23 # print(news.data[0])
    24 '''
    25 18846
    26 
    27 From: Mamatha Devineni Ratnam <mr47+@andrew.cmu.edu>
    28 Subject: Pens fans reactions
    29 Organization: Post Office, Carnegie Mellon, Pittsburgh, PA
    30 Lines: 12
    31 NNTP-Posting-Host: po4.andrew.cmu.edu
    32 
    33 I am sure some bashers of Pens fans are pretty confused about the lack
    34 of any kind of posts about the recent Pens massacre of the Devils. Actually,
    35 I am  bit puzzled too and a bit relieved. However, I am going to put an end
    36 to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
    37 are killing those Devils worse than I thought. Jagr just showed you why
    38 he is much better than his regular season stats. He is also a lot
    39 fo fun to watch in the playoffs. Bowman should let JAgr have a lot of
    40 fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final
    41 regular season game.          PENS RULE!!!
    42 '''
    43 
    44 '''
    45 2 分割数据部分
    46 '''
    47 x_train, x_test, y_train, y_test = train_test_split(news.data,
    48                                                     news.target,
    49                                                     test_size=0.25,
    50                                                     random_state=33)
    51 
    52 '''
    53 3 贝叶斯分类器对新闻进行预测
    54 '''
    55 # 进行文本转化为特征
    56 vec = CountVectorizer()
    57 x_train = vec.fit_transform(x_train)
    58 x_test = vec.transform(x_test)
    59 # 初始化朴素贝叶斯模型
    60 mnb = MultinomialNB()
    61 # 训练集合上进行训练, 估计参数
    62 mnb.fit(x_train, y_train)
    63 # 对测试集合进行预测 保存预测结果
    64 y_predict = mnb.predict(x_test)
    65 
    66 '''
    67 4 模型评估
    68 '''
    69 print("准确率:", mnb.score(x_test, y_test))
    70 print("其他指标:
    ",classification_report(y_test, y_predict, target_names=news.target_names))
    71 '''
    72 准确率: 0.8397707979626485
    73 其他指标:
    74                            precision    recall  f1-score   support
    75 
    76              alt.atheism       0.86      0.86      0.86       201
    77            comp.graphics       0.59      0.86      0.70       250
    78  comp.os.ms-windows.misc       0.89      0.10      0.17       248
    79 comp.sys.ibm.pc.hardware       0.60      0.88      0.72       240
    80    comp.sys.mac.hardware       0.93      0.78      0.85       242
    81           comp.windows.x       0.82      0.84      0.83       263
    82             misc.forsale       0.91      0.70      0.79       257
    83                rec.autos       0.89      0.89      0.89       238
    84          rec.motorcycles       0.98      0.92      0.95       276
    85       rec.sport.baseball       0.98      0.91      0.95       251
    86         rec.sport.hockey       0.93      0.99      0.96       233
    87                sci.crypt       0.86      0.98      0.91       238
    88          sci.electronics       0.85      0.88      0.86       249
    89                  sci.med       0.92      0.94      0.93       245
    90                sci.space       0.89      0.96      0.92       221
    91   soc.religion.christian       0.78      0.96      0.86       232
    92       talk.politics.guns       0.88      0.96      0.92       251
    93    talk.politics.mideast       0.90      0.98      0.94       231
    94       talk.politics.misc       0.79      0.89      0.84       188
    95       talk.religion.misc       0.93      0.44      0.60       158
    96 
    97              avg / total       0.86      0.84      0.82      4712
    98 '''
  • 相关阅读:
    .NET中使用嵌入的资源
    C#操作注册表
    .Net中大数加减乘除运算
    CYQ.Data 轻量数据层之路 V4.5 版本发布[更好的使用体验,更优的缓存机制]
    关于性能比较的应用误区
    秋色园QBlog技术原理解析:性能优化篇:打印页面SQL,全局的SQL语句优化(十三)
    CYQ.DBImport 数据库反向工程及批量导数据库工具 V1.0 发布
    框架设计之菜鸟漫漫江湖路系列 一:菜鸟入门
    MySql折腾小记二:text/blog类型不允许设置默认值,不允许存在两个CURRENT_TIMESTAMP
    CYQ.Data.Xml XmlHelper 助你更方便快捷的操作Xml/Html
  • 原文地址:https://www.cnblogs.com/Lin-Yi/p/8970522.html
Copyright © 2020-2023  润新知