Classification of text documents: using a MLComp dataset

注：原文代码链接http://scikit-learn.org/stable/auto_examples/text/mlcomp_sparse_document_classification.html

运行结果为：

Loading 20 newsgroups training set... 
20 newsgroups dataset for document classification (http://people.csail.mit.edu/jrennie/20Newsgroups)
13180 documents
20 categories
Extracting features from the dataset using a sparse vectorizer
done in 139.231000s
n_samples: 13180, n_features: 130274
Loading 20 newsgroups test set... 
done in 0.000000s
Predicting the labels of the test set...
5648 documents
20 categories
Extracting features from the dataset using the same vectorizer
done in 7.082000s
n_samples: 5648, n_features: 130274
Testbenching a linear classifier...
parameters: {'penalty': 'l2', 'loss': 'hinge', 'alpha': 1e-05, 'fit_intercept': True, 'n_iter': 50}
done in 22.012000s
Percentage of non zeros coef: 30.074190
Predicting the outcomes of the testing set
done in 0.172000s
Classification report on test set for classifier:
SGDClassifier(alpha=1e-05, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', n_iter=50, n_jobs=1,
       penalty='l2', power_t=0.5, random_state=None, shuffle=True,
       verbose=0, warm_start=False)

                          precision    recall  f1-score   support

             alt.atheism       0.95      0.93      0.94       245
           comp.graphics       0.85      0.91      0.88       298
 comp.os.ms-windows.misc       0.88      0.88      0.88       292
comp.sys.ibm.pc.hardware       0.82      0.80      0.81       301
   comp.sys.mac.hardware       0.90      0.92      0.91       256
          comp.windows.x       0.92      0.88      0.90       297
            misc.forsale       0.87      0.89      0.88       290
               rec.autos       0.93      0.94      0.94       324
         rec.motorcycles       0.97      0.97      0.97       294
      rec.sport.baseball       0.97      0.97      0.97       315
        rec.sport.hockey       0.98      0.99      0.99       302
               sci.crypt       0.97      0.96      0.96       297
         sci.electronics       0.87      0.89      0.88       313
                 sci.med       0.97      0.97      0.97       277
               sci.space       0.97      0.97      0.97       305
  soc.religion.christian       0.95      0.96      0.95       293
      talk.politics.guns       0.94      0.94      0.94       246
   talk.politics.mideast       0.97      0.99      0.98       296
      talk.politics.misc       0.96      0.92      0.94       236
      talk.religion.misc       0.89      0.84      0.86       171

             avg / total       0.93      0.93      0.93      5648

Confusion matrix:
[[227   0   0   0   0   0   0   0   0   0   0   1   2   1   1   1   0   1
    0  11]
 [  0 271   3   8   2   5   2   0   0   1   0   0   3   1   1   0   0   1
    0   0]
 [  0   7 256  14   5   6   1   0   0   0   0   0   2   0   1   0   0   0
    0   0]
 [  1   8  12 240   9   3  12   2   0   0   0   1  12   0   0   1   0   0
    0   0]
 [  0   1   3   6 235   2   4   0   0   0   0   1   3   0   1   0   0   0
    0   0]
 [  0  17   9   4   0 260   0   0   1   1   0   0   2   0   2   0   1   0
    0   0]
 [  0   1   3   7   3   0 257   7   2   0   0   1   8   0   1   0   0   0
    0   0]
 [  0   0   0   2   1   0   5 305   2   3   0   0   4   1   0   0   1   0
    0   0]
 [  0   0   0   0   1   0   3   3 285   0   0   0   1   0   0   1   0   0
    0   0]
 [  0   0   0   0   0   0   3   2   0 305   2   1   1   0   0   0   0   0
    1   0]
 [  0   0   0   0   0   0   1   0   1   0 300   0   0   0   0   0   0   0
    0   0]
 [  0   0   1   1   0   2   0   1   0   0   0 284   0   1   1   0   2   2
    1   1]
 [  0   2   2  10   2   2   6   5   1   0   1   1 279   1   1   0   0   0
    0   0]
 [  0   3   0   0   1   1   1   0   0   0   0   0   0 269   0   1   1   0
    0   0]
 [  0   5   0   0   1   0   0   0   0   0   2   0   1   0 295   0   0   0
    1   0]
 [  1   1   1   0   0   1   0   1   0   0   0   0   0   1   1 282   1   0
    0   3]
 [  0   0   1   0   0   0   0   0   1   3   0   0   1   0   0   1 232   1
    5   1]
 [  0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   2   0 293
    0   0]
 [  0   2   0   0   0   0   2   0   0   1   0   1   0   1   0   0   7   4
  216   2]
 [ 11   0   0   0   0   0   0   0   0   0   0   1   0   2   0   9   2   1
    2 143]]
Testbenching a MultinomialNB classifier...
parameters: {'alpha': 0.01}
done in 0.608000s
Percentage of non zeros coef: 100.000000
Predicting the outcomes of the testing set
done in 0.203000s
Classification report on test set for classifier:
MultinomialNB(alpha=0.01, class_prior=None, fit_prior=True)

                          precision    recall  f1-score   support

             alt.atheism       0.90      0.92      0.91       245
           comp.graphics       0.81      0.89      0.85       298
 comp.os.ms-windows.misc       0.87      0.83      0.85       292
comp.sys.ibm.pc.hardware       0.82      0.83      0.83       301
   comp.sys.mac.hardware       0.90      0.92      0.91       256
          comp.windows.x       0.90      0.89      0.89       297
            misc.forsale       0.90      0.84      0.87       290
               rec.autos       0.93      0.94      0.93       324
         rec.motorcycles       0.98      0.97      0.97       294
      rec.sport.baseball       0.97      0.97      0.97       315
        rec.sport.hockey       0.97      0.99      0.98       302
               sci.crypt       0.95      0.95      0.95       297
         sci.electronics       0.90      0.86      0.88       313
                 sci.med       0.97      0.96      0.97       277
               sci.space       0.95      0.97      0.96       305
  soc.religion.christian       0.91      0.97      0.94       293
      talk.politics.guns       0.89      0.96      0.93       246
   talk.politics.mideast       0.95      0.98      0.97       296
      talk.politics.misc       0.93      0.87      0.90       236
      talk.religion.misc       0.92      0.74      0.82       171

             avg / total       0.92      0.92      0.92      5648

Confusion matrix:
[[226   0   0   0   0   0   0   0   0   1   0   0   0   0   2   7   0   0
    0   9]
 [  1 266   7   4   1   6   2   2   0   0   0   3   4   1   1   0   0   0
    0   0]
 [  0  11 243  22   4   7   1   0   0   0   0   1   2   0   0   0   0   0
    1   0]
 [  0   7  12 250   8   4   9   0   0   1   1   0   9   0   0   0   0   0
    0   0]
 [  0   3   3   5 235   2   3   1   0   0   0   2   1   0   1   0   0   0
    0   0]
 [  0  19   5   3   2 263   0   0   0   0   0   1   0   1   1   0   2   0
    0   0]
 [  0   1   4   9   3   1 243   9   2   3   1   0   8   0   0   0   2   2
    2   0]
 [  0   0   0   1   1   0   5 304   1   2   0   0   3   2   3   1   1   0
    0   0]
 [  0   0   0   0   0   2   2   3 285   0   0   0   1   0   0   0   0   0
    0   1]
 [  0   1   0   0   0   1   1   3   0 304   5   0   0   0   0   0   0   0
    0   0]
 [  0   0   0   0   0   0   0   0   1   2 299   0   0   0   0   0   0   0
    0   0]
 [  0   2   2   1   0   1   2   0   0   0   0 283   1   0   0   0   2   1
    2   0]
 [  0  11   1   9   3   1   3   5   1   0   1   4 270   1   3   0   0   0
    0   0]
 [  0   2   0   1   1   1   0   0   0   0   0   1   0 266   2   1   0   0
    2   0]
 [  0   2   0   0   1   0   0   0   0   0   0   2   1   1 296   0   1   1
    0   0]
 [  3   1   0   0   0   0   0   0   0   0   1   0   0   2   0 283   0   1
    2   0]
 [  1   0   1   0   0   0   0   0   1   0   0   1   0   0   0   0 237   1
    3   1]
 [  1   0   0   0   0   1   0   0   0   0   0   0   0   0   0   3   0 291
    0   0]
 [  1   1   0   0   1   1   0   1   0   0   0   0   0   0   1   1  17   6
  206   0]
 [ 18   1   0   0   0   0   0   0   0   1   0   0   0   0   0  14   4   2
    4 127]]

步骤为：

一、preprocessing

1.加载训练集（training set）

2.训练集特征提取，用TfidfVectorizer，得到训练集上的x_train和y_train

3.加载测试集（test set）

4.测试集特征提取，用TfidfVectorizer，得到测试集上的x_train和y_train

二、定义Benchmark classifiers

5.训练，clf = clf_class(**params).fit(X_train, y_train)

6.测试，pred = clf.predict(X_test)

7.测试集上分类报告，print(classification_report(y_test, pred,target_names=news_test.target_names))

8.confusion matrix，cm = confusion_matrix(y_test, pred)

三、训练

9.调用两个分类器，SGDClassifier和MultinomialNB

相关阅读:
二项分布正太分布关系隶莫佛-拉普拉斯定理的推广列维-林德伯格定理
 lint (software)
eslint
Lazy freeing of keys 对数据的额异步同步操作 Redis 4.0 微信小程序
 XML-RPC JSON-RPC RPC是实现思路
 Multitier architecture
Messaging Patterns for Event-Driven Microservices
替换模板视频图片的原理
 Understanding When to use RabbitMQ or Apache Kafka Kafka RabbitMQ 性能对比
 RabbitMQ Connector
原文地址：https://www.cnblogs.com/gui0901/p/4456267.html