• Classification of text documents: using a MLComp dataset


    注:原文代码链接http://scikit-learn.org/stable/auto_examples/text/mlcomp_sparse_document_classification.html

    运行结果为:

    Loading 20 newsgroups training set... 
    20 newsgroups dataset for document classification (http://people.csail.mit.edu/jrennie/20Newsgroups)
    13180 documents
    20 categories
    Extracting features from the dataset using a sparse vectorizer
    done in 139.231000s
    n_samples: 13180, n_features: 130274
    Loading 20 newsgroups test set... 
    done in 0.000000s
    Predicting the labels of the test set...
    5648 documents
    20 categories
    Extracting features from the dataset using the same vectorizer
    done in 7.082000s
    n_samples: 5648, n_features: 130274
    Testbenching a linear classifier...
    parameters: {'penalty': 'l2', 'loss': 'hinge', 'alpha': 1e-05, 'fit_intercept': True, 'n_iter': 50}
    done in 22.012000s
    Percentage of non zeros coef: 30.074190
    Predicting the outcomes of the testing set
    done in 0.172000s
    Classification report on test set for classifier:
    SGDClassifier(alpha=1e-05, average=False, class_weight=None, epsilon=0.1,
           eta0=0.0, fit_intercept=True, l1_ratio=0.15,
           learning_rate='optimal', loss='hinge', n_iter=50, n_jobs=1,
           penalty='l2', power_t=0.5, random_state=None, shuffle=True,
           verbose=0, warm_start=False)
    
                              precision    recall  f1-score   support
    
                 alt.atheism       0.95      0.93      0.94       245
               comp.graphics       0.85      0.91      0.88       298
     comp.os.ms-windows.misc       0.88      0.88      0.88       292
    comp.sys.ibm.pc.hardware       0.82      0.80      0.81       301
       comp.sys.mac.hardware       0.90      0.92      0.91       256
              comp.windows.x       0.92      0.88      0.90       297
                misc.forsale       0.87      0.89      0.88       290
                   rec.autos       0.93      0.94      0.94       324
             rec.motorcycles       0.97      0.97      0.97       294
          rec.sport.baseball       0.97      0.97      0.97       315
            rec.sport.hockey       0.98      0.99      0.99       302
                   sci.crypt       0.97      0.96      0.96       297
             sci.electronics       0.87      0.89      0.88       313
                     sci.med       0.97      0.97      0.97       277
                   sci.space       0.97      0.97      0.97       305
      soc.religion.christian       0.95      0.96      0.95       293
          talk.politics.guns       0.94      0.94      0.94       246
       talk.politics.mideast       0.97      0.99      0.98       296
          talk.politics.misc       0.96      0.92      0.94       236
          talk.religion.misc       0.89      0.84      0.86       171
    
                 avg / total       0.93      0.93      0.93      5648
    
    Confusion matrix:
    [[227   0   0   0   0   0   0   0   0   0   0   1   2   1   1   1   0   1
        0  11]
     [  0 271   3   8   2   5   2   0   0   1   0   0   3   1   1   0   0   1
        0   0]
     [  0   7 256  14   5   6   1   0   0   0   0   0   2   0   1   0   0   0
        0   0]
     [  1   8  12 240   9   3  12   2   0   0   0   1  12   0   0   1   0   0
        0   0]
     [  0   1   3   6 235   2   4   0   0   0   0   1   3   0   1   0   0   0
        0   0]
     [  0  17   9   4   0 260   0   0   1   1   0   0   2   0   2   0   1   0
        0   0]
     [  0   1   3   7   3   0 257   7   2   0   0   1   8   0   1   0   0   0
        0   0]
     [  0   0   0   2   1   0   5 305   2   3   0   0   4   1   0   0   1   0
        0   0]
     [  0   0   0   0   1   0   3   3 285   0   0   0   1   0   0   1   0   0
        0   0]
     [  0   0   0   0   0   0   3   2   0 305   2   1   1   0   0   0   0   0
        1   0]
     [  0   0   0   0   0   0   1   0   1   0 300   0   0   0   0   0   0   0
        0   0]
     [  0   0   1   1   0   2   0   1   0   0   0 284   0   1   1   0   2   2
        1   1]
     [  0   2   2  10   2   2   6   5   1   0   1   1 279   1   1   0   0   0
        0   0]
     [  0   3   0   0   1   1   1   0   0   0   0   0   0 269   0   1   1   0
        0   0]
     [  0   5   0   0   1   0   0   0   0   0   2   0   1   0 295   0   0   0
        1   0]
     [  1   1   1   0   0   1   0   1   0   0   0   0   0   1   1 282   1   0
        0   3]
     [  0   0   1   0   0   0   0   0   1   3   0   0   1   0   0   1 232   1
        5   1]
     [  0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   2   0 293
        0   0]
     [  0   2   0   0   0   0   2   0   0   1   0   1   0   1   0   0   7   4
      216   2]
     [ 11   0   0   0   0   0   0   0   0   0   0   1   0   2   0   9   2   1
        2 143]]
    Testbenching a MultinomialNB classifier...
    parameters: {'alpha': 0.01}
    done in 0.608000s
    Percentage of non zeros coef: 100.000000
    Predicting the outcomes of the testing set
    done in 0.203000s
    Classification report on test set for classifier:
    MultinomialNB(alpha=0.01, class_prior=None, fit_prior=True)
    
                              precision    recall  f1-score   support
    
                 alt.atheism       0.90      0.92      0.91       245
               comp.graphics       0.81      0.89      0.85       298
     comp.os.ms-windows.misc       0.87      0.83      0.85       292
    comp.sys.ibm.pc.hardware       0.82      0.83      0.83       301
       comp.sys.mac.hardware       0.90      0.92      0.91       256
              comp.windows.x       0.90      0.89      0.89       297
                misc.forsale       0.90      0.84      0.87       290
                   rec.autos       0.93      0.94      0.93       324
             rec.motorcycles       0.98      0.97      0.97       294
          rec.sport.baseball       0.97      0.97      0.97       315
            rec.sport.hockey       0.97      0.99      0.98       302
                   sci.crypt       0.95      0.95      0.95       297
             sci.electronics       0.90      0.86      0.88       313
                     sci.med       0.97      0.96      0.97       277
                   sci.space       0.95      0.97      0.96       305
      soc.religion.christian       0.91      0.97      0.94       293
          talk.politics.guns       0.89      0.96      0.93       246
       talk.politics.mideast       0.95      0.98      0.97       296
          talk.politics.misc       0.93      0.87      0.90       236
          talk.religion.misc       0.92      0.74      0.82       171
    
                 avg / total       0.92      0.92      0.92      5648
    
    Confusion matrix:
    [[226   0   0   0   0   0   0   0   0   1   0   0   0   0   2   7   0   0
        0   9]
     [  1 266   7   4   1   6   2   2   0   0   0   3   4   1   1   0   0   0
        0   0]
     [  0  11 243  22   4   7   1   0   0   0   0   1   2   0   0   0   0   0
        1   0]
     [  0   7  12 250   8   4   9   0   0   1   1   0   9   0   0   0   0   0
        0   0]
     [  0   3   3   5 235   2   3   1   0   0   0   2   1   0   1   0   0   0
        0   0]
     [  0  19   5   3   2 263   0   0   0   0   0   1   0   1   1   0   2   0
        0   0]
     [  0   1   4   9   3   1 243   9   2   3   1   0   8   0   0   0   2   2
        2   0]
     [  0   0   0   1   1   0   5 304   1   2   0   0   3   2   3   1   1   0
        0   0]
     [  0   0   0   0   0   2   2   3 285   0   0   0   1   0   0   0   0   0
        0   1]
     [  0   1   0   0   0   1   1   3   0 304   5   0   0   0   0   0   0   0
        0   0]
     [  0   0   0   0   0   0   0   0   1   2 299   0   0   0   0   0   0   0
        0   0]
     [  0   2   2   1   0   1   2   0   0   0   0 283   1   0   0   0   2   1
        2   0]
     [  0  11   1   9   3   1   3   5   1   0   1   4 270   1   3   0   0   0
        0   0]
     [  0   2   0   1   1   1   0   0   0   0   0   1   0 266   2   1   0   0
        2   0]
     [  0   2   0   0   1   0   0   0   0   0   0   2   1   1 296   0   1   1
        0   0]
     [  3   1   0   0   0   0   0   0   0   0   1   0   0   2   0 283   0   1
        2   0]
     [  1   0   1   0   0   0   0   0   1   0   0   1   0   0   0   0 237   1
        3   1]
     [  1   0   0   0   0   1   0   0   0   0   0   0   0   0   0   3   0 291
        0   0]
     [  1   1   0   0   1   1   0   1   0   0   0   0   0   0   1   1  17   6
      206   0]
     [ 18   1   0   0   0   0   0   0   0   1   0   0   0   0   0  14   4   2
        4 127]]

     步骤为:

    一、preprocessing

    1.加载训练集(training set)

    2.训练集特征提取,用TfidfVectorizer,得到训练集上的x_train和y_train

    3.加载测试集(test set)

    4.测试集特征提取,用TfidfVectorizer得到测试集上的x_train和y_train

    二、定义Benchmark classifiers

    5.训练,clf = clf_class(**params).fit(X_train, y_train)

    6.测试,pred = clf.predict(X_test)

    7.测试集上分类报告,print(classification_report(y_test, pred,target_names=news_test.target_names))

    8.confusion matrix,cm = confusion_matrix(y_test, pred)

    三、训练

    9.调用两个分类器,SGDClassifier和MultinomialNB

     

  • 相关阅读:
    二项分布 正太分布 关系 隶莫佛-拉普拉斯定理的推广 列维-林德伯格定理
    lint (software)
    eslint
    Lazy freeing of keys 对数据的额异步 同步操作 Redis 4.0 微信小程序
    XML-RPC JSON-RPC RPC是实现思路
    Multitier architecture
    Messaging Patterns for Event-Driven Microservices
    替换模板视频图片的原理
    Understanding When to use RabbitMQ or Apache Kafka Kafka RabbitMQ 性能对比
    RabbitMQ Connector
  • 原文地址:https://www.cnblogs.com/gui0901/p/4456267.html
Copyright © 2020-2023  润新知