• Sample pipeline for text feature extraction and evaluation of sklearn


    Sample pipeline for text feature extraction and evaluation

    https://scikit-learn.org/stable/auto_examples/model_selection/grid_search_text_feature_extraction.html#sphx-glr-auto-examples-model-selection-grid-search-text-feature-extraction-py

        下载20新闻数据

        定义流水线 :  计数向量化  词频转换 线性分类模型

        使用网格搜索来寻找最优参数

    The dataset used in this example is the 20 newsgroups dataset which will be automatically downloaded and then cached and reused for the document classification example.

    You can adjust the number of categories by giving their names to the dataset loader or setting them to None to get the 20 of them.

    Here is a sample output of a run on a quad-core machine:

    Code

    # Author: Olivier Grisel <olivier.grisel@ensta.org>
    #         Peter Prettenhofer <peter.prettenhofer@gmail.com>
    #         Mathieu Blondel <mathieu@mblondel.org>
    # License: BSD 3 clause
    from pprint import pprint
    from time import time
    import logging
    
    from sklearn.datasets import fetch_20newsgroups
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.feature_extraction.text import TfidfTransformer
    from sklearn.linear_model import SGDClassifier
    from sklearn.model_selection import GridSearchCV
    from sklearn.pipeline import Pipeline
    
    print(__doc__)
    
    # Display progress logs on stdout
    logging.basicConfig(level=logging.INFO,
                        format='%(asctime)s %(levelname)s %(message)s')
    
    
    # #############################################################################
    # Load some categories from the training set
    categories = [
        'alt.atheism',
        'talk.religion.misc',
    ]
    # Uncomment the following to do the analysis on all the categories
    #categories = None
    
    print("Loading 20 newsgroups dataset for categories:")
    print(categories)
    
    data = fetch_20newsgroups(subset='train', categories=categories)
    print("%d documents" % len(data.filenames))
    print("%d categories" % len(data.target_names))
    print()
    
    # #############################################################################
    # Define a pipeline combining a text feature extractor with a simple
    # classifier
    pipeline = Pipeline([
        ('vect', CountVectorizer()),
        ('tfidf', TfidfTransformer()),
        ('clf', SGDClassifier()),
    ])
    
    # uncommenting more parameters will give better exploring power but will
    # increase processing time in a combinatorial way
    parameters = {
        'vect__max_df': (0.5, 0.75, 1.0),
        # 'vect__max_features': (None, 5000, 10000, 50000),
        'vect__ngram_range': ((1, 1), (1, 2)),  # unigrams or bigrams
        # 'tfidf__use_idf': (True, False),
        # 'tfidf__norm': ('l1', 'l2'),
        'clf__max_iter': (20,),
        'clf__alpha': (0.00001, 0.000001),
        'clf__penalty': ('l2', 'elasticnet'),
        # 'clf__max_iter': (10, 50, 80),
    }
    
    if __name__ == "__main__":
        # multiprocessing requires the fork to happen in a __main__ protected
        # block
    
        # find the best parameters for both the feature extraction and the
        # classifier
        grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1)
    
        print("Performing grid search...")
        print("pipeline:", [name for name, _ in pipeline.steps])
        print("parameters:")
        pprint(parameters)
        t0 = time()
        grid_search.fit(data.data, data.target)
        print("done in %0.3fs" % (time() - t0))
        print()
    
        print("Best score: %0.3f" % grid_search.best_score_)
        print("Best parameters set:")
        best_parameters = grid_search.best_estimator_.get_params()
        for param_name in sorted(parameters.keys()):
            print("	%s: %r" % (param_name, best_parameters[param_name]))

    Output

    Loading 20 newsgroups dataset for categories:
    ['alt.atheism', 'talk.religion.misc']
    1427 documents
    2 categories
    
    Performing grid search...
    pipeline: ['vect', 'tfidf', 'clf']
    parameters:
    {'clf__alpha': (1.0000000000000001e-05, 9.9999999999999995e-07),
     'clf__max_iter': (10, 50, 80),
     'clf__penalty': ('l2', 'elasticnet'),
     'tfidf__use_idf': (True, False),
     'vect__max_n': (1, 2),
     'vect__max_df': (0.5, 0.75, 1.0),
     'vect__max_features': (None, 5000, 10000, 50000)}
    done in 1737.030s
    
    Best score: 0.940
    Best parameters set:
        clf__alpha: 9.9999999999999995e-07
        clf__max_iter: 50
        clf__penalty: 'elasticnet'
        tfidf__use_idf: True
        vect__max_n: 2
        vect__max_df: 0.75
        vect__max_features: 50000
    出处:http://www.cnblogs.com/lightsong/ 本文版权归作者和博客园共有,欢迎转载,但未经作者同意必须保留此段声明,且在文章页面明显位置给出原文连接。
  • 相关阅读:
    tcl tk lappend
    file join
    [转载]强指针和弱指针
    DisplayHardware
    Android 十大调试方法
    C语言程序的外部变量与函数
    DisplayHardware
    Android 十大调试方法
    wifi连接流程分析
    [转载]强指针和弱指针
  • 原文地址:https://www.cnblogs.com/lightsong/p/14309016.html
Copyright © 2020-2023  润新知