• Hyperparameter tuning


    超参数调整

    详细可以参考官方文档

    定义

    在拟合模型之前需要定义好的参数

    适用

    • Linear regression: Choosing parameters
    • Ridge/lasso regression: Choosing alpha
    • k-Nearest Neighbors: Choosing n_neighbors
    • Parameters like alpha and k: Hyperparameters
    • Hyperparameters cannot be learned by tting the model

    GridsearchCV

    sklearn.model_selection.GridSearchCV

    • 超参数自动搜索模块
    • 网格搜索+交叉验证
    • 指定的参数范围内,按步长依次调整参数,利用调整的参数训练学习器,从所有的参数中找到在验证集上精度最高的参数,这其实是一个训练和比较的过程
    class sklearn.model_selection.GridSearchCV(estimator, param_grid, scoring=None, n_jobs=None, iid='deprecated', refit=True, cv=None, verbose=0, pre_dispatch='2*n_jobs', error_score=nan, return_train_score=False
    

    参数

    • estimator:模型对象
    • param_grid:dict or list of dictionaries,字典类型的参数,定义一个字典然后都放进去
    • scoring:string, callable, list/tuple, dict or None, default: None,就是metrics,损失函数定义rmse,mse等
    • Number of jobs to run in parallel. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.控制cop,core并行运行数量
      -cv:int, cross-validation generator or an iterable, optional,k折交叉验证数,默认5折
      • Determines the cross-validation splitting strategy. Possible inputs for cv are:
      • None, to use the default 5-fold cross validation,integer, to specify the number of folds in a (Stratified)KFold,CV splitter,
      • An iterable yielding (train, test) splits as arrays of indices.
      • For integer/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, KFold is used.
    • verbose:控制输出信息的详细程度,愈高输出越多。

    属性

    常见:

    • cv_results_dict of numpy (masked) ndarrays输出交叉验证的每一个结果
    • best_estimator_:最好的估计器
    • best_params_:dict
      • 返回最优模型参数
      • Parameter setting that gave the best results on the hold out data.
      • For multi-metric evaluation, this is present only if refit is specified.
    • best_score_:float
      • 返回最优模型参数的得分
      • Mean cross-validated score of the best_estimator
      • For multi-metric evaluation, this is present only if refit is specified.
      • This attribute is not available if refit is a function.

    复现

    # Import necessary modules
    from sklearn.model_selection import GridSearchCV
    
    from sklearn.linear_model import LogisticRegression
    # Setup the hyperparameter grid
    # 创建一个参数集
    c_space = np.logspace(-5, 8, 15)
    # 这里是创建一个字典保存参数集
    param_grid = {'C': c_space}
    
    # Instantiate a logistic regression classifier: logreg
    # 针对回归模型进行的超参数调整
    logreg = LogisticRegression()
    
    # Instantiate the GridSearchCV object: logreg_cv
    logreg_cv = GridSearchCV(logreg, param_grid, cv=5)
    
    # Fit it to the data
    logreg_cv.fit(X,y)
    
    # Print the tuned parameters and score
    # 得到最好的模型
    print("Tuned Logistic Regression Parameters: {}".format(logreg_cv.best_params_)) 
    # 得到最好的模型的最好的结果
    print("Best score is {}".format(logreg_cv.best_score_))
    
    <script.py> output:
        Tuned Logistic Regression Parameters: {'C': 3.727593720314938}
        Best score is 0.7708333333333334
    

    GridSearchCV can be computationally expensive, especially if you are searching over a large hyperparameter space and dealing with multiple hyperparameters. A solution to this is to use RandomizedSearchCV, in which not all hyperparameter values are tried out. Instead, a fixed number of hyperparameter settings is sampled from specified probability distributions.
    grid相当于一个for循环,会遍历每一个参数,因此,当调参很多的时候,会导致计算量非常的大,因此,使用随机抽样的随机搜索会好一些

    RandomizedSearchCV的使用方法其实是和GridSearchCV一致的,但它以随机在参数空间中采样的方式代替了GridSearchCV对于参数的网格搜索,在对于有连续变量的参数时,RandomizedSearchCV会将其当作一个分布进行采样这是网格搜索做不到的,它的搜索能力取决于设定的n_iter参数,同样的给出代码
    csdn

    RandomizedSearchCV

    • 随机搜索法
    • 不是每一个参数都被选取,而是从指定概率分布的参数中,抽取一定量的参数
      我还是没太能明白?
      可以比较一下时间

    比较网格搜索而言,参数略有不同

    算了,还是都列一下常见的吧,剩下的可以参照官方文档

    比Grid 多了一个属性

    • .cv_results_,可以交叉验证的每一轮的结果

    复现

    # Import necessary modules
    from scipy.stats import randint
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.model_selection import RandomizedSearchCV
    
    # Setup the parameters and distributions to sample from: param_dist
    # 以决策树为例,注意定一个字典的形式哦
    param_dist = {"max_depth": [3, None],
                  "max_features": randint(1, 9),
                  "min_samples_leaf": randint(1, 9),
                  "criterion": ["gini", "entropy"]}
    
    # Instantiate a Decision Tree classifier: tree
    tree = DecisionTreeClassifier()
    
    # Instantiate the RandomizedSearchCV object: tree_cv
    tree_cv = RandomizedSearchCV(tree, param_dist, cv=5)
    
    # Fit it to the data
    tree_cv.fit(X,y)
    
    # Print the tuned parameters and score
    print("Tuned Decision Tree Parameters: {}".format(tree_cv.best_params_))
    print("Best score is {}".format(tree_cv.best_score_))
    
    <script.py> output:
        Tuned Decision Tree Parameters: {'criterion': 'gini', 'max_depth': 3, 'max_features': 5, 'min_samples_leaf': 2}
        Best score is 0.7395833333333334
    

    调参的限制点

    • grid:
      -random:
  • 相关阅读:
    04-JQuery
    03-JavaScript
    02-CSS&JS
    01-HTML
    [LeetCode]Insert Interval
    [shell编程]正则表达式
    [LeetCode]Jump Game II
    [LeetCode]Jump Game
    [LeetCode]Wildcard Matching
    [shell编程]初识sed和gawk
  • 原文地址:https://www.cnblogs.com/gaowenxingxing/p/12306811.html
Copyright © 2020-2023  润新知