• 参数优化-学习曲线


    验证曲线是调节学习器的参数的,学习曲线是用来调节训练样本大小的。

    从理论上来讲,如果数据“同质”,当数据量到达一定程度时,学习器可以学到所有的“特征”,继续增加样本没有作用。

    那么到底多少样本是合适的呢?

    做个实验

    逐渐增大训练样本量,同时判断训练集和测试集的准确率,看看会发生什么

    1. 首先从训练集中拿出1个数据,训练模型,然后在该训练集(1个)和测试集上检验,发现在训练集上误差为0,在测试集上误差很大
    2. 然后从训练集中拿出10个数据,训练模型,然后在该训练集(10个)和测试集上检验,发现在训练集上误差增大,在测试集上误差减小
    3. 依次…
    4. 直到拿出整个训练集,发现模型在训练集上误差越来越大,在测试集上误差越来越小

    如图

    把训练集大小作为x,误差作为y

    训练集误差逐渐增大,测试集误差逐渐减小。

    那必然相交或者有个最小距离,此时继续增加样本已然无用,此时模型已无法从样本上学到任何新的东西。

    示例代码

    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn.naive_bayes import GaussianNB
    from sklearn.svm import SVC
    from sklearn.datasets import load_digits
    from sklearn.model_selection import learning_curve
    from sklearn.model_selection import ShuffleSplit
    
    
    def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None, train_sizes=np.linspace(.1, 1.0, 5)):
        plt.figure()
        plt.title(title)
        if ylim is not None:
            plt.ylim(*ylim)
        plt.xlabel("Training examples")
        plt.ylabel("Score")
        train_sizes, train_scores, test_scores = learning_curve(
            estimator, X, y, cv=cv, train_sizes=train_sizes)
        train_scores_mean = np.mean(train_scores, axis=1)
        train_scores_std = np.std(train_scores, axis=1)
        test_scores_mean = np.mean(test_scores, axis=1)
        test_scores_std = np.std(test_scores, axis=1)
        plt.grid()
    
        plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                         train_scores_mean + train_scores_std, alpha=0.1,
                         color="r")
        plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                         test_scores_mean + test_scores_std, alpha=0.1, color="g")
        plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
                 label="Training score")
        plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
                 label="Cross-validation score")
    
        plt.legend(loc="best")
        return plt
    
    
    digits = load_digits()
    X, y = digits.data, digits.target
    
    title = "Learning Curves (Naive Bayes)"
    cv = ShuffleSplit(n_splits=100, test_size=0.2, random_state=0)
    estimator = GaussianNB()
    plot_learning_curve(estimator, title, X, y, ylim=(0.7, 1.01), cv=cv)
    
    title = "Learning Curves (SVM, RBF kernel, $gamma=0.001$)"
    cv = ShuffleSplit(n_splits=10, test_size=0.2, random_state=0)
    estimator = SVC(gamma=0.001)
    plot_learning_curve(estimator, title, X, y, (0.7, 1.01), cv=cv)
    
    plt.show()

    输出

    事实上,数据“同质”的可能性很小,所以数据量越大越好。

  • 相关阅读:
    Makefile Special Built-in Target Names(Makefile内建特殊目标)
    著名的变量命名规则
    bottle py
    LuCI中文手册
    LuCI
    LuCI2 (OpenWrt web 管理界面)
    LuCI2 (OpenWrt web user interface)
    OpenWrt netifd
    加载时间/性能
    Taming the asynchronous beast with ES7
  • 原文地址:https://www.cnblogs.com/yanshw/p/10688558.html
Copyright © 2020-2023  润新知