• sklearn.model_selection 的train_test_split方法和参数


    train_test_split是sklearn中用于划分数据集,即将原始数据集划分成测试集和训练集两部分的函数。

    from sklearn.model_selection import train_test_split

    1. 其函数源代码是:

    def train_test_split(*arrays, **options):
        """Split arrays or matrices into random train and test subsets
    
        Quick utility that wraps input validation and
        ``next(ShuffleSplit().split(X, y))`` and application to input data
        into a single call for splitting (and optionally subsampling) data in a
        oneliner.
    
        Read more in the :ref:`User Guide <cross_validation>`.
    
        Parameters
        ----------
        *arrays : sequence of indexables with same length / shape[0]
            Allowed inputs are lists, numpy arrays, scipy-sparse
            matrices or pandas dataframes.
    
        test_size : float, int, None, optional
            If float, should be between 0.0 and 1.0 and represent the proportion
            of the dataset to include in the test split. If int, represents the
            absolute number of test samples. If None, the value is set to the
            complement of the train size. By default, the value is set to 0.25.
            The default will change in version 0.21. It will remain 0.25 only
            if ``train_size`` is unspecified, otherwise it will complement
            the specified ``train_size``.
    
        train_size : float, int, or None, default None
            If float, should be between 0.0 and 1.0 and represent the
            proportion of the dataset to include in the train split. If
            int, represents the absolute number of train samples. If None,
            the value is automatically set to the complement of the test size.
    
        random_state : int, RandomState instance or None, optional (default=None)
            If int, random_state is the seed used by the random number generator;
            If RandomState instance, random_state is the random number generator;
            If None, the random number generator is the RandomState instance used
            by `np.random`.
    
        shuffle : boolean, optional (default=True)
            Whether or not to shuffle the data before splitting. If shuffle=False
            then stratify must be None.
    
        stratify : array-like or None (default is None)
            If not None, data is split in a stratified fashion, using this as
            the class labels.
    
        Returns
        -------
        splitting : list, length=2 * len(arrays)
            List containing train-test split of inputs.
    
            .. versionadded:: 0.16
                If the input is sparse, the output will be a
                ``scipy.sparse.csr_matrix``. Else, output type is the same as the
                input type.

    2. 参数

    train_size:训练集大小

      float:0-1之间,表示训练集所占的比例

      int:直接指定训练集的数量

      None:自动为测试集的补集,也就是原始数据集减去测试集

    test_size:测试集大小,默认值是0.25

      float:0-1之间,表示测试集所占的比例

      int:直接指定测试集的数量

      None:自动为训练集的补集,也就是原始数据集减去训练集

    random_state:可以理解为随机数种子,主要是为了复现结果而设置

    shuffle:表示是否打乱数据位置,True或者False,默认是True

    stratify:表示是否按照样本比例(不同类别的比例)来划分数据集,例如原始数据集 类A:类B = 75%:25%,那么划分的测试集和训练集中的A:B的比例都会是75%:25%;可用于样本类别差异很大的情况,一般使用为:stratify=y,即用数据集的标签y来进行划分。

    3. 一般使用形式是:

    X_train,X_test,y_train,y_test = train_test_split(X,y,train_size = 0.75, random_state=14, stratify=y)

    参考:

    https://blog.csdn.net/liuxiao214/article/details/79019901

    https://blog.csdn.net/qq_38410428/article/details/94054920

  • 相关阅读:
    第04组 Beta冲刺 (2/5)
    第04组 Beta冲刺 (1/6)
    第04组 Alpha冲刺 总结
    二叉树的递归与非递归
    各类典例模板
    选择题合辑2
    运算符重载
    链表题目
    集合的模拟实现(类模板)
    2018Final静态成员(黑名单)
  • 原文地址:https://www.cnblogs.com/qi-yuan-008/p/11997248.html
Copyright © 2020-2023  润新知