数据集划分问题

留出法(hold-out)

使用 n:m and n + m =1 的形式对原数据进行分割,例如 train : test = 7 : 3 or train : test = 6.5 : 3.5 但是这种相对原始的处理方式效果并不好,缺点如下:

缺点一:浪费数据
缺点二:容易过拟合,且矫正方式不方便

这时,我们需要使用另外一种分割方式-交叉验证或者留P法(leave P out)

LOO 留一法 or LPO 留P法

LOO : 对于整个数据集而言,每次选取一个样本作为验证集,其余样本作为训练集
LPO : 对于整个数据集而言,每次选取P个样本作为验证集,其余样本作为训练集

LOO的好处在于,避免的数据的浪费,但是同时也拥有了,更高的性能开销
一般LOO相对于 K-Fold 而言,拥有更高的方差,但是对于方差占主导的情况时,LOO可能拥有比交叉验证更强的能力.

K-Fold

KFold 将所有的样例划分为 k 个组，称为折叠 (fold) （如果 k = n，这等价于 Leave One Out（留一）策略），都具有相同的大小（如果可能）。预测函数学习时使用 k - 1 个折叠中的数据，最后一个剩下的折叠会用于测试。在集成算法Stacking中就使用了这种方式(Bagging则为子采样,也是很有趣的方式,之前有介绍)

注意

而 i.i.d 数据是机器学习理论中的一个常见假设，在实践中很少成立。如果知道样本是使用时间相关的过程生成的，则使用 time-series aware cross-validation scheme 更安全。同样，如果我们知道生成过程具有 group structure （群体结构）（从不同 subjects（主体）， experiments（实验）， measurement devices （测量设备）收集的样本），则使用 group-wise cross-validation 更安全。

是否重复试验与分层的问题

分层: 对于K-Fold而言,保持每个分组中的train : test 的比例大致相等
重复: 即样本的放回采样,比如Bagging,训练集中部分样本会重复,部分样本永远不会出现
重复分层: 对于Sklearn中的K-Fold而言,指实现了采样中各个类别的比例与原数据集的各类别比例大致相等.

交叉验证

LOO 与 LPO 的交叉验证就是每个(或者每P个样本)都作为验证集一次,然后计算平均值,得出Score,K-Fold类似,不过不同的地方是分成了K折.

Sklearn中实现了便捷方法CV

快捷简便的使用

加载数据

from sklearn.model_selection import train_test_split,LeaveOneOut,LeavePOut
from sklearn import datasets
from sklearn import svm
from sklearn.metrics import accuracy_score
import numpy as np

iris = datasets.load_iris()
clf_svc = svm.SVC(kernel='linear')
iris.data.shape,iris.target.shape

((150, 4), (150,))

hold out

X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.4, random_state=0) 

clf_svc.fit(X_train,y_train)
accuracy_score(clf_svc.predict(X_test),y_test)

0.9666666666666667

Leave One Out

loo = LeaveOneOut()
loo.get_n_splits(iris.data)
mean_accuracy_score_list = []
for train_index, test_index in loo.split(iris.data):
    clf_svc.fit(iris.data[train_index], iris.target[train_index])
    prediction = clf_svc.predict(iris.data[test_index])
    mean_accuracy_score_list.append(accuracy_score(iris.target[test_index], prediction))
print(np.average(mean_accuracy_score_list))

0.98

Leave P Out

LeavePOut 与 LeaveOneOut 非常相似，因为它通过从整个集合中删除 p 个样本来创建所有可能的训练/测试集。对于 n 个样本，这产生了 m 个训练-测试对, m 等于 n个样本中任意选取 p 个样本不计顺序自由组合的个数。值得注意的是这种方式会导致计算开销大幅增加,下面的例子要比上面的例子,多花费 m-n 的时间

loo = LeavePOut(p=3)
mean_accuracy_score_list = []
for train_index, test_index in loo.split(iris.data):
    clf_svc.fit(iris.data[train_index], iris.target[train_index])
    prediction = clf_svc.predict(iris.data[test_index])
    mean_accuracy_score_list.append(accuracy_score(iris.target[test_index], prediction))
print(np.average(mean_accuracy_score_list))

0.9793627184231215

下面的例子更好地展示了,其效果:

X = np.ones(4)
lpo = LeavePOut(p=2)
for train, test in lpo.split(X):
    print("%s %s" % (train, test))

[2 3] [0 1]
[1 3] [0 2]
[1 2] [0 3]
[0 3] [1 2]
[0 2] [1 3]
[0 1] [2 3]

K-Fold

普通的K-Fold仅仅是折叠,除此之外,还有分层K-Fold则,则进行的分层K-Fold.

from sklearn.model_selection import KFold,StratifiedKFold
X = ["a", "b", "c", "d"]
kf = KFold(n_splits=4)
for train, test in kf.split(X):
    print("%s %s" % (train, test))

[1 2 3] [0]
[0 2 3] [1]
[0 1 3] [2]
[0 1 2] [3]

X = np.array([[1, 2, 3, 4],
              [11, 12, 13, 14],
              [21, 22, 23, 24],
              [31, 32, 33, 34],
              [41, 42, 43, 44],
              [51, 52, 53, 54],
              [61, 62, 63, 64],
              [71, 72, 73, 74]])

y = np.array([1, 1, 0, 0, 1, 1, 0, 0])

stratified_folder = StratifiedKFold(n_splits=4, random_state=0, shuffle=False)
for train_index, test_index in stratified_folder.split(X, y):
    print("Stratified Train Index:", train_index)
    print("Stratified Test Index:", test_index)
    print("Stratified y_train:", y[train_index])
    print("Stratified y_test:", y[test_index],'
')

Stratified Train Index: [1 3 4 5 6 7]
Stratified Test Index: [0 2]
Stratified y_train: [1 0 1 1 0 0]
Stratified y_test: [1 0] 

Stratified Train Index: [0 2 4 5 6 7]
Stratified Test Index: [1 3]
Stratified y_train: [1 0 1 1 0 0]
Stratified y_test: [1 0] 

Stratified Train Index: [0 1 2 3 5 7]
Stratified Test Index: [4 6]
Stratified y_train: [1 1 0 0 1 0]
Stratified y_test: [1 0] 

Stratified Train Index: [0 1 2 3 4 6]
Stratified Test Index: [5 7]
Stratified y_train: [1 1 0 0 1 0]
Stratified y_test: [1 0]

不过在实际的使用中我们更常用的是cross_val_score,一个封装好的交叉验证方法,来进行模型选择,其中默认的方法即为K-Fold,除此之外,我们还可以使用cross_val_predict来获取预测结果,不过效果不一定是最好偶.

from sklearn.model_selection import cross_val_score
scores_clf_svc_cv = cross_val_score(clf_svc,iris.data,iris.target,cv=5)
print(scores_clf_svc_cv)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores_clf_svc_cv.mean(), scores_clf_svc_cv.std() * 2))

[0.96666667 1.         0.96666667 0.96666667 1.        ]
Accuracy: 0.98 (+/- 0.03)

from sklearn.model_selection import cross_val_predict
predicted = cross_val_predict(clf_svc, iris.data, iris.target, cv=10)
accuracy_score(iris.target, predicted)

0.9733333333333334

参考

更多内容请参考我的博客,我会持续更新相关内容

相关阅读:
(转)【web前端培训之前后端的配合（中）】继续昨日的故事
 ural(Timus) 1136. Parliament
scau Josephus Problem
ACMICPC Live Archive 6204 Poker End Games
uva 10391 Compound Words
ACMICPC Live Archive 3222 Joke with Turtles
uva 10132 File Fragmentation
uva 270 Lining Up
【转】各种字符串哈希函数比较
 uva 10905 Children's Game
原文地址：https://www.cnblogs.com/fonttian/p/9162724.html