使用sklearn.model_selection.train_test_split可以在数据集上随机划分出一定比例的训练集和测试集
1.使用形式为:
1 from sklearn.model_selection import train_test_split 2 X_train, X_test, y_train, y_test = train_test_split(train_data,train_target,test_size=0.2, random_state=0)
2.参数解释:
train_data:样本特征集
train_target:样本的标签集
test_size:样本占比,测试集占数据集的比重,如果是整数的话就是样本的数量
random_state:是随机数的种子。在同一份数据集上,相同的种子产生相同的结果,不同的种子产生不同的划分结果
X_train,y_train:构成了训练集
X_test,y_test:构成了测试集
3.举例:
生成一个包含100个样本的数据集,随机换分出20%为测试集
1 #py36 2 #!/usr/bin/env python 3 # -*- coding: utf-8 -*- 4 5 #from sklearn.cross_validation import train_test_split 6 from sklearn.model_selection import train_test_split 7 8 # 生成100条数据:100个2维的特征向量,对应100个标签 9 X = [["feature ","one "]] * 50 + [["feature ","two "]] * 50 10 y = [1] * 50 + [2] * 50 11 12 # 随机抽取20%的测试集 13 X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=1) 14 print ("train:",len(X_train), "test:",len(X_test)) 15 16 # 查看被划分出的测试集 17 for i in range(len(X_test)): 18 print ("".join(X_test[i]), y_test[i]) 19 20 ''' 21 train: 80 test: 20 22 feature two 2 23 feature two 2 24 feature one 1 25 feature two 2 26 feature two 2 27 feature one 1 28 feature one 1 29 feature two 2 30 feature two 2 31 feature two 2 32 feature two 2 33 feature one 1 34 feature two 2 35 feature two 2 36 feature two 2 37 feature one 1 38 feature one 1 39 feature one 1 40 feature two 2 41 feature one 1 42 '''