7.Selecting features using the caret package
使用插入符号包装特征选择
The feature selection method searches the subset of features with minimized predictive errors. We can apply feature selection to identify which attributes are required to build an accurate model. The caret package provides a recursive feature elimi nationfunction,rfe,which can help automatically select the required features. In the following recipe, we will demonstrate how to use the car.
特征选择方法搜索特征子集的最小化预测误差。我们可以应用特征选择,以确定哪些属性需要建立一个准确的模型。插入软件包提供了一个递归特征消除功能,RFE,可以自动选择所需的功能。在下面的食谱中,我们将演示如何使用汽车。
How to do it...怎麽做
Perform the following steps to select features:执行下列步骤来选择特征;
1. Transform the feature named as变换的特征作为训练数据集 international_plan of the training dataset, trainset, to intl_yes and intl_no:
> intl_plan = model.matrix(~ trainset.international_plan - 1, data=data.frame(trainset$international_plan))
> colnames(intl_plan) = c("trainset.international_planno"="intl_no", "trainset.international_planyes"= "intl_yes")
2. Transform the feature named as voice_mail_plan of the training dataset, trainset, to voice_yes and voice_no:
> voice_plan = model.matrix(~ trainset.voice_mail_plan - 1, data=data.frame(trainset$voice_mail_plan)) > colnames(voice_plan) = c("trainset.voice_mail_planno" ="voice_no","trainset.voice_mail_plan))
3. Remove the international_plan and voice_mail_plan attributes and combine the training dataset, trainset with the data frames, intl_planand voice_plan:
> trainset$international_plan = NULL > trainset$voice_mail_plan = NULL > trainset = cbind(intl_plan,voice_plan, trainset)
4. Transform the feature named as international_plan of the testing dataset,
testset, to intl_yes and intl_no:
> intl_plan = model.matrix(~ testset.international_plan - 1, data=data.frame(testset$international_plan)) > colnames(intl_plan) = c("testset.international_planno"="intl_no", "testset.international_planyes"= "intl_yes")
5. Transform the feature named as voice_mail_plan of the training dataset, trainset, to voice_yes and voice_no:
> voice_plan = model.matrix(~ testset.voice_mail_plan - 1, data=data.frame(testset$voice_mail_plan)) > colnames(voice_plan) = c("testset.voice_mail_planno" ="voice_no", "testset.voice_mail_planyes"="voidce_yes")
6. Remove the international_plan and voice_mail_plan attributes and combine the testing dataset, testset with the data frames, intl_plan and voice_plan:
> testset$international_plan = NULL > testset$voice_mail_plan = NULL > testset = cbind(intl_plan,voice_plan, testset)
7. We then create a feature selection algorithm using linear discriminant analysis:
> ldaControl = rfeControl(functions = ldaFuncs, method = "cv")
In this recipe, we perform feature selection using the caret package. As there are factor-coded attributes within the dataset, we first use a function called model.matrix to transform the factor-coded attributes into multiple binary attributes. Therefore, we transform the international_plan attribute to intl_yes and intl_no. Additionally, we transform the voice_mail_plan attribute to voice
在这个食谱中,我们进行特征选择使用插入符号包。有因子编码属性的数据集,我们首先使用一个叫做model.matrix变换因子编码属性为多个二进制属性的功能。因此,我们将international_plan属性intl_yes和intl_no。此外,我们将voice_mail_plan属性的声音。
8.Measuring the performance of the regression model
回归模型的性能测量
To measure the performance of a regression model, we can calculate the distance from predicted output and the actual output as a quantifier of the performance of the model. Here, we often use the root mean square error (RMSE), relative square error (RSE) and R-Square as common measurements. In the following recipe, we will illustratehowto compute these measurements from a built regressio.
要衡量的回归模型的性能,我们可以计算出预测的输出和实际输出的距离作为一个量词的性能模型。在这里,我们经常使用的均方根误差(RMSE),相对误差(RSE)判定为常见的测量。在下面的食谱,我们将说明如何从建立回归计算这些测量.
The measurement of the performance of the regression model employs the distance between the predicted value and the actual value. We often use these three measurements, root mean square error, relative square error, and R-Square, as the quantifier of the performance of regression models. In this recipe, we first load the Quartet data from the car package. We then use the lm function to fit.
How to do it...怎麽做...
Perform the following steps to measure the performance of the regression model:执行下列步骤来测量回归模型的性能
1. Load the Quartet dataset from the car package:
> library(car) > data(Quartet)2. Plot the attribute, y3, against x using the lm function:> plot(Quartet$x, Quartet$y3)> lmfit = lm(Quartet$y3~Quartet$x) > abline(lmfit, col="red")
Figure 4: The linear regression plot3. You can retrieve predicted values by using the predict function:
> predicted= predict(lmfit, newdata=Quartet[c("x")])
4. Now, you can calculate the root mean square error:
> actual = Quartet$y3> rmse = (mean((predicted - actual)^2))^0.5> rmse[1] 1.118286
5. You can calculate the relative square error:
> mu = mean(actual) > rse = mean((predicted - actual)^2) / mean((mu - actual)^2) > rse[1] 0.3336766.
Also, you can use R-Square as a measurement:
> rsquare = 1 - rse> rsquare[1] 0.666324
7. Then, you can plot attribute, y3, against x using the rlm function from the MASS package:
> library(MASS) > plot(Quartet$x, Quartet$y3) > rlmfit = rlm(Quartet$y3~Quartet$x) > abline(rlmfit, col="red")
Figure 5: The robust linear regression plot on the Quartet dataset
回归模型的性能测量采用预测值与实际值之间的距离。我们经常使用这三个测量,均方根误差,相对误差,判定,作为回归模型性能的量词。在这个配方中,我们首先从车包加载四方数据。然后,我们使用LM函数拟合。
9.Measuring prediction performance with a confusion matrix用混淆矩阵测量预测性能
To measure the performance of a classification model, we can first generate a classification table based on our predicted label and actual label. Then, we can use a confusion matrix to obtain performance measures such as precision, recall, specificity, and accuracy. In this recipe, we will demonstrate how to retrieve a confusion mat
测量预测性能的混淆矩阵来衡量性能的分类模型,我们可以首先生成一个分类表的基础上我们的预测标签和实际标签。然后,我们可以使用一个混乱的矩阵,以获得性能的措施,如精度,召回,特异性和准确性。在这个食谱中,我们将演示如何检索混淆垫
How to do it怎么做...Perform the following steps to generate a classification measurement:执行下列步骤以生成分类度量
1. Train an svm model using the training dataset:训练SVM模型的训练数据集
> svm.model= train(churn ~ .,+ data = trainset, + method = "svmRadial")
2. You can then predict labels using the fitted model,然后,你可以预测标签使用拟合模型
svm.model:
> svm.pred = predict(svm.model, testset[,! names(testset) %in% c("churn")])
3. Next, you can generate a classification table:接下来,可以生成分类表
> table(svm.pred, testset[,c("churn")])svm.pred yes noyes 73 16 no 68 861
4.Lastly, you can generate a confusion matrix using the prediction results and the actual labels from the testing dataset:最后,你可以使用预测结果与实际测试数据集的标签生成混淆矩阵
> confusionMatrix(svm.pred, testset[,c("churn")]) Confusion Matrix and Statistics ReferencePrediction yes no yes 73 16 no 68 861
Accuracy : 0.9175 95% CI : (0.8989, 0.9337) No Information Rate : 0.8615 P-Value [Acc > NIR] : 2.273e-08 Kappa : 0.5909 Mcnemar's Test P-Value : 2.628e-08
Sensitivity : 0.51773
Specificity : 0.98176 Pos Pred Value : 0.82022 Neg Pred Value : 0.92680
Prevalence : 0.13851 Detection Rate : 0.07171 Detection Prevalence : 0.08743
Balanced Accuracy : 0.74974
'Positive' Class : yes
In this recipe, we demonstrate how to obtain a confusion matrix to measure the performance of a classification model. First, we use the train function from the caret package to train an svm model. Next, we use the predict function to extract the predicted labels of the svm model using the testing dataset. Then, we perform the table function to obtain the classification table based on the performance.
在这个配方中,我们演示了如何获得一个混乱的矩阵来衡量性能的分类模型。首先,我们用火车从符号功能包来训练一个SVM模型。接下来,我们使用预测函数提取预测标签的SVM模型,使用测试数据集。然后,我们执行表功能,得到的分类表的基础.
小结:在大数据分析中,回归分析是一种预测性的建模技术,它研究的是因变量(目标)和自变量(预测器)之间的关系。这种技术通常用于预测分析,时间序列模型以及发现变量之间的因果关系。可以帮助数据分析师排除并估计出一组最佳的变量,用来构建预测模型。但有其局限性,会受到异常数据点的影响,对因子的多样性与不可预测性产生了影响
--------摘自百度翻译 郎小敏