Classification (II) –Neural Network and SVM
分类—神经网络与支持向量机
Introduction
Most research has shown that support vector machines (SVM) and neural networks (NN) are
powerful classification tools, which can be applied to several different areas. Unlike tree-based
or probabilistic-based methods that were mentioned in the previous chapter, the process of
how support vector machines and neural networks transform from input to output is less clear
and can be hard to interpret. As a result, both support vector machines and neural networks
are referred to as black box methods.
大多数研究表明,支持向量机(SVM)和神经网络(NN)是强大的分类工具,可以应用于不同的领域。不同于基于树的或基于概率的方法,在前面的章节中提到,如何支持向量机和神经网络从输入到输出的过程是不太清楚,可以很难解释。因此,支持向量机和神经网络被称为黑箱方法。
The development of a neural network is inspired by human brain activities. As such, this type
of network is a computational model that mimics the pattern of the human mind. In contrast
to this, support vector machines first map input data into a high dimension feature space
defined by the kernel function, and find the optimum hyperplane that separates the training
data by the maximum margin. In short, we can think of support vector machines as a linear
algorithm in a high dimensional space.
神经网络的发展灵感来自人类的大脑活动。因此,这种类型的网络是模仿人类思维模式的计算模型。与此相反,支持向量机RST地图输入数据到一个高维特征空间的内核函数,和ND的最佳超平面分离的训练数据的最大保证金。总之,我们可以把支持向量机看成是高维空间中的线性算法。
Both these methods have advantages and disadvantages in solving classification problems.
For example, support vector machine solutions are the global optimum, while neural networks
may suffer from multiple local optimums. Thus, choosing between either depends on the
characteristics of the dataset source. In this chapter, we will illustrate the following:
这些方法在求解分类问题的优缺点。例如,支持向量机的解是全局最优解,而神经网络可能有多个局部最优。因此,两者之间的选择取决于数据集源的特性。在本章中,我们将说明以下:
How to train a support vector machine
如何支持训练向量机
Observing how the choice of cost can affect the SVM classifier
观察成本选择对SVM分类器的影响
Visualizing the SVM fit
支持向量机visualizing Fit
Predicting the labels of a testing dataset based on the model trained by SVM
基于SVM训练模型的测试数据集标签预测
Tuning the SVM
优化的纸质向量机
In the neural network section, we will cover:
在神经网络部分,我们将覆盖:
How to train a neural network
如何训练神经网络
How to visualize a neural network model
如何可视化神经网络模型
Predicting the labels of a testing dataset based on a model trained by neuralnet
预测基于训练的神经网络模型的测试数
Finally, we will show how to train a neural network with nnet , and how to use it to
predict the labels of a testing dataset
最后,我们将展示如何用神经网络训练一个神经网络,以及如何使用它来预测的测试数据集的标签
Classifying data with a support vector machine
用支持向量机分类数据。
The two most well known and popular support vector machine tools are libsvm and
SVMLite . For R users, you can find the implementation of libsvm in the e1071 package and
SVMLite in the klaR package. Therefore, you can use the implemented function of these
two packages to train support vector machines. In this recipe, we will focus on using the svm
function (the libsvm implemented version) from the e1071 package to train a support vector
machine based on the telecom customer churn data training dataset.
两个最知名和流行的支持向量机工具LIBSVM和svmlite。R用户,你可以和LIBSVM执行在e1071包和svmlite在该包。因此,可以使用这两个包的实现函数来训练支持向量机。在这个食谱中,我们将专注于利用支持向量机的功能(libsvm的版本)从e1071包基于电信客户流失数据训练集训练支持向量机。
Getting ready
In this recipe, we will continue to use the telecom churn dataset as the input data source to
train the support vector machine. For those who have not prepared the dataset, please refer
to Chapter 5, Classification (I) – Tree, Lazy, and Probabilistic, for details.
在这个配方中,我们将继续使用电信流失数据集作为输入数据源来训练支持向量机。对于那些没有准备的数据集,请参阅第5章,分类(I)-树,懒惰,和概率,详情。
How to do it...
Perform the following steps to train the SVM:
执行以下步骤来训练支持向量机:
- Load the e1071 package:
负荷e1071包:
> library(e1071)
2. Train the support vector machine using the svm function with trainset as the input dataset, and use churn as the classification category:
采用动车组作为输入数据集的支持向量机的功能训练支持向量机,并利用流失作为分类的类别:
> model = svm(churn~., data = trainset, kernel="radial", cost=1,
gamma = 1/ncol(trainset))
3.Finally, you can obtain overall information about the built model with summary :
最后,您可以通过汇总获取所建模型的全部信息。
> summary(model)
Call:
svm(formula = churn ~ ., data = trainset, kernel = "radial", cost
= 1, gamma = 1/ncol(trainset))
Parameters:
SVM-Type: C-classification
SVM-Kernel: radial
cost: 1
gamma: 0.05882353
Number of Support Vectors: 691
( 394 297 )
Number of Classes: 2
Levels:
yes no
How it works...
The support vector machine constructs a hyperplane (or set of hyperplanes) that maximize the margin width between two classes in a high dimensional space. In these, the cases that define the hyperplane are support vectors, as shown in the following figure:
支持向量机建立一个超平面(或平面),最大限度地在高维空间中的两个类之间的边距宽度。在这些定义的超平面的支持向量的情况下,如下图所示:
Figure 1: Support Vector Machine
Support vector machine starts from constructing a hyperplane that maximizes the margin width. Then, it extends the definition to a nonlinear separable problem. Lastly, it maps the data to a high dimensional space where the data can be more easily separated with a linear boundary.
支持向量机从构造一个超平面,最大限度地边缘宽度。然后,它扩展了定义的非线性可分的问题。最后,它的数据映射到一个高维空间的数据可以更容易地与线性边界分离。
The advantage of using SVM is that it builds a highly accurate model through an engineering problem-oriented kernel. Also, it makes use of the regularization term to avoid over-fitting. It also does not suffer from local optimal and multicollinearity. The main limitation of SVM is its speed and size in the training and testing time. Therefore, it is not suitable or efficient enough to construct classification models for data that is large in size. Also, since it is hard to interpret SVM, how does the determination of the kernel take place? Regularization is another problem that we need tackle.
使用支持向量机的优点是,它建立了一个高度精确的模型,通过工程问题为导向的内核。同时,利用正则化项避免过度拟合。它也不受局部最优和多重共线性。支持向量机的主要缺点是它的速度和大小在训练和测试时间。因此,这是不适当的或有效的足够的构造,尺寸大的数据分类模型。此外,由于它是难以解释SVM,如何确定的内核发生?正则化是我们需要解决的另一个问题。
In this recipe, we continue to use the telecom churn dataset as our example data source.We begin training a support vector machine using libsvm provided in the e1071 package.Within the training function, svm , one can specify the kernel function, cost, and the gamma function. For the kernel argument, the default value is radial, and one can specify the kernel to a linear, polynomial, radial basis, and sigmoid. As for the gamma argument, the default value is equal to (1/data dimension), and it controls the shape of the separating hyperplane. Increasing the gamma argument usually increases the number of support vectors.
在这个食谱中,我们继续使用电信客户流失数据集作为我们的示例数据源。我们开始使用在e1071包LIBSVM支持向量机训练。在训练函数中,支持向量机可以指定核函数、代价和Gamma函数。对于内核参数,默认值是径向的,可以指定内核的线性,多项式,径向基,乙状结肠。至于伽玛参数,默认值等于(1 /数据维数),并且它控制分离超平面的形状。增加伽玛参数通常会增加支持向量的数目。
As for the cost, the default value is set to 1, which indicates that the regularization term is constant, and the larger the value, the smaller the margin is. We will discuss more on how the cost can affect the SVM classifier in the next recipe. Once the support vector machine is built, the summary function can be used to obtain information, such as calls, parameters, number of classes, and the types of label.
至于成本,默认值被设置为1,这表明正则化项是恒定的,和较大的值,较小的保证金是。我们将讨论更多的费用如何能影响SVM分类器在接下来的食谱。一旦建立支持向量机,总结函数可以用来获取信息,如调用,参数,类的数量,和类型的标签。
See also
Another popular support vector machine tool is SVMLight . Unlike the e1071 package, which provides the full implementation of libsvm , the klaR package simply provides an interface to SVMLight only. To use SVMLight , one can perform the following steps:
另一个流行的支持向量机的工具是SVMLight。不像e1071包,它提供了libsvm的全面实施,这是包只提供了一个接口svmLight只。使用SVMlight,可以执行以下步骤:
- Install the klaR package:
安装该包:
> install.packages("klaR")
> library(klaR)
2.Download the SVMLight source code and binary for your platform from http://svmlight.joachims.org/ . For example, if your guest OS is Windows 64-bit, you should downloadthefilefromhttp://download.joachims.org/svm_light/current/svm_light_windows64.zip
2。下载SVMlight的源代码和二进制你从HTTP:/ / svmLight平台。Joachims。org /。例如,如果你的操作系统是Windows 64位,你应该下载文件从http://download.joachims.org/svm_light/电流/ svm_light_windows64.zip。
3. Then, you should unzip the file and put the workable binary in the working directory; you may check your working directory by using the getwd function:
3.然后,你应该将乐把可行的二进制在工作目录中;你可以通过使用getwd功能检查你的工作目录:
> getwd()
4. Train the support vector machine using the svmlight function:
4。使用svmLight功能训练支持向量机:
> model.light = svmlight(churn~., data = trainset,
kernel="radial", cost=1, gamma = 1/ncol(trainset))
Choosing the cost of a support vector machine
支持向量机的成本选择。
The support vector machines create an optimum hyperplane that separates the training data by the maximum margin. However, sometimes we would like to allow some misclassifications while separating categories. The SVM model has a cost function, which controls training errors and margins. For example, a small cost creates a large margin (a soft margin) and allows more misclassifications. On the other hand, a large cost creates a narrow margin (a hard margin) and permits fewer misclassifications. In this recipe, we will illustrate how the large and small cost will affect the SVM classifier.
支持向量机创建一个最佳的超平面分离的训练数据的最大保证金。然而,有时我们想同时分离类别允许一些误分类训练样本的阳离子。SVM模型具有成本函数,它控制训练错误和利润率。例如,一个小成本创造大利润(软边),允许更多的误分类训练样本的阳离子。另一方面,大的成本创造了一个狭窄的边缘(硬边缘),允许误分类训练样本少的阳离子。在这个食谱中,我们将说明如何在大的和小的成本会影响SVM分类器。
Getting ready
In this recipe, we will use the iris dataset as our example data source.
在这个配方中,我们将使用IRIS数据集作为我们的示例数据源。
How to do it...
Perform the following steps to generate two different classification examples with different costs:
执行以下步骤来产生不同的分类有不同的成本的例子:
- Subset the iris dataset with columns named as Sepal.Length , Sepal.Width ,Species , with species in setosa and virginica :
1。子集列命名为萼片的Iris数据集。长,萼片。宽度、物种,在粗糙和锦葵属:
> iris.subset = subset(iris, select=c("Sepal.Length", "Sepal.
Width", "Species"), Species %in% c("setosa","virginica"))
- Then, you can generate a scatter plot with Sepal.Length as the x-axis and the Sepal.Width as the y-axis:
2.然后,您可以生成与萼片散点图。长度为X轴和Y轴宽度为萼片:
> plot(x=iris.subset$Sepal.Length,y=iris.subset$Sepal.Width,
col=iris.subset$Species, pch=19)
Figure 2: Scatter plot of Sepal.Length and Sepal.Width with subset of iris dataset
- Next, you can train SVM based on iris.subset with the cost equal to 1:
3.接下来,你可以训练SVM基于成本等于1 iris.subset:
> svm.model = svm(Species ~ ., data=iris.subset, kernel='linear',
cost=1, scale=FALSE)
- Then, we can circle the support vector with blue circles:
4。然后,我们可以用蓝色圆圈支持向量:
> points(iris.subset[svm.model$index,c(1,2)],col="blue",cex=2)
Figure 3: Circling support vectors with blue ring
- Lastly, we can add a separation line on the plot:
5.最后,我们可以在图上添加分隔线:
> w = t(svm.model$coefs) %*% svm.model$SV
> b = -svm.model$rho
> abline(a=-b/w[1,2], b=-w[1,1]/w[1,2], col="red", lty=5)
- In addition to this, we create another SVM classifier where cost = 10,000 :
6.除此之外,我们创造了一个SVM分类器在成本:
> plot(x=iris.subset$Sepal.Length,y=iris.subset$Sepal.Width,
col=iris.subset$Species, pch=19)
> svm.model = svm(Species ~ ., data=iris.subset, type='C-
classification', kernel='linear', cost=10000, scale=FALSE)
> points(iris.subset[svm.model$index,c(1,2)],col="blue",cex=2)
> w = t(svm.model$coefs) %*% svm.model$SV
> b = -svm.model$rho
> abline(a=-b/w[1,2], b=-w[1,1]/w[1,2], col="red", lty=5)
Figure 5: A classification example with large cost
How it works...
In this recipe, we demonstrate how different costs can affect the SVM classifier. First, we create an iris subset with the columns, Sepal.Length , Sepal.Width , and Species containing the species, setosa and virginica . Then, in order to create a soft margin and allow some misclassification, we use an SVM with small cost (where cost = 1 ) to train the support of the
vector machine. Next, we circle the support vectors with blue circles and add the separation line. As per Figure 5, one of the green points ( virginica ) is misclassified (it is classified to setosa ) to the other side of the separation line due to the choice of the small cost.
在这个配方中,我们演示了如何不同的成本可以影响SVM分类器。首先,我们创建一个列,萼片虹膜的子集。长度,宽度,和萼片。两种物种,粗糙和锦葵。然后,为了创造一个软边缘和允许一些误判,我们用SVM和一些小的成本(wherecost = 1)去练习支持。接下来,我们将支持向量与蓝色圆圈和添加分离线。如图5,一个绿色的点(锦葵)是错误的(这是分类到setosa)的分离线由于小成本的选择对方。
In addition to this, we would like to determine how a large cost can affect the SVM classifier. Therefore, we choose a large cost (where cost = 10,000 ). From Figure 5, we can see that the margin created is narrow (a hard margin) and no misclassification cases are present. As a result, the two examples show that the choice of different costs may affect the margin created and also affect the possibilities of misclassification.
除此之外,我们要确定一个大的成本会影响SVM分类器。因此,我们选择一个大的成本(成本= 10000)。从图5,我们可以看到边缘创建窄(硬边缘)和阳离子病例无误分类训练样本。作为一个结果,这两个例子表明,不同成本的选择可能影响利润创造和影响误分类训练样本的阳离子的可能性。
See also
The idea of soft margin, which allows misclassified examples, was suggested by Corinna Cortes and Vladimir N. Vapnik in 1995 in the following paper: Cortes, C., and Vapnik, V. (1995). Support-vector networks. Machine learning, 20(3), 273-297.
软边缘的概念,它允许误分类训练样本ED的例子,建议由Corinna Cortes和Vladimir N. Vapnik在1995在以下论文:科尔特斯,C,和Vapnik V(1995)。支持向量网。机器学习,20(3),273-297。 -----------摘自百度
王佳佳