(原创文章,转载请注明出处!)
用logistic回归来解决分类问题。
模型的值域是[0,1],用0.5作为分类的阈值。
模型的输出是:P(y=1|x;θ),即:对给定的输入x,和确定的参数θ,事件“y=1”的概率。
那么可以选择sigmoid函数: 1/(1+e-z) ,z∈R,值域为[0,1],在logistic回归中 z=θTx。(也可以选择其他函数)
即: P(y=1|x;θ) = 1/(1+e-z),其中z=θTx 。
学习目标函数(代价函数)的选择
目标函数会使用最优化的算法来进行训练,求解最优解。
凸函数在最优化时,局部最优化解就是全局最优化解,方便求解,所以在选择目标函数时,考虑选择一个凸函数。选择如下的损失函数:
对y=1, -log[1/(1+e-z)] ; 对y=0, -log[1-(1/(1+e-z))];(对数损失函数)。这两个函数在sigmoid函数的值域上都是单调凸函数。
将两个损失函数合并起来:L(x,y;θ) = -ylog[1/(1+e-z)] - (1-y)log[1-(1/(1+e-z))]
而训练数据是由多个样本点组成的,所以训练数据的平均损失为:
L(x,y;θ) = -(1/m)∑i=1m{y(i)log[1/(1+e-z)] + (1-y(i))log[1-(1/(1+e-z))]}, z=θTx(i), m是样本点的个数
(logistic回归模型参数的估计问题,本质上是点估计,可以使用极大似然估计来(MLE)估计模型中的参数。
似然函数写成:L = ∏i=1m{y(i)[1/(1+e-z)] + (1-y(i))[1 - (1/(1+e-z))]}, z=θTx(i),
对数似然函数写成:L =∑i=1m{y(i)log[1/(1+e-z)] + (1-y(i))log[1-(1/(1+e-z))]}, z=θTx(i) )
使用样本的平均损失来代替总体分布的期望损失,由大数定律知,在样本点充分多的情况下是可行的。
如果样本点不是充分的多怎么办?这个时候还使用平均损失代替期望损失就可能出现overfitting。
解决的方法是正则化,给平均损失函数增加惩罚项:
L(x,y;θ) = -(1/m)∑i=1m{y(i)log[1/(1+e-z)] + (1-y(i))log[1-(1/(1+e-z))]} + (λ/2m)[∑j=1n(θj2)], n是参数个数。
惩罚项是在代价函数中加入的模型复杂度(比如:参数多)带来的惩罚,用以将指定的参数将来训练得到的值减小(减小参数对模型的影响),以此来控制模型的复杂度,避免overfitting。(除了上面平均损失函数中添加的2-范数惩罚项,还有其它形式的惩罚项,如:1-范数惩罚项)
代价函数中的第一项来最小化模型误差,第二项来控制模型复杂度带来的影响,系数λ≥0,用来平衡代价函数中的第一项和第二项;另外样本点的增加能减少惩罚项带来的影响(m ↑,(λ/2m) ↓),而受惩罚的参数将会变大(如果将惩罚项系数分母中的m去掉,则就消除这种影响)。
可以使用梯度下降,或者牛顿法来求解此最优化问题。采用牛顿法,参数的计算公式为:
Θ = Θ - H-1∇Θ,其中H-1是平均损失函数对Θ的Hessian矩阵的逆矩阵,∇Θ是平均损失函数对Θ的梯度向量。
使用R编程实现的logistic回归算法。
1 ## 2 ##Function: logistic_fun_sigmoid 3 logistic_fun_sigmoid <- function (z) 4 { 5 h_theta <- 1.0/(1.0 + exp((-1)*z)) 6 h_theta 7 } 8 ## 9 ##Function: logistic_fun_map_feature 10 logistic_fun_map_feature <- function(u, v, degree) 11 { 12 result <- matrix(1, nrow = length(u), ncol=1) 13 for (i in 1:degree) { 14 for (j in i:0) { 15 result <- cbind(result, u^j * v^(i-j)) 16 } 17 } 18 result 19 } 20 21 ## 22 ## load data 23 ## 24 logisticx <- read.table(file="x.dat", sep=",") 25 attr(logisticx, "names") <- c("u", "v") 26 logisticy <- read.table(file="y.dat") 27 28 logistic_x <- as.matrix(logisticx) ; attr(logistic_x,"dimnames")[2] <- NULL 29 logistic_y <- as.matrix(logisticy) ; attr(logistic_y,"dimnames")[2] <- NULL 30 logistic_y_factor <- as.factor( logistic_y ) 31 32 ## plot the samples' data 33 plot( x = c(floor(min(logistic_x[,1])), ceiling(max(logistic_x[,1]))), 34 y = c(floor(min(logistic_x[,2])), ceiling(max(logistic_x[,2]))), 35 xlab = "u", ylab = "v", type = "n") # setting up coord. system 36 points(logistic_x[,1], logistic_x[,2], col = as.integer(logistic_y)+1, pch = as.integer(logistic_y)+1) 37 legend("topright", legend = c("y=1", "y=0"), col = c(2, 1), pch = c(2,1)) 38 39 ## 40 ## map original x to 28-feature vector 41 ## 42 logistic_x_mapped <- logistic_fun_map_feature( logistic_x[,1], logistic_x[,2], 6 ) 43 44 45 logistic_lamda <- 0 46 logistic_num_of_samples <- dim(logistic_x_mapped)[1] 47 logistic_num_of_features <- dim(logistic_x_mapped)[2] 48 logistic_theta <- matrix(0, nrow=logistic_num_of_features, ncol=1 ) # 28 rows, 1 column 49 logistic_h_theta <- logistic_fun_sigmoid( logistic_x_mapped %*% logistic_theta ) 50 logistic_gradient_of_J_theta <- numeric(logistic_num_of_features) 51 logistic_diag_28by28 <- diag( c(0, rep(1, logistic_num_of_features - 1)) ) # logistic_diag_28by28[1,1] = 0 52 logistic_H <- matrix(0, nrow=logistic_num_of_features, ncol=logistic_num_of_features) # hessian matrix 53 logistic_J_theta <- 0 54 logistic_J_theta_array <- numeric(1) 55 logistic_last_J_theta <- -1 56 logistic_num_of_loop <- 0 57 logistic_theta_list <- list(logistic_theta, logistic_theta, logistic_theta) 58 logistic_lamda_vec <- c(0, 1, 10) 59 60 61 for (index_i in 1:length(logistic_theta_list) ) { 62 logistic_lamda <- logistic_lamda_vec[index_i] # 0, 1, 10 63 64 logistic_num_of_samples <- dim(logistic_x_mapped)[1] 65 logistic_num_of_features <- dim(logistic_x_mapped)[2] 66 logistic_theta <- matrix(0, nrow=logistic_num_of_features, ncol=1 ) # 28 rows, 1 column 67 logistic_h_theta <- logistic_fun_sigmoid( logistic_x_mapped %*% logistic_theta ) 68 logistic_gradient_of_J_theta <- numeric(logistic_num_of_features) 69 logistic_diag_28by28 <- diag( c(0, rep(1, logistic_num_of_features - 1)) ) # logistic_diag_28by28[1,1] = 0 70 logistic_H <- matrix(0, nrow=logistic_num_of_features, ncol=logistic_num_of_features) # hessian matrix 71 logistic_J_theta <- 0 72 logistic_J_theta_array <- numeric(1) 73 logistic_last_J_theta <- -1 74 logistic_num_of_loop <- 0 75 76 77 while (logistic_J_theta != logistic_last_J_theta) { 78 logistic_num_of_loop <- logistic_num_of_loop + 1 79 cat("----------------------------------------------------------loop ", logistic_num_of_loop, " begin ") 80 logistic_last_J_theta <- logistic_J_theta 81 82 ## calculate gradient of the J of theta 83 logistic_gradient_of_J_theta[1] <- crossprod( (logistic_h_theta - logistic_y[,1]), logistic_x_mapped[,1] ) / logistic_num_of_samples 84 for (i in 2:logistic_num_of_features) { 85 logistic_gradient_of_J_theta[i] <- ( crossprod( (logistic_h_theta - logistic_y[,1]), logistic_x_mapped[,i] ) 86 + logistic_theta[i,1] * logistic_lamda ) / logistic_num_of_samples 87 } 88 #print("logistic_gradient_of_J_theta") ; browser() 89 90 ## calculate Hessian matrix 91 for (i in 1:logistic_num_of_samples) { 92 logistic_H <- logistic_h_theta[i,1]*(1-logistic_h_theta[i,1])*tcrossprod( as.matrix(logistic_x_mapped[i,]) ) + logistic_H 93 } 94 logistic_H <- (logistic_H + logistic_lamda * logistic_diag_28by28) / logistic_num_of_samples 95 96 ## update theta 97 logistic_theta <- logistic_theta - solve(logistic_H) %*% as.matrix(logistic_gradient_of_J_theta) 98 99 100 ## update h of theta with the latest theta 101 logistic_h_theta <- logistic_fun_sigmoid( logistic_x_mapped %*% logistic_theta ) 102 103 ## calculate J of theta 104 logistic_J_theta <- ( 105 sum( as.vector(log(logistic_h_theta)) * as.vector(logistic_y) 106 + as.vector(log(1- logistic_h_theta)) * as.vector((1 - logistic_y)) ) 107 + sum(logistic_theta^2) * logistic_lamda / 2 108 ) / logistic_num_of_samples 109 110 logistic_J_theta_array[logistic_num_of_loop] <- logistic_J_theta 111 cat("----------------------------------------------------------loop ", logistic_num_of_loop, " end ") 112 } 113 logistic_theta_list[[index_i]] <- logistic_theta 114 } 115 116 117 ## 118 ## plot graphic 119 ## 120 121 ## one_theta is 28-by-1 matrix 122 logistic_plot_contour <- function(one_theta) 123 { 124 u <- seq(-1, 1.5, by=.05) 125 v <- seq(-1, 1.5, by=.05) 126 z <- matrix(0, nrow=length(u), ncol=length(v)) 127 128 for (i in 1:length(u)) { 129 for (j in 1:length(v)) { 130 z[j,i] <- logistic_fun_map_feature( u[i], v[j], 6 ) %*% one_theta 131 } 132 } 133 134 contour(u, v, t(z), col = "green", add = TRUE, method = "simple", nlevels=1, lty="solid") 135 } 136 137 op <- par( mfrow = c(1, 3), # 1 x 3 pictures on one plot 138 pty = "s" ) 139 for (index_i in 1:length(logistic_theta_list) ) { 140 plot( x = c(-1, 1.5), 141 y = c(-1, 1.5), 142 xlab = "u", ylab = "v", type = "n") # setting up coord. system 143 points(logistic_x[,1], logistic_x[,2], col = as.integer(logistic_y)+1, pch = as.integer(logistic_y)+1) 144 legend("topright", legend = c("y=1", "y=0","Decision boundary"), col = c(2, 1, "green" ), pch = c(2,1,"-")) 145 146 logistic_theta <- logistic_theta_list[[index_i]] 147 logistic_plot_contour(logistic_theta) 148 } 149 par(op)