Regularization 正则化
The Problem of Overfitting 过拟合问题
什么是过拟合问题、利用正则化技术改善或者减少过拟合问题。
Example: Linear regression (housing prices) 线性回归中的过拟合
对5个训练集建立线性回归模型,分别进行如下图所示的三种分析。
如果拟合一条直线到训练数据(图一),会出现欠拟合(underfitting)/高偏差(high bias)现象(指没有很好地拟合训练数据)。
试着拟合一个二次函数的曲线(图二),符合各项要求。称为just right。
接着拟合一个四次函数的曲线(图三),虽然曲线对训练数据做了一个很好的拟合,但是显然是不合实际的,这种情况就叫做过拟合或高方差(variance)。
Overfitting: If we have too many features, the learned hypothesis may fit the training set very well((sum_{i=1}^m(h_ heta(x^{(i)}) - y^{(i)})^2 approx 0)), but fail to generalize to new example and fails to predict prices on new examples.
过拟合:在变量过多时训练出的方程总能很好的拟合训练数据(这时你的代价函数会非常接近于(0)),但是这样的曲线千方百计的拟合训练数据,以至于它无法泛化(“泛化”指一个假设模型能够应用到新样本的能力)到新的样本中。
逻辑回归中的过拟合
对以下训练集建立线性逻辑模型,分别进行如下图所示的三种分析。
图一:欠拟合。
图二:Just Right。
图三:过拟合。
Addressing overfitting 解决过度拟合
-
Reduce number of features. 减少变量选取的数量
—— Manually select which features to keep. 人工检查决定变量去留
—— Model selection algorithm (later in course). 模型选择算法:自动选择采取哪些特征变量,自动舍弃不需要的变量(后续课程会讲到)
-
Regularization. 正则化
—— Keep all the features, but reduce magnitude values of parameters ( heta_j).保留所有特征变量,但是减小参数( heta_j)的数量级。
—— Works well when we have a lot of features, each of which contributes a bit to predicting (y). 在我们拥有大量的有用特征时往往非常有效。
Cost Function 通过控制代价函数实现正则化
Intuition
在我们进行拟合时,假如选择了4次函数,那我们可以通过对 ( heta_3, heta_4) 加上惩罚系数影响代价函数的大小从而达到控制 ( heta_3, heta_4) 大小的目的。
Suppose we penalize and make ( heta_3, heta_4) really small.
( ightarrow) (min_ hetafrac{1}{2m}sum_{i = 1}^m(h_ heta(x^{(i)}) - y^{(i)})^2 + 1000 heta_3^2 + 1000 heta_4^2)
Regularization
Small values for parameters ( heta_0, heta_1,ldots, heta_n) 如果我们的参数( heta)值都比较小那么我们将会:
—— "Simpler" hypothesis 得到形式更简单的假设函数
—— Less prone to overfitting 不易发生过拟合的情况(因为( heta)值越小,对应曲线越光滑)
Housing: 以房价预测为例:
—— Features: (x_1, x_2. ldots, x_{100})
—— Parameters: ( heta_0, heta_1, heta_2,ldots, heta_{100})
In regularized linear regression, we choose ( heta) to minimize
(J( heta) = frac{1}{2m}left[sum_{i=1}^m(h_ heta(x^{(i)})-y^{(i)})^2+lambdasum_{i=1}^n heta_j^2
ight]).
最后一项称为正则化项,(lambda)称为正则化参数。
What if (lambda) is set to an extremely large value (perhaps for too large for our problem, say (lambda = 10^{10}))?
如果(lambda)设置的太大的话,( heta_1, ldots, heta_n)将会接近于0,此时的函数图像接近一条水平直线,对于数据来说就是欠拟合了。
Regularized Linear Regression 应用正则化到线性回归中
梯度下降法
在原来的算法基础上略微改动:把( heta_0)的更新单独取出来并加入正则化项,如下:
Repeat{
( heta_0 := heta_0 - alphafrac{1}{m}sum_{i=1}^m(h_ heta(x^{(i)})-y^{(i)})x_0^{(i)})
( heta_j := heta_j - alphafrac{1}{m}left[ sum_{i=1}^m(h_ heta(x^{(i)})-y^{(i)})x_j^{(i)}+lambda heta_j
ight]) (j = 1,2,3,...,n)
}
( heta_0)单独取出来的原因是:对于正则化的线性回归,我们的惩罚参数不包含( heta_0)。
上式中的第二项也可以写为:( heta_j := heta_j (1- alphafrac{lambda}{m}) - alphafrac{1}{m}sum_{i=1}^m(h_ heta(x^{(i)})-y^{(i)})x_j^{(i)})
Normal equation 正规方程
(X= left[ egin{matrix} (x^{(1)})^T \ vdots \ (x^{(m)})^T end{matrix} ight]) (y = left[ egin{matrix} (y^{(1)})^T \ vdots \ (x^{(m)})^T end{matrix} ight]) ( ightarrow) (min_ heta J( heta))
( ightarrow) ( heta = (X^TX + lambda diag(0,1,1,ldots,1)_{(n+1)})^{-1}X^Ty)
Non-invertibility(optional/advanced) 当矩阵不可逆时(选学)
Suppose (m leq n), ( heta = (X^TX)^{-1}X^Ty)
If (lambda > 0), ( heta = X^TX + lambda diag(0,1,1,ldots,1)_{(n+1)}^{-1}X^Ty)
Regularized Logistic Regression 应用正则化到逻辑回归中
如何改进梯度下降算法和高级优化算法使其能够应用于正则化的逻辑回归。
Cost function:
(J( heta) = -frac{1}{m}left[ sum_{i=1}^my^{(i)}log h_ heta(x^{(i)}) + (1-y^{(i)})log(1-h_ heta(x^{(i)})) ight] + frac{lambda}{2m}sum_{j=1}^n heta_j^2|( heta_1, heta_2,ldots, heta_n))
具体实现如下,其中(h_ heta(x) = frac{1}{1+e^{- heta^Tx}}).
Repeat{
( heta_0 := heta_0 - alphafrac{1}{m}sum_{i=1}^m(h_ heta(x^{(i)})-y^{(i)})x_0^{(i)})
( heta_j := heta_j - alphafrac{1}{m}left[ sum_{i=1}^m(h_ heta(x^{(i)})-y^{(i)})x_j^{(i)}+lambda heta_j
ight]) (j = 1,2,3,...,n)
}
Advanced optimization 高级优化算法
自定义的函数(伪代码):
function [jVal, gradient] = costFunction(theta)
jval = [code to compute J(( heta))]
gradient(1) = [code to compute (frac{partial}{partial heta_0}J( heta))]
gradient(2) = [code to compute (frac{partial}{partial heta_1}J( heta))]
gradient(3) = [code to compute (frac{partial}{partial heta_2}J( heta))]
(vdots)
gradient(n+1) = [code to compute (frac{partial}{partial heta_n}J( heta))]
其中:
- code to compute J(( heta)):(J( heta) = -left[ frac{1}{m}sum_{i=1}^my^{(i)}log h_ heta(x^{(i)}) + (1-y^{(i)})log(1-h_ heta(x^{(i)})) ight] + frac{lambda}{2m}sum_{j=1}^n heta_j^2)
- code to compute (frac{partial}{partial heta_0} J( heta)):(frac{1}{m}sum_{i=1}^m(h_ heta(x^{(i)}) - y^{(i)})x_0^{(i)})
- code to compute (frac{partial}{partial heta_1} J( heta)):(frac{1}{m}sum_{i=1}^m(h_ heta(x^{(i)}) - y^{(i)})x_1^{(i)} + frac{lambda}{m} heta_1)
- code to compute (frac{partial}{partial heta_2} J( heta)):(frac{1}{m}sum_{i=1}^m(h_ heta(x^{(i)}) - y^{(i)})x_2^{(i)} + frac{lambda}{m} heta_2)
剩下要做的是将自定义函数代入到fminunc
函数中。
Review
测验
-
You are training a classification model with logistic regression. Which of the following statements are true? Check all that apply.
- [x] Adding a new feature to the model always results in equal or better performance on the training set. 增加新的特征会让预测模型更佳具有表达性,从而会更好的拟合训练集。
- [ ] Adding a new feature to the model always results in equal or better performance on examples not in the training set. 如果出现过拟合,则无法更好的适应其他例子。
- [ ] Introducing regularization to the model always results in equal or better performance on the training set. 如果(lambda)取得太大就会导致欠拟合,这样不论对训练集还是样例都不好。
- [ ] Introducing regularization to the model always results in equal or better performance on examples not in the training set.
- [ ] Adding many new features to the model helps prevent overfitting on the training set. 更多的特征值使模型更好的适应数据,容易导致过拟合。
-
Suppose you ran logistic regression twice, once with (lambda = 0), and once with (lambda = 1). One of the times, you got parameters ( heta = left[ egin{matrix} 74.81 \ 45.05 end{matrix} ight]), and the other time you got ( heta = left[ egin{matrix} 1.37 \ 0.51 end{matrix} ight]). However, you forgot which value of (lambda) corresponds to which value of ( heta). Which one do you think corresponds to (lambda = 1)?
- [x] ( heta = left[ egin{matrix} 1.37 \ 0.51 end{matrix} ight]).
- [ ] ( heta = left[ egin{matrix} 74.81 \ 45.05 end{matrix} ight]).
-
Which of the following statements about regularization are true? Check all that apply.
- [ ] Because logistic regression outputs values (0 leq h_ heta(x) leq 1), its range of output values can only be "shrunk" slightly by regularization anyway, so regularization is generally not helpful for it. 正则化解决的是过拟合的问题。
- [ ] Using a very large value of (lambda) cannot hurt the performance of your hypothesis; the only reason we do not set (lambda) to be too large is to avoid numerical problems. 如果(lambda)设置的太大的话,( heta_1, ldots, heta_n)将会接近于(0),此时的函数图像接近一条水平直线,对于数据来说就是欠拟合了。
- [ ] Using too large a value of (lambda) can cause your hypothesis to overfit the data; this can be avoided by reducing (lambda). underfit 不是 overfit。
- [x] Consider a classification problem. Adding regularization may cause your classifier to incorrectly classify some training examples (which it had correctly classified when not using regularization, i.e. when (lambda = 0)). (lambda)没选好时,可能会导致训练结果还不如没有正则化项时好。
- [ ] Because regularization causes (J( heta)) to no longer be convex, gradient descent may not always converge to the global minimum (when (lambda > 0), and when using an appropriate learning rate (alpha). 正则逻辑回归和正则线性回归都是凸的,因此梯度下降仍会收敛到全局最小值。
编程
-
plotData.m
% Find Indices of Positive and Negative Examples pos = find(y == 1); neg = find(y == 0); % Plot Examples plot(X(pos, 1), X(pos, 2), 'k+', 'LineWidth', 2, 'MarkerSize', 7); plot(X(neg, 1), X(neg, 2), 'ko', 'MarkerFaceColor', 'y', 'MarkerSize', 7);
-
sigmoid.m
g = 1 ./ (1 + exp(-z));
-
costFunction.m
J = 1 / m * (-y' * log(sigmoid(X * theta)) - (1 - y)' * log(1 - sigmoid(X * theta))); grad = 1 / m * X' * (sigmoid(X * theta) - y);
-
predict.m
p = sigmoid(X * theta) >= 0.5;
-
costFunctionReg.m
J = 1 / m * (-y' * log(sigmoid(X * theta)) - (1 - y)' * log(1 - sigmoid(X * theta))) + lambda / (2 * m) * theta(2:end)' * theta(2:end); grad = 1 / m * X' * (sigmoid(X * theta) - y) + lambda / m * theta; grad(1) = grad(1) - lambda / m * theta(1);