h(x)
[
egin{align*}h_ heta(x) =egin{bmatrix} heta_0 hspace{2em} heta_1 hspace{2em} ... hspace{2em} heta_nend{bmatrix}egin{bmatrix}x_0 ewline x_1 ewline vdots ewline x_nend{bmatrix}= heta^T xend{align*}, x_0^{(i)} = 1
]Gradient descent equation
[
egin{align*}& ext{repeat until convergence:} ; lbrace ewline ; & heta_j := heta_j - alpha frac{1}{m} sumlimits_{i=1}^{m} (h_ heta(x^{(i)}) - y^{(i)}) cdot x_j^{(i)} ; & ext{for j := 0...n} ewline braceend{align*}
]当不同特征的值差距过大((>10^5))时,需要特征缩放(Feature Scaling)
[
x_i := frac{x_i - mu_i}{s_i}
]
Where (mu_i) is the average of all the values for feature(i) and (s_i) is the range of values(max - min), or (s_i) is the standard deviation.Learning Rate
In automatic convergence test, declare convergence if (J( heta)) decreases by less than (1-^{-3}) in one iteration.Features and Polynomial Regression
可以将不同的特征值组合来更好的拟合数据,同时因为数据的组合,更加需要特征缩放来加快几何提高精度Normal Equation 正规方程 不需要特征缩放
[
heta = (X^TX)^{-1}X^Ty
]Comparation
Gradient Descent Normal Equation need to choose (alpha) No need to choose (alpha) Needs many iterations Don’t need to iterate Works well even when n is large ((>10^4)) Need to compute ((X^TX)^{-1}) (O(kn^2)) Slow if n is very large (O(n^3)) If (X^TX) is noninvertible, the common causes might be having :
- Redundant features, where two features are very closely related (i.e. they are linearly dependent)
- Too many features (e.g. m ≤ n). In this case, delete some features or use "regularization" (to be explained in a later lesson).