当我们用非线性假设函数n元k次方程表达逻辑回归特征值,我们的算法的效率将是( Oleft ( n^{k} ight ) )
当特征数n和k过大时,逻辑回归的算法效率将会非常低,神经网络算法就是为了解决非线性逻辑回归算法而诞生的
神经网络算法来源于模拟人类大脑神经,同一种神经放在不同的大脑区域可以学习各种能力,如听觉,知觉,视觉等等
- 神经网络比较像做n次逻辑回归,每次逻辑逻辑回归之后的输出是下一次逻辑回归的输入
- 神经网络分为三层,第一层是输入层,中间层是隐藏层,最后一层是输出层,
- 隐藏层又可以是多层,这个层数根据用户·需要来增加和减少,越多回归越精确,但计算量越大,效率越低
- 中间层的第j层的节点是(a_{i}^{(j)}),,第j层的参数(Theta^{(j)})是从第j层到j+1层的映射举证的权值
- 如果神经网络第j层含有Sj个单元,第j+1含有Sj+1个单元,那么权值矩阵(Theta^{(j)})的维度是(s_{j+1} imes (s_{j} + 1))
逻辑运算的神经网络
当遇到多类回归时,输出层的节点个数变为类别数
二,如何理解神经网络
(J(Theta) = - frac{1}{m} sum_{t=1}^m sum_{k=1}^K left[ y^{(t)}_k log (h_Theta (x^{(t)}))_k + (1 - y^{(t)}_k) log (1 - h_Theta(x^{(t)})_k) ight] + frac{lambda}{2m}sum_{l=1}^{L-1} sum_{i=1}^{s_l} sum_{j=1}^{s_l+1} ( Theta_{j,i}^{(l)})^2 )
求使代价函数 ( J(Theta) )最小的(Theta)
既然代价函数和每个( heta)都有关,和逻辑回归就很类似,就可用解决逻辑回归的梯度下降法解决,
1,那也就是初始化(Theta)
2,求出关于每个代价函数(J(Theta))关于每个(Theta)的偏导( frac{partial J(Theta)}{partial Theta_{ij}^{(l)}} )
3,(Theta_{ij}^{(l)})减去( frac{partial J(Theta)}{partial Theta_{ij}^{(l)}} )
4,重复1-3步直到( frac{partial J(Theta)}{partial Theta_{ij}^{(l)}} ) 近似为0
假设有如下神经网络图,输入层2个节点,隐藏层只有3节点, 输出层3节点
在神经网络中,计算(h_{Theta}(X))的叫作前向传播算法,计算( frac{partial J(Theta)}{partial Theta_{ij}^{(l)}} )叫作后向传播算法
由链式求导法则得( frac{partial J(Theta)}{partial Theta_{ij}^{(l)}} = frac{partial J(Theta)}{partial h_{Theta}(X)} frac{partial h_{Theta}(X)}{partial Theta_{ij}^{(l)}} )
我们先不考虑正则法,只考虑一组数据,那么此时
( frac{partial J(Theta)}{partial h_{Theta}(X)} = frac{partial - sum_{k=1}^K left[ y_k log (h_Theta (x))_k + (1 - y_k) log (1 - h_Theta(x)_k) ight] }{partial h_{Theta}(X)} = -sum_{k=1}^K frac{partial [y_k log (h_Theta (x))_k + (1 - y_k) log (1 - h_Theta(x)_k)]}{partial h_Theta (x))_k} = sum_{k=1}^K [frac{ -y_k }{h_Theta (x)_k} + frac{1 - y_k} {1 - h_Theta(x)_k}] = sum_{k=1}^K [frac{h_Theta (x)_k - y_k}{h_Theta (x)_k (1 - h_Theta (x)_k)}])
接着要求( frac{partial h_{Theta}(X)}{partial Theta_{ij}^{(l)}}),必须先求出 ( h_{Theta}(X)) 和(Theta_{ij}^{(l)})的关系
先用前向传导法计算该神经网络,计算过程如下
( a^{(1)} = X)
(z^{(2)} = Theta^{left ( 1 ight )} a^{(1)})
- (z_{1}^{(2)} = Theta_{10}^{left ( 1 ight )} a_{0}^{(1)} + Theta_{11}^{left ( 1 ight )} a_{1}^{(1)} + Theta_{12}^{left ( 1 ight )} a_{2}^{(1)} )
- (z_{2}^{(2)} = Theta_{20}^{left ( 1 ight )} a_{0}^{(1)} + Theta_{21}^{left ( 1 ight )} a_{1}^{(1)} + Theta_{22}^{left ( 1 ight )} a_{2}^{(1)} )
- (z_{3}^{(2)} = Theta_{30}^{left ( 1 ight )} a_{0}^{(1)} + Theta_{31}^{left ( 1 ight )} a_{1}^{(1)} + Theta_{32}^{left ( 1 ight )} a_{2}^{(1)} )
(a^{(2)} = g(z^{(2)}))
(z^{(3)} = Theta^{left ( 2 ight )} a^{(2)} )
- (z_{1}^{(3)} = Theta_{10}^{left ( 2 ight )} a_{0}^{(2)} + Theta_{11}^{left ( 2 ight )} a_{1}^{(2)} + Theta_{12}^{left ( 2 ight )} a_{2}^{(2)} + Theta_{13}^{left ( 2 ight )} a_{3}^{(2)} )
- (z_{2}^{(3)} = Theta_{20}^{left ( 2 ight )} a_{0}^{(2)} + Theta_{21}^{left ( 2 ight )} a_{1}^{(2)} + Theta_{22}^{left ( 2 ight )} a_{2}^{(2)} + Theta_{23}^{left ( 2 ight )} a_{3}^{(2)} )
- (z_{3}^{(3)} = Theta_{30}^{left ( 2 ight )} a_{0}^{(2)} + Theta_{31}^{left ( 2 ight )} a_{1}^{(2)} + Theta_{32}^{left ( 2 ight )} a_{2}^{(2)} + Theta_{33}^{left ( 2 ight )} a_{3}^{(2)} )
(h_{Theta}(X) = a^{(3)} = g(z^{(3)}))
由前向传导算法可知,当l不同时,计算方法公式会不同,先计算l=2
( sum_{k = 1}^{K} frac{partial h_{Theta}(X)_k}{partial Theta_{ij}^{(2)}} = sum_{k = 1}^{K} [frac{partial h_{Theta}(X)_k}{partial z_k^{(3)}} frac{partial z_k^{(3)}}{partial Theta_{kj}^{(2)}}] = sum_{k = 1}^{K} frac{partial g(z_k^{(3)})}{partial z_k^{(3)}} frac{partial [Theta_{k0}^{left ( 2 ight )} a_{0}^{(2)} + Theta_{k1}^{left ( 2 ight )} a_{1}^{(2)} + Theta_{k2}^{left ( 2 ight )} a_{2}^{(2)} + Theta_{k3}^{left ( 2 ight )} a_{3}^{(2)}]}{partial Theta_{ij}^{(2)}} = (1 - g(z_i^{(3)}))g(z_i^{(3)})a_j^{2} )
当k ≠ i 时, ( frac{partial [Theta_{k0}^{left ( 2 ight )} a_{0}^{(2)} + Theta_{k1}^{left ( 2 ight )} a_{1}^{(2)} + Theta_{k2}^{left ( 2 ight )} a_{2}^{(2)} + Theta_{k3}^{left ( 2 ight )} a_{3}^{(2)}]}{partial Theta_{ij}^{(2)}} = 0)
合并之后( frac{partial J(Theta)}{partial Theta_{ij}^{(2)}} = sum_{k = 1}^{K} [frac{partial J(Theta)}{partial h_{Theta}(X)_k} frac{partial h_{Theta}(X)_k}{partial z_k^{(3)}} frac{partial z_k^{(3)}}{partial Theta_{kj}^{(2)}}] = sum_{k=1}^K [frac{h_Theta (x)_k - y_k}{h_Theta (x)_k (1 - h_Theta (x)_k)} frac{partial g(z_k^{(3)})}{partial z_k^{(3)}} frac{partial [Theta_{k0}^{left ( 2 ight )} a_{0}^{(2)} + Theta_{k1}^{left ( 2 ight )} a_{1}^{(2)} + Theta_{k2}^{left ( 2 ight )} a_{2}^{(2)} + Theta_{k3}^{left ( 2 ight )} a_{3}^{(2)}]}{partial Theta_{ij}^{(2)}}] = sum_{k=1}^K [frac{g(z_k^{(3)}) - y_k}{g(z_k^{(3)}) (1 - g(z_k^{(3)}))}(1 - g(z_k^{(3)}))g(z_k^{(3)}) frac{partial [Theta_{k0}^{left ( 2 ight )} a_{0}^{(2)} + Theta_{k1}^{left ( 2 ight )} a_{1}^{(2)} + Theta_{k2}^{left ( 2 ight )} a_{2}^{(2)} + Theta_{k3}^{left ( 2 ight )} a_{3}^{(2)}]}{partial Theta_{ij}^{(2)}} ] = (g(z_i^{(3)}) - y_i)a_j^{(2)} )
设( Delta^{(l)} = frac{partial J(Theta)}{partial Theta^{(l)}} )
那么(Delta_{ij}^{(2)} = frac{partial J(Theta)}{partial Theta_{ij}^{(2)}} = (g(z_i^{(3)}) - y_i)a_j^{(2)} = (a_i^{(3)} - y_i)a_j^{(2)})
(Delta^{(2)})的第i行第j列(Delta_{ij}^{(2)})由 (a^{(3)} - y)的第i个数和(a^{(2)})的第j个数相乘得到,那么(Delta^{(2)} = (a^{(3)} - y) * (a^{(2)})^T) (*表示矩阵相乘)
接下来计算( frac{partial J(Theta)}{partial Theta_{ij}^{(1)}} )
为了避免重复计算,设(delta^{(3)} = a^{(3)} - y)
(frac{partial J(Theta)}{partial Theta_{ij}^1} = frac{partial J(Theta)}{partial a^{(3)}}frac{partial a^{(3)}}{partial z^{(3)}}frac{partial z^{(3)}}{partial a^{(2)}} frac{partial a^{(2)}}{partial z^{(2)}} frac{partial z^{(2)}}{partial Theta_{ij}^1} = sum_{k=1}^{k=K}[frac{partial J(Theta)}{partial a_k^{(3)}}frac{partial a_k^{(3)}}{partial z_k^{(3)}}frac{partial z_k^{(3)}}{partial a^{(2)}} frac{partial a^{(2)}}{partial z^{(2)}} frac{partial z^{(2)}}{partial Theta_{ij}^1}] = sum_{k=1}^{k=K}[(a_k^{(3)}-y_k)frac{ partial [Theta_{k0}^{left ( 2 ight )} a_{0}^{(2)} + Theta_{k1}^{left ( 2 ight )} a_{1}^{(2)} + Theta_{k2}^{left ( 2 ight )} a_{2}^{(2)} + Theta_{k3}^{left ( 2 ight )} a_{3}^{(2)}]}{partial a_{i}^{(2)}} frac{partial a_{i}^{(2)}}{partial z_{i}^{(2)}} frac{partial z_{i}^{(2)}}{partial Theta_{ij}^1}] = sum_{k=1}^{k=K}[(a_k^{(3)}-y_k)(Theta_{ki}^{(2)}) g(z_i^{(2)})(1 - g(z_i^{(2)}))a_j^{(1)}] = g(z_i^{(2)})(1 - g(z_i^{(2)}))a_j^{(1)} sum_{k=1}^{k=K}[(a_k^{(3)}-y_k)(Theta_{ki}^{(2)})] = g(z_i^{(2)})(1 - g(z_i^{(2)}))a_j^{(1)} [((Theta^{(2)})^T)_{i} * delta^{(3)}] = [((Theta^{(2)})^T)_{i} * (delta^{(3)})]g(z_i^{(2)})(1 - g(z_i^{(2)}))a_j^{(1)})
(Delta_{ij}^{(1)} = (((Theta^{(2)})^T)_{i} * delta^{(3)})g(z_i^{(2)})(1 - g(z_i^{(2)}))a_j^{(1)} = (((Theta^{(2)})^T)_{i} * delta^{(3)})a_i^{(2)}(1 - a_i^{(2)})a_j^{(1)})
(Delta^{(1)})的第i行第j列(Delta_{ij}^{(1)})由 ([((Theta^{(2)})^T) *delta^{(3)}]a^{(2)}(1 - a^{(2)}))的第i个数和(a^{(1)})的第j个数相乘得到
为了避免重复计算,设(delta^{(2)} =((Theta^{(2)})^T * delta^{(3)})a^{(2)}(1 - a^{(2)}) )
(Delta^{(1)} = delta^{(2)} * (a^{(1)})^T )
如果还有(Delta^{(0)}),
观察(Delta^{(1)})的推导过程,发现(frac{partial a^{(2)}}{partial z^{(2)}} frac{partial z^{(2)}}{partial Theta_{ij}^1})不受k取值的影响
(Delta_{ij}^{(0)} = frac{partial J(Theta)}{partial Theta_{ij}^{(0)}}=sum_{k=1}^{k=K}[frac{partial J(Theta)}{partial a_k^{(3)}}frac{partial a_k^{(3)}}{partial z_k^{(3)}}frac{partial z_k^{(3)}}{partial a^{(2)}} frac{partial a^{(2)}}{partial z^{(2)}} frac{partial z^{(2)}}{partial a^{(1)}} frac{partial a^{(1)}}{partial z^{(0)}}frac{partial z^{(0)}}{partial Theta_{ij}^{(0)}}] = delta^{(1)} frac{partial z^{(2)}}{partial a^{(1)}} frac{partial a^{(1)}}{partial z^{(0)}}frac{partial z^{(0)}}{partial Theta_{ij}^{(0)}} = (((Theta^{(1)})^T)_{i} * delta^{(2)})a_i^{(1)}(1 - a_i^{(1)})a_j^{(0)} )
.....(推导过程和(Delta^{(1)})类似,在此不赘述)
(Delta^{(0)} = delta^{(1)} * (a^{(0)})^T )
我们还可以加上梯度检查(Gradient checking)来验证(frac partial {partial Theta_{ij}^{(l)}} J(Theta))的计算方法是否正确
所谓的梯度检查用导数的定义计算导数
( heta^{(i+)}= heta + egin{bmatrix} 0\ 0\ ...\ epsilon \ ...\ 0 end{bmatrix} ) ,( heta^{(i-)}= heta - egin{bmatrix} 0\ 0\ ...\ epsilon \ ...\ 0 end{bmatrix} )
按偏导的定义可得,(f_i( heta) approx frac{J( heta^{(i+)} ) - J( heta^{(i-)} )}{2epsilon } (epsilon = 1e-4) )
如果(f_i( heta) approx frac partial {partial Theta_{ij}^{(l)}}),那算法是正确的,
一般先找一组数据分别运行一遍前向后向传播算法和偏导定义算法,如果近似相等,那么神经网络算法算法没写错,就可以运行神经网络算法了
总结:
运行一遍梯度检查算法和前向后向传播算法,检查前向后向有没有写错
for i = 1 to iteration(梯度下降次数下降次数一般大于10000)
1 对于每组训练数据 t =1 to m:
1),令( a^{(1)} = x(t) )
2),运用前向传导方法计算(a^{(l)} (l = 2,3...L))
3),令( delta^{(L)} = a^{(l)} - y(t) )
4),运用后向传导方法计算(delta^{(L-1)},delta^{(L-2)}.....delta^{(2)}),(delta^{(l)} =((Theta^{(l)})^T * delta^{(l + 1)})a^{(l)}(1 - a^{(l)}) )
5),(Delta_{i,j}^{(l)} = Delta_{i,j}^{(l)} + delta_{i}^{(l + 1)} a_j^{(l)} Rightarrow Delta^{(l)} = Delta^{(l)} + delta^{(l + 1)} * (a^{(l)})^T )
2, 加上正则化
(D^{(l)}_{i,j} := dfrac{1}{m}left(Delta^{(l)}_{i,j} + lambdaTheta^{(l)}_{i,j} ight) (j eq 0))
(D^{(l)}_{i,j} := dfrac{1}{m} Delta^{(l)}_{i,j} (j = 0))
3,(frac partial {partial Theta_{ij}^{(l)}} J(Theta) = D^{(l)}_{i,j})
4,(Theta_{i,j}^{(l)} = Theta_{i,j}^{(l)} - D^{(l)}_{i,j})