• Sparse Autoencoder(二)


    Gradient checking and advanced optimization

    In this section, we describe a method for numerically checking the derivatives computed by your code to make sure that your implementation is correct. Carrying out the derivative checking procedure described here will significantly increase your confidence in the correctness of your code.

    Suppose we want to minimize 	extstyle J(	heta) as a function of 	extstyle 	heta. For this example, suppose 	extstyle J : Re mapsto Re, so that 	extstyle 	heta in Re. In this 1-dimensional case, one iteration of gradient descent is given by

    egin{align}
	heta := 	heta - alpha frac{d}{d	heta}J(	heta).
end{align}

    Suppose also that we have implemented some function 	extstyle g(	heta) that purportedly computes 	extstyle frac{d}{d	heta}J(	heta), so that we implement gradient descent using the update 	extstyle 	heta := 	heta - alpha g(	heta).

    Recall the mathematical definition of the derivative as

    egin{align}
frac{d}{d	heta}J(	heta) = lim_{epsilon 
ightarrow 0}
frac{J(	heta+ epsilon) - J(	heta-epsilon)}{2 epsilon}.
end{align}

    Thus, at any specific value of 	extstyle 	heta, we can numerically approximate the derivative as follows:

    egin{align}
frac{J(	heta+{
m EPSILON}) - J(	heta-{
m EPSILON})}{2 	imes {
m EPSILON}}
end{align}

    Thus, given a function 	extstyle g(	heta) that is supposedly computing 	extstyle frac{d}{d	heta}J(	heta), we can now numerically verify its correctness by checking that

    egin{align}
g(	heta) approx
frac{J(	heta+{
m EPSILON}) - J(	heta-{
m EPSILON})}{2 	imes {
m EPSILON}}.
end{align}

    The degree to which these two values should approximate each other will depend on the details of 	extstyle J. But assuming 	extstyle {
m EPSILON} = 10^{-4}, you'll usually find that the left- and right-hand sides of the above will agree to at least 4 significant digits (and often many more).

    Suppose we have a function 	extstyle g_i(	heta) that purportedly computes 	extstyle frac{partial}{partial 	heta_i} J(	heta); we'd like to check if 	extstyle g_i is outputting correct derivative values. Let 	extstyle 	heta^{(i+)} = 	heta +
{
m EPSILON} 	imes vec{e}_i, where

    egin{align}
vec{e}_i = egin{bmatrix}0 \ 0 \ vdots \ 1 \ vdots \ 0end{bmatrix}
end{align}

    is the 	extstyle i-th basis vector (a vector of the same dimension as 	extstyle 	heta, with a "1" in the 	extstyle i-th position and "0"s everywhere else). So, 	extstyle 	heta^{(i+)} is the same as 	extstyle 	heta, except its 	extstyle i-th element has been incremented by EPSILON. Similarly, let 	extstyle 	heta^{(i-)} = 	heta - {
m EPSILON} 	imes vec{e}_i be the corresponding vector with the 	extstyle i-th element decreased by EPSILON. We can now numerically verify 	extstyle g_i(	heta)'s correctness by checking, for each 	extstyle i, that:

    egin{align}
g_i(	heta) approx
frac{J(	heta^{(i+)}) - J(	heta^{(i-)})}{2 	imes {
m EPSILON}}.
end{align}

    参数为向量,为了验证每一维的计算正确性,可以控制其他变量

    When implementing backpropagation to train a neural network, in a correct implementation we will have that

    egin{align}

abla_{W^{(l)}} J(W,b) &= left( frac{1}{m} Delta W^{(l)} 
ight) + lambda W^{(l)} \

abla_{b^{(l)}} J(W,b) &= frac{1}{m} Delta b^{(l)}.
end{align}

    This result shows that the final block of psuedo-code in Backpropagation Algorithm is indeed implementing gradient descent. To make sure your implementation of gradient descent is correct, it is usually very helpful to use the method described above to numerically compute the derivatives of 	extstyle J(W,b), and thereby verify that your computations of 	extstyle left(frac{1}{m}Delta W^{(l)} 
ight) + lambda W and	extstyle frac{1}{m}Delta b^{(l)} are indeed giving the derivatives you want.

    Autoencoders and Sparsity

    Anautoencoder neural network is an unsupervised learning algorithm that applies backpropagation, setting the target values to be equal to the inputs. I.e., it uses 	extstyle y^{(i)} = x^{(i)}.

    Here is an autoencoder:

    Autoencoder636.png

    we will write 	extstyle a^{(2)}_j(x) to denote the activation of this hidden unit when the network is given a specific input 	extstyle x. Further, let

    egin{align}
hat
ho_j = frac{1}{m} sum_{i=1}^m left[ a^{(2)}_j(x^{(i)}) 
ight]
end{align}

    be the average activation of hidden unit 	extstyle j (averaged over the training set). We would like to (approximately) enforce the constraint

    egin{align}
hat
ho_j = 
ho,
end{align}

    where 	extstyle 
ho is a sparsity parameter, typically a small value close to zero (say 	extstyle 
ho = 0.05). In other words, we would like the average activation of each hidden neuron 	extstyle j to be close to 0.05 (say). To satisfy this constraint, the hidden unit's activations must mostly be near 0.

    To achieve this, we will add an extra penalty term to our optimization objective that penalizes 	extstyle hat
ho_j deviating significantly from 	extstyle 
ho. Many choices of the penalty term will give reasonable results. We will choose the following:

    egin{align}
sum_{j=1}^{s_2} 
ho log frac{
ho}{hat
ho_j} + (1-
ho) log frac{1-
ho}{1-hat
ho_j}.
end{align}

    Here, 	extstyle s_2 is the number of neurons in the hidden layer, and the index 	extstyle j is summing over the hidden units in our network. If you are familiar with the concept of KL divergence, this penalty term is based on it, and can also be written

    egin{align}
sum_{j=1}^{s_2} {
m KL}(
ho || hat
ho_j),
end{align}

    Our overall cost function is now

    egin{align}
J_{
m sparse}(W,b) = J(W,b) + eta sum_{j=1}^{s_2} {
m KL}(
ho || hat
ho_j),
end{align}

    where 	extstyle J(W,b) is as defined previously, and 	extstyle eta controls the weight of the sparsity penalty term. The term 	extstyle hat
ho_j (implicitly) depends on 	extstyle W,b also, because it is the average activation of hidden unit 	extstyle j, and the activation of a hidden unit depends on the parameters 	extstyle W,b.

    egin{align}
delta^{(2)}_i =
  left( left( sum_{j=1}^{s_{2}} W^{(2)}_{ji} delta^{(3)}_j 
ight)
+ eta left( - frac{
ho}{hat
ho_i} + frac{1-
ho}{1-hat
ho_i} 
ight) 
ight) f'(z^{(2)}_i) .
end{align}

    Visualizing a Trained Autoencoder

    Consider the case of training an autoencoder on 	extstyle 10 	imes 10 images, so that 	extstyle n = 100. Each hidden unit 	extstyle i computes a function of the input:

    egin{align}
a^{(2)}_i = fleft(sum_{j=1}^{100} W^{(1)}_{ij} x_j  + b^{(1)}_i 
ight).
end{align}

    We will visualize the function computed by hidden unit 	extstyle i---which depends on the parameters 	extstyle W^{(1)}_{ij} (ignoring the bias term for now)---using a 2D image. In particular, we think of 	extstyle a^{(2)}_i as some non-linear feature of the input 	extstyle x

    If we suppose that the input is norm constrained by 	extstyle ||x||^2 = sum_{i=1}^{100} x_i^2 leq 1, then one can show (try doing this yourself) that the input which maximally activates hidden unit 	extstyle i is given by setting pixel 	extstyle x_j (for all 100 pixels, 	extstyle j=1,ldots, 100) to

    egin{align}
x_j = frac{W^{(1)}_{ij}}{sqrt{sum_{j=1}^{100} (W^{(1)}_{ij})^2}}.
end{align}

    By displaying the image formed by these pixel intensity values, we can begin to understand what feature hidden unit 	extstyle i is looking for.

    对一幅图像进行Autoencoder ,前面的隐藏结点一般捕获的是边缘等初级特征,越靠后隐藏结点捕获的特征语义更深。

  • 相关阅读:
    uuid模块
    使用pip管理第三方包
    hashlib和hmac模块
    hashlib和hmac模块
    JAVA热部署,通过agent进行代码增量热替换!!!
    史上最全java pdf精品书籍整理
    JAVA RPC (十) nio服务端解析
    java代理,手把手交你写java代理
    JAVA RPC 生产级高可用RPC框架使用分享
    DB缓存一致性
  • 原文地址:https://www.cnblogs.com/sprint1989/p/3979296.html
Copyright © 2020-2023  润新知