• Deep Learning--week1~week3



    week1

    一张图片,设像素为64*64, 颜色通道为红蓝绿三通道,则对应3个64*64实数矩阵

    为了用向量表示这些矩阵,将这些矩阵的像素值展开为一个向量x作为算法的输入

    从红色到绿色再到蓝色,依次按行一个个将元素读到向量x中,则x是一个(1 imes64*64*3)的矩阵,也就是一个64*64*3维的向量

    (n_x = 64*64*3) 表示特征向量x的维度

    而所有的训练样本表示成:(X = egin{bmatrix}mid & mid &mid &&mid \ x^{(1)}& x^{(2)}& x^{(3)}& cdots & x^{(m)}\ mid & mid &mid &&mid end{bmatrix}) ((n_x imes m)矩阵)

    注意不是(X = egin{bmatrix} (x^{(1)})^T\ vdots \ (x^{(m)})^T end{bmatrix}) ,用上面的方法运算会简单点)

    (Y=egin{bmatrix}y^{(1)} & y^{(2)} & cdots & y^{(m)}end{bmatrix})

    之前的机器学习课上的( heta = egin{bmatrix} heta_0 \ heta_1 \ vdots \ heta_{n_x} \ end{bmatrix})的形式不再使用,而用(large b = heta_0, ; w = egin{bmatrix} heta_1 \ vdots \ heta_{n_x} \ end{bmatrix})代替( it will be easier to just keep (b) and (w) as separate parameters )

    则output : (large hat{y}^{(i)} = sigma(w^Tx^{(i)}+b),{ m where;}sigma(z^{(i)}) = frac{1}{1+e^{-z^{(i)}}})

    ( ext{Given {}(x^{(1)}, y^{(1)}),dots,(x^{(m)},y^{(m)}) ext{}, want } hat{y}^{(i)} approx y^{(i)})


    week2

    Loss Function/Error Function

    Loss Function/Error Function(误差函数): used to measure how well our algorism is doing

    [{cal L}(hat{y},y) = -ycdot log(hat{y})-(1-y)cdot log(1-hat{y}) ]

    Cost Function

    [J(w,b) = -frac{1}{m}[sum_{i=1}^{m}y^{(i)}\, log\,hat{y}^{(i)})+(1-y^{(i)})\, log\,(1-hat{y}^{(i)})] ]

    Gradient Descent

    ​ 看ML的笔记,实质上是一样的

    Vectorization:

    #Non-vecotrized
    #slow
    z = 0
    for i in range(n_x):
        z += w[i] * x[i]
    z += b
    
    #Vectorized
    #import numpy as np
    z = np.dot(w,x) + b
    

    whenever possible, avoid explicit for-loops(因为是解释型语言), 用numpy带的行数可以简洁而高效地实现

    Vectorizing Logistic Regression

    (X = egin{bmatrix} lvert & lvert & cdots & lvert \ x^{(1)} & x^{(2)} & cdots & x^{(m)} \ lvert & lvert & cdots & lvert end{bmatrix}, mathbb{R}^{n_x imes m})

    (Z = egin{bmatrix}z^{(1)} & z^{(2)} & cdots & z^{(m)} end{bmatrix} = w^TX + egin{bmatrix}b &b & cdots & b end{bmatrix})

    (z^{(i)}) 是 sigmoid function的输入值

    (A = egin{bmatrix}a^{(1)} & a^{(2)} & cdots & a^{(m)} end{bmatrix} = sigma(Z))

    (这里的不同上标的元素似乎实际是在同一个layer中的,跟ML课上不大一样。 (a^{[j](i)})中方括号括起来的是层数,圆括号括起来的是第(i)个训练实例)

    import numpy as np
    z = np.dot(w,x) + b
    #Python automatically takes this real number b and expands it out to this 1*m row vector
    

    Gradient Output

    ({ m d}z^{(i)} = a^{(i)} - y^{(i)})

    (egin{align}{ m d}Z &= egin{bmatrix}{ m d}z^{(1)} & { m d}z^{(2)} & cdots & { m d}z^{(m)} end{bmatrix} \&= A-Y = egin{bmatrix}a^{(1)} - y^{(1)} & a^{(2)} - y^{(2)} & cdots & a^{(m)} - y^{(m)} end{bmatrix} end{align})

    ${ m d}b = $1/m*np.sum(dZ)

    ({ m d}w = frac{1}{m}X{ m d}Z^T)

    单次迭代免for-loop法(vectorize):

    [egin{align} downarrow&egin{cases} Z & = w^T+b\ & = { m np.dot(}w{ m .T, }X{ m)}\ A & = sigma(Z)\ { m d}Z &= A-Y \ { m d}w &= frac{1}{m}X{ m d}Z^T\ end{cases}\\ w& := w - alpha{ m d}w\ b &:= b - alpha{ m d}b end{align} ]

    若要多次迭代,最外层的显式for-loop是不可避免的

    Broadcasting

    reshape()确保矩阵的尺寸

    举个例子说明numpy 的 broadcasting机制:

    >>> import numpy as np
    >>> a = np.arange(0,6).reshape(6,1)
    >>> a
    array([[0],
           [1],
           [2],
           [3],
           [4],
           [5]])
    >>> b = np.arange(0,5)
    >>> b
    array([0, 1, 2, 3, 4])
    >>> a * b
    array([[ 0,  0,  0,  0,  0],
           [ 0,  1,  2,  3,  4],
           [ 0,  2,  4,  6,  8],
           [ 0,  3,  6,  9, 12],
           [ 0,  4,  8, 12, 16],
           [ 0,  5, 10, 15, 20]])
    >>> a + b
    array([[0, 1, 2, 3, 4],
           [1, 2, 3, 4, 5],
           [2, 3, 4, 5, 6],
           [3, 4, 5, 6, 7],
           [4, 5, 6, 7, 8],
           [5, 6, 7, 8, 9]])
    

    也就是说matrix+-*/number/vector时,numpy会将number/vector通过自我复制拓展成合法的矩阵

    注意这会导致 在期望抛出异常的地方 不抛出异常而是发生奇怪的BUG:

    ​ 比如 有时我想 行向量和列向量相加时抛出异常, 但是numpy却用broadcasting机制把它给算出来了...

    numpy的坑

    import numpy as np
    a = np.random.randn(5)
    >>> a
    array([-0.19837642, -0.16758652,  1.57705505,  0.13033745, -0.81073889])
    >>> a.shape
    (5,)	
    # which is called a rank 1 array in Python and is neither a row vector nor a column vector
    
    >>> a.T
    array([-0.19837642, -0.16758652,  1.57705505,  0.13033745, -0.81073889])	
    # which is same as 'a' i self
    
    >>> np.dot(a,a.T)
    3.2288264718632416	
    # it is a number rather than a matrix in expectation(just like array([[55]]))
    

    不要使用形如(5,)或者(n,)这样的“rank 1 array”, 而是显式地说明是(m imes n)的矩阵:

    >>> a = np.random.randn(5,1)
    >>> a
    array([[ 0.7643396 ],
           [-1.66945103],
           [ 1.66235712],
           [-0.06892102],
           [-1.61347409]])
    >>> a.T
    array([[ 0.7643396 , -1.66945103,  1.66235712, -0.06892102, -1.61347409]])
    

    注意array([-0.19837642, -0.16758652, 1.57705505, 0.13033745, -0.81073889])array([[ 0.7643396 , -1.66945103, 1.66235712, -0.06892102, -1.61347409]])的区别(后者有两个方括号), 这说明前者是秩为1的数组而后者是一个真正的(1 imes 5)矩阵(就像C里一样矩阵是用二维数组表示的)(另外我觉得rank 1 array翻译为一维数组更为准确)

    It can use assert() statement to make sure the dimension of one of vectors.

    When you get a rank 1 array, you can use a.reshape to transform it into a (n,1) array or a (1,n) array.

    Logistic Regression Cost Function

    [left. egin{array}{l} ext{If y=1:}quad p(y|x)=hat{y}\ ext{If y=0:}quad p(y|x)=1-hat{y} end{array} ight} p(y|x) = hat{y}^ycdot (1-hat{y})^{1-y}\ \,\ egin{align} herefore { m log}(p(y|x)) &= ycdot log\,hat{y} + (1-y)cdot log\, (1-hat{y}) \ &= -mathcal{L}(hat{y},y) end{align} ]

    所以:

    [egin{align} { m log }[p( ext{labels in training set})] &= { m log } prod_{i=1}^mp(y^{(i)}|x^{(i)})\ &=sum_{i=1}^m { m log\,}p(y^{(i)}|x^{(i)})\ &=sum_{i=1}^m-mathcal{L}(hat{y}^{(i)},y^{(i)})\ &=-sum_{i=1}^m mathcal{L}(hat{y}^{(i)},y^{(i)}) end{align}\ ext{Cost: }J(w,b) = frac{1}{m}sum_{i=1}^m mathcal{L}(hat{y}^{(i)},y^{(i)}) ]

    maximum likelihood estimation (极大似然估计)


    week3

    (Z^{[j]} = W^{[j]}A^{[j-1]} + b^{[j]} = w^{[j]}egin{bmatrix} | & | & | & \ a^{[j-1](1)} & a^{[j-1](2)} & a^{[j-1](3)} & cdots \ | & | & | & end{bmatrix} + b^{[j]} = egin{bmatrix} | & | & | & \ z^{[j](1)} & z^{[j](2)} & z^{[j](3)} & cdots \ | & | & | & end{bmatrix})

    其中((i) in [(1),(m)],quad [j] in [[1],[n]],quad X = A^{[0]})

    Other Activation Function

    (tanh(z)) function:

    [a= tanh(z)=frac{e^z -e^{-z}}{e^z +e^{-z}} ext{ , when } tanh(z) in (-1,1), tanh(0)=0 ]

    (tanh(z)) 可以把 数据中心化 为 0 (Sigmoid Function 将数据中心化为 0.5)

    ​ 之后只有 (0 le hat{y} le 1) (即二元分类问题)才用 Sigmoid Function,因为(tanh)几乎严格优于Sigmoid...

    ②Rectified Linear Unit(线性整流函数, ReLU):(Q = max(0,z))

    ​ When not sure what to use for your hidden layer, can use the ReLU function

    ​ Disadvantage of ReLU: when (z) is negative, the value is 0.

    ​ It can use what names Leaky ReLU to overcome the disadvantage below.

    ​ Leaky ReLU: (a = max(0.01z, z))

    ​ ReLU可以使得斜率不变(Sigmoid 和 (tanh(z))(z ightarrow infin)时斜率趋向于0,会使得学习速度下降)

    ​ 最常用的 Activation Function

    ③Tannish Function(双曲函数)

    当且仅当要解决回归问题的时候,在生成到output layer才使用线性的Activation Function((g(z)=z)) ,比如预测房价时,y不限于 0 和 1((y in mathbb{R})),所以可以用(g(z)=z) 输出,隐藏单元不应该使用Linear Activation Function, 而是应该使用tanh/ReLU/Leaky ReLU

    Derivatives of Activation Functions

    • Sigmoid:
      • (frac{{ m d}}{{ m d}z}g(z) = g(z)(1-g(z)))
        (tanh(z)):
      • (gprime(z) = 1-(tanh(z))^2)
    • ReLU:
      • (gprime(z) = egin{cases}1, ext{if }zge0 \0, ext{if }zlt0 end{cases})

    Gradient Descents For Neural Networks

    Parameters : (w^{[1]},b^{[1]},w^{[2]},b^{[2]})

    Cost Function : (J(w^{[1]},b^{[1]},w^{[2]},b^{[2]})= frac{1}{m}sum_{i=1}^m mathcal{L}(hat{y},y))

    Gradient Function:

    [egin{align} & ext{Repeat {}\ &quad ext{compute predicts} (hat{y}^{(i)}, i = 1,dots,m) \ &quad { m d}w^{[1]} = frac{partial J}{partial w^{[1]}}, { m d}b^{[1]} = frac{partial J}{partial b^{[1]}},dots\ &quad w^{[1]} = w^{[1]} - alpha { m d}w^{[1]}\ &quad b^{[1]} = b^{[1]} - alpha { m d}b^{[1]}\ &quad w^{[2]} = w^{[2]} - alpha { m d}w^{[2]}\ &quad b^{[2]} = b^{[2]} - alpha { m d}b^{[2]}\ ext{}} end{align} ]

    Forward Propagation :

    [egin{align} Z^{[1]} &= w^{[1]}X + b^{[1]}\ A^{[1]} &= g^{[1]}(z^{[1]})\ Z^{[2]} &= w^{[2]}A^{[1]} + b^{[2]}\ A^{[2]} &= g^{[2]}(z^{[2]}) = sigma(Z^{[2]}) end{align} ]

    Backward Propagation :

    [egin{align} { m d}Z^{[2]} &= A^{[2]} - Y, quad Y = egin{bmatrix}y^{[1]} & y^{[2]} & dots & y^{[m]}end{bmatrix}\ { m d}w^{[2]} &= frac{1}{m} { m d}z^{[2]} A^{[1]T}\ { m d}d^{[2]} &= frac{1}{m} ext{np.sum(d}z^{[2]} ext{,axis=1,keepdims=True)}\ { m d}Z^{[1]} &= w^{[2]T}{ m d}Z^{[2]}; .* ; g^{[1]prime}(Z^{[1]})\ { m d}w^{[1]} &= frac{1}{m} { m d}Z^{[1]}X^T\ { m d}d^{[1]} &= frac{1}{m} ext{np.sum(d}z^{[1]} ext{,axis=1,keepdims=True)}\ end{align} ]

    注:axis = 1 means summing horizontally, and keepdims = True means prevent from outputting Rank 1 Array. You can call reshape function explicitly rather than keeping these parameters.

    又注:(由于A^{[1]} = g^{[1]}(Z^{[1]})且g^{[1]prime}(z) = 1-a^2,;所以 g^{[1]prime}(Z^{[1]}) = 1-(A^{[1]})^2), 即:(Z^{[1]} = w^{[2]T}{ m d}Z^{[2]}; .* ; (1-(A^{[1]})^2)

    Random Initialization

    For a neural network, if initialize the weights to parameters to all zero and then apply gradient descent, it won't work.

  • 相关阅读:
    console在ie下不兼容的问题(console在ie9下阻碍页面的加载,打开页面一片空白)
    相等(==)、严格相等(===)、NaN、null、undefined、空和0
    算法--排序--分治与快速排序
    java线程总结1--线程的一些概念基础以及线程状态
    java设计模式--基础思想总结--抽象类与架构设计思想
    jsp servlet基础复习 Part2--GET,Post请求
    java设计模式--基础思想总结--父类引用操作对象
    Hibernate学习2--对象的三种状态以及映射关系的简单配置
    java--集合框架总结1--set总结
    Hibernate学习1--对象持久化的思想
  • 原文地址:https://www.cnblogs.com/khunkin/p/10199480.html
Copyright © 2020-2023  润新知