• [Xavier] Understanding the difficulty of training deep feedforward neural networks


    Glorot X, Bengio Y. Understanding the difficulty of training deep feedforward neural networks[C]. international conference on artificial intelligence and statistics, 2010: 249-256.

    @article{glorot2010understanding,
    title={Understanding the difficulty of training deep feedforward neural networks},
    author={Glorot, Xavier and Bengio, Yoshua},
    pages={249--256},
    year={2010}}

    本文提出了Xavier参数初始化方法.

    主要内容

    在第(i=1, ldots, d)层:

    [mathbf{s}^i=mathbf{z}^i W^i+mathbf{b}^i \ mathbf{z}^{i+1}= f(mathbf{s}^i), ]

    其中(mathbf{z}^i)是第(i)层的输入, (mathbf{s}^i)是激活前的值, (f(cdot))是激活函数(假设其在0点对称, 且(f'(0)=1) 如tanh).

    [mathrm{Var}(z^i) = n_lmathrm{Var}(w^iz^i), ]

    (0)附近近似成立(既然(f'(0)=1)), 其中(z^i, w^i,)分别是(mathbf{z}^i, W^i)的某个元素, 且假设这些({w^i})之间是独立同分布的, (w^i, z^i)是相互独立的, 进一步假设(mathbb{E}(w^i)=0,mathbb{E}(x)=0)((x)是输入的样本), 则

    [mathrm{Var}(z^i) = n_lmathrm{Var}(w^i)mathrm{Var}(z^i), ]

    (0)点附近近似成立.

    [mathrm{Var}(z^i) = mathrm{Var}(x) prod_{i'=0}^{i-1} n_{i'} mathrm{Var}(w_{i'}) ]

    其中(n_i)表示第(i)层输入的节点个数.

    根据梯度反向传播可知:

    [ ag{2} frac{partial Cost}{partial s_k^i} = f'(s_k^i) W_{k, cdot}^{i+1} frac{partial Cost}{partial mathbf{s}^{i+1}} ]

    [ ag{3} frac{partial Cost}{partial w_{l,k}^i} = z_l^i frac{partial Cost}{partial s_k^i}. ]

    于是

    [ ag{6} mathrm{Var}[frac{partial Cost}{partial s_k^i}] = mathrm{Var}[frac{partial Cost}{partial s^d}] prod_{i'=i}^d n_{i'+1} mathrm{Var} [w^{i'}], ]

    [mathrm{Var}[frac{partial Cost}{partial w^i}] = prod_{i'=0}^{i-1} n_{i'} mathrm{Var}[w^{i'}] prod_{i'=i}^d n_{i'+1} mathrm{Var} [w^{i'}] imes mathrm{Var}(x) mathrm{Var}[frac{partial Cost}{partial s^d}], ]

    当我们要求前向进程中关于(z^i)的方差一致, 则

    [ ag{10} forall i, quad n_i mathrm{Var} [w^i]=1. ]

    当我们要求反向进程中梯度的方差(frac{partial Cost}{partial s^i})一致, 则

    [ ag{11} forall i quad n_{i+1} mathrm{Var} [w^i]=1. ]

    本文选了一个折中的方案

    [mathrm{Var} [w^i] = frac{2}{n_{i+1}+n_{i}}, ]

    并构造了一个均匀分布, (w^i)从其中采样

    [w^i sim U[-frac{sqrt{6}}{sqrt{n_{i+1}+n_{i}}},frac{sqrt{6}}{sqrt{n_{i+1}+n_{i}}}]. ]

    文章还有许多关于不同的激活函数的分析, 如sigmoid, tanh, softsign... 这些不是重点, 就不记录了.

  • 相关阅读:
    (01)Docker简介
    Gym-101242B:Branch Assignment(最短路,四边形不等式优化DP)
    2019牛客暑期多校训练营(第三场)G: Removing Stones(启发式分治)
    POJ
    高维前缀和
    HDU
    BZOJ
    HDU
    POJ
    Gym-100648B: Hie with the Pie(状态DP)
  • 原文地址:https://www.cnblogs.com/MTandHJ/p/12759206.html
Copyright © 2020-2023  润新知