1. 基本概念(Momentum vs SGD)
Momentum 用于加速 SGD(随机梯度下降)在某一方向上的搜索以及抑制震荡的发生。
GD(gradient descent)
θt=θt−1−η∇Jθ(θ)⇒θ=θ−η∇J(θ) for i in range(num_epochs): params_grad = evaluate_gradient(loss_function, data, params) params = params - learning_rate * params_grad
SGD(stochastic gradient descent)
θt=θt−1−η∇Jθ(θ;x(i),y(i))⇒θ=θ−η∇J(θ;x(i),y(i)) for i in range(num_epochs): np.random.shuffle(data) for example in data: params_grad = evaluate_gradient(loss_function, example, params) params = params - learning_rate * params_grad
Momentum(冲量/动量)
vt=γvt−1+η∇θJ(θ)θ=θ−vt for i in range(num_epochs): params_grad = evaluate_gradient(loss_function, data, params) v = gamma*v + learning_rate*params_grad params = params - v
γ 即为此处的动量,要求γ<1 ,一般取γ=0.9 或者更小的值,如本文第二节所示,还可以在迭代过程中设置可变的γ
2. 可变动量设置
maxepoch = 50;
initialmomentum = .5;
finalmomentum = .9;
for i = 1:maxepoch
...
if i < maxepoch/2
momentum = initialmomentum
else
momentum = finalmomentum
end
...
end