• PyTorch-Adam优化算法原理,公式,应用


         概念:Adam 是一种可以替代传统随机梯度下降过程的一阶优化算法,它能基于训练数据迭代地更新神经网络权重。Adam 最开始是由 OpenAI 的 Diederik Kingma 和多伦多大学的 Jimmy Ba 在提交到 2015 年 ICLR 论文(Adam: A Method for Stochastic Optimization)中提出的.该算法名为「Adam」,其并不是首字母缩写,也不是人名。它的名称来源于适应性矩估计(adaptive moment estimation)

      Adam(Adaptive Moment Estimation)本质上是带有动量项的RMSprop,它利用梯度的一阶矩估计和二阶矩估计动态调整每个参数的学习率。它的优点主要在于经过偏置校正后,每一次迭代学习率都有个确定范围,使得参数比较平稳。其公式如下:

      

      其中,前两个公式分别是对梯度的一阶矩估计和二阶矩估计,可以看作是对期望E|gt|,E|gt^2|的估计; 
    公式3,4是对一阶二阶矩估计的校正,这样可以近似为对期望的无偏估计。可以看出,直接对梯度的矩估计对内存没有额外的要求,而且可以根据梯度进行动态调整。最后一项前面部分是对学习率n形成的一个动态约束,而且有明确的范围。

      优点:

    1、结合了Adagrad善于处理稀疏梯度和RMSprop善于处理非平稳目标的优点; 
    2、对内存需求较小; 
    3、为不同的参数计算不同的自适应学习率; 
    4、也适用于大多非凸优化-适用于大数据集和高维空间。

      应用和源码:

      参数实例:

    class torch.optim.Adam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0)

      参数含义:

      params(iterable):可用于迭代优化的参数或者定义参数组的dicts。

      lr (float, optional) :学习率(默认: 1e-3) betas (Tuple[float, float], optional):

      用于计算梯度的平均和平方的系数(默认: (0.9, 0.999)) eps (float, optional):

      为了提高数值稳定性而添加到分母的一个项(默认: 1e-8) weight_decay (float, optional):权重衰减(如L2惩罚)(默认: 0)

      torch.optim.adam源码:

     1 import math
     2 from .optimizer import Optimizer
     3 
     4 class Adam(Optimizer):
     5     def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8,weight_decay=0):
     6         defaults = dict(lr=lr, betas=betas, eps=eps,weight_decay=weight_decay)
     7         super(Adam, self).__init__(params, defaults)
     8 
     9     def step(self, closure=None):
    10         loss = None
    11         if closure is not None:
    12             loss = closure()
    13 
    14         for group in self.param_groups:
    15             for p in group['params']:
    16                 if p.grad is None:
    17                     continue
    18                 grad = p.grad.data
    19                 state = self.state[p]
    20 
    21                 # State initialization
    22                 if len(state) == 0:
    23                     state['step'] = 0
    24                     # Exponential moving average of gradient values
    25                     state['exp_avg'] = grad.new().resize_as_(grad).zero_()
    26                     # Exponential moving average of squared gradient values
    27                     state['exp_avg_sq'] = grad.new().resize_as_(grad).zero_()
    28 
    29                 exp_avg, exp_avg_sq = state['exp_avg'], state['exp_avg_sq']
    30                 beta1, beta2 = group['betas']
    31 
    32                 state['step'] += 1
    33 
    34                 if group['weight_decay'] != 0:
    35                     grad = grad.add(group['weight_decay'], p.data)
    36 
    37                 # Decay the first and second moment running average coefficient
    38                 exp_avg.mul_(beta1).add_(1 - beta1, grad)
    39                 exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad)
    40 
    41                 denom = exp_avg_sq.sqrt().add_(group['eps'])
    42 
    43                 bias_correction1 = 1 - beta1 ** state['step']
    44                 bias_correction2 = 1 - beta2 ** state['step']
    45                 step_size = group['lr'] * math.sqrt(bias_correction2) / bias_correction1
    46 
    47                 p.data.addcdiv_(-step_size, exp_avg, denom)
    48 
    49         return loss

      使用例子:

     1 import torch
     2 
     3 # N is batch size; D_in is input dimension;
     4 # H is hidden dimension; D_out is output dimension.
     5 N, D_in, H, D_out = 64, 1000, 100, 10
     6 
     7 # Create random Tensors to hold inputs and outputs
     8 x = torch.randn(N, D_in)
     9 y = torch.randn(N, D_out)
    10 
    11 # Use the nn package to define our model and loss function.
    12 model = torch.nn.Sequential(
    13     torch.nn.Linear(D_in, H),
    14     torch.nn.ReLU(),
    15     torch.nn.Linear(H, D_out),
    16 )
    17 loss_fn = torch.nn.MSELoss(reduction='sum')
    18 
    19 # Use the optim package to define an Optimizer that will update the weights of
    20 # the model for us. Here we will use Adam; the optim package contains many other
    21 # optimization algoriths. The first argument to the Adam constructor tells the
    22 # optimizer which Tensors it should update.
    23 learning_rate = 1e-4
    24 optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
    25 for t in range(500):
    26     # Forward pass: compute predicted y by passing x to the model.
    27     y_pred = model(x)
    28 
    29     # Compute and print loss.
    30     loss = loss_fn(y_pred, y)
    31     print(t, loss.item())
    32 
    33     # Before the backward pass, use the optimizer object to zero all of the
    34     # gradients for the variables it will update (which are the learnable
    35     # weights of the model). This is because by default, gradients are
    36     # accumulated in buffers( i.e, not overwritten) whenever .backward()
    37     # is called. Checkout docs of torch.autograd.backward for more details.
    38     optimizer.zero_grad()
    39 
    40     # Backward pass: compute gradient of the loss with respect to model
    41     # parameters
    42     loss.backward()
    43 
    44     # Calling the step function on an Optimizer makes an update to its
    45     # parameters
    46     optimizer.step()

      到这里,相信对付绝大多数的应用是可以的了.我的目的也就基本完成了.接下来就要在应用中加深理解了.

      

    参考文档:

    1 https://blog.csdn.net/kgzhang/article/details/77479737

    2 https://pytorch.org/tutorials/beginner/examples_nn/two_layer_net_optim.html

  • 相关阅读:
    20155203 《深入理解计算机系统》第五章学习总结
    2017-2018-1 20155203 20155204 实验五 通讯协议设计
    20155203 《信息安全系统设计基础》第十一周学习总结
    2017-2018-1 20155203 20155204 实验四 外设驱动程序设计
    课上第六章测试(补)
    20155203 《信息安全系统设计基础》第九周学习总结
    2017-2018-1 20155203 实验三 实时系统
    mypwd的编译和测试
    第二周 第三周 课下实践补交
    课上测试 补交及重做 深刻的教训
  • 原文地址:https://www.cnblogs.com/dylancao/p/9878978.html
Copyright © 2020-2023  润新知