CS231n笔记 Lecture 7, Training Neural Networks, Part 2

CS231n笔记 Lecture 7, Training Neural Networks, Part 2
Review

Activation Functions. Sigmoid, tanh, ReLU(good default choice).

Optimization
- Optimization algorithms
  - SGD. Problems: jitering or stop at saddle point or local minima, noisy.
  - SGD + Momentum. Use velocity as a running mean of gradients, and use velocity instead of minibatch gradient to curve the noise and solve the problem of local minima and saddle points. Rho: gives "friction". Key Idea: Velocity, Curve sensitive.
  - Nesterov Momentum. $v_{t+1}= ho v_t-alpha abla f(x_t+ ho v_t)$, substitute variable to rearrange the equation, thus loss function and gradient will have the same input.
  - AdaGrad. Added element-wise scaling of the gradient based on the historical sum of squares in each dimension. On implementation, +eps in case of dividing by zero. Why AdaGrad: think about small gradient dimension and wiggling dimention. Good for convex optimization but bad when we have saddle points. Not commonly used for DNN cases.
  - RMSProp. Based on AdaGrad, solve the problem of smaller and smaller steps overtime. Key Idea: Making grad_squared decay.
  - Adam (Almost). maintain momentum(velocity) and scaling momentum(grad_squared) both (Momentum + AdaGrad/RMSProp), Problem: big first step.
  - Adam. plus bias correction. use an iteration count, and subtract the friction exponentially from 1 to make beginning steps small.
- pick a learning rate
  - Tricky trial: Adam, beta1 = 0.9, beta2 = 0.999, lr = 1e-3 or 5e-4
  - Learning rate decay. step decay, exponential decay $alpha = alpha_0e^{-kt}$. 1/t decay, $alpha = alpha_0 / (1 + kt)$. Common for SGD-momentum, less common for Adam.
  - lr decay is like a second order optimiztion, it will be very tricky, so start with no decay and see the loss map, then to decide whether to use decay or not.
- Second-Order Optimization
  - Impractical
- Model Ensembles
  - Train and Average multiple independent models
  - Polyak averaging
Regularization
- L2, L1, Elastic (combining L2 and L1)
- Dropout
  - Why make sense?
    
    Forces the network to have a redundant representation;
    
    Prevents co-adaptation of features
    or
    
    Dropout is training a large ensemble of models (with shared parameters)
  - Dropout: Test time. At test time, to approximate the dropout behavior during the training time, we multiply the input (activation) of a nueron by the dropout probability.
  - More common: “Inverted dropout”. scale the activation during training so that at test time everything keeps unchanged.
- Data Augmentation
Transfer Learning

Don't train a nn from scratch, instead, use pretrained model as feature extractor, and
相关阅读:
iOS----------WKWebView修改userAgent
Vmware路由配置
 【手机APP开发】指南
 【Git】git 指南
 【微信小程序开发】阮一峰老师的微信小程序开发入门教程——学起来~
【vue】2-Keycode对照表
 【vue】1-vue简介与基础
 Meaning
数据增强
 Dropout
原文地址：https://www.cnblogs.com/ichn/p/8496745.html

CS231n笔记 Lecture 7, Training Neural Networks, Part 2

Review

Optimization

Regularization

Transfer Learning