【BigNAS】2020-ECCV-BigNAS Scaling Up Neural Architecture Search with Big Single-Stage Models-论文阅读

【BigNAS】2020-ECCV-BigNAS Scaling Up Neural Architecture Search with Big Single-Stage Models-论文阅读
BigNAS

2020-ECCV-BigNAS Scaling Up Neural Architecture Search with Big Single-Stage Models

来源：ChenBong 博客园
- Institute：Google Brain
- Author：Jiahui Yu
- GitHub：/
- Citation： 20+
Introduction

训练supernet，直接采样子网即可直接部署

Motivation
- 2019-ICLR-Slimmable Neural Networks
  - scale 维度：
    
    channel num（network-wise，top n index）
  - 每个 batch：
    
    channel num（固定的4个：0.25x，0.5x，0.75x，1.0x）
- 2019-ICCV-US-Net
  - scale维度：
    
    channel num（network-wise，top n index）
  - 每个 batch：
    
    channel num（随机的4个：min，random×2，max）
  - inplace distillation： $CE(max, hat y), CE(min, y_{max}), CE(random_{1,2}, y_{max}) $
- 2020-ECCV-MutualNet
  - scale 维度：
    
    input resolution（network-wise）
    
    channel num（network-wise，top n index）
  - 每个 batch：
    
    input resolution（相同的1个crop，resize成4个resolution）
    
    channel num（随机的4个：min，random×2，max）
  - inplace distillation： (CE(max, hat y), KLDiv(random_{1,2}, y_{max}), KLDiv(min, y_{max}))
- 2020-ECCV-RS-Net
  - scale维度：
    
    input resolution（network-wise）
  - 每个 batch：
    
    input resolution（相同的1个crop，resize成S个resolution：(S_1>S_2>...>S_N)）
  - inplace distillation： $CE(ensemble, hat y), CE(S_1, ensemble), CE(S_2, S_1)... $
- 2020-ICLR-Once for All：
  - scale维度：
    
    input resolution（network-wise）
    
    channel num（layer-wise，top L1 select）
    
    layer num（stage-wise，top n select）
    
    kernel size（layer-wise，center+layer's fc）
  - train full
  - KD： (CE(ps_1, full), CE(ps_2, ps_1)...)
- 2020-ECCV-BigNAS
  - scale维度：
    
    input resolution（network-wise）
    
    channel num（layer-wise，top n select）
    
    layer num（stage-wise，top n select）
    
    kernel size（layer-wise，center）
  - 每个 batch：
    
    input resolution（相同的1个crop，resize成4种resolution）
    
    每个batch sample 3个child model (full, rand1, rand2)
  - inplace distillation： (CE(full, hat y), KLDiv(random_{1,2}, y_{full}))
Contribution

Method

Training a High-Quality Single-Stage Model

Sandwich Rule (previous work)

Inplace Distillation (previous work)

Batch Norm Calibration (previous work)

Initialization
1. He Initialization: both small (left) and big (right) child models drops to zero after a few thousand training steps during the learning rate warming-up.
2. The single-stage model is able to converge when we reduce the learning rate to the 30%.
3. If the initialization is modified according to Section 3.1, the model learns much faster at the beginning of the training (shown in Figure 4), and has better performance at the end of the training (shown in Figure 5).
Section 3.1:

we initialize the output of each residual block (before skip connection) to an allzeros tensor by setting the learnable scaling coefficient γ = 0 in the last Batch Normalization [20] layer of each residual block.

We also add a skip connection in each stage transition when either resolutions or channels differ (using 2 × 2 average pooling and/or 1 × 1 convolution if necessary) to explicitly construct an identity mapping.

Convergence Behavior

small child models converge slower and need more training

==> exponentially decaying with constant ending.

Regularization
we compare the effects of the regularization (weight decay and dropout) between two rules:
1. applying regularization on all child models
2. applying regularization only on the full network
Coarse-to-fine Architecture Selection

scale维度：
- input resolution（network-wise）
- channel num（layer-wise，top n select）
- layer num（stage-wise，top n select）
- kernel size（layer-wise，center）
we pre-define :

five input resolutions (network-wise, {192, 224, 256, 288, 320}),

four depth configurations (stage-wise),

two channel configurations (stage-wise),

four kernel size configurations (stage-wise)

Experiments

Cost

8×8 TPUv3

To train a single-stage model, it roughly takes 36 hours.

ImageNet

ablation study

Finetuning child models

Training child from scratch

Conclusion

Summary
- 将one-shot超网训练做到比较完善的一个工作，相比之下ofa还需要多个stage，分阶段蒸馏/fine-tune，不够one-shot
- 将无需 fine-tune 的 motivation 贯彻到底
- 整个pipeline之所以work的主要原因应该是 top n select + self KD 的超网训练方式，其他的部分都是一两个点的小修小补的工作
- 训练开销也在可接受的范围内（和once for all 比起来）
- fine-tune和train from scratch child model都无法再提高模型性能，有点反直觉，说明这个超网训练的pipeline确实对child model非常有益？
- 搜索空间似乎经过精心设计
- 没有开源
To Read

Reference
相关阅读:
Atitit 人脸识别眼睛形态 attilax总结
 Atitit 手机号码选号规范流程 attilax总结 v2 r99.docx
atitit 板块分类上市公司龙头企业公司列表 attilax总结.docx
Atititi atiitt eam pam资产管理购物表去年.xlsx
使用cmd查看电脑连接过的wifi密码(一)
常见十大web攻击手段悟寰轩
 常见web攻击方式悟寰轩
 【MYSQL数据库】MYSQL学习笔记mysql分区基本操作悟寰轩
 Filter及FilterChain的使用详解悟寰轩
 启动tomcat spring初始化两次问题（eg:@PostConstruct）悟寰轩
原文地址：https://www.cnblogs.com/chenbong/p/14353937.html

【BigNAS】2020-ECCV-BigNAS Scaling Up Neural Architecture Search with Big Single-Stage Models-论文阅读

BigNAS

Introduction

Motivation

Contribution

Method

Training a High-Quality Single-Stage Model

Sandwich Rule (previous work)

Inplace Distillation (previous work)

Batch Norm Calibration (previous work)

Initialization

Convergence Behavior

Regularization

Coarse-to-fine Architecture Selection

Experiments

Cost

ImageNet

ablation study

Finetuning child models

Training child from scratch

Conclusion

Summary

To Read

Reference