• 【BigNAS】2020-ECCV-BigNAS Scaling Up Neural Architecture Search with Big Single-Stage Models-论文阅读


    BigNAS

    2020-ECCV-BigNAS Scaling Up Neural Architecture Search with Big Single-Stage Models

    来源:ChenBong 博客园

    • Institute:Google Brain
    • Author:Jiahui Yu
    • GitHub:/
    • Citation: 20+

    Introduction

    训练supernet,直接采样子网即可直接部署

    image-20210131170347160


    Motivation

    • 2019-ICLR-Slimmable Neural Networks
      • scale 维度:
        • channel num(network-wise,top n index)
      • 每个 batch:
        • channel num(固定的4个:0.25x,0.5x,0.75x,1.0x)

    • 2019-ICCV-US-Net
      • scale维度:
        • channel num(network-wise,top n index)
      • 每个 batch:
        • channel num(随机的4个:min,random×2,max)
      • inplace distillation: $CE(max, hat y), CE(min, y_{max}), CE(random_{1,2}, y_{max}) $

    • 2020-ECCV-MutualNet
      • scale 维度:
        • input resolution(network-wise)
        • channel num(network-wise,top n index)
      • 每个 batch:
        • input resolution(相同的1个crop,resize成4个resolution)
        • channel num(随机的4个:min,random×2,max)
      • inplace distillation: (CE(max, hat y), KLDiv(random_{1,2}, y_{max}), KLDiv(min, y_{max}))

    • 2020-ECCV-RS-Net
      • scale维度:
        • input resolution(network-wise)
      • 每个 batch:
        • input resolution(相同的1个crop,resize成S个resolution:(S_1>S_2>...>S_N)
      • inplace distillation: $CE(ensemble, hat y), CE(S_1, ensemble), CE(S_2, S_1)... $

    • 2020-ICLR-Once for All:
      • scale维度:
        • input resolution(network-wise)
        • channel num(layer-wise,top L1 select)
        • layer num(stage-wise,top n select)
        • kernel size(layer-wise,center+layer's fc)
      • train full
      • KD: (CE(ps_1, full), CE(ps_2, ps_1)...)

    • 2020-ECCV-BigNAS
      • scale维度:
        • input resolution(network-wise)
        • channel num(layer-wise,top n select)
        • layer num(stage-wise,top n select)
        • kernel size(layer-wise,center)
      • 每个 batch:
        • input resolution(相同的1个crop,resize成4种resolution)
        • 每个batch sample 3个child model (full, rand1, rand2)
      • inplace distillation: (CE(full, hat y), KLDiv(random_{1,2}, y_{full}))

    image-20210131185604401


    Contribution


    Method

    Training a High-Quality Single-Stage Model

    Sandwich Rule (previous work)

    Inplace Distillation (previous work)

    Batch Norm Calibration (previous work)


    Initialization

    1. He Initialization: both small (left) and big (right) child models drops to zero after a few thousand training steps during the learning rate warming-up.
    2. The single-stage model is able to converge when we reduce the learning rate to the 30%.
    3. If the initialization is modified according to Section 3.1, the model learns much faster at the beginning of the training (shown in Figure 4), and has better performance at the end of the training (shown in Figure 5).

    Section 3.1:

    we initialize the output of each residual block (before skip connection) to an allzeros tensor by setting the learnable scaling coefficient γ = 0 in the last Batch Normalization [20] layer of each residual block.

    We also add a skip connection in each stage transition when either resolutions or channels differ (using 2 × 2 average pooling and/or 1 × 1 convolution if necessary) to explicitly construct an identity mapping.

    image-20210131190339851


    image-20210131190402302


    Convergence Behavior

    small child models converge slower and need more training

    ==> exponentially decaying with constant ending.

    image-20210131190716917


    image-20210131190433279


    Regularization

    we compare the effects of the regularization (weight decay and dropout) between two rules:

    1. applying regularization on all child models
    2. applying regularization only on the full network

    image-20210131190625218


    Coarse-to-fine Architecture Selection

    scale维度:

    • input resolution(network-wise)
    • channel num(layer-wise,top n select)
    • layer num(stage-wise,top n select)
    • kernel size(layer-wise,center)

    we pre-define :

    five input resolutions (network-wise, {192, 224, 256, 288, 320}),

    four depth configurations (stage-wise),

    two channel configurations (stage-wise),

    four kernel size configurations (stage-wise)

    image-20210131190513738


    Experiments

    Cost

    8×8 TPUv3

    To train a single-stage model, it roughly takes 36 hours.


    ImageNet

    image-20210131190248536


    ablation study

    Finetuning child models

    image-20210131193548125


    Training child from scratch

    image-20210131193556473


    Conclusion

    Summary

    • 将one-shot超网训练做到比较完善的一个工作,相比之下ofa还需要多个stage,分阶段蒸馏/fine-tune,不够one-shot

    • 将无需 fine-tune 的 motivation 贯彻到底

    • 整个pipeline之所以work的主要原因应该是 top n select + self KD 的超网训练方式,其他的部分都是一两个点的小修小补的工作

    • 训练开销也在可接受的范围内(和once for all 比起来)

    • fine-tune和train from scratch child model都无法再提高模型性能,有点反直觉,说明这个超网训练的pipeline确实对child model非常有益?

    • 搜索空间似乎经过精心设计

    • 没有开源


    To Read

    Reference

  • 相关阅读:
    Atitit 人脸识别 眼睛形态 attilax总结
    Atitit 手机号码选号 规范 流程 attilax总结 v2 r99.docx
    atitit 板块分类 上市公司 龙头企业公司 列表 attilax总结.docx
    Atititi atiitt eam pam资产管理 购物表去年.xlsx
    使用cmd查看电脑连接过的wifi密码(一)
    常见十大web攻击手段 悟寰轩
    常见web攻击方式 悟寰轩
    【MYSQL数据库】MYSQL学习笔记mysql分区基本操作 悟寰轩
    Filter及FilterChain的使用详解 悟寰轩
    启动tomcat spring初始化两次问题(eg:@PostConstruct) 悟寰轩
  • 原文地址:https://www.cnblogs.com/chenbong/p/14353937.html
Copyright © 2020-2023  润新知