• 【NeurIPS】ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias


    请添加图片描述

    论文:https://openreview.net/forum?id=_WnAQKse_uK

    代码:https://github.com/Annbless/ViTAE

    1、Motivation

    这个论文的思想非常简单:将CNN和 VIT 结合,浅层用CNN,深层用VIT。 同时,在attention 分支添加一个卷积层分支。

    2、Method

    网络整体架构如下图所示,包括三个 Reduction Cell (RC) 和若干 Normal Cell(NC)。

    请添加图片描述

    RC 模块

    和 VIT 的 Transformer block 相比,RC多了一个 pyramid reduction ,就是多尺度空洞卷积并行,最终拼接成一个。同时,在 shortcut 里,多了3个卷积。最后,还要 seq2img 转成 feature map。

    NC 模块

    和VIT的 transformer block 有区别的地方就是计算 attention 那里多了一个卷积分支。

    3、有趣的地方

    从openreview的意见来看,审稿人认可的 strong points:

    • The idea of injecting multi-scale features is interesting and promising.
    • The paper is well written and easy to follow.

    同时,论文也存在一些薄弱环节:

    • The paper use an additional conv branch together with the self-attention branch to construct the new network architecture, it is obvious that the extra conv layers will help to improve the performance of the network. The proposed network modification looks a little bit incremental and not very interesting to me.
    • There are no results on the downstream object detection and segmentation tasks, since this paper aims to introduce the inductive bias on the visual structure.
    • The proposed method is mainly verified on small input images. Thus, I am a little bit concerned about its memory consumption and running speed when applied on large images (as segmentation or detection typically uses large image resolutions).
  • 相关阅读:
    05 DDMS中logcat的使用
    04项目的目录结构分析与资源引用
    03 Android之我的第一个.hello world 程序
    关于myeclipse的有关操作
    jsp运行原理分析
    JSP简介
    prepareStatement与Statement的区别
    正则表达式
    过滤器的作用和用法
    剑指Offer--数值的整数次方
  • 原文地址:https://www.cnblogs.com/gaopursuit/p/16065130.html
Copyright © 2020-2023  润新知