• BERT在多模态领域中的应用


    BERT在多模态领域中的应用

    Source: Link 

    BERT (Bidrectional Encoder Representations from Transformers) 自提出后,凭借着 Transformer 强大的特征学习能力以及通过掩码语言模型实现的双向编码,其大幅地提高了各项 NLP 任务的基准表现。鉴于其强大的学习能力,2019 年开始逐渐被用到多模态领域。其在多模态领域的应用主要分为了两个流派:一个是单流模型,在单流模型中文本信息和视觉信息在一开始便进行了融合;另一个是双流模型,在双流模型中文本信息和视觉信息一开始先经过两个独立的编码模块,然后再通过互相的注意力机制来实现不同模态信息的融合。本文主要介绍和对比五个在图片与文本交互领域应用的 BERT 模型:VisualBert, Unicoder-VL, VL-Bert, ViLBERT, LXMERT。 

    单流模型

    1. VisualBERT 

    论文标题:VisualBERT: A Simple and Performant Baseline for Vision and Language

    论文链接:https://arxiv.org/abs/1908.03557

    源码链接:https://github.com/uclanlp/visualbert

    2. Unicoder-VL

    论文标题:Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training

    论文链接:https://arxiv.org/abs/1908.06066

    3. VL-BERT 

    论文标题:VL-BERT: Pre-training of Generic Visual-Linguistic Representations

    论文链接:https://arxiv.org/abs/1908.08530

    源码链接:https://github.com/jackroos/VL-BERT

    双流模型

    1. ViLBERT 

    论文标题:ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

    论文链接:https://arxiv.org/abs/1908.02265

    源码链接:https://github.com/facebookresearch/vilbert-multi-task

    2. LXMERT 

    论文标题:LXMERT: Learning Cross-Modality Encoder Representations from Transformers

    论文链接:https://arxiv.org/abs/1908.07490

    源码链接:https://github.com/airsplay/lxmert

    基于视频的 BERT:

    1. VideoBERT

    论文标题:VideoBERT: A Joint Model for Video and Language Representation Learning

    论文链接:ICCV-2019

    Reference

    [1] VL-BERT: Pre-training of generic visual linguistic representations. Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, Jifeng Dai 

    [2] Unicoder-VL: A universal encoder for vision and language by cross-modal pre-training. Gen Li, Nan Duan, Yuejian Fang, Ming Gong, Daxin Jiang, Ming Zhou 

    [3] VisualBERT: A simple and performant baseline for vision and language. Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Change

    [4] LXMERT: Learning cross-modality encoder representations from transformers. Hao Tan, Mohit Bansal 

    [5] ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Jiasen Lu, Dhruv Batra, Devi Parikh, Stefan Lee 

    [6] VideoBERT: A Joint Model for Video and Language Representation Learning 

  • 相关阅读:
    Kotlin之类属性延迟初始化
    Android 之ANR
    Android之Handler基础篇
    Android Handler进阶篇
    Android 进程与线程管理
    Android 启动模式LaunchMode详解(LaunchMode四种模式详解)
    Android 应用版本号配置修改
    Android ViewGroup
    Android app与Activity主题配置
    Android 本地序列化
  • 原文地址:https://www.cnblogs.com/wangxiaocvpr/p/12402975.html
Copyright © 2020-2023  润新知