BERT在多模态领域中的应用
Source: Link
BERT (Bidrectional Encoder Representations from Transformers) 自提出后,凭借着 Transformer 强大的特征学习能力以及通过掩码语言模型实现的双向编码,其大幅地提高了各项 NLP 任务的基准表现。鉴于其强大的学习能力,2019 年开始逐渐被用到多模态领域。其在多模态领域的应用主要分为了两个流派:一个是单流模型,在单流模型中文本信息和视觉信息在一开始便进行了融合;另一个是双流模型,在双流模型中文本信息和视觉信息一开始先经过两个独立的编码模块,然后再通过互相的注意力机制来实现不同模态信息的融合。本文主要介绍和对比五个在图片与文本交互领域应用的 BERT 模型:VisualBert, Unicoder-VL, VL-Bert, ViLBERT, LXMERT。
单流模型
1. VisualBERT
论文标题:VisualBERT: A Simple and Performant Baseline for Vision and Language
论文链接:https://arxiv.org/abs/1908.03557
源码链接:https://github.com/uclanlp/visualbert
2. Unicoder-VL
论文标题:Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training
论文链接:https://arxiv.org/abs/1908.06066
3. VL-BERT
论文标题:VL-BERT: Pre-training of Generic Visual-Linguistic Representations
论文链接:https://arxiv.org/abs/1908.08530
源码链接:https://github.com/jackroos/VL-BERT
双流模型
1. ViLBERT
论文标题:ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
论文链接:https://arxiv.org/abs/1908.02265
源码链接:https://github.com/facebookresearch/vilbert-multi-task
2. LXMERT
论文标题:LXMERT: Learning Cross-Modality Encoder Representations from Transformers
论文链接:https://arxiv.org/abs/1908.07490
源码链接:https://github.com/airsplay/lxmert
基于视频的 BERT:
1. VideoBERT
论文标题:VideoBERT: A Joint Model for Video and Language Representation Learning
论文链接:ICCV-2019
Reference
[1] VL-BERT: Pre-training of generic visual linguistic representations. Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, Jifeng Dai
[2] Unicoder-VL: A universal encoder for vision and language by cross-modal pre-training. Gen Li, Nan Duan, Yuejian Fang, Ming Gong, Daxin Jiang, Ming Zhou
[3] VisualBERT: A simple and performant baseline for vision and language. Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Change
[4] LXMERT: Learning cross-modality encoder representations from transformers. Hao Tan, Mohit Bansal
[5] ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Jiasen Lu, Dhruv Batra, Devi Parikh, Stefan Lee
[6] VideoBERT: A Joint Model for Video and Language Representation Learning