论文笔记: WWW2021:A Linguistic Study on Relevance Modeling Information Retrieval

论文笔记: WWW2021:A Linguistic Study on Relevance Modeling Information Retrieval
阅读笔记: A Linguistic Study on Relevance Modeling in Information Retrieval
作者: 计算所, chengxueqi & guojaifeng
WWW2021
- 贡献:
1. 第一次做实验从语言学角度分析: IR中三个任务(相关性模型)query-doc, anwser selection, response 任务的关系
2. 回答了两个问题:
  1. 三种任务分别对应的相关性模型更关注语言学中的什么问题?
  2. 从三种任务对于语言学方向关注的不同点怎么改进现有相关性模型？
- 原文链接
- 代码:
- 备注: 推荐好好读几遍原文，写作、实验设置都很值得学习
1. 系统性总结一下: 信息检索中三种任务的异同, 以及解决相应任务的相关性模型更关注哪些特征?
2. 怎么将三种任务模型对于语言特点的关注度转移到模型上来。
写在前面: 本文是分析的文章，但是实验设置和分析的角度都很新颖，需要仔细思考实验的合理性，以及实验结果的推倒。文章围绕几个核心问题展开，理解几个问题并做出回答是主要目的。
总结:
文章围绕相关性展开，涉及相关性有关的信息检索任务、NLP文本层层面的理解粒度。
首先一个假设: 信息检索中不同任务本身是否有异质性的差异? 如果有差异，怎样识别差异？怎么利用差异?

三种信息检索的任务: doc retrieval (Search engine)、answer retrieval(问答系统)、response retrieval(对话机器人)。
核心问题: 假设将三种信息检索任务按照统一的相关性模型进行学习处理，那么由于三种信息检索任务内在的异质性会导致相关性模型在建模过程中关注的重点发生变化吗？
为了回答上述问题，从两个子问题展开:
Q1: 三种相关性统一模型(对应三种信息检索任务)在对于文本的自然语言理解层面真的有不同吗？换言之: 三种信息检索任务的相关性模型(模型是一致的) 分别关注NLU中的哪些层面信息？语法、词法、语义？
W1: 文章采用BERT模型作为三种任务相关性模型的统一，进一步探索 BERT模型分别对于三种信息检索任务建模过程中更关注于哪些方面? 具体而言，采用了16中NLU任务(对应了NLU中的词法、语法、语义层面)先去探索，三种模型更关注哪些NLU的特征。这快比较绕，这个实验设置还是基于假设: 在三种任务上，分别对于BERT 进行finetune之后取拿性能最好的BERT模型(BERT_doc, BERT_ans, BERT_res)在 16种预先定义的人任务上测试，并和原始的BERT在16种任务上的结果做对比，性能较大提升的则说明，BERT在对应的信息检索任务上微调之后模型具备了相应NLU的能力(模型捕获了这方面的能力，则认为: 解决该类信息检索任务需要相关性模型有NLU任务相对应的能力。) ps:这快正向推，勉强成立。 a.分别选择3种IR上面性能最好的BERT_ft模型 b.在16种NLU任务上探测, performance gap。 c.反推解决对应IR任务，模型需要NLU对应的能力。但是如果per_gap为负，则不能说明解决对应IR任务不需要对应的NLU能力。
A1: doc retrieval 更关注语义层面的任务， answer retrieval 更关注：语法和语义层面任务
response retrieval 对于词法、语法、语义三种层面都关注的较少。另外: 同义词理解是三种IR任务都关注的。

Q2:三种IR任务对于输入的敏感程度？
W2: 分别测试三种任务仅保留一个输入的情况下在16种NLU任务上的性能。也就是关注三种信息检索任务从两个输入中哪部分对于结果的影响更大。
A2: answer retrieval 比其他两个任务更加异质性。
Q3: 如何将三种IR任务的内在异质性应用到相关性模型中，并提高相关性模型的表现?
W3: 采用了干预因素 + 干预方法进行实验探索: 哪些干预因素、哪些干预方法分别对于三种IR任务中相关性模型表现有提升。
A3: 参数干扰方法最有用。

三种信息检索任务对比: 实验中还有结果补充... (两个输入长度的比较之前很多文章都提到了)
1. doc retrieval: 搜索引擎。
  1.1 相关的文档和搜索query 主题相关。
  1.2 query 短,意图表达不清晰,大多数情况下仅仅是一些关键词语,一般是 2.35 terms； doc长，长度不统一。长度造成相关性匹配问题.
2. answer retrieval: 问答系统
  2.1 相关的答案正确解决用户问题
  2.2 问题一般是自然语言, 意图清晰、格式完整；answer 一般较短，且多是 text span。
  2.3 答案不进主体和query相关，还要解决问题
3. response retrieval: 对话机器人
  3.1 对于输入话语的相关响应
  3.2 需要考虑,历史回话
  3.3 input utterance 和候选response一般都是短句子
  3.4 回复语义对应或者话语连贯，与话语保持一致性且避免琐碎的回复。
第三章：问题1，2 探测分析
- 问题：whether the relevance modeling in different IR tasks really shows differences in terms of natural language understanding
3.1 探测方法

探测任务的核心: 学习一个统一格式的检索retrieval 模型在三种检索任务，然后使用学习到的模型(BERT-doc, BERT-ans, BERT-res) 再去 NLU任务上去比较学到的模型和初始化模型的表现差距。
The core idea of the probing analysis is to learn a unified representative retrieval model over the three IR tasks, and probe the learned model to compare the focuses between different relevance modelIng tasks.
Finally, we analyze the performance gap of each probing task between the original and finetuned Bert over each IR task.
taking the original Bert as a baseline, the relative performance gap of the finetuned Bert over the baseline on a probing task could reflect the importance of the specific linguistic property for the corresponding retrieval task.

3.2 探测任务

Lexical Tasks:
Chunk:divide a complicated text into smaller parts, CoNLL 2000 dataset
POS:UD-EWT dataset
NER:CoNLL 2003 dataset

Syntactic Tasks:
GED: Grammatical Error Detection is to detect grammatical errors in sentence, First Certificate in English dataset
SynArcPred: syntactic arc dependency prediction(语法弧依赖预测) UD-EWT datasets
SynArcCls: syntactic arc dependency classification UD-EWT datasets
Word Scramble: 词汇排序，PAWS-wiki dataset

Semantic Tasks:
PS-fxn： STREUSLE 4.0 Preposition Supersense Disambiguation，介词功能分析
PS-role: STREUSLE 4.0 介词角色分析
CorefArcPred: CoNLL dataset 代词、实体共指预测
SemArcPred: semantic arc dependency prediction，两个token之间是否语义依存
SemArcCls: semantic arc dependency classification，两个token语义关系分类 SemEval 2015 dataset
Polysemy: 同义词，两个句子中两个单词是否意思一样 crawled 10k sentences from an online Website
Synonym: 一词多义，两个句子中同一个单词意思是否一样
Keyword: 关键词识别 Inspec dataset
Topic:文档主体分类 Yahoo! Answers dataset

3.3 探测实验

3.3.1 Retrieval Model: BERT-Base vs BERT-Base_FineTune
[SEP]区分两个输入, 并用 [S] [T]表示整个输入开始和结束
具体实验设置: 附录A
doc-retrieval: 数据集:Robust04: loss: pairwise ranking loss 评价: NDCG@20

answer-retrieval: 数据集:MsMarco 评价:MRR@10

response-retrieval: 数据集:Ubuntu loss: cross entropy loss 评价: recall@1

3.3.2 Retrieval Datasets:
doc-retrieval: Robust04；answer-retrieval: MsMarco response-retrieval: Ubuntu
[PAD] padding 并且切分长文本至于 512 tokens

三中数据集统计(和选择数据集有关)
3.4 实验结果: 结果分析至关重要★
3.4.1 How does the unified retrieval model perform on each retrieval task? BERT模型能不能作为统一模型衡量三种信息检索任务?
在 doc-retrieval，ans-retrieval， res-retrieval三种任务上分别微调BERT到性能最好，分别得到: BERT-doc, BERT-ans, BERT-res 模型。图1：

BERT微调和原始BERT在三种信息检索任务上表现(分BERT 层比较,最后选择效果最好的层)
结论: BERT微调之后在三个数据集上面都取得了较大的进步,认为BERT针对不同任务学习到了任务特定的特征。
3.4.2 Do different IR tasks show different modeling focuses in terms of natural language understanding? 三种信息检索任务上相关性模型在NLU任务上表现有哪些不同? 分别测试16种探测任务在三种IR信息检测上微调之后的BERT， BERT-ft模型种捕获了哪些信息(解决不同IR任务应该侧重哪些方面? )，NLU本身是对 BERT-fn 效果的探测，通过BERT-base 和 BERT-ft在探测任务上面前后表现的差异来定量分析(Quantitative analysis)三种BERT-ft模型蕴含的NLU知识。

16种NLU探测任务在三种IR模型 BERT_base 和 BERT_ft的结果对比(相对值)
结果分析:
从IR任务角度：
1). document retrieval：更关注于语义任务, BERT_doc 比 BERT_base 在 semantic tasks 任务上面提升阶段，左下角黄色标识。其中同义词(Polysemy) 和一词多义(Synonym) 分别提升4.68、18.35，这说明:doc-retrieval 任务需要模型更好的理解 semantic meaning of a word pair. 佐证: LDA等主体模型抽取文档主题可以增强 doc-retrieval 任务。
2). answer retrieval: 11/16 被提升, 8 个提升明显, 这说明: answer retrieval 任务较难, 需要模型全面综合理解输入。 BERT_ans 在所有 Syntatic 任务上都得到了提升说明: 语法信息例如: 单词顺序和句子结构对于 answer retrieval 任务重要。
3). response retrieval: BERT_res 在 12/16 探测任务出现了下降，其中10个出现明显下降；这里的原文分析没太懂: It suggests that most linguistic properties encoded by the original Bert has already been sufficient for the relevance modeling in response retrieval. BERT-base 中已经编码了大量的关于 response retrieval 的知识? 下降勉强说明: BERT-res 比 BERT-base 丢失了对于NLU任务的理解能力，也可以说明通过微调得到BERT-res更适应 res任务了，反而在 NLU任务上出现了下降..... 原文结论，没太懂.... ; 另外 BERT-res 提高了同义词预测任务(Synonym),但是在一词多义任务出现降低，说明: response-retrieval 需要模型更好理解不同上下文中的同义词。个人认为数据集太少了,,,,结论勉强？
从探测任务角度：
1). The CorefArcPred and Keyword tasks have only been significantly improved by BERTdoc among the three finetuned models but decreased by the rest. Meanwhile, the NER and GED tasks have only been significantly improved by BERTans but drop on the other two. The results indicate that relevance modeling in document retrieval pays more attention to similar keywords while the relevance modeling in answer retrieval pays more attention to identifying targeted entities in questions and answers.
2). doc-retrieval 和 ans-retrieval 更注重同义词和主题词理解。

3). Synonym 和 PS-role 任务在三种IR任务模型都得到提升，同义词提升巨大，说明: 三种IR任务对于同义词的捕获和理解能力都至关重要。
感觉上面6点: 都是符合直觉.....

3.4.3 Do relevance modeling treat their inputs differently in terms of natural language understanding? 相关性模型一般将 pair of texts 作为输入, 想单独探索左右输入之间的 performance gap.
we only keep the probing tasks whose input is a single sentence, and ignore the tasks that require a pair of inputs

blue and orange bar represent the performance gap between the BERT-base model and the finetuned Bert models on the left and right input, respectively。
1. doc retrieval: the left (query) side and the right (document) side show similar trends on both lexical and syntactic probing tasks.
  the query side cares about the coarse-level functions of the prepositions (i.e., PS-fxn) while the document side pays attention to both the coarse-level functions and the fine-grained roles (i.e., PS-fxn and PS-role) in terms of the prepositions. q侧:粗粒度, d:粗+细
  The query side improves the performance on Keyword but drops on Topic, while the document side does the opposite. This is reasonable since queries are usually short keywords while documents are typically long articles. keyword: q升, d降, topic:q降,d升
2. answer retrieval: question 和 answer 结果相反，q侧提升5/12，......
3. Response Retrieval: the left (utterance) side and the right (response) side show similar trends on most of the probing tasks (i.e. 10 out of 12), with Chunk and PS-role as the exceptions.
异质性进一步结论: answer retrieval 任务左右输入更加异质性
4. 干预分析
通过引入干预方法学习: NLU任务的学习分析是否可以帮助相关性模型。在信息检索任务上面比较: 模型在引入NLU任务前后性能的差异，分析哪些干预方法有用。在三种信息检索任务上微调得到的三种模型作为干预分析的baseline, 进一步引入 NLU任务学习比较前后的结果。
4.1 干预设置:选择4种NLU探测任务作为干预因子， 3种干预方法
选择Keyword, NER, Synonmy, SemArcCls 作为4种干预因子, 原因: 在这四种上面结果提升了。
通过干预方法、干预因子实验时候需要label，所以使用 BERT-Large 生成了对应干预方法、干预因子的伪标签(假标签：这里全部针对 IR数据集了, NLU任务仅仅作为考虑的因子,对应的数据集不作使用)。三中干预方法具体描述如下:
1. Feature Intervention: 在BERT posi-, seg-, token- 基础上加上 feature-embedding(生成的伪标签), 进一步实验(柑橘这里破坏了BERT预训练自带的结构... 可能性能不会好)。
2. Parameter Intervention: 先在 NLU tasks任务上进行微调BERT模型，然后进一步在三个IR任务上面分别 Finetune 模型。 BERT上面多加了一层MLP适应各个 NLU任务。
3. Objective Intervention: 多任务学习
  For objective intervention, we jointly learn the intervention factor as well as the retrieval task. For this purpose, we add a task-specific layer on top of the BERT model
4.2 结果及分析
4.2.1 干预方法比较:

4.2.2 干预因子分析:
1. 相关工作
  5.1 相关性模型
  5.2 探测分析
2. 总结:
1. 在文档检索、问答检索、响应检索三中信息检索任务上，对于相关性模型进行了研究和分析。提出了通过 BERT作为统一模型采用16个NLU任务探索三种信息检索任务内在的异质性，结果说明: 不同的信息检索任务关注于不同的NLU层面特征;
2. 设计干预方法+干预因子利用发现的NLU表现上的影响因子进一步改进相关性模型。
1. 本文的借鉴:
1. 为了研究 IR任务的内在不同，通过NLU任务去量化，模型微调前后的性能变化，表征相关性任务中不同模型关注的特征。
2. 后续可以通过这种思路对于 IR中其他任务进行探测，设计更多、更合理的NLU探测任务以及探测方法。
3. 探测结果进一步帮助 IR任务提高模型的效果.....
4. 同义词真的对于 IR匹配任务至关重要，简单通过BERT编码可以捕获到两个句子中的同义词信息嘛？
5. input 两部分实验中: 输入在很多任务中占有不同的敏感地位，怎么利用这点呢？
相关阅读:
2020.12.12【NOIP提高B组】模拟总结
 2020.10.17【普及组】模拟赛C组总结
 jsonp多次请求报错 not a function的解决方法
 windows phone 独立存储空间的操作(2)[转]
Sencha Touch 2 官方文档翻译之 Intro to Applications with Sencha Touch 2（ST2应用程序简介）[转]
windows phone xaml文件中元素及属性(10)[转]
Sencha touch 初体验[转]
地理坐标系与投影坐标系的区别[转]
jquery ajax中使用jsonp的限制[转]
windows phone小试自定义样式 (12) [转]
原文地址：https://www.cnblogs.com/zhangtao-0001/p/14797141.html

论文笔记: WWW2021:A Linguistic Study on Relevance Modeling Information Retrieval

第三章：问题1，2 探测分析

3.1 探测方法

3.2 探测任务

3.3 探测实验