RoBERTa & BERT embedding space bias

RoBERTa

Original address of paper

https://arxiv.org/abs/1907.11692

https://zhuanlan.zhihu.com/p/149249619

根据知乎的文章,总结出以下几点特征:

  1. 用更长的时间和更大的BS,训练更多的数据
  2. 去掉BERT中的NSP(next sentence prediction)
  3. 在更长的句子上训练
  4. 根据训练数据动态的改变mask模式

Reason the idea come up

从BERT的角度来说,masking是在预处理的时候执行Masked Language Model. In random masking, BERT will select the 15% word in each sentence, and with 80% chance replaced by MASK, 10% chance replaced by other random word, 10% keep the same. Recall the COMP 5046 unit, assignment, there are same. Thus, the model target is to predict the place be marked.

Also, BERT could predict the adjacent position relationship between sentences, that is called Next Sentence Prediction(NSP)

Difference:

其中比较大的变化是对于mask的处理方式不同

BERT是在pre-precessing的时候对于masking进行处理,这样一来masking是静态的,也就是对于每一个epoch来说masking的结果是一致的。为了避免这样的情况,会对数据进行复制,作出不同的静态masking进行训练。但还是会出现重复的现象。

而RoBERTa使用动态masking,在序列进入模型之前才处理masking的任务。

RoBERT remove the task of NSP, it consider this task is not improve the performance in expected.

思考:BERT输出信息的继续挖掘以达到更好的语义理解

On the sentence embeddings from pre-trained language models

this is the address for original paper

https://arxiv.org/pdf/2011.05864.pdf

The problems proposed in the paper mainly discuss the:

  1. why low performance of text matching in BERT for unsupervised task. Is that lacking of the sentiment information or wrong way to digging these information?
  2. If lacking of digging, how to utilize them.

论文给出的结论是我们对于BERT的输出信息利用不够强

论文中提出了一个方式,去映射BERT sentence embedding, called BERT-flow.

其启发是来自于1. 高频的偏差,对于高频词语的space来说,可能会导致向量的偏移,这会导致对于向量之间关系不按照我们所想的方式进行发展。2.sparse word会在BERT空间中比较稀疏(低频词),以及产生一些洞。这些洞会导致语义理解的难度

基于这两个问题, paper提出了一个转换的方式:

BERT embedding space to standard Gaussian latent space

从而实现了基于流的标定来帮助建立更加正确的BERT词向量