XLNET

XLNET

Original address of paper

https://arxiv.org/abs/1906.08237

Reason the idea come up

XLNET uses the AR method to implement the language model.

AR has the unidirectional limitation. And, AE has issue in MASK. In fine tuning phase, input of BERT without data with MASK, showing the discrepancy, and author thought that BERT assume the predicted token as independent word is oversimplified the natural language. Therefore, the XLNET is created to leverage best of AR and AE to avoid the limitation.

Main idea of the technology

Firstly, standing at the conventional AG mode

permutation

instead of forward or backward of sequence, XLNET have all possible permutations

this operation aims to get the bidirectional context like BERT.

In detail, XLNET create the all permutations and randomly choose some cases to predict some masked words in end of sequence. e.g., 1,2,3,4,5,6,7 → use 1,2,4,5,6,7 to predict 3. No matter the order is.

Meanwhile, how to decide the length of predict token? there is a hyper parameter K, which is equal the length of sequence dividing by length of predicted token.

The implementation of permutation is the mask in transformer’s attention.

Two-Stream Self-Attention

In the case of 1342 and 1324, when we want to predict the place of 3, i.e. (target 4 or 2), the word they could see are same, which is not right. Therefore, we introduce the Two-Stream Self-Attention.

  1. h, the content representation
  2. g, the query representation, access to contextual and position information, but without content

Thinking

The paper also incorporating the idea from Transformer-XL. (seems aims for long text task) (relative position encoding and fragment loop?)

The details of Transformer XL is not clear for me, maybe after learning for it and get deeper understanding for XLNET

Terms appear in the paper

AR: Autoregressive Language model

using the sequential direction to predict the next word, that is the single direction, e.g. GPT2

AE: Autoencoding Language model

utilise the context. i.e., BERT. It includes the MLM (masked language model) and NSP (next sentence prediction). NSP is to deduce the inference relation, MLM replace some word with MASK, and to predict the position with MASK.