Transformer Positional Embeddings and Encodings

How transformers encode information about token positions?

Learned Positional Embeddings

• In BERT, positional embeddings give first few tens of dimensions of the token embeddings meaning of relative positional closeness within the input sequence.
• In Perceiver IO positional embeddings are concatenated to the input embedding sequence instead.
• In SRU++ the positional embeddings are learned feature of the RNN.

Fourier (Sinusoid) Positional Encodings in BERT

• Positional embeddings are added to the word embeddings once before the first layer.
• Each position $$t$$ within the sequence gets different embedding
• if $$t = 2i$$ is even then $$P_{t, j} := \sin (p / 10^{\frac{8i}{d}})$$
• if $$t = 2i + 1$$ is odd then $$P_{t, j} := \cos (p / 10^{\frac{8i}{d}})$$
• This is similar to fourier expansion of Diracs delta
• dot product of any two positional encodings decays fast after first 2 nearby words
• average sentence has around 15 words, thus only first dimensions carry information
• the rest of the embeddings can thus function as word embeddings
• not translational invariant, only the self-attention key-query comparison is
• in-practical work for high-resolution inputs

Rotary Position Embedding (RoPE)

• introduced in RoPE Embeddings in RoFormer
• want relative position info in query-value dot-product
• use multiplicative rotational matrix mixing pairwise neighboring dimensions
• improves accuracy on long sequences?
• poor results also reported: tweet 1, tweet 2
• used in Google’s \$10M model PaLM

Created on 05 Jun 2022. Updated on: 11 Jun 2022.