- self-attention and feed-forward layers are symmetrical with respect to the input
- so we have to provide positional information about each input token
- so positional encodings or embeddings are added to token embeddings in transformer
- encodings are manually (human) selected, while embeddingss are learned (trained)
Learned Positional Embeddings
- Hierarchical Perceiver for high resolution inputs
- learns low-dimensional positional embeddings
- objective function is masked token prediction
- embeddings are concatenated to input and used as a query for masked prediction
- What Do Position Embeddings Learn?
- sinusoidal embeddings below are not learned
- GPT2 learned positional embeddings as in GPT-1 have a very symmetrical structure
- RoBERTa embeddings mildly similar to sinusoidal
- BERT trained embeddings, up to position 128, are very similar to sinusoidal, but not elsewhere - likely training artefact
- sinusoidal and GPT-2 were the best for classification
Positional Embeddings in Popular Models
- In BERT, positional embeddings give first few tens of dimensions of the token embeddings meaning of relative positional closeness within the input sequence.
- In Perceiver IO positional embeddings are concatenated to the input embedding sequence instead.
- In SRU++ the positional embeddings are learned feature of the RNN.
Fourier (Sinusoid) Positional Encodings in BERT
- Positional embeddings are added to the word embeddings once before the first layer.
- Each position \( t \) within the sequence gets different embedding
- if \( t = 2i \) is even then \( P_{t, j} := \sin (p / 10^{\frac{8i}{d}}) \)
- if \( t = 2i + 1 \) is odd then \( P_{t, j} := \cos (p / 10^{\frac{8i}{d}}) \)
- This is similar to fourier expansion of Diracs delta
- dot product of any two positional encodings decays fast after first 2 nearby words
- average sentence has around 15 words, thus only first dimensions carry information
- the rest of the embeddings can thus function as word embeddings
- not translational invariant, only the self-attention key-query comparison is
- in-practical work for high-resolution inputs
Rotary Position Embedding (RoPE)
- introduced in RoPE Embeddings in RoFormer
- want relative position info in query-value dot-product
- use multiplicative rotational matrix mixing pairwise neighboring dimensions
- improves accuracy on long sequences?
- poor results also reported: tweet 1, tweet 2
- used in Google’s $10M model PaLM