BERT trained embeddings, up to position 128, are very similar to sinusoidal, but not elsewhere - likely training artefact
sinusoidal and GPT-2 were the best for classification
Positional Embeddings in Popular Models
In BERT, positional embeddings give first few tens of dimensions of the token embeddings meaning of relative positional closeness within the input sequence.
In Perceiver IO positional embeddings are concatenated to the input embedding sequence instead.
In SRU++ the positional embeddings are learned feature of the RNN.
Fourier (Sinusoid) Positional Encodings in BERT
Positional embeddings are added to the word embeddings once before the first layer.
Each position \( t \) within the sequence gets different embedding
if \( t = 2i \) is even then \( P_{t, j} := \sin (p / 10^{\frac{8i}{d}}) \)
if \( t = 2i + 1 \) is odd then \( P_{t, j} := \cos (p / 10^{\frac{8i}{d}}) \)
This is similar to fourier expansion of Diracs delta
dot product of any two positional encodings decays fast after first 2 nearby words
average sentence has around 15 words, thus only first dimensions carry information
the rest of the embeddings can thus function as word embeddings
not translational invariant, only the self-attention key-query comparison is