Vaclav Kosar's face photo
Vaclav Kosar
Software, Machine Learning, & Business

Transformer Positional Embeddings and Encodings

How transformers encode information about token positions?

positional embeddings in BERT architecture

Learned Positional Embeddings

Visualization of position-wise cosine similarity of different position embeddings

  • In BERT, positional embeddings give first few tens of dimensions of the token embeddings meaning of relative positional closeness within the input sequence.
  • In Perceiver IO positional embeddings are concatenated to the input embedding sequence instead.
  • In SRU++ the positional embeddings are learned feature of the RNN.

Fourier (Sinusoid) Positional Encodings in BERT

  • Positional embeddings are added to the word embeddings once before the first layer.
  • Each position \( t \) within the sequence gets different embedding
    • if \( t = 2i \) is even then \( P_{t, j} := \sin (p / 10^{\frac{8i}{d}}) \)
    • if \( t = 2i + 1 \) is odd then \( P_{t, j} := \cos (p / 10^{\frac{8i}{d}}) \)
  • This is similar to fourier expansion of Diracs delta
  • dot product of any two positional encodings decays fast after first 2 nearby words
  • average sentence has around 15 words, thus only first dimensions carry information
  • the rest of the embeddings can thus function as word embeddings
  • not translational invariant, only the self-attention key-query comparison is
  • in-practical work for high-resolution inputs

Fourier (Sinusoid) Positional Encodings in BERT

Rotary Position Embedding (RoPE)

  • introduced in RoPE Embeddings in RoFormer
  • want relative position info in query-value dot-product
  • use multiplicative rotational matrix mixing pairwise neighboring dimensions
  • improves accuracy on long sequences?
  • poor results also reported: tweet 1, tweet 2
  • used in Google’s $10M model PaLM

Created on 05 Jun 2022. Updated on: 11 Jun 2022.

Let's connect





Privacy Policy How many days left in this quarter? Twitter Bullet Points to Copy & Paste