Cross-Attention in Transformer Architecture

Merge two embedding sequences regardless of modality, e.g., image with text.

Cross attention is:

  • an attention mechanism in Transformer architecture that mixes two different embedding sequences
  • the two sequences must have the same dimension
  • the two sequences can be of different modalities (e.g. text, image, sound)
  • one of the sequences defines the output length as it plays a role of a query input
  • the other sequence then produces key and value input

Cross-attention Applications

Cross-attention vs Self-attention

Except for inputs, cross-attention calculation is the same as self-attention. Cross-attention combines asymmetrically two separate embedding sequences of same dimension, in contrast self-attention input is a single embedding sequence. One of the sequences serves as a query input, while the other as a key and value inputs. Alternative cross-attention in SelfDoc, uses query and value from one sequence, and key from the other.

The feed forward layer is related to cross-attention, except the feed forward layer does use softmax and one of the input sequences is static. Augmenting Self-attention with Persistent Memory paper shows that Feed Forward layer calculation made the same as self-attention.

cross-attention perceiver io detail

Cross-attention Algorithm

  • Let us have embeddings (token) sequences S1 and S2
  • Calculate Key and Value from sequence S1
  • Calculate Queries from sequence S2
  • Calculate attention matrix from Keys and Queries
  • Apply queries to the attention matrix
  • Output sequence has dimension and length of sequence S2

In an equation: \( \mathbf{softmax}((W_Q S_2) (W_K S_1)^\intercal) W_V S_1 \)

Cross-Attention in Transformer Decoder

Cross-attention was described in the Transformer paper, but it was not given this name yet. Transformer decoding starts with full input sequence, but empty decoding sequence. Cross-attention introduces information from the input sequence to the layers of the decoder, such that it can predict the next output sequence token. The decoder then adds the token to the output sequence, and repeats this autoregressive process until the EOS token is generated.

Cross-Attention in the Transformer decoder of Attention is All You Need paper

Cross-Attention in Perceiver IO

Perceiver IO is a general-purpose multi-modal architecture that can handle wide variety of inputs as well as outputs. Perceiver can be applied to for example image-text classification. Perceiver IO uses cross-attention for merging:

  • multimodal input sequences (e.g. image, text, audio) into a low dimensional latent sequence
  • “output query” or “command” to decode the output value e.g. predict this masked word

Perceiver IO architecture

Advantage of the Perceiver architecture is that in general you can work with very large inputs. Architecture Hierarchical Perceiver has ability to process even longer input sequences by splitting into subsequences and then merging them. Hierarchical Perceiver also learns the positional encodings with a separate training step with a reconstruction loss.

Cross-Attention in SelfDoc

selfdoc cross-attention

In Selfdoc, cross-attention is integrated in a special way. First step of their Cross-Modality Encoder, instead uses value and query from sequence A and then key from the sequence B.

Other Cross-Attention Examples

Created on 28 Dec 2021. Updated on: 08 Sep 2022.
Thank you

Ask or Report A Mistake


Let's connect








Privacy Policy How many days left in this quarter? Twitter Bullet Points to Copy & Paste About Vaclav Kosar