# Cross-Attention in Transformer Architecture

Cross-attention is a way to merge two embedding sequences e.g. image with text.

Cross attention is:

• an attention mechanism in Transformer architecture that mixes two different embedding sequences
• the two sequences must have the same dimension
• the two sequences can be of different modalities (e.g. text, image, sound)
• one of the sequences defines the output length as it plays a role of a query input
• the other sequence then produces key and value input

## Cross-attention vs Self-attention

Except for inputs, cross-attention calculation is the same as self-attention. Cross-attention combines asymmetrically two separate embedding sequences of same dimension, in contrast self-attention input is a single embedding sequence. One of the sequences serves as a query input, while the other as a key and value inputs. Alternative cross-attention in SelfDoc, uses query and value from one sequence, and key from the other.

The feed forward layer is related to cross-attention, except the feed forward layer does use softmax and one of the input sequences is static. Augmenting Self-attention with Persistent Memory paper shows that Feed Forward layer calculation made the same as self-attention.

## Cross-attention Algorithm

• Let us have embeddings (token) sequences S1 and S2
• Calculate Key and Value from sequence S1
• Calculate Queries from sequence S2
• Calculate attention matrix from Keys and Queries
• Apply queries to the attention matrix
• Output sequence has dimension and length of sequence S2

In an equation: $$\mathbf{softmax}((W_Q S_2) (W_K S_1)^\intercal) W_V S_1$$

## Cross-Attention in BERT Decoder

Cross-attention was described in the Attention is All You Need (BERT) decoder, but not named yet. BERT decoding starts with full sized input sequence, but empty decoding sequence. Cross-attention introduces information from the input sequence to the decoding layers, such that the decoder can predict the next sequence token. The next token is added to the output sequence and we repeat the decoding process.

## Cross-Attention Examples

### Cross-Attention in Perceiver IO

Perceiver IO is a general-purpose cross-domain architecture that can handle variety of inputs and outputs uses extensively cross-attention for:

• merging very long input sequences (e.g. images, audio) into the low dimensional latent embeddings sequence
• merging “output query” or “command” to decode the output value e.g. we can the model ask about a masked word

Advantage of this is that in general you can work with very long sequences. Architecture Hierarchical Perceiver has ability to process even longer sequences by splitting into subsequences and then merging them. Hierarchical Perceiver also learns the positional encodings with a separate training step with a reconstruction loss.

### Cross-Attention in SelfDoc

In Selfdoc, cross-attention is integrated in a special way. First step of their Cross-Modality Encoder, instead uses value and query from sequence A and then key from the sequence B.

### Other Cross-Attention Examples

Created on 28 Dec 2021.