Encoder-Only vs Decoder-Only vs Encoder-Decoder Transformer

Wrap your head around the main Transformer variants in 5 minutes.
Transformer encoder-decoder model diagram (Attention is all you need)
Transformer encoder-decoder model diagram (Attention is all you need)

People keep asking me about, what is the difference between encoder, decoder, and normal transformer (with self-attention). It is a simple thing, you can master quickly.

Encoder-only (BERT model)

BERT has Encoder-only architecture. Input is text and output is sequence of embeddings. Use cases are sequence classification (class token), token classification. It uses bidirectional attention, so the model can see forwards and backwards.

bidirectional attention in BERT vs unidirectional (causal) attention in GPT
bidirectional attention in BERT vs unidirectional (causal) attention in GPT

Another encoder-only model example is ViT (Vision Transformer) for image classification.

Decoder-only (GPT2 model)

GPT-2 has Decoder-only architecture. Input is text and output is the next word (token), which is then appended to the input. Use cases are mostly text generation (autoregressive), but with prompting we can do many things including sequence classification. The attention is almost always causal (unidirectional), so the model can see only previous tokens (prefix).

Encoder-Decoder (T5 model)

T5 encoder-decoder multi-task visualization
T5 encoder-decoder multi-task visualization

T5 has Encoder-Decoder or Full-Transformer. Input is text and output is the next word (token), which is then appended to the decoder-input. Encoder decoder uses cross-attention to introduce information from the encoder into the decoder.

Decoder-Only vs Encoder-Decoder

The intuition is that, the decoder model just appends text, so if we have significant distribution difference between the input and the output, for example completely different set of tokens, we can expect that encoder-decoder would work better. And the decoder (prefix model) and sees only the past, and so any task that involves seeing entire text context and addressing specific tokens is a bit more complex for it. However, decoder-only is simpler architecture than Encoder-decoder, and it is already Turing-complete and size of the model and training is likely the biggest factor in most cases (The Bitter Lesson).

To make relevant apples to apples comparison, we can compare these in latency or compute-matched or parameter-match way, but it is hard to get rid of major differences in training objectives, which likely play the decisive role.

In the Flan-UL2 paper, authors attempted to reduce training differences by reformulating fill-in-the-blank task (denoising) into generative (autoregressive or prefix-language modelling setting) - this is called Mixture of Denoisers. Furthermore, they seem to use the same encoder-decoder model in both decoder-only way and encoder-decoder way. Also in Flan-UL2 paper, their best model was 20b parameter encoder-decoder.

Furthermore, Compute matched encoder-decoder models in UL2 paper have approximately twice the number of parameters as the decoder models but similar speeds and accuracy. This indicates that encoder-decoder may have more sparsity that may be taken out with some pruning or distillation techniques to eventually outperform.

UL2 formulation of masking tasks in a autoregressive way
UL2 formulation of masking tasks in a autoregressive way

In this older pre-RLHF paper, Encoder-decoder models trained with masked language modeling achieve the best zero-shot performance after multitask finetuning.

For details, there is a difference between decoder-only causal and prefix LM. Prefix-LM has a section that has non-causal (bidirectional attention) token dependencies like BERT:

encoder-decoder-language-model-prefix-lm.png
encoder-decoder-language-model-prefix-lm.png

Which To Choose From Encoder, Decoder, or Encoder-Decoder Transformer?

Personally, I will choose based on what pretrained model is available and how easy is it to adopt it for the task at hand. It is unclear what architecture may be the best from the start. Perhaps minor consideration could be following:

  • encoder-only: vector embeddings for classification, clustering, search
  • decoder-only: strong at text generation tasks (models for prompting, chatting)
  • encoder-decoder: strong for natural language understanding (NLU). For example translation, question answering, summarization.

Created on 29 Oct 2023. Updated on: 29 Oct 2023.
Thank you










About Vaclav Kosar How many days left in this quarter? Twitter Bullet Points to Copy & Paste Averaging Stopwatch Privacy Policy
Copyright © Vaclav Kosar. All rights reserved. Not investment, financial, medical, or any other advice. No guarantee of information accuracy.