People keep asking me about, what is the difference between encoder, decoder, and normal transformer (with self-attention). It is a simple thing, you can master quickly.
Encoder-only (BERT model)
BERT has Encoder-only architecture. Input is text and output is sequence of embeddings. Use cases are sequence classification (class token), token classification. It uses bidirectional attention, so the model can see forwards and backwards.
Another encoder-only model example is ViT (Vision Transformer) for image classification.
Decoder-only (GPT2 model)
GPT-2 has Decoder-only architecture. Input is text and output is the next word (token), which is then appended to the input. Use cases are mostly text generation (autoregressive), but with prompting we can do many things including sequence classification. The attention is almost always causal (unidirectional), so the model can see only previous tokens (prefix).
Encoder-Decoder (T5 model)
T5 has Encoder-Decoder or Full-Transformer. Input is text and output is the next word (token), which is then appended to the decoder-input. Encoder decoder uses cross-attention to introduce information from the encoder into the decoder.
Decoder-Only vs Encoder-Decoder
The intuition is that, the decoder model just appends text, so if we have significant distribution difference between the input and the output, for example completely different set of tokens, we can expect that encoder-decoder would work better. And the decoder (prefix model) and sees only the past, and so any task that involves seeing entire text context and addressing specific tokens is a bit more complex for it. However, decoder-only is simpler architecture than Encoder-decoder, and it is already Turing-complete and size of the model and training is likely the biggest factor in most cases (The Bitter Lesson).
To make relevant apples to apples comparison, we can compare these in latency or compute-matched or parameter-match way, but it is hard to get rid of major differences in training objectives, which likely play the decisive role.
In the Flan-UL2 paper, authors attempted to reduce training differences by reformulating fill-in-the-blank task (denoising) into generative (autoregressive or prefix-language modelling setting) - this is called Mixture of Denoisers. Furthermore, they seem to use the same encoder-decoder model in both decoder-only way and encoder-decoder way. Also in Flan-UL2 paper, their best model was 20b parameter encoder-decoder.
Furthermore, Compute matched encoder-decoder models in UL2 paper have approximately twice the number of parameters as the decoder models but similar speeds and accuracy. This indicates that encoder-decoder may have more sparsity that may be taken out with some pruning or distillation techniques to eventually outperform.
In this older pre-RLHF paper, Encoder-decoder models trained with masked language modeling achieve the best zero-shot performance after multitask finetuning.
For details, there is a difference between decoder-only causal and prefix LM. Prefix-LM has a section that has non-causal (bidirectional attention) token dependencies like BERT:
Which To Choose From Encoder, Decoder, or Encoder-Decoder Transformer?
Personally, I will choose based on what pretrained model is available and how easy is it to adopt it for the task at hand. It is unclear what architecture may be the best from the start. Perhaps minor consideration could be following:
- encoder-only: vector embeddings for classification, clustering, search
- decoder-only: strong at text generation tasks (models for prompting, chatting)
- encoder-decoder: strong for natural language understanding (NLU). For example translation, question answering, summarization.