- transformer (e.g. BERT) is sequence to sequence neural network architecture
- input text is encoded with tokenizers to sequence of integers
- input tokens are mapped to sequence of embeddings via embeddings layer
- output embeddings can be classified to a sequence of tokens
- output tokens can then be converted back to the text
Tokenizers
- Input text is split into character chunks called tokens present in a dictionary.
- Vocabulary of the token dictionaries contain around 100k most common sequences from the training text
- Tokens often correspond to words of 4 characters long with prepended whitespace or special characters.
- Embedding layers map tokens to vectors in other words to sequence of numbers.
- Input and output embeddings layer often share the same token-vector mapping.
- common tokenization algorithms are WordPiece, SentencePiece
BPE Tokenizer
Byte-Pair-Encoding (BPE) algorithm:
- BPE pre-tokenizes text by splitting on spaces
- start with only characters as token
- merge the highest frequency token pair from the text
- stop if max vocabulary size reached, otherwise loop to previous step
WordPiece vs BPE Tokenizer
- WordPiece merges token pair with highest
count(ab) / count(a)count(b)
- Used for BERT, DistilBERT, Electra
Unigram Tokenizer
- Unigram instead of merging and adding like BPE, it removes
- starts with a very large vocabulary and removes fixed number symbols such that a vocabulary loss increase minimally
- stop if vocabulary size reached, otherwise loop to previous step
- to disambiguate tokenization a probability of token occurrence is used, and packaged with the tokenizer
SentencePiece vs WordPiece Tokenizer
- Japanese, Korean, or Chinese languages don’t separate words with a space
- SentencePiece removes pre-tokenization (splitting on spaces)
- instead tokenizes text stream with usually with Unigram or alternatively with BPE
- T5, ALBERT, XLNet, MarianMT use SentencePiece with Unigram
Tokenizers vs Encoders
- Tokenizers are not suitable for modalities like image or speech.
- Architectures like Vision Transformer (ViT) or MMBT encode input without a tokenizer.
- Inputs to transformer can be encoded with a another neural network.
- Output of the encoding layer has to be a sequence of embeddings for the transformer.
Positional Encodings add Token Order Information
- self-attention and feed-forward layers are symmetrical with respect to the input
- so we have to provide positional information about each input token
- so positional encodings or embeddings are added to token embeddings in transformer
- encodings are manually (human) selected, while embeddings are learned (trained)
Explore Yourself
Try out BERT BPE tokenizer and its embeddings using Transformers package.
# pip install transformers && pip install torch
from transformers import DistilBertTokenizerFast, DistilBertModel
tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")
tokens = tokenizer.encode('This is a input.', return_tensors='pt')
print(tokens)
for token in tokens[0]:
print(tokenizer.decode([token]))
model = DistilBertModel.from_pretrained("distilbert-base-uncased")
print(model.embeddings.word_embeddings(tokens))
for e in model.embeddings.word_embeddings(tokens)[0]:
print(e)