Vaclav Kosar's face photo
Vaclav Kosar
Software, Machine Learning, & Business

Transformer Embeddings and Tokenization

How transformers convert words and other objects to vectors and back.
  • transformer (e.g. BERT) is sequence to sequence neural network architecture
  • input text is encoded with tokenizers to sequence of integers
  • input tokens are mapped to sequence of embeddings via embeddings layer
  • output embeddings can be classified to a sequence of tokens
  • output tokens can then be converted back to the text

embeddings in transformer architecture


  • Input text is split into character chunks called tokens present in a dictionary.
  • Vocabulary of the token dictionaries contain around 100k most common sequences from the training text
  • Tokens often correspond to words of 4 characters long with prepended whitespace or special characters.
  • Embedding layers map tokens to vectors in other words to sequence of numbers.
  • Input and output embeddings layer often share the same token-vector mapping.
  • common tokenization algorithms are WordPiece, SentencePiece

tokenization and embedding layer for transformer

BPE Tokenizer

Byte-Pair-Encoding (BPE) algorithm:

  1. BPE pre-tokenizes text by splitting on spaces
  2. start with only characters as token
  3. merge the highest frequency token pair from the text
  4. stop if max vocabulary size reached, otherwise loop to previous step

WordPiece vs BPE Tokenizer

  • WordPiece merges token pair with highest count(ab) / count(a)count(b)
  • Used for BERT, DistilBERT, Electra

Unigram Tokenizer

  • Unigram instead of merging and adding like BPE, it removes
  • starts with a very large vocabulary and removes fixed number symbols such that a vocabulary loss increase minimally
  • stop if vocabulary size reached, otherwise loop to previous step
  • to disambiguate tokenization a probability of token occurrence is used, and packaged with the tokenizer

SentencePiece vs WordPiece Tokenizer

  • Japanese, Korean, or Chinese languages don’t separate words with a space
  • SentencePiece removes pre-tokenization (splitting on spaces)
  • instead tokenizes text stream with usually with Unigram or alternatively with BPE
  • T5, ALBERT, XLNet, MarianMT use SentencePiece with Unigram

Tokenizers vs Encoders

  • Tokenizers are not suitable for modalities like image or speech.
  • Architectures like Vision Transformer (ViT) or MMBT encode input without a tokenizer.
  • Inputs to transformer can be encoded with a another neural network.
  • Output of the encoding layer has to be a sequence of embeddings for the transformer.

Positional Encodings add Token Order Information

Explore Yourself

Try out BERT BPE tokenizer and its embeddings using Transformers package.

# pip install transformers && pip install torch

from transformers import DistilBertTokenizerFast, DistilBertModel

tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")
tokens = tokenizer.encode('This is a input.', return_tensors='pt')
for token in tokens[0]:

model = DistilBertModel.from_pretrained("distilbert-base-uncased")
for e in model.embeddings.word_embeddings(tokens)[0]:

Created on 05 Jun 2022. Updated on: 18 Jun 2022.

Let's connect

Privacy Policy How many days left in this quarter? Twitter Bullet Points to Copy & Paste