Tokenization in Machine Learning Explained

Tokenization is splitting the input data into a sequence of meaningful parts e.g. pice data like a word, image patch, document sentence.

Tokenization in NLP

  • Input data text is split using a dictionary into character chunks called tokens
  • The vocabulary contains around 100k most common sequences from the training text.
  • Tokens often correspond to words of 4 characters long with prepended whitespace or special characters.
  • common tokenization algorithms are BPE, WordPiece, SentencePiece
  • Text tokens can be converted back to text, but sometimes there is a loss of information.
  • Tokenization in NLP is a form of compression - dictionary coding.

tokenization and embedding layer for transformer

Tokenization In Continuous Modalities Vision or Speech

  • Tokenizers are not quite present in modalities like image or speech.
  • Instead, the images or audio is split into a matrix of patches without dictionary equivalent as in case of the text.
  • Image architectures Vision Transformer (ViT), Resnets split image into overlapping patches and then encode these.
  • Outputs embeddings of these can then be passed to ,e.g., transformer (CMA-CLIP or MMBT)

tokenization and embedding in Vision Transformer ViT

Quantization

  • related to tokenization in that it outputs finite number of items from a dictionary
  • is used in Wav2vec and DALL-E 1 and VQ-VAE
  • replaces the input vector with the closest vector from a finite dictionary of vectors called codebook
  • during training, backward pass uses Gumbal softmax over the codebook to propagate gradient
  • product quantization: concatenation of several quantizations then linear transformation

The Most Common Tokenizers in NLP

A list of commonly used tokenizers sorted by their date of introduction.

FastText Tokenizer

  • Older models like Word2vec, or FastText used simple tokenizers, that after some preprocessing simply split the text on whitespace characters. These chunks are often words of a natural language.
  • Then, if the character sequence chunk is present in a dictionary of most common chunks, and return an index in the dictionary.
  • If not found, most tokenizers before FastText returned a special token called the unknown token. FastText solved this problem by additional split on the word level into fixed size “subwords”, but to find out more details about FastText read this post.
  • Other tokenizers, continued to return the unknown token until SentencePiece, which includes all single characters and almost never returns the unknown token.

BPE Tokenizer

Byte-Pair-Encoding (BPE) algorithm:

  1. BPE pre-tokenizes text by splitting on spaces
  2. start with only characters as token
  3. merge the highest frequency token pair from the text
  4. stop if max vocabulary size reached, otherwise loop to previous step

WordPiece vs BPE Tokenizer

  • WordPiece merges token pair with highest count(ab) / count(a)count(b)
  • Used for BERT, DistilBERT, Electra

Unigram Tokenizer

  • Unigram construction instead of merging and adding to a vocabulary like BPE, it removes tokens
  • A vocabulary loss is constructed as expectation maximization loss summing over all tokenizations of all corpus’ subsequences.
    • The probability of each token is approximated as independent of other tokens.
  • Starts with a very large vocabulary and removes fixed number symbols such that the vocabulary loss increase minimally
  • Stop if vocabulary size reached, otherwise loop to previous step

SentencePiece vs WordPiece Tokenizer

  • Japanese, Korean, or Chinese languages don’t separate words with a space
  • SentencePiece removes pre-tokenization (splitting on spaces)
  • instead tokenizes text stream with usually with Unigram or alternatively with BPE
  • T5, ALBERT, XLNet, MarianMT use SentencePiece with Unigram

Created on 16 Sep 2022. Updated on: 16 Sep 2022.
Thank you

Ask or Report A Mistake


Let's connect








Privacy Policy How many days left in this quarter? Twitter Bullet Points to Copy & Paste About Vaclav Kosar