Transformer Embeddings and Tokenization

How transformers convert text and other data to vectors and back using tokenization, positional encoding, embedding layers.
  • Transformer is sequence to sequence neural network architecture
  • input text is encoded with tokenizers to sequence of integers called input tokens
  • input tokens are mapped to sequence of vectors (word embeddings) via embeddings layer
  • output vectors (embeddings) can be classified to a sequence of tokens
  • output tokens can then be decoded back to a text
embeddings in transformer architecture
embeddings in transformer architecture

Tokenization vs Embedding

  • input is tokenized, the tokens then are embedded
  • output text embeddings are classified back into tokens, which then can be decoded into text
  • tokenization converts a text into a list of integers
  • embedding converts the list of integers into a list of vectors (list of embeddings)
  • positional information about each token is added to embeddings using positional encodings or embeddings

Tokenization

Positional Encodings add Token Order Information

Word Embeddings

  • Embedding layers map tokens to word vectors (sequence of numbers) called word embeddings.
  • Input and output embeddings layer often share the same token-vector mapping.
  • Embeddings contain semantic information about the word.

Explore Yourself

Try out BERT BPE tokenizer and its embeddings using Transformers package.

# pip install transformers && pip install torch

from transformers import DistilBertTokenizerFast, DistilBertModel

tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")
tokens = tokenizer.encode('This is a input.', return_tensors='pt')
print("These are tokens!", tokens)
for token in tokens[0]:
    print("This are decoded tokens!", tokenizer.decode([token]))

model = DistilBertModel.from_pretrained("distilbert-base-uncased")
print(model.embeddings.word_embeddings(tokens))
for e in model.embeddings.word_embeddings(tokens)[0]:
    print("This is an embedding!", e)

Created on 05 Jun 2022. Updated on: 18 Jun 2022.
Thank you










About Vaclav Kosar How many days left in this quarter? Twitter Bullet Points to Copy & Paste Averaging Stopwatch Privacy Policy
Copyright © Vaclav Kosar. All rights reserved. Not investment, financial, medical, or any other advice. No guarantee of information accuracy.