Transformer Embeddings and Tokenization

Transformer is sequence to sequence neural network architecture
input text is encoded with tokenizers to sequence of integers called input tokens
input tokens are mapped to sequence of vectors (word embeddings) via embeddings layer
output vectors (embeddings) can be classified to a sequence of tokens
output tokens can then be decoded back to a text

embeddings in transformer architecture

Tokenization vs Embedding

input is tokenized, the tokens then are embedded
output text embeddings are classified back into tokens, which then can be decoded into text
tokenization converts a text into a list of integers
embedding converts the list of integers into a list of vectors (list of embeddings)
positional information about each token is added to embeddings using positional encodings or embeddings

Tokenization

Tokenization is cutting input data into parts (symbols) that can be mapped (embedded) into a vector space.
For example, input text is split into frequent words e.g. transformer tokenization.
Sometimes we append special tokens to the sequence e.g. class token ([CLS]) used for classification embedding in BERT transformer.
Tokens are mapped to vectors (embedded, represented), which are passed into neural neural networks.
Token sequence position itself is often vectorized and added to the word embeddings (positional encodings).

Positional Encodings add Token Order Information

self-attention and feed-forward layers are symmetrical with respect to the input
so we have to provide positional information about each input token
so positional encodings or embeddings are added to token embeddings in transformer
encodings are manually (human) selected, while embeddingss are learned (trained)

Word Embeddings

Embedding layers map tokens to word vectors (sequence of numbers) called word embeddings.
Input and output embeddings layer often share the same token-vector mapping.
Embeddings contain semantic information about the word.

Explore Yourself

Try out BERT BPE tokenizer and its embeddings using Transformers package.

# pip install transformers && pip install torch

from transformers import DistilBertTokenizerFast, DistilBertModel

tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")
tokens = tokenizer.encode('This is a input.', return_tensors='pt')
print("These are tokens!", tokens)
for token in tokens[0]:
    print("This are decoded tokens!", tokenizer.decode([token]))

model = DistilBertModel.from_pretrained("distilbert-base-uncased")
print(model.embeddings.word_embeddings(tokens))
for e in model.embeddings.word_embeddings(tokens)[0]:
    print("This is an embedding!", e)

Vaclav Kosar

Transformer Embeddings and Tokenization

Tokenization vs Embedding

Tokenization

Positional Encodings add Token Order Information

Word Embeddings

Explore Yourself

Vaclav Kosar

You'll love also...

Vaclav Kosar

Vaclav Kosar