- Transformer is sequence to sequence neural network architecture
- input text is encoded with tokenizers to sequence of integers called input tokens
- input tokens are mapped to sequence of vectors (word embeddings) via embeddings layer
- output vectors (embeddings) can be classified to a sequence of tokens
- output tokens can then be decoded back to a text
Tokenization vs Embedding
- input is tokenized, the tokens then are embedded
- output text embeddings are classified back into tokens, which then can be decoded into text
- tokenization converts a text into a list of integers
- embedding converts the list of integers into a list of vectors (list of embeddings)
- positional information about each token is added to embeddings using positional encodings or embeddings
- Tokenization is cutting input data into meaningful parts that can be embedded into a vector space.
- image is split into patches, text is split into tokens (frequent words) e.g. transformer tokenization
- Can add token position is added to their embeddings.
- Can add tokens for pooling purposes e.g. class token ([CLS]) used for text classification in BERT transformer.
Positional Encodings add Token Order Information
- self-attention and feed-forward layers are symmetrical with respect to the input
- so we have to provide positional information about each input token
- so positional encodings or embeddings are added to token embeddings in transformer
- encodings are manually (human) selected, while embeddingss are learned (trained)
- Embedding layers map tokens to word vectors (sequence of numbers) called word embeddings.
- Input and output embeddings layer often share the same token-vector mapping.
- Embeddings contain semantic information about the word.
Try out BERT BPE tokenizer and its embeddings using Transformers package.
# pip install transformers && pip install torch from transformers import DistilBertTokenizerFast, DistilBertModel tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased") tokens = tokenizer.encode('This is a input.', return_tensors='pt') print("These are tokens!", tokens) for token in tokens: print("This are decoded tokens!", tokenizer.decode([token])) model = DistilBertModel.from_pretrained("distilbert-base-uncased") print(model.embeddings.word_embeddings(tokens)) for e in model.embeddings.word_embeddings(tokens): print("This is an embedding!", e)