Embeddings in Machine Learning Explained

Embedding is a task specific lower dimensional vector representation of data like a word, image, document, or an user.
  • Want to represent data as numbers to compute our tasks.
  • Start with simple high dimensional feature vectors created from input data e.g. vocabulary word index.
  • Then find lower dimensional vectors optimized for our task called embeddings.
  • Can train with both unsupervised, and supervised tasks:
    • How similar are these two product images? (similarity e.g. student-teacher)
    • How similar is this image to this abstract image class? (classification)
  • Before representing the full data we often split data into meaningful parts called tokens

Input Tokenization

Embedding Tokens

  • Map Tokens to their representations e.g. word (token) embeddings, image patch (token) embeddings.
  • Step by step pool the sequences of embeddings into shorter sequences, until we get a single full contextual data representation for the output.
  • Can pool via averaging, summation, segmentation, or just take a single sequence position output embedding (class token).

Simple Document Representations

Latent semantic analysis (LSA) - CC BY-SA 4.0 Christoph Carl Kling
Latent semantic analysis (LSA) - CC BY-SA 4.0 Christoph Carl Kling

Non-Contextual Words Vectors

  • document split into sentence sized running window of 10 words
  • each of 10k sparsely coded vocabulary words is mapped to a vector (embedded) into a 300 dimensional space
  • the embeddings are compressed as only 300 dimensions much less than 10k vocabulary feature vectors
  • the embeddings are dense as the vector norm is not allowed to grow too large
  • these word vectors are non-contextual (global), so we cannot disambiguate fruit (flowering) from fruit (food)

Word2vec Method for Non-contextual Word Vectors

word2vec operation
word2vec operation

Knowledge Graph’s Nodes Are Disambiguated

  • knowledge graph (KG) e.g. Wikidata: each node is specific fruit (flowering) vs fruit (food)
  • KG is a tradeoff between database and training data samples
  • Wikipedia and the internet are something between knowledge graph and set of documents
  • random walks over KG are valid “sentences”, which can be used to train node embeddings e.g. with Word2vec (see “link prediction”)
knowledge graph visualization from wikipedia
knowledge graph visualization from wikipedia

Contextual Word Vectors with Transformer

  • imagine there is a node for each specific meaning of each word in hypothetical knowledge graph
  • given a word in a text of 100s of words, the specific surrounding words locate our position within the knowledge graph, and identify the word’s meaning
  • two popular model architectures incorporate context:
transformer from word2vec
transformer from word2vec

Image Embeddings

  • instead of tokens (words) we embed image patches
  • convolutional networks embed overlapping patches and progressively pool them into a single image embedding
  • Vision Transformer (ViT) uses transformer architecture and the output class token embedding is used as an image embedding
vision transformer (ViT) architecture
vision transformer (ViT) architecture

Reusing Embeddings

  • Embeddings are trained to represent data such that it makes the training task easy
  • Embeddings perform often better than the input feature vectors on at least related tasks
  • some tasks are more related than others: multi-task learning
  • speculation: Because of high number precision, smoothness of the neural network layers, and random weight initialization, most input information is preserved within the output embeddings
    • that would explain why neural networks can improve by training
  • for example Word2vec or BERT embeddings are trained on a word prediction tasks, but their embeddings are useful for e.g. text classification tasks
inter-task affinity for multi-task learning task grouping
inter-task affinity for multi-task learning task grouping

Created on 11 Sep 2022. Updated on: 11 Sep 2022.
Thank you

About Vaclav Kosar How many days left in this quarter? Twitter Bullet Points to Copy & Paste Averaging Stopwatch Privacy Policy
Copyright © Vaclav Kosar. All rights reserved. Not investment, financial, medical, or any other advice. No guarantee of information accuracy.