Embeddings in Machine Learning Explained

Embedding is a task specific lower dimensional vector representation of data like a word, image, document, or an user.
  • Want to represent data as numbers to compute our tasks.
  • Start with simple high dimensional feature vectors created from input data e.g. vocabulary word index.
  • Then find lower dimensional vectors optimized for our task called embeddings.
  • Can train with both unsupervised, and supervised tasks:
    • How similar are these two product images? (similarity e.g. student-teacher)
    • How similar is this image to this abstract image class? (classification)
  • Before representing the full data we often split data into meaningful parts called tokens

Input Tokenization

Embedding Tokens

  • Map Tokens to their representations e.g. word (token) embeddings, image patch (token) embeddings.
  • Step by step pool the sequences of embeddings into shorter sequences, until we get a single full contextual data representation for the output.
  • Can pool via averaging, summation, segmentation, or just take a single sequence position output embedding (class token).

Simple Document Representations

Latent semantic analysis (LSA) - CC BY-SA 4.0 Christoph Carl Kling

Non-Contextual Words Vectors

  • document split into sentence sized running window of 10 words
  • each of 10k sparsely coded vocabulary words is mapped (embedded) to a vector into a 300 dimensional space
  • the embeddings are compressed as only 300 dimensions much less than 10k vocabulary feature vectors
  • the embeddings are dense as the vector norm is not allowed to grow too large
  • these word vectors are non-contextual (global), so we cannot disambiguate fruit (flowering) from fruit (food)

word2vec

Word2vec Method for Non-contextual Word Vectors

  • word2vec (Mikolov 2013): 10 surrounding words embeddings trained to sum up close to the middle word vector
  • even simpler method: GloVe (Pennington 2014): just counting co-occurrence in a 10 word window
  • other similar methods: FastText, StarSpace
  • words appearing in similar context have similar embedding vectors
  • word disambiguation is not supported

word2vec operation

Knowledge Graph’s Nodes Are Disambiguated

  • knowledge graph (KG) e.g. Wikidata: each node is specific fruit (flowering) vs fruit (food)
  • KG is an imperfect tradeoff between database and training data samples
  • Wikipedia and the internet are something between knowledge graph and set of documents
  • random walks over KG are valid “sentences”, which can be used to train node embeddings e.g. with Word2vec (see “link prediction”)

knowledge graph visualization from wikipedia

Contextual Word Vectors

  • imagine there is a node for each specific meaning of each word in hypothetical knowledge graph
  • given a word in a text of 100s of words, the specific surrounding words locate our position within the knowledge graph, and identify the word’s meaning
  • two popular model architectures incorporate context:

transformer from word2vec

Image Embeddings

  • instead of tokens (words) we embed image patches
  • convolutional networks embed overlapping patches and progressively pool them into a single image embedding
  • Vision Transformer (ViT) uses transformer architecture and the output class token embedding is used as an image embedding

vision transformer (ViT) architecture

Reusing Embeddings

  • Embeddings are trained to represent data such that it makes the training task easy
  • Embeddings perform often better than the input feature vectors on at least related tasks
  • some tasks are more related than others: multi-task learning
  • speculation: Because of high number precision, smoothness of the neural network layers, and random weight initialization, most input information is preserved within the output embeddings
    • that would explain why neural networks can improve by training
  • for example Word2vec or BERT embeddings are trained on a word prediction tasks, but their embeddings are useful for e.g. text classification tasks

inter-task affinity for multi-task learning task grouping

Created on 11 Sep 2022. Updated on: 11 Sep 2022.
Thank you

Ask or Report A Mistake


Let's connect








Privacy Policy How many days left in this quarter? Twitter Bullet Points to Copy & Paste About Vaclav Kosar