- Want to represent data as numbers to compute our tasks.
- Start with simple high dimensional feature vectors created from input data e.g. vocabulary word index.
- Then find lower dimensional vectors optimized for our task called embeddings.
- Can train with both unsupervised, and supervised tasks:
- How similar are these two product images? (similarity e.g. student-teacher)
- How similar is this image to this abstract image class? (classification)
- Before representing the full data we often split data into meaningful parts called tokens
- Tokenization is cutting input data into meaningful parts that can be embedded into a vector space.
- image is split into patches, text is split into tokens (frequent words) e.g. transformer tokenization
- Can add token position is added to their embeddings.
- Can add tokens for pooling purposes e.g. class token ([CLS]) used for text classification in BERT transformer.
- Map Tokens to their representations e.g. word (token) embeddings, image patch (token) embeddings.
- Step by step pool the sequences of embeddings into shorter sequences, until we get a single full contextual data representation for the output.
- Can pool via averaging, summation, segmentation, or just take a single sequence position output embedding (class token).
Simple Document Representations
- Once were paper archives replaced with databases of textual documents some tasks became cheaper: search by list of words (query) ~1970s, finding document topics ~1980
- simplest methods: counting word occurrences on documents level into sparce matrices as feature vectors in methods term frequency–inverse document frequency (TF-IDF), Latent semantic analysis (LSA)
- this co-occurrence of words in documents later used to embed words
Non-Contextual Words Vectors
- document split into sentence sized running window of 10 words
- each of 10k sparsely coded vocabulary words is mapped (embedded) to a vector into a 300 dimensional space
- the embeddings are compressed as only 300 dimensions much less than 10k vocabulary feature vectors
- the embeddings are dense as the vector norm is not allowed to grow too large
- these word vectors are non-contextual (global), so we cannot disambiguate fruit (flowering) from fruit (food)
Word2vec Method for Non-contextual Word Vectors
- word2vec (Mikolov 2013): 10 surrounding words embeddings trained to sum up close to the middle word vector
- even simpler method: GloVe (Pennington 2014): just counting co-occurrence in a 10 word window
- other similar methods: FastText, StarSpace
- words appearing in similar context have similar embedding vectors
- word disambiguation is not supported
Knowledge Graph’s Nodes Are Disambiguated
- knowledge graph (KG) e.g. Wikidata: each node is specific fruit (flowering) vs fruit (food)
- KG is an imperfect tradeoff between database and training data samples
- Wikipedia and the internet are something between knowledge graph and set of documents
- random walks over KG are valid “sentences”, which can be used to train node embeddings e.g. with Word2vec (see “link prediction”)
Contextual Word Vectors
- imagine there is a node for each specific meaning of each word in hypothetical knowledge graph
- given a word in a text of 100s of words, the specific surrounding words locate our position within the knowledge graph, and identify the word’s meaning
- two popular model architectures incorporate context:
- instead of tokens (words) we embed image patches
- convolutional networks embed overlapping patches and progressively pool them into a single image embedding
- Vision Transformer (ViT) uses transformer architecture and the output class token embedding is used as an image embedding
- Embeddings are trained to represent data such that it makes the training task easy
- Embeddings perform often better than the input feature vectors on at least related tasks
- some tasks are more related than others: multi-task learning
- speculation: Because of high number precision, smoothness of the neural network layers, and random weight initialization, most input information is preserved within the output embeddings
- that would explain why neural networks can improve by training
- for example Word2vec or BERT embeddings are trained on a word prediction tasks, but their embeddings are useful for e.g. text classification tasks