Map Tokens to their representations e.g. word (token) embeddings, image patch (token) embeddings.
Step by step pool the sequences of embeddings into shorter sequences, until we get a single full contextual data representation for the output.
Can pool via averaging, summation, segmentation, or just take a single sequence position output embedding (class token).
Simple Document Representations
Once were paper archives replaced with databases of textual documents some tasks became cheaper: search by list of words (query) ~1970s, finding document topics ~1980
knowledge graph (KG) e.g. Wikidata: each node is specific fruit (flowering) vs fruit (food)
KG is a tradeoff between database and training data samples
Wikipedia and the internet are something between knowledge graph and set of documents
random walks over KG are valid “sentences”, which can be used to train node embeddings e.g. with Word2vec (see “link prediction”)
Contextual Word Vectors with Transformer
imagine there is a node for each specific meaning of each word in hypothetical knowledge graph
given a word in a text of 100s of words, the specific surrounding words locate our position within the knowledge graph, and identify the word’s meaning
two popular model architectures incorporate context:
speculation: Because of high number precision, smoothness of the neural network layers, and random weight initialization, most input information is preserved within the output embeddings
that would explain why neural networks can improve by training
for example Word2vec or BERT embeddings are trained on a word prediction tasks, but their embeddings are useful for e.g. text classification tasks