- ideas existed at least since 1700s, but not enough compute and computer science
- Current computers do almost what was predicted, but how?
- How to instruct computer to perform tasks?
- How represent knowledge in computers?
- How to generate the answers?
by his contrivance, the most ignorant person, at a reasonable charge, and with a little bodily labour, might write books in philosophy, poetry, politics, laws, mathematics, and theology, without the least assistance from genius or study. ... to read the several lines softly, as they appeared upon the frame (Gulliver's Travels, by Jonathan Swift 1726, making fun of Ramon Llull 1232)
Prompt as an Interface
- 2001: A Space Odyssey HAL 9000
- input textual instructions e.g. explain a riddle
- based on its knowledge computer generates the answer text
Simple Document Representations
- Once were paper archives replaced with databases of textual documents some tasks became cheaper: search by list of words (query) ~1970s, finding document topics ~1980
- simplest methods: counting word occurrences on documents level into sparce matrices as feature vectors in methods term frequency–inverse document frequency (TF-IDF), Latent semantic analysis (LSA)
- this co-occurrence of words in documents later used to embed words
Non-Contextual Words Vectors
- document split into sentence sized running window of 10 words
- each of 10k sparsely coded vocabulary words is mapped (embedded) to a vector into a 300 dimensional space
- the embeddings are compressed as only 300 dimensions much less than 10k vocabulary feature vectors
- the embeddings are dense as the vector norm is not allowed to grow too large
- these word vectors are non-contextual (global), so we cannot disambiguate fruit (flowering) from fruit (food)
Word2vec Method for Non-contextual Word Vectors
- word2vec (Mikolov 2013): 10 surrounding words embeddings trained to sum up close to the middle word vector
- even simpler method: GloVe (Pennington 2014): just counting co-occurrence in a 10 word window
- other similar methods: FastText, StarSpace
- words appearing in similar context have similar embedding vectors
- word disambiguation is not supported
Knowledge Graph’s Nodes Are Disambiguated
- knowledge graph (KG) e.g. Wikidata: each node is specific fruit (flowering) vs fruit (food)
- KG is an imperfect tradeoff between database and training data samples
- Wikipedia and the internet are something between knowledge graph and set of documents
- random walks over KG are valid “sentences”, which can be used to train node embeddings e.g. with Word2vec (see “link prediction”)
Contextual Word Vectors
- imagine there is a node for each specific meaning of each word in hypothetical knowledge graph
- given a word in a text of 100s of words, the specific surrounding words locate our position within the knowledge graph, and identify the word’s meaning
- two popular model architectures incorporate context:
- recurrent neural networks (LSTM, GRU) are sequential models with memory units
- transformer architecture consumes the entire input sequence is State-of-the-art 2022
Big Transformer Models
- generate by predicting input text continuation
- $10M transformers trained on large amount of text from the internet in 2022
- can solve wide variety of problems like explaining jokes, sometimes with human level performance
- examples: PaLM (2022), RETRO (2021), GPT-3, …