Understand Large Language Models like ChatGPT

In 9 slides from TF-IDF, Word2vec, knowledge graphs, and transformers to LLMs and ChatGPT basics explained.

The presentation explains the development of large language models like ChatGPT, which can generate text by predicting input text continuation. The idea of a talking machine has been around since the 1700s, but it wasn’t until the development of powerful computers and computer science that it became a reality. Simple document representations, such as counting word occurrences, were used to create sparse matrices as feature vectors in methods like term frequency–inverse document frequency and latent semantic analysis. Non-contextual word vectors were created using word2vec, which trained embeddings to sum up close to the middle word vector. Later, contextual word vectors were created using transformer architecture, which consumes the entire input sequence and is state-of-the-art in 2022.

Dream of a Talking Machine

  • Idea of a talking machine since 1700s, but weak computers and computer science
  • ChatGPT does almost what was predicted, but how?
  • How to instruct large language model to perform tasks?
  • How represent knowledge in computers?
  • How to generate the answers?

by his contrivance, the most ignorant person, at a reasonable charge, and with a little bodily labour, might write books in philosophy, poetry, politics, laws, mathematics, and theology, without the least assistance from genius or study. ... to read the several lines softly, as they appeared upon the frame (Gulliver's Travels, by Jonathan Swift 1726, making fun of Ramon Llull 1232)

Text Prompt as an Interface

  • For example 2001: A Space Odyssey HAL 9000
  • input textual instructions e.g. explain a riddle
  • based on its knowledge computer generates the answer text

2001 A Space Odyssey HAL-9000 Interface

Simple Document Representations

Latent semantic analysis (LSA) - CC BY-SA 4.0 Christoph Carl Kling

Non-Contextual Words Vectors

  • document split into sentence sized running window of 10 words
  • each of 10k sparsely coded vocabulary words is mapped to a vector (embedded) into a 300 dimensional space
  • the embeddings are compressed as only 300 dimensions much less than 10k vocabulary feature vectors
  • the embeddings are dense as the vector norm is not allowed to grow too large
  • these word vectors are non-contextual (global), so we cannot disambiguate fruit (flowering) from fruit (food)


Word2vec Method for Non-contextual Word Vectors

  • word2vec (Mikolov 2013): 10 surrounding words embeddings trained to sum up close to the middle word vector
  • even simpler method: GloVe (Pennington 2014): just counting co-occurrence in a 10 word window
  • other similar methods: FastText, StarSpace
  • words appearing in similar context have similar embedding vectors
  • word disambiguation is not supported

word2vec operation

Knowledge Graph’s Nodes Are Disambiguated

  • knowledge graph (KG) e.g. Wikidata: each node is specific fruit (flowering) vs fruit (food)
  • KG is a tradeoff between database and training data samples
  • Wikipedia and the internet are something between knowledge graph and set of documents
  • random walks over KG are valid “sentences”, which can be used to train node embeddings e.g. with Word2vec (see “link prediction”)

knowledge graph visualization from wikipedia

Contextual Word Vectors with Transformer

  • imagine there is a node for each specific meaning of each word in hypothetical knowledge graph
  • given a word in a text of 100s of words, the specific surrounding words locate our position within the knowledge graph, and identify the word’s meaning
  • two popular model architectures incorporate context:

transformer from word2vec

Large Language Models

  • generate by predicting input text continuation
  • $10M transformers trained on large amount of text from the internet in 2022
  • can solve wide variety of problems like explaining jokes, sometimes with human level performance
  • examples: PaLM (2022), RETRO (2021), hybrids with algorithms
  • ChatGPT additionally trained to chat using RLHF alignment method

transformer next token prediction

Future: Hybridizing Text with Algorithms

hybridizing neural networks with code

Instructing ChatGPT and Large Language Models

chain-of-thought prompting technique

Created on 18 Apr 2022. Updated on: 11 Jun 2023.
Thank you

About Vaclav Kosar How many days left in this quarter? Twitter Bullet Points to Copy & Paste Averaging Stopwatch Privacy Policy
Copyright © Vaclav Kosar. All rights reserved. Not investment, financial, medical, or any other advice. No guarantee of information accuracy.