DALL-E 1 uses discrete variational autoencoder (dVAE), next token prediction, and CLIP model re-ranking, while DALL-E 2 uses CLIP embedding directly, and decodes images via diffusion similar to GLIDE.
- CLIP: Connecting Text and Images (Jan 2021): encodes image, and text to similar embeddings
- dataset was proprietary WebImageText (WIT is not Wikipedia-based Image Text Dataset (WIT)) 400M of various images with a caption text from the internet
- now open-source image-text datasets like LAION-400M available, open source CLIP models as well
- trained with contrastive learning, maximizing cosine similarity of corresponding image and text
- CLIP’s output image embeddings contain both style and semantics
- zero-shot classification, but fails on abstract or systematic tasks like counting
- text and image have separate transformer encoders
- visual encoder is ViT (vision transformer)
- text encoder is GPT-2 transformer
- the fixed-length text embedding is extracted from [EOS] token position,
- text token embeddings and image patch embeddings also available
- trained on 256 GPUs for 2 weeks
- DALL-E 1 uses
- discrete variational autoencoder (dVAE), next token prediction,
- and CLIP model for re-ranking,
- DALL-E 2
- uses CLIP embedding directly,
- and decodes images via diffusion similar to GLIDE.
- zero-shot image classification:
- create for each class a text -> embedding
- cosine similarity between image and text embeddings
- image-text classification
- sum up the two output class token embeddings zero-shot similar
- or the two output class token embeddings fed in to a shallow MLP classification head
- or the two output sequences fed into a transformer with classification head
Variational Auto-encoder (VAE) Models
- model the image distribution via lower bound on maximum likelihood
- encode each image as a gaussian distribution on the latent space
- random sampling from latents not differentiable
- => re-parametrization trick \( z = \sigma * r + \mu \) where \( r \) is random vector
- loss is to reconstruct (L2) the image and latents to have normal distribution (KL)
- sample, or interpolate from the latent normal distribution and generate images - may find disentangled representations
- related to tokenization in that it outputs finite number of items from a dictionary
- is used in Wav2vec and DALL-E 1 and VQ-VAE
- replaces the input vector with the closest vector from a finite dictionary of vectors called codebook
- during training, backward pass uses Gumbal softmax over the codebook to propagate gradient
- product quantization: concatenation of several quantizations then linear transformation
Discreet Variational Auto-Encoder (dVAE)
- introduced in VQ-VAE 1 and VQ-VAE-2 (dVAE, up-scaling)
- image encoder maps to latent 32x32 grid of embeddings
- vector quantization maps to 8k code words (visual codebook)
- decoder maps from quantized grid to the image
- copy gradients from decoder input z to the encoder output
OpenAI’s DALL-E 1
- OpenAI introduced DALL-E 1 text-to-image generator in introduced in paper and code.
- generates 256×256 images from text via dVAE inspired by VQ-VAE-2.
- autoregressive-ly generates image tokens from textual tokens on a discrete latent space.
DALL-E 1 Training:
- train encoder and decoder image of image into 32x32 grid of 8k possible code word tokens (dVAE)
- concatenate encoded text tokens with image tokens into single array
- train to predict next image token from the preceding tokens (autoregressive transformer)
- discard the image encoder, keep only image decoder and next token predictor
DALL-E 1 Prediction:
- encode input text to text tokens
- iteratively predict next image token from the learned codebook
- decode the image tokens using dVAE decoder
- select the best image using CLIP model ranker
DALL-E 1 Discreet Variational Auto-Encoder (dVAE)
- instead of copying gradients annealing (categorical reparameterization with gumbel-softmax)
- promote codebook utilization using higher KL-divergence weight
- decoder is conv2d, decoder block (4x relu + conv), upsample (tile bigger array), repeat
DALL-E 1 Results
- competitive in zero-shot fashion, preferred 90% time by humans
- Human evaluation which is preferred DALL-E vs DF-GAN, zero-shot
DALL-E 1 Examples
- Diffusion models reverse addition of gaussian noise to an image.
- An image arises from iterative denoising e.g. after 100 steps.
- Training task is to predict the added noise with mean-squared error loss.
- Similar to normalizing flow models like OpenAI’s Glow which are additionally single step and invertible.
- Diffusion model can be formulated as an ODE solution, where de-noising step represents time dimension step. The training image data form a manifold. Adding noise to the images expands the manifold volume. The expansion direction and step size of the expansion define the ODE. The ODE’s solution is the probability density function. We link gradient of the density function to the L2 loss of denoising function. The step size is scaled with a function dependent on the noise level.
- Diffusion text-to-image (256 × 256) generator introduced in paper.
- GLIDE outperforms on human preference DALL-E 1.
- CLIP guided diffusion
- task: “predict the added noise given that the image has this caption”
- training task is prediction of the noise and guidance towards the CLIP text embedding
- training loss has additional term of gradient of dot-product with the CLIP text embedding
- CLIP encoders are trained on noised images to stay in distribution
- text-conditional diffusion model
- GLIDE diffusion model is a transformer (ADM model)
- text is embedded via another transformer
- text embeddings are appended to the diffusion model sequence in each layer
OpenAI’s DALL-E 2
- OpenAI introduced DaLL-E-2 in the paper
- model name is unCLIP while DALL-E 2 is seems to be a marketing name
- generates 1024 x 1024 images from text using diffusion models.
- generates more diverse and higher resolution images than GLIDE.
DALL-E 2 Training
- generate a CLIP model text embedding for text caption
- “prior” network generates CLIP image embedding from text embedding
- diffusion decoder generates image from the image embedding
- Can vary images while preserving style and semantics in the embeddings
- Authors found diffusion models more efficient and higher quality compared to autoregressive
DALL-E 2 Image Generation
DALL-E 2 “Prior” Network
- Prior decoder generates CLIP image embedding from text
- tested autoregressive and diffusion prior generation with similar results
- autoregressive prior uses quantization to discrete codes
- diffusion prior is more compute efficient
- Gaussian diffusion model conditioned on the caption text
DALL-E 2 Decoder
- diffusion decoder similar to GLIDE
- additionally condition also on CLIP image embedding
- projected as 4 extra tokens
- in addition to the text present in the original GLIDE
DALL-E 2 Evaluation Results
DALL-E 2 competitive photo-realism while more diverse images than GLIDE
DALL-E 2 Examples
Sample (“A teddybear on a skateboard in Times Square.”):