DALL-E 1 uses discrete variational autoencoder (dVAE), next token prediction, and CLIP model re-ranking, while DALL-E 2 uses CLIP embedding directly, and decodes images via diffusion similar to GLIDE.
OpenAI’s CLIP
- paper: encodes image, and text to similar embeddings
- trained on a 400M various images with a caption text from the internet
- trained with contrastive learning, maximizing cosine similarity of corresponding image and text
- image representations contain both style and semantics
- zero-shot classification, but fails on abstract or systematic tasks like counting
CLIP Architecture
- text and image have separate transformer encoders
- visual encoder is ViT (vision transformer)
- text encoder is GPT-2 transformer
- the fix length text embedding is extracted from [EOS] token position
- trained on 256 GPUs for 2 weeks
Variational Auto-encoder (VAE) Models
- model the image distribution via lower bound on maximum likelihood
- encode each image as a gaussian distribution on the latent space
- random sampling from latents not differentiable
- => re-parametrization trick \( z = \sigma * r + \mu \) where \( r \) is random vector
- loss is to reconstruct (L2) the image and latents to have normal distribution (KL)
- sample, or interpolate from the latent normal distribution and generate images - may find disentangled representations
Discreet Variational Auto-Encoder (dVAE)
- introduced in VQ-VAE 1 and VQ-VAE-2 (dVAE, up-scaling)
- image encoder maps to latent 32x32 grid of embeddings
- vector quantization maps to 8k code words (visual codebook)
- decoder maps from quantized grid to the image
- copy gradients from decoder input z to the encoder output
OpenAI’s DALL-E 1
- OpenAI introduced DALL-E 1 text-to-image generator in introduced in paper and code.
- generates 256×256 images from text via dVAE inspired by VQ-VAE-2.
- autoregressive-ly generates image tokens from textual tokens on a discrete latent space.
DALL-E 1 Training:
- train encoder and decoder image of image into 32x32 grid of 8k possible code word tokens (dVAE)
- concatenate encoded text tokens with image tokens into single array
- train to predict next image token from the preceding tokens (autoregressive transformer)
- discard the image encoder, keep only image decoder and next token predictor
DALL-E 1 Prediction:
- encode input text to text tokens
- iteratively predict next image token from the learned codebook
- decode the image tokens using dVAE decoder
- select the best image using CLIP model ranker
DALL-E 1 Discreet Variational Auto-Encoder (dVAE)
- instead of copying gradients annealing (categorical reparameterization with gumbel-softmax)
- promote codebook utilization using higher KL-divergence weight
- decoder is conv2d, decoder block (4x relu + conv), upsample (tile bigger array), repeat
DALL-E 1 Results
- competitive in zero-shot fashion, preferred 90% time by humans
- Human evaluation which is preferred DALL-E vs DF-GAN, zero-shot
DALL-E 1 Examples
Diffusion Models
- diffusion models reverse addition of gaussian noise to an image
- an image arises from iterative denoising e.g. after 100 steps
- training task is to predict the added noise with mean-squared error loss
- similar to normalizing flow models like OpenAI’s Glow which are additionally single step and invertible
OpenAI’s GLIDE
- Diffusion text-to-image (256 × 256) generator introduced in paper.
- GLIDE outperforms on human preference DALL-E 1.
- CLIP guided diffusion
- task: “predict the added noise given that the image has this caption”
- training task is prediction of the noise and guidance towards the CLIP text embedding
- training loss has additional term of gradient of dot-product with the CLIP text embedding
- CLIP encoders are trained on noised images to stay in distribution
- text-conditional diffusion model
- GLIDE diffusion model is a transformer (ADM model)
- text is embedded via another transformer
- text embeddings are appended to the diffusion model sequence in each layer
OpenAI’s DALL-E 2
- OpenAI introduced DaLL-E-2 in the paper
- model name is unCLIP while DALL-E 2 is seems to be a marketing name
- generates 1024 x 1024 images from text using diffusion models.
- generates more diverse and higher resolution images than GLIDE.
DALL-E 2 Training
- generate a CLIP model text embedding for text caption
- “prior” network generates CLIP image embedding from text embedding
- diffusion decoder generates image from the image embedding
- Can vary images while preserving style and semantics in the embeddings
- Authors found diffusion models more efficient and higher quality compared to autoregressive
DALL-E 2 Image Generation
DALL-E 2 “Prior” Network
- Prior decoder generates CLIP image embedding from text
- tested autoregressive and diffusion prior generation with similar results
- autoregressive prior uses quantization to discrete codes
- diffusion prior is more compute efficient
- Gaussian diffusion model conditioned on the caption text
DALL-E 2 Decoder
- diffusion decoder similar to GLIDE
- additionally condition also on CLIP image embedding
- projected as 4 extra tokens
- in addition to the text present in the original GLIDE
DALL-E 2 Evaluation Results
DALL-E 2 competitive photo-realism while more diverse images than GLIDE
DALL-E 2 Examples
Comparison:
Sample (“A teddybear on a skateboard in Times Square.”):