Since the initial release, the CLIP architecture entered the hall of fame. One of the most performant versions now is likely H/14 OpenCLIP variant trained on LAION-2B achieving 78.0% zero shot top-1 accuracy on ImageNet and 73.4% on zero-shot image retrieval at Recall@5 on MS COCO. Both vision and the text par of the CLIP architecture are available in various implementation, for example, vision encoder can be ViT or Renset-50.
OpenAI’s CLIP
- CLIP: Connecting Text and Images (Jan 2021): encodes image, and text to similar embeddings
- dataset was proprietary WebImageText (WIT is not Wikipedia-based Image Text Dataset (WIT)) 400M of various images with a caption text from the internet
- now open-source image-text datasets like LAION-400M available, open source CLIP models as well
- trained with contrastive learning, maximizing cosine similarity of corresponding image and text
- CLIP’s output image embeddings contain both style and semantics
- zero-shot classification, but fails on abstract or systematic tasks like counting
CLIP Architecture
- text and image have separate transformer encoders
- visual encoder is ViT (vision transformer)
- text encoder is GPT-2 transformer
- the fixed-length text embedding is extracted from [EOS] token position,
- text token embeddings and image patch embeddings also available
- trained on 256 GPUs for 2 weeks
CLIP Applications
- DALL-E 1 uses
- discrete variational autoencoder (dVAE), next token prediction,
- and CLIP model for re-ranking,
- DALL-E 2
- uses CLIP embedding directly,
- and decodes images via diffusion similar to GLIDE.
- zero-shot image classification:
- create for each class a text -> embedding
- cosine similarity between image and text embeddings
- image-text classification
- sum up the two output class token embeddings zero-shot similar
- or the two output class token embeddings fed in to a shallow MLP classification head
- or the two output sequences fed into a transformer with classification head