OpenAI's Image-Text Model CLIP

Encode image, and text into similar embedding vectors for multimodality.

Since the initial release, the CLIP architecture entered the hall of fame. One of the most performant versions now is likely H/14 OpenCLIP variant trained on LAION-2B achieving 78.0% zero shot top-1 accuracy on ImageNet and 73.4% on zero-shot image retrieval at Recall@5 on MS COCO. Both vision and the text par of the CLIP architecture are available in various implementation, for example, vision encoder can be ViT or Renset-50.


CLIP contrastive pretraining
CLIP contrastive pretraining

CLIP Architecture

  • text and image have separate transformer encoders
  • visual encoder is ViT (vision transformer)
  • text encoder is GPT-2 transformer
  • the fixed-length text embedding is extracted from [EOS] token position,
  • text token embeddings and image patch embeddings also available
  • trained on 256 GPUs for 2 weeks
CLIP architecture
CLIP architecture

CLIP Applications

  • DALL-E 1 uses
  • DALL-E 2
    • uses CLIP embedding directly,
    • and decodes images via diffusion similar to GLIDE.
  • zero-shot image classification:
    • create for each class a text -> embedding
    • cosine similarity between image and text embeddings
  • image-text classification
    • sum up the two output class token embeddings zero-shot similar
    • or the two output class token embeddings fed in to a shallow MLP classification head
    • or the two output sequences fed into a transformer with classification head

Created on 03 Jul 2023. Updated on: 03 Jul 2023.
Thank you

About Vaclav Kosar How many days left in this quarter? Twitter Bullet Points to Copy & Paste Averaging Stopwatch Privacy Policy
Copyright © Vaclav Kosar. All rights reserved. Not investment, financial, medical, or any other advice. No guarantee of information accuracy.