- input is image and text pair (multiple modalities) and output a class or embedding vector
- used in product classification to product taxonomies e.g. Google product taxonomy
- multi-modal models are increasingly important e.g. CoCa achieved SoTA on ImageNet
OpenAI’s CLIP
- CLIP: Connecting Text and Images (Jan 2021): encodes image, and text to similar embeddings
- dataset was proprietary WebImageText (WIT is not Wikipedia-based Image Text Dataset (WIT)) 400M of various images with a caption text from the internet
- now open-source image-text datasets like LAION-400M available, open source CLIP models as well
- trained with contrastive learning, maximizing cosine similarity of corresponding image and text
- CLIP’s output image embeddings contain both style and semantics
- zero-shot classification, but fails on abstract or systematic tasks like counting
CLIP Architecture
- text and image have separate transformer encoders
- visual encoder is ViT (vision transformer)
- text encoder is GPT-2 transformer
- the fixed-length text embedding is extracted from [EOS] token position,
- text token embeddings and image patch embeddings also available
- trained on 256 GPUs for 2 weeks
CLIP Applications
- DALL-E 1 uses
- discrete variational autoencoder (dVAE), next token prediction,
- and CLIP model for re-ranking,
- DALL-E 2
- uses CLIP embedding directly,
- and decodes images via diffusion similar to GLIDE.
- zero-shot image classification:
- create for each class a text -> embedding
- cosine similarity between image and text embeddings
- image-text classification
- sum up the two output class token embeddings zero-shot similar
- or the two output class token embeddings fed in to a shallow MLP classification head
- or the two output sequences fed into a transformer with classification head
Amazon’s CMA-CLIP Model
- image-text classification task model
- CMA-CLIP: Cross-Modality Attention CLIP for Image-Text Classification from Amazon on Dec 2021
- image and text modalities fuses with task-wise attention for multi-task classification
- beats two stream (global image embedding):
- CLIP (keeps modalities separate and only shallow head is used) on Amazon’s proprietary MRWPA dataset,
- MMBT model (see below) on Food101 by +1%
- no comparison with EmbraceNet (see below)
- strongly beats one-stream (local fine-grained selected image patches)
- KaleidoBERT (pretrains with aligning image tokens with text tokens, then transformer) on Fashion-Gen dataset
CMA-CLIP Architecture
- split image into patches, and embed with CLIP into sequence
- embed text with CLIP into sequence of text token embeddings
- concatenate both embeddings into single sequence into a transformer
- the transformer outputs aggregate (global) image and text embedding
- modality-wise attention per task: learned weighted sum of the two embeddings
- asks: is this input relevant?
- the weight is a softmax of a dot product to a learned vector
- resists noise and missing data better similar to EmbraceNet feature dropout
CMA-CLIP Datasets
- Amazon’s proprietary MRWPA dataset contains labels for Color, Pattern, and Style
- Fashion-Gen Dataset with 325k images, 78k texts, single-label, and 41 categories.
- UPMC-Food101 Dataset with 90k images, 86k texts, 101 categories.
CMA-CLIP Results
- Overall CMA-CLIP slightly better than MMBT, but speculatively could outperform on multitasking
- Parameter count comparison is missing
CMA-CLIP vs KaleidoBERT vs ImageBERT on Fashion-Gen
- CMA-CLIP outperforms KaleidoBERT vs ImageBERT, and other models.
- There is no benchmark avaiabale for MMBT or CLIP on Fashion-Gen
CMA-CLIP vs MMBT vs CLIP on Food101
- CMA-CLIP outperforms MMBT and CLIP
- MMBT significantly outperforms CLIP likely due to the tuned transformer head
- BERT does better than ViT on this dataset
CMA-CLIP Results on MRWPA dataset
- WIT in below is proprietary WebImageText
- Since CMA-CLIP has more parameters, the performance is expected
- Multitask learning classification usable for learning disentangled representations
CMA-CLIP Image-text Alignment on MRWPA dataset
Text-to-image attention map alignment suggest CMA-CLIP can find cross-modality correlations.
CMA-CLIP Ablation Results
- modality wise attention helps the most on Style labels, then Pattern, then Color
- likely because (the text feature is irrelevant to relevant in this order)
Google’s CoCa Model
- CoCa: Contrastive Captioners are Image-Text Foundation Models
- State-of-the-art model many image and image-text tasks
- Combines a contrastive loss (similar to CLIP) with captioning task
CoCa Results
- achieved SoTA on ImageNet!
Facebook’s MMBT Model
- Supervised Multimodal Bitransformers for Classifying Images and Text
- concatenate linear projections of Resnet output with BERT token embeddings into a sequence as Transformer input
- MMBT has similar architecture to CMA-CLIP except for the CLIP backbone and modality-wise attention useful in multitasking
EmbraceNet Model
- EmbraceNet: A robust deep learning architecture for multimodal classification 2019
- feature fusion via a weighted summation with a normalizatin and “feature dropout”
- model has similar performance to concatenation, but performs better when some modalities are missing due to noisy data
- We used it in GLAMI-1M dataset
DeepMind’s Perceiver Model
Perceiver IO is a general-purpose multi-modal architecture that can handle wide variety of inputs as well as outputs. Perceiver can be applied to for example image-text classification. Perceiver IO uses cross-attention for merging:
- multimodal input sequences (e.g. image, text, audio) into a low dimensional latent sequence
- “output query” or “command” to decode the output value e.g. predict this masked word
Advantage of the Perceiver architecture is that in general you can work with very large inputs. Architecture Hierarchical Perceiver has ability to process even longer input sequences by splitting into subsequences and then merging them. Hierarchical Perceiver also learns the positional encodings with a separate training step with a reconstruction loss.
Multilingual Image-Text Classification
A lot of room for research left our new 13-lingual dataset GLAMI-1M. The task requires a multilingual language encoder, while images usually are international by default. Language distribution requires additional consideration.
GLAMI-1M Colab Notebook
Try hands-on exercise with the dataset in this Google Colab notebook.