Multimodal Image-text Classification

Understand the top deep learning image and text classification models CMA-CLIP, CLIP, CoCa, and MMBT used in e-commerce.

OpenAI’s CLIP

CLIP contrastive pretraining

CLIP Architecture

  • text and image have separate transformer encoders
  • visual encoder is ViT (vision transformer)
  • text encoder is GPT-2 transformer
  • the fixed-length text embedding is extracted from [EOS] token position,
  • text token embeddings and image patch embeddings also available
  • trained on 256 GPUs for 2 weeks

CLIP architecture

CLIP Applications

  • DALL-E 1 uses
  • DALL-E 2
    • uses CLIP embedding directly,
    • and decodes images via diffusion similar to GLIDE.
  • zero-shot image classification:
    • create for each class a text -> embedding
    • cosine similarity between image and text embeddings
  • image-text classification
    • sum up the two output class token embeddings zero-shot similar
    • or the two output class token embeddings fed in to a shallow MLP classification head
    • or the two output sequences fed into a transformer with classification head

Amazon’s CMA-CLIP Model

CMA-CLIP model architecture

CMA-CLIP Architecture

  • split image into patches, and embed with CLIP into sequence
  • embed text with CLIP into sequence of text token embeddings
  • concatenate both embeddings into single sequence into a transformer
  • the transformer outputs aggregate (global) image and text embedding
  • modality-wise attention per task: learned weighted sum of the two embeddings
    • asks: is this input relevant?
    • the weight is a softmax of a dot product to a learned vector
    • resists noise and missing data better similar to EmbraceNet feature dropout

CMA-CLIP model architecture

CMA-CLIP Datasets

  • Amazon’s proprietary MRWPA dataset contains labels for Color, Pattern, and Style
  • Fashion-Gen Dataset with 325k images, 78k texts, single-label, and 41 categories.
  • UPMC-Food101 Dataset with 90k images, 86k texts, 101 categories.

CMA-CLIP datasets

CMA-CLIP Results

  • Overall CMA-CLIP slightly better than MMBT, but speculatively could outperform on multitasking
  • Parameter count comparison is missing

CMA-CLIP model results

CMA-CLIP vs KaleidoBERT vs ImageBERT on Fashion-Gen

  • CMA-CLIP outperforms KaleidoBERT vs ImageBERT, and other models.
  • There is no benchmark avaiabale for MMBT or CLIP on Fashion-Gen

CMA-CLIP vs KaleidoBERT vs ImageBERT on Fashion-Gen

CMA-CLIP vs MMBT vs CLIP on Food101

  • CMA-CLIP outperforms MMBT and CLIP
  • MMBT significantly outperforms CLIP likely due to the tuned transformer head
  • BERT does better than ViT on this dataset

cma-clip vs mmbt vs clip vs bert vs vit on Food101

CMA-CLIP Results on MRWPA dataset

  • WIT in below is proprietary WebImageText
  • Since CMA-CLIP has more parameters, the performance is expected
  • Multitask learning classification usable for learning disentangled representations

CMA-CLIP vs CLIP Results on MRWPA dataset

CMA-CLIP Image-text Alignment on MRWPA dataset

Text-to-image attention map alignment suggest CMA-CLIP can find cross-modality correlations.

CMA-CLIP text-image token attention map

CMA-CLIP Ablation Results

  • modality wise attention helps the most on Style labels, then Pattern, then Color
  • likely because (the text feature is irrelevant to relevant in this order)

CMA-CLIP ablation results

Google’s CoCa Model

CoCa model pretraining

CoCa Results

  • achieved SoTA on ImageNet!

CoCa results

Facebook’s MMBT Model

MMBT model architecture

EmbraceNet Model

EmbraceNet model architecture

DeepMind’s Perceiver Model

Perceiver IO is a general-purpose multi-modal architecture that can handle wide variety of inputs as well as outputs. Perceiver can be applied to for example image-text classification. Perceiver IO uses cross-attention for merging:

  • multimodal input sequences (e.g. image, text, audio) into a low dimensional latent sequence
  • “output query” or “command” to decode the output value e.g. predict this masked word

Perceiver IO architecture

Advantage of the Perceiver architecture is that in general you can work with very large inputs. Architecture Hierarchical Perceiver has ability to process even longer input sequences by splitting into subsequences and then merging them. Hierarchical Perceiver also learns the positional encodings with a separate training step with a reconstruction loss.

Created on 01 Sep 2022. Updated on: 07 Sep 2022.
Thank you

Ask or Report A Mistake


Let's connect








Privacy Policy How many days left in this quarter? Twitter Bullet Points to Copy & Paste About Vaclav Kosar