Manipulate Item Attributes via Disentangled Representation

Using attribute-specific embedding subspaces for image manipulation retrieval, outfit completion, conditional similarity retrieval.
  • Tasks:
    • Given a product’s image, find the product’s different color variant within a dataset.
    • Generate an image of the product but with a flower pattern.
    • Complete this fashion outfit with an additional product.
  • What is disentangled representation (embedding)?
    • Entangled representation = hard to preserve some attributes and change others
    • Disentangled = object’s attributes have separate dimensions

Unsupervised Disentangling Methods

  • Below methods are generative
    • so instead of search, can manipulate the image
  • Variational Auto-encoders
    • speculation: some disentanglement thanks to the architecture
      • compressing into low-dimension and space close around the zero (regularization term)
      • only high-level factors get through the compression
      • products with similar high level factors are encoded close in the embedding space
    • methods: mutual information between latents, total correlation e.g. unsupervised Relevance factors VAE
  • GANs (has encoder and decoder) e.g. DNA-GAN: Learning Disentangled Representations from Multi-Attribute Images,
  • Flow-Based models e.g. OpenAI’s Glow - Flow-Based Model Teardown
    • like VAE but the decoder is reverse of the encoder
    • reversibly encodes into independent gaussian factors
    • the attribute vectors are found using labeled data

Glow model smiling vector

Unsupervised Disentangled Representations

  • Google ICML 2019 Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations
  • A large-scale evaluation of various unsupervised methods (12k models)
  • On dataset Shape3D try to separate all attributes of the scene
    • into 10 dimensions: object shape, object size, camera rotation, colors
  • No model disentangled reliably into above
  • Theorem: infinite transformations of the true distribution
    • cannot ever find true dimensions without a guide
    • but could find with additional data?
  • Assumptions about the data have to be incorporated into the model (inductive bias)
  • Each unsupervised model has to be specialized

Shape3D dataset for disentagling factors: floor color, wall color, object color, object size, camera angle

Multi-Task Learning

  • Multi-task learning may improve performance
  • Google NeurIPS 2021 paper on a method for grouping tasks
  • meta-learning
  • usually the tasks have to be related
  • inter-task affinity:
    • measure one task’s gradient affects the other tasks loss
    • correlates overall model performance

inter-task affinity for multi-task learning task grouping

Supervised-Disentangling: Attribute-driven Disentangled Representations

  • Amazon 2021 paper Learning Attribute-driven Disentangled Representations for Interactive Fashion Retrieval
  • SoTA on the fashion tasks (Attribute manipulation retrieval, Conditional similarity retrieval, Outfit completion)
  • supervised disentangled representation learning
    • all attribute multiple values
    • split embedding into sections corresponding to attributes
    • multi-task learning allows disentangling
    • store prototype embeddings of each attribute value in memory module
    • prototypes can then be swapped for items attribute vector
  • Read more about related research in image-text classfication

disentangled representation using attribute-specific encoder


  • image representation (AlexNet, Resnet18)
  • per attribute:
    • fully-connected two-layer network
    • map into attributed-specific subspace
    • producing image’s attribute embedding
  • disentangled representation
  • called Attribute-Driven Disentangled Encoder (ADDE)
  • memory block
    • stores prototype embeddings for all values of the attributes
    • e.g. each color has one prototype embeddings
    • stored in a matrix that forces small non-block diagonal elements
    • trained via triplet loss

Attribute-Driven Disentangled Encoder (ADDE)

Loss Function

  • Label triplet loss
    • representations with same labels to have same vectors
  • Consistency triplet loss
    • attribute representations of an image close to corresponding memory vectors
    • align prototype embeddings with representations
  • Compositional triplet loss
    • generate change in attributes
    • create manipulation vector using prototype vectors
    • sample positive and negative samples based on labels
  • Memory block loss
    • off-block-diagonal to zero

Experiments and Results


  • Shopping100k: 100k samples, 12 attributes
  • DeepFashion: 100k samples, 3 attributes: category, texture, shape

Attribute manipulation retrieval examples on Shopping100k and DeepFashion

Attribute Manipulation Retrieval

Attribute manipulation top-k retrival on Shopping100k and DeepFashion

Outfit Completion

ADDE outfit complementary retrieval

Outfit Ranking Loss
  • operates on entire outfit
  • calculates average distance from all members in the outfit to the proposed addition
  • input these distances into a triplet loss

Outfit Ranking Loss

Created on 25 Oct 2021.
Thank you

About Vaclav Kosar How many days left in this quarter? Twitter Bullet Points to Copy & Paste Privacy Policy
Copyright © Vaclav Kosar. All rights reserved. Not investment, financial, medical, or any other advice. No guarantee of information accuracy.