Manipulate Item Attributes via Disentangled Representation

Using attribute-specific embedding subspaces for image manipulation retrieval, outfit completion, conditional similarity retrieval.
  • Tasks:
    • Given a product’s image, find the product’s different color variant within a dataset.
    • Generate an image of the product but with a flower pattern.
    • Complete this fashion outfit with an additional product.
  • What is disentangled representation (embedding)?
    • Entangled representation = hard to preserve some attributes and change others
    • Disentangled = object’s attributes have separate dimensions

Unsupervised Disentangling Methods

  • Below methods are generative
    • so instead of search, can manipulate the image (condition)
  • Variational Auto-encoders
    • speculation: some disentanglement thanks to the architecture
      • compressing into low-dimension and space close around the zero (regularization term)
      • only high-level factors get through the compression
      • products with similar high level factors are encoded close in the embedding space
    • methods: mutual information between latents, total correlation e.g. unsupervised Relevance factors VAE
  • GANs (has encoder and decoder) e.g. DNA-GAN: Learning Disentangled Representations from Multi-Attribute Images,
  • Flow-Based models e.g. OpenAI’s Glow - Flow-Based Model Teardown
    • like VAE but the decoder is reverse of the encoder
    • reversibly encodes into independent gaussian factors
    • the attribute vectors are found using labeled data

Glow model smiling vector

Unsupervised Disentangled Representations

  • Google ICML 2019 Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations
  • A large-scale evaluation of various unsupervised methods (12k models)
  • On dataset Shape3D try to separate all attributes of the scene
    • into 10 dimensions: object shape, object size, camera rotation, colors
  • No model disentangled reliably into above
  • Theorem: infinite transformations of the true distribution
    • cannot ever find true dimensions without a guide
    • but could find with additional data?
  • Assumptions about the data have to be incorporated into the model (inductive bias)
  • Each unsupervised model has to be specialized

Shape3D dataset for disentagling factors: floor color, wall color, object color, object size, camera angle

Multi-Task Learning

  • Multi-task learning may improve performance
  • Google NeurIPS 2021 paper on a method for grouping tasks
  • meta-learning
  • usually the tasks have to be related
  • inter-task affinity:
    • measure one task’s gradient affects the other tasks loss
    • correlates overall model performance

inter-task affinity for multi-task learning task grouping

Supervised-Disentangling: Attribute-driven Disentangled Representations

  • Amazon 2021 paper Learning Attribute-driven Disentangled Representations for Interactive Fashion Retrieval
  • SoTA on the fashion tasks (Attribute manipulation retrieval, Conditional similarity retrieval, Outfit completion)
  • supervised disentangled representation learning
    • all attribute multiple values
    • split embedding into sections corresponding to attributes
    • multi-task learning allows disentangling
    • store prototype embeddings of each attribute value in memory module
    • prototypes can then be swapped for items attribute vector
  • Read more about related research in image-text classfication

disentangled representation using attribute-specific encoder


  • image representation (AlexNet, Resnet18)
  • per attribute:
    • fully-connected two-layer network
    • map into attributed-specific subspace
    • producing image’s attribute embedding
  • disentangled representation
  • called Attribute-Driven Disentangled Encoder (ADDE)
  • memory block
    • stores prototype embeddings for all values of the attributes
    • e.g. each color has one prototype embeddings
    • stored in a matrix that forces small non-block diagonal elements
    • trained via triplet loss

Attribute-Driven Disentangled Encoder (ADDE)

Loss Function

  • Label triplet loss
    • representations with same labels to have same vectors
  • Consistency triplet loss
    • attribute representations of an image close to corresponding memory vectors
    • align prototype embeddings with representations
  • Compositional triplet loss
    • generate change in attributes
    • create manipulation vector using prototype vectors
    • sample positive and negative samples based on labels
  • Memory block loss
    • off-block-diagonal to zero

Experiments and Results


  • Shopping100k: 100k samples, 12 attributes
  • DeepFashion: 100k samples, 3 attributes: category, texture, shape

Attribute manipulation retrieval examples on Shopping100k and DeepFashion

Attribute Manipulation Retrieval

Attribute manipulation top-k retrival on Shopping100k and DeepFashion

Outfit Completion

ADDE outfit complementary retrieval

Outfit Ranking Loss
  • operates on entire outfit
  • calculates average distance from all members in the outfit to the proposed addition
  • input these distances into a triplet loss

Outfit Ranking Loss

Output Conditioning

In diffusion models, we manipulate the output image with conditioning input e.g. conditioning text. This relies on certain disentangling of the representations for the model to be able to manipulate them. For example, feature-wise linear modulation layer can be used for this purpose.

Created on 25 Oct 2021.
Thank you

About Vaclav Kosar How many days left in this quarter? Twitter Bullet Points to Copy & Paste Averaging Stopwatch Privacy Policy
Copyright © Vaclav Kosar. All rights reserved. Not investment, financial, medical, or any other advice. No guarantee of information accuracy.