- Tasks:
- Given a product’s image, find the product’s different color variant within a dataset.
- Generate an image of the product but with a flower pattern.
- Complete this fashion outfit with an additional product.
- What is disentangled representation (embedding)?
- Entangled representation = hard to preserve some attributes and change others
- Disentangled = object’s attributes have separate dimensions
Unsupervised Disentangling Methods
- Below methods are generative
- so instead of search, can manipulate the image (condition)
- Variational Auto-encoders
- speculation: some disentanglement thanks to the architecture
- compressing into low-dimension and space close around the zero (regularization term)
- only high-level factors get through the compression
- products with similar high level factors are encoded close in the embedding space
- methods: mutual information between latents, total correlation e.g. unsupervised Relevance factors VAE
- speculation: some disentanglement thanks to the architecture
- GANs (has encoder and decoder) e.g. DNA-GAN: Learning Disentangled Representations from Multi-Attribute Images,
- Flow-Based models e.g. OpenAI’s Glow - Flow-Based Model Teardown
- like VAE but the decoder is reverse of the encoder
- reversibly encodes into independent gaussian factors
- the attribute vectors are found using labeled data
Unsupervised Disentangled Representations
- Google ICML 2019 Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations
- A large-scale evaluation of various unsupervised methods (12k models)
- On dataset Shape3D try to separate all attributes of the scene
- into 10 dimensions: object shape, object size, camera rotation, colors
- No model disentangled reliably into above
- Theorem: infinite transformations of the true distribution
- cannot ever find true dimensions without a guide
- but could find with additional data?
- Assumptions about the data have to be incorporated into the model (inductive bias)
- Each unsupervised model has to be specialized
Multi-Task Learning
- Multi-task learning may improve performance
- Google NeurIPS 2021 paper on a method for grouping tasks
- meta-learning
- usually the tasks have to be related
- inter-task affinity:
- measure one task’s gradient affects the other tasks loss
- correlates overall model performance
Supervised-Disentangling: Attribute-driven Disentangled Representations
- Amazon 2021 paper Learning Attribute-driven Disentangled Representations for Interactive Fashion Retrieval
- SoTA on the fashion tasks (Attribute manipulation retrieval, Conditional similarity retrieval, Outfit completion)
- supervised disentangled representation learning
- all attribute multiple values
- split embedding into sections corresponding to attributes
- multi-task learning allows disentangling
- store prototype embeddings of each attribute value in memory module
- prototypes can then be swapped for items attribute vector
- Read more about related research in image-text classfication
Architecture
- image representation (AlexNet, Resnet18)
- per attribute:
- fully-connected two-layer network
- map into attributed-specific subspace
- producing image’s attribute embedding
- disentangled representation
- called Attribute-Driven Disentangled Encoder (ADDE)
- memory block
- stores prototype embeddings for all values of the attributes
- e.g. each color has one prototype embeddings
- stored in a matrix that forces small non-block diagonal elements
- trained via triplet loss
Loss Function
- Label triplet loss
- representations with same labels to have same vectors
- Consistency triplet loss
- attribute representations of an image close to corresponding memory vectors
- align prototype embeddings with representations
- Compositional triplet loss
- generate change in attributes
- create manipulation vector using prototype vectors
- sample positive and negative samples based on labels
- Memory block loss
- off-block-diagonal to zero
Experiments and Results
Datasets
- Shopping100k: 100k samples, 12 attributes
- DeepFashion: 100k samples, 3 attributes: category, texture, shape
Attribute Manipulation Retrieval
- Previous approaches
- AMNet: Memory-Augmented Attribute Manipulation Networks for Interactive Fashion Search
- no disentangling
- target attribute value is represented by a prototype vector
- specialized NN layer fuses the prototype vector attribute into the representation
- FSN: Learning attribute representations with localization for flexible fashion search
- localizes regions of attributes within the image
- using attribute activation maps
- then weighted-pooling on earlier convolution layer (5 instead of 7)
- AMNet: Memory-Augmented Attribute Manipulation Networks for Interactive Fashion Search
- AlexNet backbone network for comparable results
- Loss function ablations included
Outfit Completion
- backbone network Resnet18
- previous Amazon paper 2020 Fashion Outfit Complementary Item Retrieval
- introduced CSA-Net with similar architecture without disentanglement
Outfit Ranking Loss
- operates on entire outfit
- calculates average distance from all members in the outfit to the proposed addition
- input these distances into a triplet loss
Output Conditioning
In diffusion models, we manipulate the output image with conditioning input e.g. conditioning text. This relies on certain disentangling of the representations for the model to be able to manipulate them. For example, feature-wise linear modulation layer can be used for this purpose.