Feature-wise Linear Modulation (FiLM) layer is used to change outputs of a general model with some specific conditioning input.
FiLM input is one input feature and a conditioning feature. The conditioning is trained to change model’s behavior on demand (changes output probability distribution). For example you condition diffusion model with words “brown cat” to generate images of brown cats. The conditioning has smaller impact than the input feature and overall model training, but is very important for the model application.
Example FiLM applied to U-Net implementation is here.
- Feature-wise transformations conditions each input feature separately.
- Multiplicative conditioning seems to be more useful.
- But to avoid loss of generality conditioning with addition and multiplication (affine transformation) is used.
- This affine conditioning is called Feature-wise Linear Modulation (FiLM) layer.
- Conditioning is often applied across multiple layers.
- Cross-attention is more complex feature-wise transformation, where the feature is an input sequence. Cross-attention has quadratic complexity.
Q-Transformer applies FiLM to a visual EfficientNet to condition with embeddings of textual instructions to predict Q-values.
Try-on in Fashion
TryOnDiffusion: A Tale of Two UNets is using FiLM layers to condition U-Net to generate new image given an input person image but wearing a shirt from another conditioning image.
FiLM layer relies on the ability to disentangle attributes about the input features and change them using the conditioning.