Unsupervised Disentanglement of Content and Style via Variance-Invariance Constraints

Introduction

In this work, we contribute an unsupervised method that effectively learns from raw observation and disentangles its latent space into content and style representations. Unlike most disentanglement algorithms that rely on domain-specific labels and knowledge, our method is based on the meta-level insight of domain-general statistical differences between content and style — content varies more among different fragments within a sample but maintains an invariant vocabulary across data samples, whereas style remains relatively invariant within a sample but exhibits more significant variation across different samples. We integrate such inductive bias into an encoder-decoder architecture and name our method after V3 (variance-versus-invariance).

This demonstration page serves as a supplement to Section 4.2 and 4.4 in the paper. Specifically, we provide provide interactive demos for experiments on the two synthetic datasets involved in the paper — PhoneNums and InsNotes. PhoneNums contains images of written digit strings with different colors and other style variations, and InsNotes contains monophonic music audio played by single instruments with different pitches and expression variations.

Visualizations of Latent Representations and Codebook Confusion Matrix: As a supplement to Section 4.2 and Section 4.4, we provide a visual exploration of the learned latent representations of content and style, and confusion matrices between the learned codebook and ground truth content labels. Latent representations are visualized using t-SNE in 3D space. The colors of data points denote ground truth content or style labels. Visualizations with good groupings indicate better disentanglement performance. The codebook confusion matrix shows how well the learned vocabulary aligns with human knowledge. The horizontal axis represents the ground truth contents, and the vertical axis represents the learned codebook entries. The deeper the color, the higher the correlation.
Synthesized Results via Content-Style Recombination: As a supplement to Section 4.4, we illustrate the interpretability of codebook and successfulness of content-style disentanglement by synthesizing new samples through recombining latent content and style factors. Under a specific codebook size, you can select an individual content or style source to see the corresponding recombination results. If a content index is selected, we will show the style transfer results of this content recombined with all styles (taking the mean of the style representations). If a style index is selected, we will show the style transfer results of this style recombined with all contents. Note that the content indices are sorted in the order of the codebook.

We compare V3 with the MINE-based method and the cycle loss-based method as mentioned in the paper. Note that all models shown in the demos have already got good reconstruction performance, which we omit here for simplicity.

PhoneNums: Learning Digits and Colors

Example data

In this demo, we show that V3 can learn to disentangle digits and colors from images of written digit strings. Above shows eight samples of different colors in the dataset. All 10 digits are involved, written in eight different colors. Note that although every line is written in a single color, there are rich variations in the position, noise, foreground and background color jitter, and the blurriness of digits.

We compare V3 with the MINE-based method and the cycle loss-based method as mentioned in the paper, by training all three methods on the image dataset without any supervision except segmentation. Below shows the codebook confusion matrices, latent space visualizations and content-style recombinations of all methods under different codebook size settings.

Please choose codebook size:

K=10 K=20 K=40

Visualizations of symbol-level interpretability & latent representations

Method	Codebook Confusion Matrix	Content Latent Space	Style Latent Space
V3 (Proposed)
MINE-based
Cycle loss
Legends

Please choose a content or style for traversal:

Fix content index, traverse all styles:	Or	Fix style, traverse all content indices:

Synthesized results via content-style recombination

V3 (Proposed)
MINE-based
Cycle loss

From both the visualizations and the recombination synthesis results, we can see that V3 successfully learns to disentangle the digits and colors well. The content and style representations show clear locality compared to ground truth labels. The confusion matrices show a near one-to-one alignment with human knowledge when there is no codebook redundancy (K=10), and a full coverage and interpretability when there is codebook redundancy (K=20 and K=40). The style transfer results are also correct and semantically meaningful compared to the baselines.

InsNotes: Learning Pitches and Timbres

Example data

In this demo, we show that V3 can learn to disentangle pitch and timbre from raw music audio played by single instruments. Above shows the spectrograms and the audio files of some samples in the dataset. There are 12 pitches (a full octave) and 12 timbres involved in the dataset. Note that although each sample is played by only one instrument, there are rich velocity and amplitude envelope variations.

We compare V3 with the MINE-based method and the cycle loss-based method as mentioned in the paper, by training all three methods on the music dataset without any supervision except segmentation. Below shows the codebook confusion matrices, latent space visualizations and content-style recombinations of all methods under different codebook size settings.

Please choose codebook size:

K=12 K=24 K=48

Visualizations of symbol-level interpretability & latent representations

Method	Codebook Confusion Matrix	Content Latent Space	Style Latent Space
V3 (Proposed)
MINE-based
Cycle loss
Legends

Please choose content or style for traversal:

Fix content index, traverse all styles:	Or	Fix style, traverse all content indices:

Synthesized results via content-style recombination

V3 (Proposed)
MINE-based
Cycle loss

From both the visualizations and the recombination synthesis results, we can see that V3 successfully learns to disentangle the pitches and timbres well. The content and style representations show clear locality compared to ground truth labels. The confusion matrices show a clear one-to-one alignment with human knowledge when there is no codebook redundancy (K=12), and most codebook entries are still interpretable when there is codebook redundancy (K=24 and K=48). The style transfer results are also correct and semantically meaningful when K=12 compared to the baselines. Even though there are imperfections when K=24 and K=48, the results are still better than the baselines as all pitches are covered when traversing all content indices with a fixed style, and all notes are produced as supposed when traversing all timbres with a meaningful fixed content index well-aligned with human knowledge.

Introduction#

PhoneNums: Learning Digits and Colors#

Example data

Please choose codebook size:

Visualizations of symbol-level interpretability & latent representations

Please choose a content or style for traversal:

Synthesized results via content-style recombination

InsNotes: Learning Pitches and Timbres#

Example data

Please choose codebook size:

Visualizations of symbol-level interpretability & latent representations

Please choose content or style for traversal:

Synthesized results via content-style recombination

Introduction

PhoneNums: Learning Digits and Colors

InsNotes: Learning Pitches and Timbres