Introduction

In this work, we contribute an unsupervised method that effectively learns from raw observation and disentangles its latent space into content and style representations. Unlike most disentanglement algorithms that rely on domain-specific labels and knowledge, our method is based on the meta-level insight of domain-general statistical differences between content and style — content varies more among different fragments within a sample but maintains an invariant vocabulary across data samples, whereas style remains relatively invariant within a sample but exhibits more significant variation across different samples. We integrate such inductive bias into an encoder-decoder architecture and name our method after V3 (variance-versus-invariance).

This demonstration page serves as a supplement to Section 4.2 and 4.4 in the paper. Specifically, we provide provide interactive demos for experiments on the two synthetic datasets involved in the paper — PhoneNums and InsNotes. PhoneNums contains images of written digit strings with different colors and other style variations, and InsNotes contains monophonic music audio played by single instruments with different pitches and expression variations.

We compare V3 with the MINE-based method and the cycle loss-based method as mentioned in the paper. Note that all models shown in the demos have already got good reconstruction performance, which we omit here for simplicity.


PhoneNums: Learning Digits and Colors

Example data

In this demo, we show that V3 can learn to disentangle digits and colors from images of written digit strings. Above shows eight samples of different colors in the dataset. All 10 digits are involved, written in eight different colors. Note that although every line is written in a single color, there are rich variations in the position, noise, foreground and background color jitter, and the blurriness of digits.

We compare V3 with the MINE-based method and the cycle loss-based method as mentioned in the paper, by training all three methods on the image dataset without any supervision except segmentation. Below shows the codebook confusion matrices, latent space visualizations and content-style recombinations of all methods under different codebook size settings.

Please choose codebook size:

K=10 K=20 K=40

Visualizations of symbol-level interpretability & latent representations

Method Codebook Confusion Matrix Content Latent Space Style Latent Space
V3 (Proposed)
MINE-based
Cycle loss
Legends

Please choose a content or style for traversal:

Fix content index, traverse all styles: Or Fix style, traverse all content indices:

Synthesized results via content-style recombination

V3 (Proposed)
MINE-based
Cycle loss
From both the visualizations and the recombination synthesis results, we can see that V3 successfully learns to disentangle the digits and colors well. The content and style representations show clear locality compared to ground truth labels. The confusion matrices show a near one-to-one alignment with human knowledge when there is no codebook redundancy (K=10), and a full coverage and interpretability when there is codebook redundancy (K=20 and K=40). The style transfer results are also correct and semantically meaningful compared to the baselines.




InsNotes: Learning Pitches and Timbres

Example data

In this demo, we show that V3 can learn to disentangle pitch and timbre from raw music audio played by single instruments. Above shows the spectrograms and the audio files of some samples in the dataset. There are 12 pitches (a full octave) and 12 timbres involved in the dataset. Note that although each sample is played by only one instrument, there are rich velocity and amplitude envelope variations.

We compare V3 with the MINE-based method and the cycle loss-based method as mentioned in the paper, by training all three methods on the music dataset without any supervision except segmentation. Below shows the codebook confusion matrices, latent space visualizations and content-style recombinations of all methods under different codebook size settings.

Please choose codebook size:

K=12 K=24 K=48

Visualizations of symbol-level interpretability & latent representations

Method Codebook Confusion Matrix Content Latent Space Style Latent Space
V3 (Proposed)
MINE-based
Cycle loss
Legends

Please choose content or style for traversal:

Fix content index, traverse all styles: Or Fix style, traverse all content indices:

Synthesized results via content-style recombination

V3 (Proposed)
MINE-based
Cycle loss
From both the visualizations and the recombination synthesis results, we can see that V3 successfully learns to disentangle the pitches and timbres well. The content and style representations show clear locality compared to ground truth labels. The confusion matrices show a clear one-to-one alignment with human knowledge when there is no codebook redundancy (K=12), and most codebook entries are still interpretable when there is codebook redundancy (K=24 and K=48). The style transfer results are also correct and semantically meaningful when K=12 compared to the baselines. Even though there are imperfections when K=24 and K=48, the results are still better than the baselines as all pitches are covered when traversing all content indices with a fixed style, and all notes are produced as supposed when traversing all timbres with a meaningful fixed content index well-aligned with human knowledge.