Introduction
In this work, we contribute an unsupervised method that effectively learns from raw observation and disentangles its latent space into content and style representations. Unlike most disentanglement algorithms that rely on domain-specific labels and knowledge, our method is based on the meta-level insight of domain-general statistical differences between content and style — content varies more among different fragments within a sample but maintains an invariant vocabulary across data samples, whereas style remains relatively invariant within a sample but exhibits more significant variation across different samples. We integrate such inductive bias into an encoder-decoder architecture and name our method after V3 (variance-versus-invariance).
This demonstration page serves as a supplement to Section 4.2 and 4.4 in the paper. Specifically, we provide provide interactive demos for experiments on the two synthetic datasets involved in the paper — PhoneNums and InsNotes. PhoneNums contains images of written digit strings with different colors and other style variations, and InsNotes contains monophonic music audio played by single instruments with different pitches and expression variations.
-
Visualizations of Latent Representations and Codebook Confusion Matrix: As a supplement to Section 4.2 and Section 4.4, we provide a visual exploration of the learned latent representations of content and style, and confusion matrices between the learned codebook and ground truth content labels. Latent representations are visualized using t-SNE in 3D space. The colors of data points denote ground truth content or style labels. Visualizations with good groupings indicate better disentanglement performance. The codebook confusion matrix shows how well the learned vocabulary aligns with human knowledge. The horizontal axis represents the ground truth contents, and the vertical axis represents the learned codebook entries. The deeper the color, the higher the correlation.
-
Synthesized Results via Content-Style Recombination: As a supplement to Section 4.4, we illustrate the interpretability of codebook and successfulness of content-style disentanglement by synthesizing new samples through recombining latent content and style factors. Under a specific codebook size, you can select an individual content or style source to see the corresponding recombination results. If a content index is selected, we will show the style transfer results of this content recombined with all styles (taking the mean of the style representations). If a style index is selected, we will show the style transfer results of this style recombined with all contents. Note that the content indices are sorted in the order of the codebook.
We compare V3 with the MINE-based method and the cycle loss-based method as mentioned in the paper. Note that all models shown in the demos have already got good reconstruction performance, which we omit here for simplicity.
PhoneNums: Learning Digits and Colors
Example data
In this demo, we show that V3 can learn to disentangle digits and colors from images of written digit strings. Above shows eight samples of different colors in the dataset. All 10 digits are involved, written in eight different colors. Note that although every line is written in a single color, there are rich variations in the position, noise, foreground and background color jitter, and the blurriness of digits.
We compare V3 with the MINE-based method and the cycle loss-based method as mentioned in the paper, by training all three methods on the image dataset without any supervision except segmentation. Below shows the codebook confusion matrices, latent space visualizations and content-style recombinations of all methods under different codebook size settings.
Please choose codebook size:
K=10 K=20 K=40Visualizations of symbol-level interpretability & latent representations
Method | Codebook Confusion Matrix | Content Latent Space | Style Latent Space |
---|---|---|---|
V3 (Proposed) |
|
|
|
MINE-based |
|
|
|
Cycle loss |
|
|
|
Legends |
|
|
|
Please choose a content or style for traversal:
Fix content index, traverse all styles: | Or | Fix style, traverse all content indices: |
Synthesized results via content-style recombination
V3 (Proposed) | |
MINE-based | |
Cycle loss |
InsNotes: Learning Pitches and Timbres
Example data
|
|
|
In this demo, we show that V3 can learn to disentangle pitch and timbre from raw music audio played by single instruments. Above shows the spectrograms and the audio files of some samples in the dataset. There are 12 pitches (a full octave) and 12 timbres involved in the dataset. Note that although each sample is played by only one instrument, there are rich velocity and amplitude envelope variations.
We compare V3 with the MINE-based method and the cycle loss-based method as mentioned in the paper, by training all three methods on the music dataset without any supervision except segmentation. Below shows the codebook confusion matrices, latent space visualizations and content-style recombinations of all methods under different codebook size settings.
Please choose codebook size:
K=12 K=24 K=48Visualizations of symbol-level interpretability & latent representations
Method | Codebook Confusion Matrix | Content Latent Space | Style Latent Space |
---|---|---|---|
V3 (Proposed) |
|
|
|
MINE-based |
|
|
|
Cycle loss |
|
|
|
Legends |
|
|
|
Please choose content or style for traversal:
Fix content index, traverse all styles: | Or | Fix style, traverse all content indices: |
Synthesized results via content-style recombination
V3 (Proposed) | |
MINE-based | |
Cycle loss |