Domain Adversarial Training on C-VAE for Controllable Music Generation


Introduction

Welcome to the demo page of our ISMIR 2022 paper: Adversarial Training on Conditional Variational Auto-Encoder for Controllable Music Generation. we contribute a generalized form of domain adversarial training for better disentanglement and controllability for music generation, especially when complex sequential conditions are involved. In this paper, we focus on the task of chord representation learning conditioned on melody. A well-trained model with good controllability can help us harmonize a new melody using the representation (style) of an existing chord progression. The following demos will showcase to you our model's controllability in chord generation.

First, let's listen to a source lead sheet (chord + melody). This is an 8-bar pop song phrase selected from our validation set. Our model will extract its chord representation and then reconstruct the chord conditioned on varied melody conditions.

MIDI 1. Source lead sheet (paper Fig. 5(b))


Here is the reconstruction conditioned on a transposed melody (down a tritone) as condition:

MIDI 2. Generation conditioned on transposition


Then, what about changing the mode of the melody from major to minor:

MIDI 3. Generation conditioned on modal change


Now we introduce a new melody sample as the condition. The generation result is as follows:

MIDI 4. New melody source (paper Fig. 5(a))

MIDI 5. Generation conditioned on new melody (paper Fig. 6(b))


Finally, let's swith the chord and melody source of previous samples and see how it works:

MIDI 6. Generation conditioned on new melody (paper Fig. 6(a))

Ablation Study

The main novelty our paper is to introduce a generalized domain adversarial objective with condition corruption, which contextualizes the exact dependency between representation and condition, and therefore assists disentanglement and control. To validate our design, we compare our model with three baseline models. For definition of each baseline model (non-DAT, mask-CR, and non-CR), please refer to our paper.

Demos for ablation study are as follows:


Case 1

Chord source

Melody source

Generation-Ours

Generation-non-DAT

Generation-mask-CR

Generation-non-CR


Case 2

Chord source

Melody source

Generation-Ours

Generation-non-DAT

Generation-mask-CR

Generation-non-CR


Case 3

Chord source

Melody source

Generation-Ours

Generation-non-DAT

Generation-mask-CR

Generation-non-CR


Case 4

Chord source

Melody source

Generation-Ours

Generation-non-DAT

Generation-mask-CR

Generation-non-CR


Case 5

Chord source

Melody source

Generation-Ours

Generation-non-DAT

Generation-mask-CR

Generation-non-CR


Failure Case

While our model is controllable in term of tonality, it is not yet enforced to adapt chord progression locally. As a result, when the progression of chord source differs very much from that implied by the source melody (e.g., when chord source has a ii-IVM7-ii-V progression while the melody implies a vi-IIM7-V7-V7(sus4)), our model can fail to generate satisfied harmony. Such a failure sample is as follows. We will seek to solve these problems in our future research.

Chord source

Melody source

Generation-Ours


Conclusion

In conclusion, we contribute a generalized form of domain adversarial training for controllable music generation, especially when complex sequential conditions are involved. Our method shows excellent performance on chord representation learning, where we learn a pitch-invariant representation conditioned on the melody and develop a novel harmonization strategy. Our improvement in disentanglement and controllability is elaborated with extensive subjective and objective evaluation. With the proposal of our methodology, we hope to bring a new perspective not only to music generation, but also to more general scenarios of conditional representation learning.

Our codes are available on GitHub Repo. For more information about our algorithm and experiments, feel free to refer to our paper!