研究実績の概要 |
The main challenge of representation learning across different modalities is the heterogeneous gap. A classical method series is the CCA-based approaches, which aims at finding transformation to optimize the correlation between the input pairs from two different variable sets. We propose an unsupervised generative adversarial alignment representation (UGAAR) model to learn deep discriminative representations shared across three major musical modalities: sheet music, lyrics, and audio, where a deep neural network based architecture on three branches is jointly trained. In particular, the proposed model can transfer the strong relationship between audio and sheet music to audio-lyrics and sheet-lyrics pairs by learning the correlation in the latent shared subspace. We apply CCA components of audio and sheet music to establish new ground truth. The generative (G) model learns the correlation of two couples of transferred pairs to generate new audio-sheet pair for a fixed lyrics to challenge the discriminative (D) model. The discriminative model aims at distinguishing the input which is from the generative model or the ground truth. The two models simultaneously train in an adversarial way to enhance the ability of deep alignment representation learning. Our experimental results demonstrate the feasibility of our proposed UGAAR for alignment representation learning among sheet music, audio, and lyrics.
|
今後の研究の推進方策 |
Future work will aim to develop novel cross-modal learning algorithms from the following aspects: (i) develop transformer techniques to learning more stronger correlation, (ii) develop multimodal metric learning to enhance system performance, (iii) do extensive experiments to compare with other existing state-of-the-art methods.
|