2021 Fiscal Year Annual Research Report
Zero-shot Cross-modal Embedding Learning
Project/Area Number |
19K11987
|
Research Institution | National Institute of Informatics |
Principal Investigator |
ュ イ 国立情報学研究所, コンテンツ科学研究系, 特任助教 (00754681)
|
Project Period (FY) |
2019-04-01 – 2022-03-31
|
Keywords | Cross-Modal Correlation / Cross-Modal Embedding |
Outline of Annual Research Achievements |
We study cross-modal embedding learning for two tasks. (1) We present a novel variational autoencoder (VAE) architecture for audio-visual cross-modal learning. i) Audio encoder and visual encoder separately encode audio and visual data into two different spaces, and the features are further mapped to a common sub-space by canonical correlation analysis (CCA). ii) Probabilistic modeling methods are utilized to handle possible noise and missing information in the data. In this way, the cross-modal discrepancy from intra-modal and inter-modal information are simultaneously eliminated in the joint embedding subspace. (2) Embedding learning is also very important for cross-modal generative modeling, besides cross-modal retrieval. We pay attention to learning the relationship between melody and lyrics in different feature spaces to generate the best matching pair between them. The challenging issue is how to solve the discrepancy between real training samples and generated ones. In this project, for melody generation, we propose a novel architecture, Three Branch Conditional LSTM-GAN conditioned on lyrics which is composed of a LSTM-based generator and discriminator respectively. The generative model is composed of three branches of identical and independent lyrics-conditioned LSTM-based sub-networks, each responsible for generating an attribute of a melody. For discrete-valued sequence generation, we leverage the Gumbel-Softmax technique to train GANs.
|