2021 Fiscal Year Final Research Report

Zero-shot Cross-modal Embedding Learning

Research Project

PDF

Project/Area Number	19K11987
Research Category	Grant-in-Aid for Scientific Research (C)
Allocation Type	Multi-year Fund
Section	一般
Review Section	Basic Section 60080:Database-related
Research Institution	National Institute of Informatics
Principal Investigator	Yu Yi 国立情報学研究所, コンテンツ科学研究系, 特任助教 (00754681)
Project Period (FY)	2019-04-01 – 2022-03-31
Keywords	Cross-Modal Correlation / Cross-Modal Embedding
Outline of Final Research Achievements	This project focused on cross-modal embedding learning for cross-modal retrieval. The main challenge is how to learn joint embeddings in a shared subspace for computing the similarity across different modalities. 1) We proposed a novel deep triplet neural network with cluster canonical correlation analysis (TNN-C-CCA), which is an end-to-end supervised learning architecture with audio branch and video branch. 2) We proposed a novel variational autoencoder (VAE) architecture for audio-visual cross-modal retrieval, by learning paired audio-visual correlation embedding and category correlation embedding as constraints to reinforce the mutuality of audio-visual information. 3) We proposed an unsupervised generative adversarial alignment representation (UGAAR) model to learn deep discriminative representations shared across three major musical modalities: sheet music, lyrics, and audio, where a deep neural network based architecture on three branches is jointly trained.
Free Research Field	データベース関連
Academic Significance and Societal Importance of the Research Achievements	The distribution of data in different modalities are inconsistent, which makes it difficult to directly measure the similarity across different modalities. The proposed technique of cross-modal embedding learning can help improve the performance of cross-modal retrieval, recognition, and generation.