Project/Area Number |
19K11987
|
Research Category |
Grant-in-Aid for Scientific Research (C)
|
Allocation Type | Multi-year Fund |
Section | 一般 |
Review Section |
Basic Section 60080:Database-related
|
Research Institution | National Institute of Informatics |
Principal Investigator |
Yu Yi 国立情報学研究所, コンテンツ科学研究系, 特任助教 (00754681)
|
Project Period (FY) |
2019-04-01 – 2022-03-31
|
Project Status |
Completed (Fiscal Year 2021)
|
Budget Amount *help |
¥4,290,000 (Direct Cost: ¥3,300,000、Indirect Cost: ¥990,000)
Fiscal Year 2021: ¥780,000 (Direct Cost: ¥600,000、Indirect Cost: ¥180,000)
Fiscal Year 2020: ¥1,430,000 (Direct Cost: ¥1,100,000、Indirect Cost: ¥330,000)
Fiscal Year 2019: ¥2,080,000 (Direct Cost: ¥1,600,000、Indirect Cost: ¥480,000)
|
Keywords | Cross-Modal Correlation / Cross-Modal Embedding / cross-modal embedding / zero-shot / cross-modal retrieval |
Outline of Research at the Start |
Lots of efforts have been devoted to learning cross-modal correlation between data in different modalities. But existing cross-modal embedding models usually do not work well when the query or database include new data with unknown categories. To solve this problem, this project aims to develop zero-shot cross-modal embedding learning algorithms from the following aspects: (i) compute modality-invariant embedding, (ii) predict unknown categories based on external knowledge describing their correlation from known categories, (iii) apply adversarial learning to enhance system performance.
|
Outline of Final Research Achievements |
This project focused on cross-modal embedding learning for cross-modal retrieval. The main challenge is how to learn joint embeddings in a shared subspace for computing the similarity across different modalities. 1) We proposed a novel deep triplet neural network with cluster canonical correlation analysis (TNN-C-CCA), which is an end-to-end supervised learning architecture with audio branch and video branch. 2) We proposed a novel variational autoencoder (VAE) architecture for audio-visual cross-modal retrieval, by learning paired audio-visual correlation embedding and category correlation embedding as constraints to reinforce the mutuality of audio-visual information. 3) We proposed an unsupervised generative adversarial alignment representation (UGAAR) model to learn deep discriminative representations shared across three major musical modalities: sheet music, lyrics, and audio, where a deep neural network based architecture on three branches is jointly trained.
|
Academic Significance and Societal Importance of the Research Achievements |
The distribution of data in different modalities are inconsistent, which makes it difficult to directly measure the similarity across different modalities. The proposed technique of cross-modal embedding learning can help improve the performance of cross-modal retrieval, recognition, and generation.
|