Zero-shot Cross-modal Embedding Learning

Research Project

Project/Area Number	19K11987
Research Category	Grant-in-Aid for Scientific Research (C)
Allocation Type	Multi-year Fund
Section	一般
Review Section	Basic Section 60080:Database-related
Research Institution	National Institute of Informatics
Principal Investigator	Yu Yi 国立情報学研究所, コンテンツ科学研究系, 特任助教 (00754681)
Project Period (FY)	2019-04-01 – 2022-03-31
Project Status	Completed (Fiscal Year 2021)
Budget Amount *help	¥4,290,000 (Direct Cost: ¥3,300,000、Indirect Cost: ¥990,000) Fiscal Year 2021: ¥780,000 (Direct Cost: ¥600,000、Indirect Cost: ¥180,000) Fiscal Year 2020: ¥1,430,000 (Direct Cost: ¥1,100,000、Indirect Cost: ¥330,000) Fiscal Year 2019: ¥2,080,000 (Direct Cost: ¥1,600,000、Indirect Cost: ¥480,000)
Keywords	Cross-Modal Correlation / Cross-Modal Embedding / cross-modal embedding / zero-shot / cross-modal retrieval
Outline of Research at the Start	Lots of efforts have been devoted to learning cross-modal correlation between data in different modalities. But existing cross-modal embedding models usually do not work well when the query or database include new data with unknown categories. To solve this problem, this project aims to develop zero-shot cross-modal embedding learning algorithms from the following aspects: (i) compute modality-invariant embedding, (ii) predict unknown categories based on external knowledge describing their correlation from known categories, (iii) apply adversarial learning to enhance system performance.
Outline of Final Research Achievements	This project focused on cross-modal embedding learning for cross-modal retrieval. The main challenge is how to learn joint embeddings in a shared subspace for computing the similarity across different modalities. 1) We proposed a novel deep triplet neural network with cluster canonical correlation analysis (TNN-C-CCA), which is an end-to-end supervised learning architecture with audio branch and video branch. 2) We proposed a novel variational autoencoder (VAE) architecture for audio-visual cross-modal retrieval, by learning paired audio-visual correlation embedding and category correlation embedding as constraints to reinforce the mutuality of audio-visual information. 3) We proposed an unsupervised generative adversarial alignment representation (UGAAR) model to learn deep discriminative representations shared across three major musical modalities: sheet music, lyrics, and audio, where a deep neural network based architecture on three branches is jointly trained.
Academic Significance and Societal Importance of the Research Achievements	The distribution of data in different modalities are inconsistent, which makes it difficult to directly measure the similarity across different modalities. The proposed technique of cross-modal embedding learning can help improve the performance of cross-modal retrieval, recognition, and generation.

Report

(4 results)

2021 Annual Research Report Final Research Report ( PDF )
2020 Research-status Report
2019 Research-status Report

Research Products
(5 results)

All 2022 2020 Other

All Journal Article (1 results) (of which Peer Reviewed: 1 results) Presentation (3 results) (of which Int'l Joint Research: 2 results) Remarks (1 results)

[Journal Article] Deep Triplet Neural Networks with Cluster-CCA for Audio-Visual Cross-Modal Retrieval2020
- Author(s)
  Donghuo Zeng, Yi Yu, Keizo Oyama
- Journal Title
  
  ACM Transaction on Multimedia Computing Communication and Applications
  
  Volume: 16 Pages: 1-23
- Related Report
  2019 Research-status Report
- Peer Reviewed
[Presentation] Melody generation from lyrics using three branch conditional LSTM-GAN2022
- Author(s)
  Abhishek Srivastava, Wei Duan, Rajiv Ratn Shah, Jianming Wu, Suhua Tang, Wei Li, and Yi Yu
- Organizer
  28th International Conference on Multimedia Modeling (MMM)
- Related Report
  2021 Annual Research Report
- Int'l Joint Research
[Presentation] Unsupervised generative adversarial alignment representation for sheet music, audio and lyrics2020
- Author(s)
  Donghuo Zeng, Yi Yu, and Keizo Oyama
- Organizer
  IEEE International Conference on Multimedia Big Data 2020
- Related Report
  2020 Research-status Report
- Int'l Joint Research
[Presentation] MusicTM-Dataset for joint representation learning among sheet music, lyrics, and musical audio2020
- Author(s)
  Donghuo Zeng, Yi Yu, and Keizo Oyama
- Organizer
  The 8th Conference on Sound and Music Technology, Lecture Notes in Electrical Engineering
- Related Report
  2020 Research-status Report
[Remarks]
- URL
  https://github.com/yy1lab/Lyrics-Conditioned-Neural-Melody-Generation
- Related Report
  2019 Research-status Report

Zero-shot Cross-modal Embedding Learning

Principal Investigator

Yu Yi 国立情報学研究所, コンテンツ科学研究系, 特任助教 (00754681)

¥4,290,000 (Direct Cost: ¥3,300,000、Indirect Cost: ¥990,000)

Report

Research Products

[Journal Article] Deep Triplet Neural Networks with Cluster-CCA for Audio-Visual Cross-Modal Retrieval2020

Author(s)

Journal Title

Related Report

[Presentation] Melody generation from lyrics using three branch conditional LSTM-GAN2022

Author(s)

Organizer

Related Report

[Presentation] Unsupervised generative adversarial alignment representation for sheet music, audio and lyrics2020

Author(s)

Organizer

Related Report

[Presentation] MusicTM-Dataset for joint representation learning among sheet music, lyrics, and musical audio2020

Author(s)

Organizer

Related Report

[Remarks]

URL

Related Report