研究実績の概要 |
Deep cross-modal learning has successfully demonstrated excellent performances in cross-modal multimedia retrieval. Unfortunately, little research focuses on cross-modal correlation learning where temporal structures of different data modalities such as audio and lyrics are taken into account. Stemming from the characteristic of temporal structures of music in nature, we are motivated to learn cross-modal embedding learning between audio, lyrics, video, and sheet music.
We have proposed a deep cross-modal embedding learning architecture involving two-branch deep neural networks for audio modality and video modality. Data in different modalities are converted to the same canonical space where inter-modal canonical correlation analysis is utilized as an objective function to calculate the similarity of temporal structures. We propose a novel Triplet neural network with cluster-based canonical correlation analysis (TNN-C-CCA), which is an end-to-end supervised learning architecture with audio branch and video branch. We not only consider the matching pairs in the common space, but also compute the mismatching pairs when maximizing the correlation. In particular, two significant contributions are made in this work: i) a better representation by constructing triplet neural network with triplet loss for optimal projections can be generated to maximize correlation in the shared subspace. ii) positive examples and negative examples are used in the learning stage to improve the capability of embedding learning between audio and video.
|
今後の研究の推進方策 |
Besides studying the cross-modal embedding learning between audio and video, other kinds of modality data also will be investigated such as sheet music. Future work will aim to develop novel cross-modal learning algorithms from the following aspects: (i) compute modality-invariant embedding, (ii) use transfer learning techniques to learning more stronger correlation, and (iii) apply adversarial learning to enhance system performance.
|