2020 Fiscal Year Research-status Report
A Unified Computational Model for Audio-Visual Recognition of Human Social Interaction
Project/Area Number |
20K19833
|
Research Institution | Institute of Physical and Chemical Research |
Principal Investigator |
Nugraha Aditya (Arie) 国立研究開発法人理化学研究所, 革新知能統合研究センター, 特別研究員 (60858025)
|
Project Period (FY) |
2020-04-01 – 2023-03-31
|
Keywords | deep speech model / deep generative model / latent variable model / variational autoencoder / normalizing flow |
Outline of Annual Research Achievements |
This study aims to formulate a probabilistic computational model of audio-visual scene understanding. In FY2020, as we planned, we developed a model that generates time-varying speech given different latent variables: (1) time-invariant speaker label, encoding the voice characteristic, (2) time-invariant phone label, conveying the message, and (3) additional time-varying latent variables, holding other unspecified aspects in speech. The generated speech sounded natural given a good set of latent variables, and voice conversion could be done by modifying the speaker label. The model also worked well for speech separation outperforming the classical model without phone and speaker information. In FY2021, we would focus on the visual part by developing a generative model for lip movement.
|
Current Status of Research Progress |
Current Status of Research Progress
3: Progress in research has been slightly delayed.
Reason
For FY2020, we originally planned to work on the speech generative modeling and start working on the lip-movement generative modeling. However, we were too focused on developing and improving the speech model, causing the development of the lip-movement model behind schedule. In addition to the phone- and speaker-aware speech model based on the widely used variational autoencoder (VAE), we also developed another model based on a combination of VAE and the normalizing flow (NF). We showed that this novel model represents and produces better speech harmonics and improves a speech enhancement system utilizing it. We also developed a novel NF-based independent vector analysis for source separation purposes. Our knowledge of NF would be valuable for the development of the lip-movement model.
|
Strategy for Future Research Activity |
In FY2021, we would focus on developing a generative model for lip movement. Since the research proposal is written, there are works on audio-visual speech enhancement systems exploiting the lip movement images published by other researchers. We believe that these recent works would boost the progress of our study. However, since this research area evolves very quickly, we might need to adjust our research objectives to work on something novel and original.
|
Causes of Carryover |
Because of the global COVID-19 pandemic, international and domestic conferences are held online, so most travel budget remains unused. We decided not to spend it for other purposes in a rush and preserve it for possible critical needs in the future.
|
Research Products
(6 results)