研究課題/領域番号 |
20K19833
|
研究機関 | 国立研究開発法人理化学研究所 |
研究代表者 |
Nugraha Aditya (Arie) 国立研究開発法人理化学研究所, 革新知能統合研究センター, 特別研究員 (60858025)
|
研究期間 (年度) |
2020-04-01 – 2023-03-31
|
キーワード | deep speech model / deep generative model / latent variable model / variational autoencoder / normalizing flow |
研究実績の概要 |
This study aims to formulate a probabilistic computational model of audio-visual scene understanding. In FY2020, as we planned, we developed a model that generates time-varying speech given different latent variables: (1) time-invariant speaker label, encoding the voice characteristic, (2) time-invariant phone label, conveying the message, and (3) additional time-varying latent variables, holding other unspecified aspects in speech. The generated speech sounded natural given a good set of latent variables, and voice conversion could be done by modifying the speaker label. The model also worked well for speech separation outperforming the classical model without phone and speaker information. In FY2021, we would focus on the visual part by developing a generative model for lip movement.
|
現在までの達成度 (区分) |
現在までの達成度 (区分)
3: やや遅れている
理由
For FY2020, we originally planned to work on the speech generative modeling and start working on the lip-movement generative modeling. However, we were too focused on developing and improving the speech model, causing the development of the lip-movement model behind schedule. In addition to the phone- and speaker-aware speech model based on the widely used variational autoencoder (VAE), we also developed another model based on a combination of VAE and the normalizing flow (NF). We showed that this novel model represents and produces better speech harmonics and improves a speech enhancement system utilizing it. We also developed a novel NF-based independent vector analysis for source separation purposes. Our knowledge of NF would be valuable for the development of the lip-movement model.
|
今後の研究の推進方策 |
In FY2021, we would focus on developing a generative model for lip movement. Since the research proposal is written, there are works on audio-visual speech enhancement systems exploiting the lip movement images published by other researchers. We believe that these recent works would boost the progress of our study. However, since this research area evolves very quickly, we might need to adjust our research objectives to work on something novel and original.
|
次年度使用額が生じた理由 |
Because of the global COVID-19 pandemic, international and domestic conferences are held online, so most travel budget remains unused. We decided not to spend it for other purposes in a rush and preserve it for possible critical needs in the future.
|