A Unified Computational Model for Audio-Visual Recognition of Human Social Interaction
Project/Area Number |
20K19833
|
Research Category |
Grant-in-Aid for Early-Career Scientists
|
Allocation Type | Multi-year Fund |
Review Section |
Basic Section 61010:Perceptual information processing-related
|
Research Institution | Institute of Physical and Chemical Research |
Principal Investigator |
NUGRAHA Aditya Arie 国立研究開発法人理化学研究所, 革新知能統合研究センター, 研究員 (60858025)
|
Project Period (FY) |
2020-04-01 – 2023-03-31
|
Project Status |
Completed (Fiscal Year 2022)
|
Budget Amount *help |
¥4,290,000 (Direct Cost: ¥3,300,000、Indirect Cost: ¥990,000)
Fiscal Year 2022: ¥1,040,000 (Direct Cost: ¥800,000、Indirect Cost: ¥240,000)
Fiscal Year 2021: ¥1,560,000 (Direct Cost: ¥1,200,000、Indirect Cost: ¥360,000)
Fiscal Year 2020: ¥1,690,000 (Direct Cost: ¥1,300,000、Indirect Cost: ¥390,000)
|
Keywords | Audio-visual processing / Smart glasses / Adaptive system / Blind source separation / Speech enhancement / Speech recognition / Neural spatial model / Generative model / Normalizing flow / Dereverberation / Deep spatial model / deep speech model / deep generative model / latent variable model / variational autoencoder / normalizing flow / audio-visual processing / probabilistic model / speech enhancement / speaker diarization |
Outline of Research at the Start |
We aim to form a unified computational model of audio-visual scene understanding that mimics human’s capability in exploiting audio and visual cues. We expect the model can improve front-end processes (e.g., speech enhancement) and back-end processes (e.g., speech recognition) in a mutual manner.
|
Outline of Final Research Achievements |
We aimed for a probabilistic computational model of audio-visual information processing for understanding human verbal communication. We proposed a model for generating speech signals from speaker labels controlling the voice characteristics and phone labels controlling the speech content. For speech enhancement, it potentially improves not only the signal quality but also the speech intelligibility. We also introduced principled time-varying extensions, based on a novel deep generative model called normalizing flow, of time-invariant blind source separation (BSS) methods, including the classical independent vector analysis and the state-of-the-art FastMNMF. Finally, we developed adaptive audio-visual speech enhancement with augmented reality smart glasses. Camera images allow speakers of interest to be identified to control direction-aware enhancement. We achieve robust low-latency enhancement via a fast environment-sensitive beamforming governed by a slow environment-agnostic BSS.
|
Academic Significance and Societal Importance of the Research Achievements |
One key achievement is the prototype of adaptive speech enhancement for real-time speech transcription with head-worn smart glasses. It involves challenging egocentric information processing with non-stationary sensors. This technology may benefit older adults and people with hearing impairment.
|
Report
(4 results)
Research Products
(24 results)