2022 Fiscal Year Final Research Report
A Unified Computational Model for Audio-Visual Recognition of Human Social Interaction
Project/Area Number |
20K19833
|
Research Category |
Grant-in-Aid for Early-Career Scientists
|
Allocation Type | Multi-year Fund |
Review Section |
Basic Section 61010:Perceptual information processing-related
|
Research Institution | Institute of Physical and Chemical Research |
Principal Investigator |
NUGRAHA Aditya Arie 国立研究開発法人理化学研究所, 革新知能統合研究センター, 研究員 (60858025)
|
Project Period (FY) |
2020-04-01 – 2023-03-31
|
Keywords | Audio-visual processing / Smart glasses / Adaptive system / Blind source separation / Speech enhancement / Speech recognition / Neural spatial model / Generative model |
Outline of Final Research Achievements |
We aimed for a probabilistic computational model of audio-visual information processing for understanding human verbal communication. We proposed a model for generating speech signals from speaker labels controlling the voice characteristics and phone labels controlling the speech content. For speech enhancement, it potentially improves not only the signal quality but also the speech intelligibility. We also introduced principled time-varying extensions, based on a novel deep generative model called normalizing flow, of time-invariant blind source separation (BSS) methods, including the classical independent vector analysis and the state-of-the-art FastMNMF. Finally, we developed adaptive audio-visual speech enhancement with augmented reality smart glasses. Camera images allow speakers of interest to be identified to control direction-aware enhancement. We achieve robust low-latency enhancement via a fast environment-sensitive beamforming governed by a slow environment-agnostic BSS.
|
Free Research Field |
Audio-visual speech enhancement for smart glasses
|
Academic Significance and Societal Importance of the Research Achievements |
One key achievement is the prototype of adaptive speech enhancement for real-time speech transcription with head-worn smart glasses. It involves challenging egocentric information processing with non-stationary sensors. This technology may benefit older adults and people with hearing impairment.
|