2022 Fiscal Year Final Research Report

A Unified Computational Model for Audio-Visual Recognition of Human Social Interaction

Research Project

PDF

Project/Area Number	20K19833
Research Category	Grant-in-Aid for Early-Career Scientists
Allocation Type	Multi-year Fund
Review Section	Basic Section 61010:Perceptual information processing-related
Research Institution	Institute of Physical and Chemical Research
Principal Investigator	NUGRAHA Aditya Arie 国立研究開発法人理化学研究所, 革新知能統合研究センター, 研究員 (60858025)
Project Period (FY)	2020-04-01 – 2023-03-31
Keywords	Audio-visual processing / Smart glasses / Adaptive system / Blind source separation / Speech enhancement / Speech recognition / Neural spatial model / Generative model
Outline of Final Research Achievements	We aimed for a probabilistic computational model of audio-visual information processing for understanding human verbal communication. We proposed a model for generating speech signals from speaker labels controlling the voice characteristics and phone labels controlling the speech content. For speech enhancement, it potentially improves not only the signal quality but also the speech intelligibility. We also introduced principled time-varying extensions, based on a novel deep generative model called normalizing flow, of time-invariant blind source separation (BSS) methods, including the classical independent vector analysis and the state-of-the-art FastMNMF. Finally, we developed adaptive audio-visual speech enhancement with augmented reality smart glasses. Camera images allow speakers of interest to be identified to control direction-aware enhancement. We achieve robust low-latency enhancement via a fast environment-sensitive beamforming governed by a slow environment-agnostic BSS.
Free Research Field	Audio-visual speech enhancement for smart glasses
Academic Significance and Societal Importance of the Research Achievements	One key achievement is the prototype of adaptive speech enhancement for real-time speech transcription with head-worn smart glasses. It involves challenging egocentric information processing with non-stationary sensors. This technology may benefit older adults and people with hearing impairment.