研究実績の概要 |
This study aims to formulate a probabilistic computational model of audio-visual information processing for understanding verbal communication in human social interactions. The model is based on the probabilistic local Gaussian model for a multichannel audio signal, which uses spectral parameters to portray source characteristics, and spatial parameters to represent source and sensor locations in an environment. Initially, we assumed the availability of high-resolution recordings using stationary sensors (microphones and cameras) as in roundtable discussion scenarios. However, we decided to shift our focus to tackling issues in recordings captured by non-stationary sensors of head-worn smartglasses. The users naturally move their heads and bodies in real-world scenarios, especially when interacting with multiple persons in a group. Thus, dealing with the recordings requires a highly adaptive system that is robust to noise and reverberation. In FY2020, we developed multiple deep spectral models, including one based on the speaker and phone disentanglement. In FY2021, we worked on deep spatial models, including one based on an integration of normalizing flow and state-of-the-art joint diagonalization techniques for spatial covariance matrices, and started to incorporate visual aspects into our audio-visual information processing. In FY2022, we continued working on adaptive visually-informed audio signal processing, in which probable speaker locations govern the spatial parameter optimization of audio source separation or speech enhancement for speech recognition purposes.
|