2021 Fiscal Year Research-status Report
A Unified Computational Model for Audio-Visual Recognition of Human Social Interaction
Project/Area Number |
20K19833
|
Research Institution | Institute of Physical and Chemical Research |
Principal Investigator |
Nugraha Aditya (Arie) 国立研究開発法人理化学研究所, 革新知能統合研究センター, 特別研究員 (60858025)
|
Project Period (FY) |
2020-04-01 – 2023-03-31
|
Keywords | Audio-visual processing / Blind source separation / Speech enhancement / Dereverberation / Deep spatial model / Normalizing flow |
Outline of Annual Research Achievements |
This study aims to formulate a probabilistic computational model of audio-visual information processing. We build upon the probabilistic local Gaussian model for a multichannel audio signal that is parameterized by the spectral parameters portraying the source characteristics and the spatial parameters representing the source and microphone locations in an environment. In FY2020, we have developed multiple deep spectral models, including one based on the speaker and phone disentanglement. In FY2021, we continued working on modeling deep spatial models based on the normalizing flow while integrating our state-of-the-art joint diagonalization techniques for spatial covariance matrices. Most importantly, we started to incorporate visual aspects into our audio-visual information processing. Although we originally planned to model the lip movement specifically, we decided to shift our focus to tackling issues in real-world scenarios, where lip movement detection is often not reliable, e.g., due to low-quality images, and thus, the typical lip-movement-informed speech enhancement is not possible. Conversely, human faces or bodies could be detected relatively easier. We regarded the detected humans as possible speaker locations that govern the spatial parameter optimization of an audio source separation or speech enhancement technique. Building upon this visually-informed audio signal processing, in FY2022, we would attempt to explore more visual aspects and develop techniques that exploit both audio and visual data in a mutually-dependent manner.
|
Current Status of Research Progress |
Current Status of Research Progress
2: Research has progressed on the whole more than it was originally planned.
Reason
In FY2021, following the research implementation plan, we worked on the visual aspect of audio-visual information processing. We originally planned to work on modeling the lip movement, as one of the prominent visual aspects in documenting human social interaction. However, learning from multiple recently published works on lip-movement-informed speech enhancement, we decided to shift our focus to tackling issues in real-world scenarios where lip movement detection is not reliable, e.g., because the target speaker is too far away from the camera or the image resolution is too low due to the camera hardware limitations. In this case, instead of lip movement, human faces or bodies can still be detected from the camera images to inform an audio source separation technique about the possible locations of the target speakers. The multichannel audio signals are assumed to follow the probabilistic local Gaussian model parameterized by the spectral parameters portraying the source characteristics and the spatial parameters representing the relative source and microphone locations in an environment. The detected possible speaker locations can then be exploited to govern the spatial parameter optimization. The publication list, unfortunately, has not reflected this study on visually-informed audio source separation yet because multiple papers are currently under review. Additionally, we continued working on modeling the spectral and spatial parameters with deep neural networks while integrating our state-of-the-art joint diagonalization techniques for spatial covariance matrices.
|
Strategy for Future Research Activity |
In FY2022, we would explore more visual aspects, including probably the lip movement as planned earlier, to be incorporated into our audio-visual information processing. Although the lip movements might not be effective for lip reading given low-resolution images, the movements might still be useful to help speech activity detection. Building upon our visually-informed audio signal processing, we would attempt to develop signal processing techniques that exploit both audio and visual data in a mutually-dependent manner. It is also important that the techniques are applicable to real environments, where the noise signals are non-stationary most of the time and the target speakers may move. Thus, to be effective, the techniques must be versatile. As always, since the research field evolves rapidly, we might need to adjust our research objectives to keep working on something novel and original.
|
Causes of Carryover |
Because of the global COVID-19 pandemic, international and domestic conferences are still mostly held online, so most travel budget remains unused. We decided not to spend it for other purposes in a rush and preserve it for possible critical needs in the future. Considering the improving pandemic situation and the easing of travel restrictions, we may expect that there will be more participation in on-site conferences soon.
|
Research Products
(9 results)