User-centric Audio-Visual Scene Understanding for Augmented Reality Smart Glasses in the Wild

Research Project

Project/Area Number	23K16912
Research Category	Grant-in-Aid for Early-Career Scientists
Allocation Type	Multi-year Fund
Review Section	Basic Section 61010:Perceptual information processing-related
Research Institution	Institute of Physical and Chemical Research
Principal Investigator	Nugraha AdityaArie 国立研究開発法人理化学研究所, 革新知能統合研究センター, 研究員 (60858025)
Project Period (FY)	2023-04-01 – 2026-03-31
Project Status	Granted (Fiscal Year 2023)
Budget Amount *help	¥4,680,000 (Direct Cost: ¥3,600,000、Indirect Cost: ¥1,080,000) Fiscal Year 2025: ¥1,040,000 (Direct Cost: ¥800,000、Indirect Cost: ¥240,000) Fiscal Year 2024: ¥1,690,000 (Direct Cost: ¥1,300,000、Indirect Cost: ¥390,000) Fiscal Year 2023: ¥1,950,000 (Direct Cost: ¥1,500,000、Indirect Cost: ¥450,000)
Keywords	Audio-visual processing / Augmented reality / Smart glasses / Audio source separation / Speech enhancement / Dereverberation / Auditory perception / Speech recognition / audio-visual processing / augmented reality / smart glasses / non-stationary sensors / hearing aids
Outline of Research at the Start	This study aims to build a user-centric computational model of audio-visual scene understanding for first-person non-stationary sensor data (scientific contribution) and its application to hearing aids using wearable augmented reality (AR) smart glasses in the wild (practical contribution).
Outline of Annual Research Achievements	This study aims to formulate a user-centric computational model of audio-visual scene understanding for first-person non-stationary sensor data. We intend to apply this formulation to hearing aids using wearable augmented reality (AR) smart glasses equipped with multiple microphones and cameras. This system's ability to improve how a human with hearing loss perceives the surroundings will rely on audio-visual information processing, especially audio source separation or speech enhancement technology. As the system’s key feature, the processing should work fine without any user involvement but can also take into account some user controls. In FY2023, we focused on advancing audio source separation and speech enhancement approaches. We successfully published works on (1) a time-domain audio source separation based on a probabilistic deep generative model, (2) a probabilistically-motivated unsupervised training approach for source separation based on deep neural networks (DNNs), and (3) a signal-processing-inspired semi-supervised training approach for DNN-based estimation of source directions. In FY2024, we will continue improving the approaches to be robust to different real-life noisy-reverberant conditions. We will also attempt to develop techniques that can take some user controls into account.
Current Status of Research Progress	Current Status of Research Progress 2: Research has progressed on the whole more than it was originally planned. Reason In FY2023, following the research implementation plan, we worked on advancing our audio source separation and speech enhancement approaches. We published our work on a time-domain audio source separation approach based on a probabilistic deep generative model. It demonstrates that pure time-domain modeling improves the perceptual quality of separated speech signals. This improvement is likely due to better phase consistency, which is challenging to handle in time-frequency-domain models. This study would benefit applications intended for human listening. In addition, we published works on an unsupervised training approach for source separation based on deep neural networks (DNNs) and a semi-supervised training approach for DNN-based estimation of source directions. This technique is useful for real-world applications where obtaining good parallel clean-noisy data or annotations required for supervised training is costly. We also worked on room impulse response (RIR) estimation from 3-dimensional mesh data of a room obtained using augmented reality (AR) smart glasses. RIR can be exploited to improve speech enhancement, e.g., via dereverberation. The knowledge of RIR can also be used to preserve spatial perception in binaural spatialization for hearing aid users. This work is currently under review and thus not listed as an FY2023 research achievement.
Strategy for Future Research Activity	In FY2024, we would continue improving our audio source separation techniques to make them robust to unseen noisy-reverberant environments while benefiting from data-driven approaches. This boils down to developing semi-supervised training approaches or adaptation approaches for models pre-trained in a supervised manner. We would explore the utilization of more multimodal data. We have been using three-dimensional mesh data of a room obtained using the depth sensors of augmented reality (AR) smart glasses to estimate room impulse responses (RIRs). We could additionally infer the room surface materials using the camera images and use the information to condition the RIR estimation, knowing that RIR physically depends on the absorption characteristics of the surfaces. We would also attempt to develop techniques that can take some user controls into account. However, perceptually-related user controls, e.g., for adjusting the trade-off between residual noise and speech distortion, would be possible only after good speech separation can be achieved. As always, since the research field evolves rapidly, we might need to adjust our research objectives to keep working on something novel and original.

Report

(1 results)

2023 Research-status Report

Research Products
(5 results)

All 2023 Other

All Presentation (4 results) (of which Int'l Joint Research: 4 results) Remarks (1 results)

[Presentation] Time-Domain Audio Source Separation Based on Gaussian Processes with Deep Kernel Learning2023
- Author(s)
  Nugraha Aditya Arie、Di Carlo Diego、Bando Yoshiaki、Fontaine Mathieu、Yoshii Kazuyoshi
- Organizer
  IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)
- Related Report
  2023 Research-status Report
- Int'l Joint Research
[Presentation] Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation2023
- Author(s)
  Bando Yoshiaki、Masuyama Yoshiki、Nugraha Aditya Arie、Yoshii Kazuyoshi
- Organizer
  European Signal Processing Conference (EUSIPCO)
- Related Report
  2023 Research-status Report
- Int'l Joint Research
[Presentation] Exploiting Sparse Recovery Algorithms for Semi-Supervised Training of Deep Neural Networks for Direction-of-Arrival Estimation2023
- Author(s)
  Ali Murtiza、Nugraha Aditya Arie、Nathwani Karan
- Organizer
  IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)
- Related Report
  2023 Research-status Report
- Int'l Joint Research
[Presentation] Tutorial: Foundations, Extensions and Applications of Statistical Multichannel Speech Separation Models2023
- Author(s)
  Yoshii Kazuyoshi、Nugraha Aditya Arie、Fontaine Mathieu、Bando Yoshiaki
- Organizer
  INTERSPEECH
- Related Report
  2023 Research-status Report
- Int'l Joint Research
[Remarks] Time-Domain Audio Source Separation Based on GPDKL
- URL
  https://aanugraha.github.io/demo/gpdkl/
- Related Report
  2023 Research-status Report

User-centric Audio-Visual Scene Understanding for Augmented Reality Smart Glasses in the Wild

Principal Investigator

Nugraha AdityaArie 国立研究開発法人理化学研究所, 革新知能統合研究センター, 研究員 (60858025)

¥4,680,000 (Direct Cost: ¥3,600,000、Indirect Cost: ¥1,080,000)

Current Status of Research Progress

Reason

Report

Research Products

[Presentation] Time-Domain Audio Source Separation Based on Gaussian Processes with Deep Kernel Learning2023

Author(s)

Organizer

Related Report

[Presentation] Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation2023

Author(s)

Organizer

Related Report

[Presentation] Exploiting Sparse Recovery Algorithms for Semi-Supervised Training of Deep Neural Networks for Direction-of-Arrival Estimation2023

Author(s)

Organizer

Related Report

[Presentation] Tutorial: Foundations, Extensions and Applications of Statistical Multichannel Speech Separation Models2023

Author(s)

Organizer

Related Report

[Remarks] Time-Domain Audio Source Separation Based on GPDKL

URL

Related Report