2023 年度実施状況報告書

User-centric Audio-Visual Scene Understanding for Augmented Reality Smart Glasses in the Wild

研究課題

研究課題/領域番号	23K16912
研究機関	国立研究開発法人理化学研究所
研究代表者	Nugraha AdityaArie 国立研究開発法人理化学研究所, 革新知能統合研究センター, 研究員 (60858025)
研究期間 (年度)	2023-04-01 – 2026-03-31
キーワード	Audio-visual processing / Augmented reality / Smart glasses / Audio source separation / Speech enhancement / Dereverberation / Auditory perception / Speech recognition
研究実績の概要	This study aims to formulate a user-centric computational model of audio-visual scene understanding for first-person non-stationary sensor data. We intend to apply this formulation to hearing aids using wearable augmented reality (AR) smart glasses equipped with multiple microphones and cameras. This system's ability to improve how a human with hearing loss perceives the surroundings will rely on audio-visual information processing, especially audio source separation or speech enhancement technology. As the system’s key feature, the processing should work fine without any user involvement but can also take into account some user controls. In FY2023, we focused on advancing audio source separation and speech enhancement approaches. We successfully published works on (1) a time-domain audio source separation based on a probabilistic deep generative model, (2) a probabilistically-motivated unsupervised training approach for source separation based on deep neural networks (DNNs), and (3) a signal-processing-inspired semi-supervised training approach for DNN-based estimation of source directions. In FY2024, we will continue improving the approaches to be robust to different real-life noisy-reverberant conditions. We will also attempt to develop techniques that can take some user controls into account.
現在までの達成度 (区分)	現在までの達成度 (区分) 2: おおむね順調に進展している理由 In FY2023, following the research implementation plan, we worked on advancing our audio source separation and speech enhancement approaches. We published our work on a time-domain audio source separation approach based on a probabilistic deep generative model. It demonstrates that pure time-domain modeling improves the perceptual quality of separated speech signals. This improvement is likely due to better phase consistency, which is challenging to handle in time-frequency-domain models. This study would benefit applications intended for human listening. In addition, we published works on an unsupervised training approach for source separation based on deep neural networks (DNNs) and a semi-supervised training approach for DNN-based estimation of source directions. This technique is useful for real-world applications where obtaining good parallel clean-noisy data or annotations required for supervised training is costly. We also worked on room impulse response (RIR) estimation from 3-dimensional mesh data of a room obtained using augmented reality (AR) smart glasses. RIR can be exploited to improve speech enhancement, e.g., via dereverberation. The knowledge of RIR can also be used to preserve spatial perception in binaural spatialization for hearing aid users. This work is currently under review and thus not listed as an FY2023 research achievement.
今後の研究の推進方策	In FY2024, we would continue improving our audio source separation techniques to make them robust to unseen noisy-reverberant environments while benefiting from data-driven approaches. This boils down to developing semi-supervised training approaches or adaptation approaches for models pre-trained in a supervised manner. We would explore the utilization of more multimodal data. We have been using three-dimensional mesh data of a room obtained using the depth sensors of augmented reality (AR) smart glasses to estimate room impulse responses (RIRs). We could additionally infer the room surface materials using the camera images and use the information to condition the RIR estimation, knowing that RIR physically depends on the absorption characteristics of the surfaces. We would also attempt to develop techniques that can take some user controls into account. However, perceptually-related user controls, e.g., for adjusting the trade-off between residual noise and speech distortion, would be possible only after good speech separation can be achieved. As always, since the research field evolves rapidly, we might need to adjust our research objectives to keep working on something novel and original.
次年度使用額が生じた理由	A small part of the FY2023 budget remains unused. We will use it to supplement the travel expenses in the FY2024 budget, knowing that participation in international conferences is getting expensive these days.

研究成果
(5件)

すべて 2023 その他

すべて学会発表 (4件) (うち国際学会 4件) 備考 (1件)

[学会発表] Time-Domain Audio Source Separation Based on Gaussian Processes with Deep Kernel Learning2023
- 著者名/発表者名
  Nugraha Aditya Arie、Di Carlo Diego、Bando Yoshiaki、Fontaine Mathieu、Yoshii Kazuyoshi
- 学会等名
  IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)
- 国際学会
[学会発表] Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation2023
- 著者名/発表者名
  Bando Yoshiaki、Masuyama Yoshiki、Nugraha Aditya Arie、Yoshii Kazuyoshi
- 学会等名
  European Signal Processing Conference (EUSIPCO)
- 国際学会
[学会発表] Exploiting Sparse Recovery Algorithms for Semi-Supervised Training of Deep Neural Networks for Direction-of-Arrival Estimation2023
- 著者名/発表者名
  Ali Murtiza、Nugraha Aditya Arie、Nathwani Karan
- 学会等名
  IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)
- 国際学会
[学会発表] Tutorial: Foundations, Extensions and Applications of Statistical Multichannel Speech Separation Models2023
- 著者名/発表者名
  Yoshii Kazuyoshi、Nugraha Aditya Arie、Fontaine Mathieu、Bando Yoshiaki
- 学会等名
  INTERSPEECH
- 国際学会
[備考] Time-Domain Audio Source Separation Based on GPDKL
- URL
  https://aanugraha.github.io/demo/gpdkl/

2023 年度 実施状況報告書

User-centric Audio-Visual Scene Understanding for Augmented Reality Smart Glasses in the Wild

研究代表者

Nugraha AdityaArie 国立研究開発法人理化学研究所, 革新知能統合研究センター, 研究員 (60858025)

現在までの達成度 (区分)

理由

研究成果

[学会発表] Time-Domain Audio Source Separation Based on Gaussian Processes with Deep Kernel Learning2023

著者名/発表者名

学会等名

[学会発表] Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation2023

著者名/発表者名

学会等名

[学会発表] Exploiting Sparse Recovery Algorithms for Semi-Supervised Training of Deep Neural Networks for Direction-of-Arrival Estimation2023

著者名/発表者名

学会等名

[学会発表] Tutorial: Foundations, Extensions and Applications of Statistical Multichannel Speech Separation Models2023

著者名/発表者名

学会等名

[備考] Time-Domain Audio Source Separation Based on GPDKL

URL

2023 年度実施状況報告書