User-centric Audio-Visual Scene Understanding for Augmented Reality Smart Glasses in the Wild

研究課題

研究課題/領域番号	23K16912
研究種目	若手研究
配分区分	基金
審査区分	小区分61010:知覚情報処理関連
研究機関	国立研究開発法人理化学研究所
研究代表者	Nugraha AdityaArie 国立研究開発法人理化学研究所, 革新知能統合研究センター, 研究員 (60858025)
研究期間 (年度)	2023-04-01 – 2026-03-31
研究課題ステータス	交付 (2023年度)
配分額 *注記	4,680千円 (直接経費: 3,600千円、間接経費: 1,080千円) 2025年度: 1,040千円 (直接経費: 800千円、間接経費: 240千円) 2024年度: 1,690千円 (直接経費: 1,300千円、間接経費: 390千円) 2023年度: 1,950千円 (直接経費: 1,500千円、間接経費: 450千円)
キーワード	Audio-visual processing / Augmented reality / Smart glasses / Audio source separation / Speech enhancement / Dereverberation / Auditory perception / Speech recognition / audio-visual processing / augmented reality / smart glasses / non-stationary sensors / hearing aids
研究開始時の研究の概要	This study aims to build a user-centric computational model of audio-visual scene understanding for first-person non-stationary sensor data (scientific contribution) and its application to hearing aids using wearable augmented reality (AR) smart glasses in the wild (practical contribution).
研究実績の概要	This study aims to formulate a user-centric computational model of audio-visual scene understanding for first-person non-stationary sensor data. We intend to apply this formulation to hearing aids using wearable augmented reality (AR) smart glasses equipped with multiple microphones and cameras. This system's ability to improve how a human with hearing loss perceives the surroundings will rely on audio-visual information processing, especially audio source separation or speech enhancement technology. As the system’s key feature, the processing should work fine without any user involvement but can also take into account some user controls. In FY2023, we focused on advancing audio source separation and speech enhancement approaches. We successfully published works on (1) a time-domain audio source separation based on a probabilistic deep generative model, (2) a probabilistically-motivated unsupervised training approach for source separation based on deep neural networks (DNNs), and (3) a signal-processing-inspired semi-supervised training approach for DNN-based estimation of source directions. In FY2024, we will continue improving the approaches to be robust to different real-life noisy-reverberant conditions. We will also attempt to develop techniques that can take some user controls into account.
現在までの達成度 (区分)	現在までの達成度 (区分) 2: おおむね順調に進展している理由 In FY2023, following the research implementation plan, we worked on advancing our audio source separation and speech enhancement approaches. We published our work on a time-domain audio source separation approach based on a probabilistic deep generative model. It demonstrates that pure time-domain modeling improves the perceptual quality of separated speech signals. This improvement is likely due to better phase consistency, which is challenging to handle in time-frequency-domain models. This study would benefit applications intended for human listening. In addition, we published works on an unsupervised training approach for source separation based on deep neural networks (DNNs) and a semi-supervised training approach for DNN-based estimation of source directions. This technique is useful for real-world applications where obtaining good parallel clean-noisy data or annotations required for supervised training is costly. We also worked on room impulse response (RIR) estimation from 3-dimensional mesh data of a room obtained using augmented reality (AR) smart glasses. RIR can be exploited to improve speech enhancement, e.g., via dereverberation. The knowledge of RIR can also be used to preserve spatial perception in binaural spatialization for hearing aid users. This work is currently under review and thus not listed as an FY2023 research achievement.
今後の研究の推進方策	In FY2024, we would continue improving our audio source separation techniques to make them robust to unseen noisy-reverberant environments while benefiting from data-driven approaches. This boils down to developing semi-supervised training approaches or adaptation approaches for models pre-trained in a supervised manner. We would explore the utilization of more multimodal data. We have been using three-dimensional mesh data of a room obtained using the depth sensors of augmented reality (AR) smart glasses to estimate room impulse responses (RIRs). We could additionally infer the room surface materials using the camera images and use the information to condition the RIR estimation, knowing that RIR physically depends on the absorption characteristics of the surfaces. We would also attempt to develop techniques that can take some user controls into account. However, perceptually-related user controls, e.g., for adjusting the trade-off between residual noise and speech distortion, would be possible only after good speech separation can be achieved. As always, since the research field evolves rapidly, we might need to adjust our research objectives to keep working on something novel and original.

報告書

(1件)

2023 実施状況報告書

研究成果
(5件)

すべて 2023 その他

すべて学会発表 (4件) (うち国際学会 4件) 備考 (1件)

[学会発表] Time-Domain Audio Source Separation Based on Gaussian Processes with Deep Kernel Learning2023
- 著者名/発表者名
  Nugraha Aditya Arie、Di Carlo Diego、Bando Yoshiaki、Fontaine Mathieu、Yoshii Kazuyoshi
- 学会等名
  IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)
- 関連する報告書
  2023 実施状況報告書
- 国際学会
[学会発表] Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation2023
- 著者名/発表者名
  Bando Yoshiaki、Masuyama Yoshiki、Nugraha Aditya Arie、Yoshii Kazuyoshi
- 学会等名
  European Signal Processing Conference (EUSIPCO)
- 関連する報告書
  2023 実施状況報告書
- 国際学会
[学会発表] Exploiting Sparse Recovery Algorithms for Semi-Supervised Training of Deep Neural Networks for Direction-of-Arrival Estimation2023
- 著者名/発表者名
  Ali Murtiza、Nugraha Aditya Arie、Nathwani Karan
- 学会等名
  IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)
- 関連する報告書
  2023 実施状況報告書
- 国際学会
[学会発表] Tutorial: Foundations, Extensions and Applications of Statistical Multichannel Speech Separation Models2023
- 著者名/発表者名
  Yoshii Kazuyoshi、Nugraha Aditya Arie、Fontaine Mathieu、Bando Yoshiaki
- 学会等名
  INTERSPEECH
- 関連する報告書
  2023 実施状況報告書
- 国際学会
[備考] Time-Domain Audio Source Separation Based on GPDKL
- URL
  https://aanugraha.github.io/demo/gpdkl/
- 関連する報告書
  2023 実施状況報告書

User-centric Audio-Visual Scene Understanding for Augmented Reality Smart Glasses in the Wild

研究代表者

Nugraha AdityaArie 国立研究開発法人理化学研究所, 革新知能統合研究センター, 研究員 (60858025)

4,680千円 (直接経費: 3,600千円、間接経費: 1,080千円)

現在までの達成度 (区分)

理由

報告書

研究成果

[学会発表] Time-Domain Audio Source Separation Based on Gaussian Processes with Deep Kernel Learning2023

著者名/発表者名

学会等名

関連する報告書

[学会発表] Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation2023

著者名/発表者名

学会等名

関連する報告書

[学会発表] Exploiting Sparse Recovery Algorithms for Semi-Supervised Training of Deep Neural Networks for Direction-of-Arrival Estimation2023

著者名/発表者名

学会等名

関連する報告書

[学会発表] Tutorial: Foundations, Extensions and Applications of Statistical Multichannel Speech Separation Models2023

著者名/発表者名

学会等名

関連する報告書

[備考] Time-Domain Audio Source Separation Based on GPDKL

URL

関連する報告書