研究課題/領域番号 |
23K16912
|
研究種目 |
若手研究
|
配分区分 | 基金 |
審査区分 |
小区分61010:知覚情報処理関連
|
研究機関 | 国立研究開発法人理化学研究所 |
研究代表者 |
Nugraha AdityaArie 国立研究開発法人理化学研究所, 革新知能統合研究センター, 研究員 (60858025)
|
研究期間 (年度) |
2023-04-01 – 2026-03-31
|
研究課題ステータス |
交付 (2023年度)
|
配分額 *注記 |
4,680千円 (直接経費: 3,600千円、間接経費: 1,080千円)
2025年度: 1,040千円 (直接経費: 800千円、間接経費: 240千円)
2024年度: 1,690千円 (直接経費: 1,300千円、間接経費: 390千円)
2023年度: 1,950千円 (直接経費: 1,500千円、間接経費: 450千円)
|
キーワード | Audio-visual processing / Augmented reality / Smart glasses / Audio source separation / Speech enhancement / Dereverberation / Auditory perception / Speech recognition / audio-visual processing / augmented reality / smart glasses / non-stationary sensors / hearing aids |
研究開始時の研究の概要 |
This study aims to build a user-centric computational model of audio-visual scene understanding for first-person non-stationary sensor data (scientific contribution) and its application to hearing aids using wearable augmented reality (AR) smart glasses in the wild (practical contribution).
|
研究実績の概要 |
This study aims to formulate a user-centric computational model of audio-visual scene understanding for first-person non-stationary sensor data. We intend to apply this formulation to hearing aids using wearable augmented reality (AR) smart glasses equipped with multiple microphones and cameras. This system's ability to improve how a human with hearing loss perceives the surroundings will rely on audio-visual information processing, especially audio source separation or speech enhancement technology. As the system’s key feature, the processing should work fine without any user involvement but can also take into account some user controls. In FY2023, we focused on advancing audio source separation and speech enhancement approaches. We successfully published works on (1) a time-domain audio source separation based on a probabilistic deep generative model, (2) a probabilistically-motivated unsupervised training approach for source separation based on deep neural networks (DNNs), and (3) a signal-processing-inspired semi-supervised training approach for DNN-based estimation of source directions. In FY2024, we will continue improving the approaches to be robust to different real-life noisy-reverberant conditions. We will also attempt to develop techniques that can take some user controls into account.
|
現在までの達成度 (区分) |
現在までの達成度 (区分)
2: おおむね順調に進展している
理由
In FY2023, following the research implementation plan, we worked on advancing our audio source separation and speech enhancement approaches. We published our work on a time-domain audio source separation approach based on a probabilistic deep generative model. It demonstrates that pure time-domain modeling improves the perceptual quality of separated speech signals. This improvement is likely due to better phase consistency, which is challenging to handle in time-frequency-domain models. This study would benefit applications intended for human listening. In addition, we published works on an unsupervised training approach for source separation based on deep neural networks (DNNs) and a semi-supervised training approach for DNN-based estimation of source directions. This technique is useful for real-world applications where obtaining good parallel clean-noisy data or annotations required for supervised training is costly. We also worked on room impulse response (RIR) estimation from 3-dimensional mesh data of a room obtained using augmented reality (AR) smart glasses. RIR can be exploited to improve speech enhancement, e.g., via dereverberation. The knowledge of RIR can also be used to preserve spatial perception in binaural spatialization for hearing aid users. This work is currently under review and thus not listed as an FY2023 research achievement.
|
今後の研究の推進方策 |
In FY2024, we would continue improving our audio source separation techniques to make them robust to unseen noisy-reverberant environments while benefiting from data-driven approaches. This boils down to developing semi-supervised training approaches or adaptation approaches for models pre-trained in a supervised manner. We would explore the utilization of more multimodal data. We have been using three-dimensional mesh data of a room obtained using the depth sensors of augmented reality (AR) smart glasses to estimate room impulse responses (RIRs). We could additionally infer the room surface materials using the camera images and use the information to condition the RIR estimation, knowing that RIR physically depends on the absorption characteristics of the surfaces. We would also attempt to develop techniques that can take some user controls into account. However, perceptually-related user controls, e.g., for adjusting the trade-off between residual noise and speech distortion, would be possible only after good speech separation can be achieved. As always, since the research field evolves rapidly, we might need to adjust our research objectives to keep working on something novel and original.
|