Project/Area Number |
23K16912
|
Research Category |
Grant-in-Aid for Early-Career Scientists
|
Allocation Type | Multi-year Fund |
Review Section |
Basic Section 61010:Perceptual information processing-related
|
Research Institution | Institute of Physical and Chemical Research |
Principal Investigator |
Nugraha AdityaArie 国立研究開発法人理化学研究所, 革新知能統合研究センター, 研究員 (60858025)
|
Project Period (FY) |
2023-04-01 – 2026-03-31
|
Project Status |
Granted (Fiscal Year 2023)
|
Budget Amount *help |
¥4,680,000 (Direct Cost: ¥3,600,000、Indirect Cost: ¥1,080,000)
Fiscal Year 2025: ¥1,040,000 (Direct Cost: ¥800,000、Indirect Cost: ¥240,000)
Fiscal Year 2024: ¥1,690,000 (Direct Cost: ¥1,300,000、Indirect Cost: ¥390,000)
Fiscal Year 2023: ¥1,950,000 (Direct Cost: ¥1,500,000、Indirect Cost: ¥450,000)
|
Keywords | Audio-visual processing / Augmented reality / Smart glasses / Audio source separation / Speech enhancement / Dereverberation / Auditory perception / Speech recognition / audio-visual processing / augmented reality / smart glasses / non-stationary sensors / hearing aids |
Outline of Research at the Start |
This study aims to build a user-centric computational model of audio-visual scene understanding for first-person non-stationary sensor data (scientific contribution) and its application to hearing aids using wearable augmented reality (AR) smart glasses in the wild (practical contribution).
|
Outline of Annual Research Achievements |
This study aims to formulate a user-centric computational model of audio-visual scene understanding for first-person non-stationary sensor data. We intend to apply this formulation to hearing aids using wearable augmented reality (AR) smart glasses equipped with multiple microphones and cameras. This system's ability to improve how a human with hearing loss perceives the surroundings will rely on audio-visual information processing, especially audio source separation or speech enhancement technology. As the system’s key feature, the processing should work fine without any user involvement but can also take into account some user controls. In FY2023, we focused on advancing audio source separation and speech enhancement approaches. We successfully published works on (1) a time-domain audio source separation based on a probabilistic deep generative model, (2) a probabilistically-motivated unsupervised training approach for source separation based on deep neural networks (DNNs), and (3) a signal-processing-inspired semi-supervised training approach for DNN-based estimation of source directions. In FY2024, we will continue improving the approaches to be robust to different real-life noisy-reverberant conditions. We will also attempt to develop techniques that can take some user controls into account.
|
Current Status of Research Progress |
Current Status of Research Progress
2: Research has progressed on the whole more than it was originally planned.
Reason
In FY2023, following the research implementation plan, we worked on advancing our audio source separation and speech enhancement approaches. We published our work on a time-domain audio source separation approach based on a probabilistic deep generative model. It demonstrates that pure time-domain modeling improves the perceptual quality of separated speech signals. This improvement is likely due to better phase consistency, which is challenging to handle in time-frequency-domain models. This study would benefit applications intended for human listening. In addition, we published works on an unsupervised training approach for source separation based on deep neural networks (DNNs) and a semi-supervised training approach for DNN-based estimation of source directions. This technique is useful for real-world applications where obtaining good parallel clean-noisy data or annotations required for supervised training is costly. We also worked on room impulse response (RIR) estimation from 3-dimensional mesh data of a room obtained using augmented reality (AR) smart glasses. RIR can be exploited to improve speech enhancement, e.g., via dereverberation. The knowledge of RIR can also be used to preserve spatial perception in binaural spatialization for hearing aid users. This work is currently under review and thus not listed as an FY2023 research achievement.
|
Strategy for Future Research Activity |
In FY2024, we would continue improving our audio source separation techniques to make them robust to unseen noisy-reverberant environments while benefiting from data-driven approaches. This boils down to developing semi-supervised training approaches or adaptation approaches for models pre-trained in a supervised manner. We would explore the utilization of more multimodal data. We have been using three-dimensional mesh data of a room obtained using the depth sensors of augmented reality (AR) smart glasses to estimate room impulse responses (RIRs). We could additionally infer the room surface materials using the camera images and use the information to condition the RIR estimation, knowing that RIR physically depends on the absorption characteristics of the surfaces. We would also attempt to develop techniques that can take some user controls into account. However, perceptually-related user controls, e.g., for adjusting the trade-off between residual noise and speech distortion, would be possible only after good speech separation can be achieved. As always, since the research field evolves rapidly, we might need to adjust our research objectives to keep working on something novel and original.
|