2019 Fiscal Year Research-status Report
Construction of a computational model to deal with the cocktail-party problem for intelligent speech interface
Project/Area Number |
19K12035
|
Research Institution | National Institute of Information and Communications Technology |
Principal Investigator |
LU Xugang 国立研究開発法人情報通信研究機構, 先進的音声翻訳研究開発推進センター先進的音声技術研究室, 主任研究員 (20362022)
|
Project Period (FY) |
2019-04-01 – 2022-03-31
|
Keywords | Acoustic event detection / Speaker embedding |
Outline of Annual Research Achievements |
In order to construct a smart speech interface for real applications, we need to discriminate several sound sources which take different roles by conveying different information: Acoustic environments (different sound events and scenes), and speaker attributes (different genders, identities, and speaking segmentation, as well as different spoken languages). Correspondingly, we first constructed a deep learning system for acoustic event detection (figure out the acoustic sources), and then we built a speaker embedding system in order to characterize speakers' attributes. For more specific, we proposed a class-wise centroid distance metric based learning algorithm which showed improved performance in discriminating acoustic events. In addition, we constructed a speaker embedding system.
|
Current Status of Research Progress |
Current Status of Research Progress
2: Research has progressed on the whole more than it was originally planned.
Reason
As we carried out the research based on our original plan, we found more detailed and specific problems which need to be solved to step on the next step. In real application scenarios, the acoustic environments are rather complex (non-speech acoustic events, multi-speakers), rather than making a simple controlled acoustic environmental study for different speaker speech separation (original planned), we digged into deep in real acoustic environments with unexpected acoustic events and unknown speakers. Therefore, we investigated acoustic event and scene detection and speaker embedding techniques in order to utilize them for accurate source separation.
|
Strategy for Future Research Activity |
Speaker attribute description is important for speech separation. For unknown speakers, we need to investigate a universal speaker feature description for separating different speakers. As our initial experiments showed that speaker embedding is one of the most efficient algorithms. In learning speaker embedding, what kind of loss metric is essential. In the following, we will focus on investigating efficient distance metric learning for discriminating speakers.
|
Causes of Carryover |
In planned business trip for conferences (international conferences), due to COVID-19, I could not attend. Planned usage: Attending international conference INTERSPEECH 2020, Oct. Attending international conference SLT2020 (may be delayed to 2021. Jan.) Attending international conference Odyssey 2020 Nov.
|
Research Products
(1 results)