2019 Fiscal Year Research-status Report

Construction of a computational model to deal with the cocktail-party problem for intelligent speech interface

Research Project

Project/Area Number	19K12035
Research Institution	National Institute of Information and Communications Technology
Principal Investigator	LU Xugang 国立研究開発法人情報通信研究機構, 先進的音声翻訳研究開発推進センター先進的音声技術研究室, 主任研究員 (20362022)
Project Period (FY)	2019-04-01 – 2022-03-31
Keywords	Acoustic event detection / Speaker embedding
Outline of Annual Research Achievements	In order to construct a smart speech interface for real applications, we need to discriminate several sound sources which take different roles by conveying different information: Acoustic environments (different sound events and scenes), and speaker attributes (different genders, identities, and speaking segmentation, as well as different spoken languages). Correspondingly, we first constructed a deep learning system for acoustic event detection (figure out the acoustic sources), and then we built a speaker embedding system in order to characterize speakers' attributes. For more specific, we proposed a class-wise centroid distance metric based learning algorithm which showed improved performance in discriminating acoustic events. In addition, we constructed a speaker embedding system.
Current Status of Research Progress	Current Status of Research Progress 2: Research has progressed on the whole more than it was originally planned. Reason As we carried out the research based on our original plan, we found more detailed and specific problems which need to be solved to step on the next step. In real application scenarios, the acoustic environments are rather complex (non-speech acoustic events, multi-speakers), rather than making a simple controlled acoustic environmental study for different speaker speech separation (original planned), we digged into deep in real acoustic environments with unexpected acoustic events and unknown speakers. Therefore, we investigated acoustic event and scene detection and speaker embedding techniques in order to utilize them for accurate source separation.
Strategy for Future Research Activity	Speaker attribute description is important for speech separation. For unknown speakers, we need to investigate a universal speaker feature description for separating different speakers. As our initial experiments showed that speaker embedding is one of the most efficient algorithms. In learning speaker embedding, what kind of loss metric is essential. In the following, we will focus on investigating efficient distance metric learning for discriminating speakers.
Causes of Carryover	In planned business trip for conferences (international conferences), due to COVID-19, I could not attend. Planned usage: Attending international conference INTERSPEECH 2020, Oct. Attending international conference SLT2020 (may be delayed to 2021. Jan.) Attending international conference Odyssey 2020 Nov.

Research Products
(1 results)

All Presentation (1 results) (of which Int'l Joint Research: 1 results)

[Presentation] Class-Wise Centroid Distance Metric Learning for Acoustic Event Detection2019
- Author(s)
  Xugang Lu, Peng Shen, Sheng Li, Yu Tsao, Hisashi Kawai
- Organizer
  Interspeech 2019
- Int'l Joint Research