2021 年度実施状況報告書

Self-supervised graph-based representation for language and speaker detection

研究課題

研究課題/領域番号	21K17776
研究機関	国立研究開発法人情報通信研究機構
研究代表者	沈鵬国立研究開発法人情報通信研究機構, ユニバーサルコミュニケーション研究所先進的音声翻訳研究開発推進センター, 主任研究員 (80773118)
研究期間 (年度)	2021-04-01 – 2024-03-31
キーワード	speaker recognition / language identification / cross-domain / self-supervised learning / pre-training model
研究実績の概要	I focused on investigating how to better represent speech signals for both speaker and language recognition tasks. In detail, the following work was done to progress this project. 1.Utilizing generative and discriminative model for speaker verification: We proposed a hybrid learning framework, i.e., coupling a joint Bayesian generative model structure and parameters with a neural discriminative learning framework to improve the recognition performance (The related results were published in the IEEE/ACM TASLP(journal) and APASIPA(international conference)). 2.Improving the representation of speech signal for language identification(LID): We propose a novel transducer-based language embedding approach for LID tasks by integrating an RNN transducer model into a language embedding framework. Benefiting from the advantages of the RNN transducer’s linguistic representation capability, the proposed method can exploit both phonetically-aware acoustic features and explicit linguistic features for LID tasks. To reduce the influence of the cross-domain problem, we proposed a joint distribution alignment model based on partial optimal transport algorithm (Two papers were submitted to International conference).
現在までの達成度 (区分)	現在までの達成度 (区分) 2: おおむね順調に進展している理由 Following the plan, I conducted research work to investigate and understand how to better represent languages and speakers of a speech signal. The related work progressed smoothly. And the related research results were published or submitted to the top-level journal and international conferences.
今後の研究の推進方策	I will further investigate how to better utilize the structural phonetic information to represent a speech signal for speaker, language, and speech recognition tasks. I will focus on the following problems: 1. How to better utilize both the acoustic feature and linguistic features is still a challenging task in language recognition. Especially with a self-supervised or pre-trained learning manner. 2. Recently, tokenization methods are widely used in natural language processing tasks, I will also focus on how to use tokens to build a universal model to represent multilingual speech signals.
次年度使用額が生じた理由	Affected by the covid19 epidemic, budgets for running machines and travel were not implemented FY2021. FY2022, these budgets will be used to purchase high-performance computing machines for training large-scale pre-trained models.