2022 年度実施状況報告書

Self-supervised graph-based representation for language and speaker detection

研究課題

研究課題/領域番号	21K17776
研究機関	国立研究開発法人情報通信研究機構
研究代表者	沈鵬国立研究開発法人情報通信研究機構, ユニバーサルコミュニケーション研究所先進的音声翻訳研究開発推進センター, 主任研究員 (80773118)
研究期間 (年度)	2021-04-01 – 2024-03-31
キーワード	language identification / Speech recognition / cross-domain / pre-training model / self-supervised learning
研究実績の概要	I focused on investigating how to better represent speech signals for both language recognition and speech recognition tasks. In detail, the following work was done to progress this project: 1. Improving the representation of speech signal for language identification (LID): We propose a novel transducer-based language embedding approach for LID tasks by integrating an RNN transducer model into a language embedding framework. Benefiting from the advantages of the RNN transducer's linguistic representation capability, the proposed method can exploit both phonetically-aware acoustic features and explicit linguistic features for LID tasks. The research paper was accepted by Interspeech 2022. Additionally, we further investigated these techniques on the NICT LID system, which also demonstrated robustness on cross-channel data. 2. Another work focuses on improving RNN-T for Mandarin ASR. I propose to use a novel pronunciation-aware unique character encoding for building end-to-end RNN-T-based Mandarin ASR systems. The proposed encoding is a combination of pronunciation-based syllable and character index (CI). By introducing the CI, the RNN-T model can overcome the homophone problem while utilizing the pronunciation information for extracting modeling units. With the proposed encoding, the model outputs can be converted into the final recognition result through a one-to-one mapping. This paper was accepted by IEEE SLT 2022.
現在までの達成度 (区分)	現在までの達成度 (区分) 2: おおむね順調に進展している理由 Following the plan, I conducted work to investigate and understand how to better represent languages for language identification and model units for ASR tasks. The related work progressed smoothly, and the research results were published at top-level international conferences.
今後の研究の推進方策	I will focus on investigating how to build a universal model for speaker, language, and speech recognition tasks with a single model. I will concentrate on the following problems: 1. Investigating large universal models, such as Whisper, and attempting to train or fine-tune similar larger models. 2. Building a joint task training framework based on a pre-trained large speech representation model.
次年度使用額が生じた理由	Due to the COVID-19 pandemic, relevant budgets for travel expenses were not utilized. This year, these budgets will be allocated to purchasing machines and devices for data preparation and building demo systems.

研究成果
(2件)

すべて学会発表 (2件) (うち国際学会 2件)

[学会発表] Partial Coupling of Optimal Transport for Spoken Language Identification2022
- 著者名/発表者名
  P Shen, X Lu, H Kawai
- 学会等名
  SLT2022
- 国際学会
[学会発表] Transducer-based language embedding for spoken language identification2022
- 著者名/発表者名
  P Shen, X Lu, H Kawai
- 学会等名
  Interspeech2022
- 国際学会