2023 Fiscal Year Annual Research Report

Self-supervised graph-based representation for language and speaker detection

Research Project

Project/Area Number	21K17776
Research Institution	National Institute of Information and Communications Technology
Principal Investigator	沈鵬国立研究開発法人情報通信研究機構, ユニバーサルコミュニケーション研究所先進的音声翻訳研究開発推進センター, 主任研究員 (80773118)
Project Period (FY)	2021-04-01 – 2024-03-31
Keywords	language identification / Speech recognition / pre-training model / large language models / speaker diarization
Outline of Annual Research Achievements	In year 2023, I focused on investigating how to better use pre-trained or self-supervised training models to improve the performance of language recognition(LID) and speech recognition (ASR) tasks. In detail, the following work was done to progress this project: 1. With the success of ChatGPT, I began to investigate the generative model and tried to use the knowledge from this model to improve the performance of LID. Such investigations are important to understand the behavior of large language models. Our work was published by IEEE ASRU 2023. 2. I also focused on improving cross-domain ASR tasks. We tried to use pre-trained large models, such as BERT, and proposed using optimal transport techniques to better utilize the knowledge transferred from the large models. Our works were published by IEEE ICASSP 2022, 2024, and IEEE ASRU 2023. 3. I also propose a novel speaker mask branch to detection the speech segments of individual speakers. With the proposed model, we can perform both ASR and speaker diarization tasks simultaneously using a single model. In this project, I conducted investigations and utilized models trained with self-supervised techniques or pre-trained techniques for LID, speaker recognition, and ASR tasks. Through this research, we classified how to better utilize the knowledge inside the pre-trained models and proposed several techniques, such as RNN-T-based LID and optimal transport-based ASR to improve the performance of these tasks. Especially, our proposed techniques was successfully used to build the NICT LID system, which showed very robust performance.

Research Products
(4 results)

All 2024 2023

All Presentation (4 results) (of which Int'l Joint Research: 3 results)

[Presentation] Hierarchical cross-modality knowledge transfer with Sinkhorn attention for CTC-based ASR2024
- Author(s)
  X. Lu, P. Shen, Y. Tsao, H. Kawai
- Organizer
  IEEE ICASSP
- Int'l Joint Research
[Presentation] Generative linguistic representation for spoken language identification2023
- Author(s)
  P. Shen, X. Lu, H. Kawai
- Organizer
  IEEE ASRU
- Int'l Joint Research
[Presentation] Cross-modal alignment with optimal transport for CTC-based ASR2023
- Author(s)
  X. Lu, P. Shen, Y. Tsao, H. Kawai
- Organizer
  IEEE ASRU
- Int'l Joint Research
[Presentation] Investigation on Multi-task Universal Speech Models2023
- Author(s)
  P. Shen, X. Lu, H. Kawai
- Organizer
  Autumn Meeting of Acoustical Society of Japan

2023 Fiscal Year Annual Research Report

Self-supervised graph-based representation for language and speaker detection

Principal Investigator

沈 鵬 国立研究開発法人情報通信研究機構, ユニバーサルコミュニケーション研究所先進的音声翻訳研究開発推進センター, 主任研究員 (80773118)

Research Products

[Presentation] Hierarchical cross-modality knowledge transfer with Sinkhorn attention for CTC-based ASR2024

Author(s)

Organizer

[Presentation] Generative linguistic representation for spoken language identification2023

Author(s)

Organizer

[Presentation] Cross-modal alignment with optimal transport for CTC-based ASR2023

Author(s)

Organizer

[Presentation] Investigation on Multi-task Universal Speech Models2023

Author(s)

Organizer

沈鵬国立研究開発法人情報通信研究機構, ユニバーサルコミュニケーション研究所先進的音声翻訳研究開発推進センター, 主任研究員 (80773118)