Self-supervised graph-based representation for language and speaker detection
Project/Area Number |
21K17776
|
Research Category |
Grant-in-Aid for Early-Career Scientists
|
Allocation Type | Multi-year Fund |
Review Section |
Basic Section 61010:Perceptual information processing-related
|
Research Institution | National Institute of Information and Communications Technology |
Principal Investigator |
沈 鵬 国立研究開発法人情報通信研究機構, ユニバーサルコミュニケーション研究所先進的音声翻訳研究開発推進センター, 主任研究員 (80773118)
|
Project Period (FY) |
2021-04-01 – 2024-03-31
|
Project Status |
Completed (Fiscal Year 2023)
|
Budget Amount *help |
¥4,550,000 (Direct Cost: ¥3,500,000、Indirect Cost: ¥1,050,000)
Fiscal Year 2023: ¥780,000 (Direct Cost: ¥600,000、Indirect Cost: ¥180,000)
Fiscal Year 2022: ¥1,690,000 (Direct Cost: ¥1,300,000、Indirect Cost: ¥390,000)
Fiscal Year 2021: ¥2,080,000 (Direct Cost: ¥1,600,000、Indirect Cost: ¥480,000)
|
Keywords | language identification / Speech recognition / pre-training model / large language models / speaker diarization / cross-domain / self-supervised learning / speaker recognition / language recognition / speaker recogntion |
Outline of Research at the Start |
Developing spoken language and speaker detection techniques is one of the important tasks for improving the usability of real-time multilingual speech translation systems. However, current advanced spoken language and speaker detection techniques cannot perform well on cross-channel and cross-domain data. In this project, investigations will be conducted to understand how to better represent languages and speakers of a speech signal by developing self-supervised graph-based learning techniques for robust spoken language and speaker detection tasks.
|
Outline of Annual Research Achievements |
In year 2023, I focused on investigating how to better use pre-trained or self-supervised training models to improve the performance of language recognition(LID) and speech recognition (ASR) tasks. In detail, the following work was done to progress this project: 1. With the success of ChatGPT, I began to investigate the generative model and tried to use the knowledge from this model to improve the performance of LID. Such investigations are important to understand the behavior of large language models. Our work was published by IEEE ASRU 2023. 2. I also focused on improving cross-domain ASR tasks. We tried to use pre-trained large models, such as BERT, and proposed using optimal transport techniques to better utilize the knowledge transferred from the large models. Our works were published by IEEE ICASSP 2022, 2024, and IEEE ASRU 2023. 3. I also propose a novel speaker mask branch to detection the speech segments of individual speakers. With the proposed model, we can perform both ASR and speaker diarization tasks simultaneously using a single model.
In this project, I conducted investigations and utilized models trained with self-supervised techniques or pre-trained techniques for LID, speaker recognition, and ASR tasks. Through this research, we classified how to better utilize the knowledge inside the pre-trained models and proposed several techniques, such as RNN-T-based LID and optimal transport-based ASR to improve the performance of these tasks. Especially, our proposed techniques was successfully used to build the NICT LID system, which showed very robust performance.
|
Report
(3 results)
Research Products
(8 results)