Self-supervised graph-based representation for language and speaker detection
Project/Area Number |
21K17776
|
Research Category |
Grant-in-Aid for Early-Career Scientists
|
Allocation Type | Multi-year Fund |
Review Section |
Basic Section 61010:Perceptual information processing-related
|
Research Institution | National Institute of Information and Communications Technology |
Principal Investigator |
沈 鵬 国立研究開発法人情報通信研究機構, ユニバーサルコミュニケーション研究所先進的音声翻訳研究開発推進センター, 主任研究員 (80773118)
|
Project Period (FY) |
2021-04-01 – 2024-03-31
|
Project Status |
Granted (Fiscal Year 2022)
|
Budget Amount *help |
¥4,550,000 (Direct Cost: ¥3,500,000、Indirect Cost: ¥1,050,000)
Fiscal Year 2023: ¥780,000 (Direct Cost: ¥600,000、Indirect Cost: ¥180,000)
Fiscal Year 2022: ¥1,690,000 (Direct Cost: ¥1,300,000、Indirect Cost: ¥390,000)
Fiscal Year 2021: ¥2,080,000 (Direct Cost: ¥1,600,000、Indirect Cost: ¥480,000)
|
Keywords | language identification / Speech recognition / cross-domain / pre-training model / self-supervised learning / speaker recognition / language recognition / speaker recogntion |
Outline of Research at the Start |
Developing spoken language and speaker detection techniques is one of the important tasks for improving the usability of real-time multilingual speech translation systems. However, current advanced spoken language and speaker detection techniques cannot perform well on cross-channel and cross-domain data. In this project, investigations will be conducted to understand how to better represent languages and speakers of a speech signal by developing self-supervised graph-based learning techniques for robust spoken language and speaker detection tasks.
|
Outline of Annual Research Achievements |
I focused on investigating how to better represent speech signals for both language recognition and speech recognition tasks. In detail, the following work was done to progress this project: 1. Improving the representation of speech signal for language identification (LID): We propose a novel transducer-based language embedding approach for LID tasks by integrating an RNN transducer model into a language embedding framework. Benefiting from the advantages of the RNN transducer's linguistic representation capability, the proposed method can exploit both phonetically-aware acoustic features and explicit linguistic features for LID tasks. The research paper was accepted by Interspeech 2022. Additionally, we further investigated these techniques on the NICT LID system, which also demonstrated robustness on cross-channel data. 2. Another work focuses on improving RNN-T for Mandarin ASR. I propose to use a novel pronunciation-aware unique character encoding for building end-to-end RNN-T-based Mandarin ASR systems. The proposed encoding is a combination of pronunciation-based syllable and character index (CI). By introducing the CI, the RNN-T model can overcome the homophone problem while utilizing the pronunciation information for extracting modeling units. With the proposed encoding, the model outputs can be converted into the final recognition result through a one-to-one mapping. This paper was accepted by IEEE SLT 2022.
|
Current Status of Research Progress |
Current Status of Research Progress
2: Research has progressed on the whole more than it was originally planned.
Reason
Following the plan, I conducted work to investigate and understand how to better represent languages for language identification and model units for ASR tasks. The related work progressed smoothly, and the research results were published at top-level international conferences.
|
Strategy for Future Research Activity |
I will focus on investigating how to build a universal model for speaker, language, and speech recognition tasks with a single model. I will concentrate on the following problems: 1. Investigating large universal models, such as Whisper, and attempting to train or fine-tune similar larger models. 2. Building a joint task training framework based on a pre-trained large speech representation model.
|
Report
(2 results)
Research Products
(4 results)