研究課題/領域番号 |
21K17776
|
研究種目 |
若手研究
|
配分区分 | 基金 |
審査区分 |
小区分61010:知覚情報処理関連
|
研究機関 | 国立研究開発法人情報通信研究機構 |
研究代表者 |
沈 鵬 国立研究開発法人情報通信研究機構, ユニバーサルコミュニケーション研究所先進的音声翻訳研究開発推進センター, 主任研究員 (80773118)
|
研究期間 (年度) |
2021-04-01 – 2024-03-31
|
研究課題ステータス |
交付 (2022年度)
|
配分額 *注記 |
4,550千円 (直接経費: 3,500千円、間接経費: 1,050千円)
2023年度: 780千円 (直接経費: 600千円、間接経費: 180千円)
2022年度: 1,690千円 (直接経費: 1,300千円、間接経費: 390千円)
2021年度: 2,080千円 (直接経費: 1,600千円、間接経費: 480千円)
|
キーワード | language identification / Speech recognition / cross-domain / pre-training model / self-supervised learning / speaker recognition / language recognition / speaker recogntion |
研究開始時の研究の概要 |
Developing spoken language and speaker detection techniques is one of the important tasks for improving the usability of real-time multilingual speech translation systems. However, current advanced spoken language and speaker detection techniques cannot perform well on cross-channel and cross-domain data. In this project, investigations will be conducted to understand how to better represent languages and speakers of a speech signal by developing self-supervised graph-based learning techniques for robust spoken language and speaker detection tasks.
|
研究実績の概要 |
I focused on investigating how to better represent speech signals for both language recognition and speech recognition tasks. In detail, the following work was done to progress this project: 1. Improving the representation of speech signal for language identification (LID): We propose a novel transducer-based language embedding approach for LID tasks by integrating an RNN transducer model into a language embedding framework. Benefiting from the advantages of the RNN transducer's linguistic representation capability, the proposed method can exploit both phonetically-aware acoustic features and explicit linguistic features for LID tasks. The research paper was accepted by Interspeech 2022. Additionally, we further investigated these techniques on the NICT LID system, which also demonstrated robustness on cross-channel data. 2. Another work focuses on improving RNN-T for Mandarin ASR. I propose to use a novel pronunciation-aware unique character encoding for building end-to-end RNN-T-based Mandarin ASR systems. The proposed encoding is a combination of pronunciation-based syllable and character index (CI). By introducing the CI, the RNN-T model can overcome the homophone problem while utilizing the pronunciation information for extracting modeling units. With the proposed encoding, the model outputs can be converted into the final recognition result through a one-to-one mapping. This paper was accepted by IEEE SLT 2022.
|
現在までの達成度 (区分) |
現在までの達成度 (区分)
2: おおむね順調に進展している
理由
Following the plan, I conducted work to investigate and understand how to better represent languages for language identification and model units for ASR tasks. The related work progressed smoothly, and the research results were published at top-level international conferences.
|
今後の研究の推進方策 |
I will focus on investigating how to build a universal model for speaker, language, and speech recognition tasks with a single model. I will concentrate on the following problems: 1. Investigating large universal models, such as Whisper, and attempting to train or fine-tune similar larger models. 2. Building a joint task training framework based on a pre-trained large speech representation model.
|