Project/Area Number |
21K17776
|
Research Category |
Grant-in-Aid for Early-Career Scientists
|
Allocation Type | Multi-year Fund |
Review Section |
Basic Section 61010:Perceptual information processing-related
|
Research Institution | National Institute of Information and Communications Technology |
Principal Investigator |
Shen Peng 国立研究開発法人情報通信研究機構, ユニバーサルコミュニケーション研究所先進的音声翻訳研究開発推進センター, 主任研究員 (80773118)
|
Project Period (FY) |
2021-04-01 – 2024-03-31
|
Project Status |
Completed (Fiscal Year 2023)
|
Budget Amount *help |
¥4,550,000 (Direct Cost: ¥3,500,000、Indirect Cost: ¥1,050,000)
Fiscal Year 2023: ¥780,000 (Direct Cost: ¥600,000、Indirect Cost: ¥180,000)
Fiscal Year 2022: ¥1,690,000 (Direct Cost: ¥1,300,000、Indirect Cost: ¥390,000)
Fiscal Year 2021: ¥2,080,000 (Direct Cost: ¥1,600,000、Indirect Cost: ¥480,000)
|
Keywords | language identification / Speech recognition / self-supervised learning / speaker recognition / pre-training model / large language models / speaker diarization / cross-domain / language recognition / speaker recogntion |
Outline of Research at the Start |
Developing spoken language and speaker detection techniques is one of the important tasks for improving the usability of real-time multilingual speech translation systems. However, current advanced spoken language and speaker detection techniques cannot perform well on cross-channel and cross-domain data. In this project, investigations will be conducted to understand how to better represent languages and speakers of a speech signal by developing self-supervised graph-based learning techniques for robust spoken language and speaker detection tasks.
|
Outline of Final Research Achievements |
In this project, we focus on developing self-supervised or pre-trained techniques to enhance spoken language and speaker recognition tasks. We experimented with different methods to better capture the characteristics of languages and speakers from speech signals. Our proposed techniques include transducer-based language embeddings, pronunciation-aware character encoding, cross-modal alignment, and generative linguistic representations. These innovations aim to improve language and speaker recognition, as well as speech recognition tasks. Further, we explored multi-task recognition to advance language, speaker, and speech recognition using a single model. The results of this project have been published at top international conferences, including IEEE ICASSP, SLT, ASRU, and Interspeech.
|
Academic Significance and Societal Importance of the Research Achievements |
本プロジェクトは、音声信号の理解と表現を進化させることをその大きな目的としており、このことは重要な科学的意義を有する。言語と話者の認識におけるパフォーマンス向上のための技術は、技術的な応用を進めることに役立つ。
|