研究課題/領域番号 |
23K11227
|
研究種目 |
基盤研究(C)
|
配分区分 | 基金 |
応募区分 | 一般 |
審査区分 |
小区分61030:知能情報学関連
|
研究機関 | 国立研究開発法人情報通信研究機構 |
研究代表者 |
李 勝 国立研究開発法人情報通信研究機構, ユニバーサルコミュニケーション研究所先進的音声翻訳研究開発推進センター, 研究員 (70840940)
|
研究分担者 |
李 吉屹 山梨大学, 大学院総合研究部, 准教授 (30726667)
チョ シンキ 京都大学, 情報学研究科, 特定准教授 (70784891)
|
研究期間 (年度) |
2023-04-01 – 2026-03-31
|
研究課題ステータス |
交付 (2023年度)
|
配分額 *注記 |
4,810千円 (直接経費: 3,700千円、間接経費: 1,110千円)
2025年度: 1,300千円 (直接経費: 1,000千円、間接経費: 300千円)
2024年度: 1,300千円 (直接経費: 1,000千円、間接経費: 300千円)
2023年度: 2,210千円 (直接経費: 1,700千円、間接経費: 510千円)
|
キーワード | speech recognition / Multitask / Multimodal / Multilingual / Low-resource / quality estimation / federated learning |
研究開始時の研究の概要 |
Cross-modality, general purposed multitask model, and cross-lingual communication ability are three key features of next-generation artificial intelligence. This research focuses on advancing these three features simultaneously in the speech recognition (ASR) system to prove: (1) Can rich-resourced language information aid the understanding of low-resource languages? (2) Can other modal information aid the understanding of low-resource languages? (3) Can additional information from other tasks aid in understanding low-resource languages?
|
研究実績の概要 |
This research project aims to solve the classic low-resource problem of speech recognition area and search for solutions from natural language processing (NLP), multimodal modeling and big data society. Research achievements of FY2023 were fruitful. Our publications appeared not only in traditional speech conferences (ICASSP/ASRU) and journals (speech communication) but also in NLP top conferences (ACL, IWSLT), big data conferences (DASFAA), neural network conference (ICANN), and multimedia conferences (ACM Multimedia Asia). The achievements were also reported in domestic conferences of both speech and NLP. I also devoted myself into challenges of speech recognition and quality estimation of speech synthesis, both got top ranking scores.
|
現在までの達成度 (区分) |
現在までの達成度 (区分)
2: おおむね順調に進展している
理由
To solve the low-resourced problems of speech recognition, we proposed the following methods regarding multimodal, multilingual, and multitasking: 1. For the multilingual problem, we proposed universal language modeling technology. In FY2023, an enhanced hierarchical softmax modeling method was used to encode hundreds of languages, and we reported on it in the ASJ2023 autumn. We also hold a workshop to promote data collection and sharing for low-resourced languages. 2. For multimodal modeling, we introduced multimodal modeling technology into speech processing, such as model reprogramming technology. 3. The pretrained speech and language models were used together within my proposed multitasking downstreaming framework. I successfully combined the wav2vec2.0 model with the GPT and BERT models for dialectical speech recognition. Moreover, I proposed combining the current state-of-the-art speech recognition model, OpenAI whisper, with a large language model, Meta Llama2.0.
|
今後の研究の推進方策 |
In FY2023, the large language model attracted significant attention from both industry and academia. In my research, I also empirically proved that it can revolutionize performance in most speech tasks. So, in FY2024, I will integrate the large language model into our speech recognition task. And in the meantime, I will keep an eye on multimodal modeling technology.
|