研究課題/領域番号 |
21K11951
|
研究種目 |
基盤研究(C)
|
配分区分 | 基金 |
応募区分 | 一般 |
審査区分 |
小区分61010:知覚情報処理関連
|
研究機関 | 国立情報学研究所 |
研究代表者 |
Cooper Erica 国立情報学研究所, コンテンツ科学研究系, 特任助教 (30843156)
|
研究分担者 |
Kruengkrai Canasai 国立情報学研究所, コンテンツ科学研究系, 特任助教 (10895907)
|
研究期間 (年度) |
2021-04-01 – 2024-03-31
|
研究課題ステータス |
交付 (2022年度)
|
配分額 *注記 |
4,160千円 (直接経費: 3,200千円、間接経費: 960千円)
2023年度: 1,300千円 (直接経費: 1,000千円、間接経費: 300千円)
2022年度: 1,430千円 (直接経費: 1,100千円、間接経費: 330千円)
2021年度: 1,430千円 (直接経費: 1,100千円、間接経費: 330千円)
|
キーワード | speech synthesis / self-supervised learning / low-resource languages / speech assessment / mean opinion score / text-to-speech / vocoder / pruning / efficiency / multi-lingual / machine translation / deep learning / neural networks |
研究開始時の研究の概要 |
Language technology has improved due to advances in neural-network-based approaches; for example, speech synthesis has reached the quality of human speech. However, neural models require large quantities of data. Speech technologies bring social benefits of accessibility and communication - to ensure broad access to these benefits, we consider language-independent methods that can make use of less data. We propose 1) articulatory class based end-to-end speech synthesis; 2) multi-modal machine translation with text and speech; and 3) neural architecture search for data-efficient architectures.
|
研究実績の概要 |
In this second year of the project, we looked at two main topics: language-independent, data-efficient text-to-speech synthesis for low-resource languages using self-supervised speech representations, and automatic mean opinion score prediction.
Self-supervised representations for speech have shown remarkable usefulness for many downstream speech-related tasks, and have been shown to contain phonetic information. We therefore chose these as an intermediate representation for text-to-speech synthesis trained on data from many languages, which can then be fine-tuned to a new language using only a small amount of data. This is ongoing work in progress, and we are collaborating with researchers from the National Research Council of Canada and the University of Edinburgh.
We have also identified automatic evaluation of synthesized speech as an important topic for low-resource languages, since finding listeners to participate in listening tests can be especially difficult for these languages. In collaboration with Nagoya University and Academia Sinica, we co-organized the first VoiceMOS Challenge, a shared task for automatic mean opinion score (MOS) prediction for synthesized speech. The challenge attracted 22 participating teams from academia and industry, and we ran a special session about the challenge at Interspeech 2022. This challenge has advanced the field by generating a great deal of interest in this topic.
|
現在までの達成度 (区分) |
現在までの達成度 (区分)
2: おおむね順調に進展している
理由
In the first year of the project, we initially proposed to work on language-independent speech synthesis, but instead we worked on efficient speech synthesis architectures using neural network pruning (originally scheduled for the third year). Therefore, we worked this year on language-independent approaches for speech synthesis. While we had originally planned to investigate articulatory features for this purpose, we shifted our focus to the use of self-supervised speech representations instead, since these seem very promising and well-suited for our task.
Although it was outside of our original proposal, automatic speech quality assessment has arisen during this project as an important and relevant topic. The ability to automatically predict the quality of synthesized speech, especially for low-resource languages, will facilitate future research in low-resource speech synthesis.
|
今後の研究の推進方策 |
Although we changed the order of the topics in original plan somewhat, the topic of multimodal text and speech modeling for low-resource languages still remains -- we will therefore focus on this in the third year. We will also continue our ongoing research in language-independent speech synthesis that is adaptable to low-resource languages, and we will also run the 2023 edition of the VoiceMOS Challenge, which focuses on zero-shot prediction of out-of-domain synthesized speech.
|