研究課題/領域番号 |
21K11951
|
研究機関 | 国立情報学研究所 |
研究代表者 |
Cooper Erica 国立情報学研究所, コンテンツ科学研究系, 特任助教 (30843156)
|
研究分担者 |
Kruengkrai Canasai 国立情報学研究所, コンテンツ科学研究系, 特任助教 (10895907)
|
研究期間 (年度) |
2021-04-01 – 2024-03-31
|
キーワード | speech synthesis / self-supervised learning / low-resource languages / speech assessment / mean opinion score |
研究実績の概要 |
In this second year of the project, we looked at two main topics: language-independent, data-efficient text-to-speech synthesis for low-resource languages using self-supervised speech representations, and automatic mean opinion score prediction.
Self-supervised representations for speech have shown remarkable usefulness for many downstream speech-related tasks, and have been shown to contain phonetic information. We therefore chose these as an intermediate representation for text-to-speech synthesis trained on data from many languages, which can then be fine-tuned to a new language using only a small amount of data. This is ongoing work in progress, and we are collaborating with researchers from the National Research Council of Canada and the University of Edinburgh.
We have also identified automatic evaluation of synthesized speech as an important topic for low-resource languages, since finding listeners to participate in listening tests can be especially difficult for these languages. In collaboration with Nagoya University and Academia Sinica, we co-organized the first VoiceMOS Challenge, a shared task for automatic mean opinion score (MOS) prediction for synthesized speech. The challenge attracted 22 participating teams from academia and industry, and we ran a special session about the challenge at Interspeech 2022. This challenge has advanced the field by generating a great deal of interest in this topic.
|
現在までの達成度 (区分) |
現在までの達成度 (区分)
2: おおむね順調に進展している
理由
In the first year of the project, we initially proposed to work on language-independent speech synthesis, but instead we worked on efficient speech synthesis architectures using neural network pruning (originally scheduled for the third year). Therefore, we worked this year on language-independent approaches for speech synthesis. While we had originally planned to investigate articulatory features for this purpose, we shifted our focus to the use of self-supervised speech representations instead, since these seem very promising and well-suited for our task.
Although it was outside of our original proposal, automatic speech quality assessment has arisen during this project as an important and relevant topic. The ability to automatically predict the quality of synthesized speech, especially for low-resource languages, will facilitate future research in low-resource speech synthesis.
|
今後の研究の推進方策 |
Although we changed the order of the topics in original plan somewhat, the topic of multimodal text and speech modeling for low-resource languages still remains -- we will therefore focus on this in the third year. We will also continue our ongoing research in language-independent speech synthesis that is adaptable to low-resource languages, and we will also run the 2023 edition of the VoiceMOS Challenge, which focuses on zero-shot prediction of out-of-domain synthesized speech.
|
次年度使用額が生じた理由 |
Travel expenses were not used due to the ongoing coronavirus situation in 2022.
The budget remaining will be used for attending international conferences in 2023.
|
備考 |
Official website for the VoiceMOS Challenge 2022, and open-source code for SSL-based MOS predictor which was one of the baseline systems for the challenge.
|