Project/Area Number |
21K11951
|
Research Category |
Grant-in-Aid for Scientific Research (C)
|
Allocation Type | Multi-year Fund |
Section | 一般 |
Review Section |
Basic Section 61010:Perceptual information processing-related
|
Research Institution | National Institute of Informatics |
Principal Investigator |
Cooper Erica 国立情報学研究所, コンテンツ科学研究系, 特任助教 (30843156)
|
Co-Investigator(Kenkyū-buntansha) |
Kruengkrai Canasai 国立情報学研究所, コンテンツ科学研究系, 特任助教 (10895907)
|
Project Period (FY) |
2021-04-01 – 2024-03-31
|
Project Status |
Granted (Fiscal Year 2022)
|
Budget Amount *help |
¥4,160,000 (Direct Cost: ¥3,200,000、Indirect Cost: ¥960,000)
Fiscal Year 2023: ¥1,300,000 (Direct Cost: ¥1,000,000、Indirect Cost: ¥300,000)
Fiscal Year 2022: ¥1,430,000 (Direct Cost: ¥1,100,000、Indirect Cost: ¥330,000)
Fiscal Year 2021: ¥1,430,000 (Direct Cost: ¥1,100,000、Indirect Cost: ¥330,000)
|
Keywords | speech synthesis / self-supervised learning / low-resource languages / speech assessment / mean opinion score / text-to-speech / vocoder / pruning / efficiency / multi-lingual / machine translation / deep learning / neural networks |
Outline of Research at the Start |
Language technology has improved due to advances in neural-network-based approaches; for example, speech synthesis has reached the quality of human speech. However, neural models require large quantities of data. Speech technologies bring social benefits of accessibility and communication - to ensure broad access to these benefits, we consider language-independent methods that can make use of less data. We propose 1) articulatory class based end-to-end speech synthesis; 2) multi-modal machine translation with text and speech; and 3) neural architecture search for data-efficient architectures.
|
Outline of Annual Research Achievements |
In this second year of the project, we looked at two main topics: language-independent, data-efficient text-to-speech synthesis for low-resource languages using self-supervised speech representations, and automatic mean opinion score prediction.
Self-supervised representations for speech have shown remarkable usefulness for many downstream speech-related tasks, and have been shown to contain phonetic information. We therefore chose these as an intermediate representation for text-to-speech synthesis trained on data from many languages, which can then be fine-tuned to a new language using only a small amount of data. This is ongoing work in progress, and we are collaborating with researchers from the National Research Council of Canada and the University of Edinburgh.
We have also identified automatic evaluation of synthesized speech as an important topic for low-resource languages, since finding listeners to participate in listening tests can be especially difficult for these languages. In collaboration with Nagoya University and Academia Sinica, we co-organized the first VoiceMOS Challenge, a shared task for automatic mean opinion score (MOS) prediction for synthesized speech. The challenge attracted 22 participating teams from academia and industry, and we ran a special session about the challenge at Interspeech 2022. This challenge has advanced the field by generating a great deal of interest in this topic.
|
Current Status of Research Progress |
Current Status of Research Progress
2: Research has progressed on the whole more than it was originally planned.
Reason
In the first year of the project, we initially proposed to work on language-independent speech synthesis, but instead we worked on efficient speech synthesis architectures using neural network pruning (originally scheduled for the third year). Therefore, we worked this year on language-independent approaches for speech synthesis. While we had originally planned to investigate articulatory features for this purpose, we shifted our focus to the use of self-supervised speech representations instead, since these seem very promising and well-suited for our task.
Although it was outside of our original proposal, automatic speech quality assessment has arisen during this project as an important and relevant topic. The ability to automatically predict the quality of synthesized speech, especially for low-resource languages, will facilitate future research in low-resource speech synthesis.
|
Strategy for Future Research Activity |
Although we changed the order of the topics in original plan somewhat, the topic of multimodal text and speech modeling for low-resource languages still remains -- we will therefore focus on this in the third year. We will also continue our ongoing research in language-independent speech synthesis that is adaptable to low-resource languages, and we will also run the 2023 edition of the VoiceMOS Challenge, which focuses on zero-shot prediction of out-of-domain synthesized speech.
|