Project/Area Number |
21K11951
|
Research Category |
Grant-in-Aid for Scientific Research (C)
|
Allocation Type | Multi-year Fund |
Section | 一般 |
Review Section |
Basic Section 61010:Perceptual information processing-related
|
Research Institution | National Institute of Informatics |
Principal Investigator |
Cooper Erica 国立情報学研究所, コンテンツ科学研究系, 特任准教授 (30843156)
|
Co-Investigator(Kenkyū-buntansha) |
Kruengkrai Canasai 国立情報学研究所, コンテンツ科学研究系, 特任助教 (10895907)
|
Project Period (FY) |
2021-04-01 – 2024-03-31
|
Project Status |
Completed (Fiscal Year 2023)
|
Budget Amount *help |
¥4,160,000 (Direct Cost: ¥3,200,000、Indirect Cost: ¥960,000)
Fiscal Year 2023: ¥1,300,000 (Direct Cost: ¥1,000,000、Indirect Cost: ¥300,000)
Fiscal Year 2022: ¥1,430,000 (Direct Cost: ¥1,100,000、Indirect Cost: ¥330,000)
Fiscal Year 2021: ¥1,430,000 (Direct Cost: ¥1,100,000、Indirect Cost: ¥330,000)
|
Keywords | text-to-speech synthesis / low-resource languages / speech evaluation / speech synthesis / self-supervised learning / speech assessment / mean opinion score / text-to-speech / vocoder / pruning / efficiency / multi-lingual / machine translation / deep learning / neural networks |
Outline of Research at the Start |
Language technology has improved due to advances in neural-network-based approaches; for example, speech synthesis has reached the quality of human speech. However, neural models require large quantities of data. Speech technologies bring social benefits of accessibility and communication - to ensure broad access to these benefits, we consider language-independent methods that can make use of less data. We propose 1) articulatory class based end-to-end speech synthesis; 2) multi-modal machine translation with text and speech; and 3) neural architecture search for data-efficient architectures.
|
Outline of Annual Research Achievements |
We developed methods for text-to-speech (TTS) synthesis for low-resource languages using smaller amounts of data as well as data from less traditional sources. First, we developed an approach to building text-to-speech (TTS) corpora from podcast data, using the Hebrew language as a case study, resulting in a publicly-available dataset. We next developed a data processing pipeline and TTS system that can be repurposed for other low-resource languages that have similar available data, resulting in one peer-reviewed publication at Interspeech 2023. Finally, we continued investigating self-supervised speech representations as an intermediate representation for multilingual TTS which can be fine-tuned to a new language.
Having previously identified automatic evaluation of TTS as a critical issue especially for low-resource languages, we continued the VoiceMOS Challenge, a shared task for automatic TTS evaluation, by running a second edition focusing on zero-shot multi-domain scenarios. The challenge was presented as a special session at ASRU 2023, and attracted ten teams from academia and industry. We also studied contextual effects on listener ratings, self-supervised speech models' abilities for speech quality prediction, and a ranking-based quality prediction approach, resulting in three additional peer-reviewed publications.
|