2021 Fiscal Year Research-status Report
Language-independent, multi-modal, and data-efficient approaches for speech synthesis and translation
Project/Area Number |
21K11951
|
Research Institution | National Institute of Informatics |
Principal Investigator |
Cooper Erica 国立情報学研究所, コンテンツ科学研究系, 特任助教 (30843156)
|
Co-Investigator(Kenkyū-buntansha) |
Kruengkrai Canasai 国立情報学研究所, コンテンツ科学研究系, 特任助教 (10895907)
|
Project Period (FY) |
2021-04-01 – 2024-03-31
|
Keywords | text-to-speech / vocoder / speech synthesis / pruning / efficiency |
Outline of Annual Research Achievements |
In this first year, we have explored the effects of pruning end-to-end neural network based models for single-speaker text-to-speech (TTS) synthesis and evaluated the effects of different levels of pruning on naturalness, intelligibility, and prosody of the synthesized speech. The pruning method used follows a "prune, adjust, and re-prune" approach that was found to be effective for self-supervised speech models used in automatic speech recognition. We found that both neural vocoders and neural text-to-mel speech synthesis models are highly over-parametrized, and that up to 90% of the model weights can be pruned without detriment to the quality of the output synthesized speech. This was determined based on experiments pruning multiple types of neural acoustic models including Tacotron and Transformer TTS, as well as the Parallel WaveGAN neural vocoder, and evaluated both using objective metrics such as Word Error Rate from an automatic speech recognizer and measures of f0 and utterance duration, as well as by subjective Mean Opinion Score and A/B comparison listening tests. Pruned models are smaller and thus more computationally- and space-efficient, and our results address our original proposed aim to find more efficient neural TTS models. This work resulted in one peer-reviewed conference publication at ICASSP 2022.
|
Current Status of Research Progress |
Current Status of Research Progress
2: Research has progressed on the whole more than it was originally planned.
Reason
This year's work addresses one of the original aims of our proposal, which is to find more efficient neural TTS models. Although we originally proposed to achieve this using neural architecture search, we found that a neural network pruning based approach was more computationally-efficient and allowed us to straightforwardly and directly examine the tradeoffs between model size and synthesis output quality.
|
Strategy for Future Research Activity |
The original research plans were: 1) articulatory feature based neural TTS; 2) multimodal speech and text translation; 3) neural architecture search for more efficient neural TTS.
A change of plan in our project was that we spent this year investigating efficient neural TTS, which was originally scheduled for the third year. For our next steps in this project, we will investigate articulatory feature based input for neural end-to-end speech synthesis in low-resource languages and dialects. Also, based on the findings of this past year, we plan to investigate how well the pruned models that we developed in this past year can make use of low-resource data.
|
Causes of Carryover |
The budget for GPUs and CPUs was not executed due to the global semiconductor shortage. The budget for traveling to international conferences was not executed due to the global pandemic. We plan to purchase servers and attend conferences in person in 2022 if the global situation allows.
|
Remarks |
Webpage for ICASSP 2022 paper containing audio samples.
|
Research Products
(3 results)
-
-
[Presentation] On the Interplay Between Sparsity, Naturalness, Intelligibility, and Prosody in Speech Synthesis2022
Author(s)
Cheng-I Jeff Lai, Erica Cooper, Yang Zhang, Shiyu Chang, Kaizhi Qian, Yi-Lun Liao, Yung-Sung Chuang, Alexander H. Liu, Junichi Yamagishi, David Cox, James Glass
Organizer
ICASSP 2022
Int'l Joint Research
-