研究課題/領域番号 |
21K11951
|
研究機関 | 国立情報学研究所 |
研究代表者 |
Cooper Erica 国立情報学研究所, コンテンツ科学研究系, 特任助教 (30843156)
|
研究分担者 |
Kruengkrai Canasai 国立情報学研究所, コンテンツ科学研究系, 特任助教 (10895907)
|
研究期間 (年度) |
2021-04-01 – 2024-03-31
|
キーワード | text-to-speech / vocoder / speech synthesis / pruning / efficiency |
研究実績の概要 |
In this first year, we have explored the effects of pruning end-to-end neural network based models for single-speaker text-to-speech (TTS) synthesis and evaluated the effects of different levels of pruning on naturalness, intelligibility, and prosody of the synthesized speech. The pruning method used follows a "prune, adjust, and re-prune" approach that was found to be effective for self-supervised speech models used in automatic speech recognition. We found that both neural vocoders and neural text-to-mel speech synthesis models are highly over-parametrized, and that up to 90% of the model weights can be pruned without detriment to the quality of the output synthesized speech. This was determined based on experiments pruning multiple types of neural acoustic models including Tacotron and Transformer TTS, as well as the Parallel WaveGAN neural vocoder, and evaluated both using objective metrics such as Word Error Rate from an automatic speech recognizer and measures of f0 and utterance duration, as well as by subjective Mean Opinion Score and A/B comparison listening tests. Pruned models are smaller and thus more computationally- and space-efficient, and our results address our original proposed aim to find more efficient neural TTS models. This work resulted in one peer-reviewed conference publication at ICASSP 2022.
|
現在までの達成度 (区分) |
現在までの達成度 (区分)
2: おおむね順調に進展している
理由
This year's work addresses one of the original aims of our proposal, which is to find more efficient neural TTS models. Although we originally proposed to achieve this using neural architecture search, we found that a neural network pruning based approach was more computationally-efficient and allowed us to straightforwardly and directly examine the tradeoffs between model size and synthesis output quality.
|
今後の研究の推進方策 |
The original research plans were: 1) articulatory feature based neural TTS; 2) multimodal speech and text translation; 3) neural architecture search for more efficient neural TTS.
A change of plan in our project was that we spent this year investigating efficient neural TTS, which was originally scheduled for the third year. For our next steps in this project, we will investigate articulatory feature based input for neural end-to-end speech synthesis in low-resource languages and dialects. Also, based on the findings of this past year, we plan to investigate how well the pruned models that we developed in this past year can make use of low-resource data.
|
次年度使用額が生じた理由 |
The budget for GPUs and CPUs was not executed due to the global semiconductor shortage. The budget for traveling to international conferences was not executed due to the global pandemic. We plan to purchase servers and attend conferences in person in 2022 if the global situation allows.
|
備考 |
Webpage for ICASSP 2022 paper containing audio samples.
|