2021 Fiscal Year Research-status Report

Language-independent, multi-modal, and data-efficient approaches for speech synthesis and translation

Research Project

Project/Area Number	21K11951
Research Institution	National Institute of Informatics
Principal Investigator	Cooper Erica 国立情報学研究所, コンテンツ科学研究系, 特任助教 (30843156)
Co-Investigator(Kenkyū-buntansha)	Kruengkrai Canasai 国立情報学研究所, コンテンツ科学研究系, 特任助教 (10895907)
Project Period (FY)	2021-04-01 – 2024-03-31
Keywords	text-to-speech / vocoder / speech synthesis / pruning / efficiency
Outline of Annual Research Achievements	In this first year, we have explored the effects of pruning end-to-end neural network based models for single-speaker text-to-speech (TTS) synthesis and evaluated the effects of different levels of pruning on naturalness, intelligibility, and prosody of the synthesized speech. The pruning method used follows a "prune, adjust, and re-prune" approach that was found to be effective for self-supervised speech models used in automatic speech recognition. We found that both neural vocoders and neural text-to-mel speech synthesis models are highly over-parametrized, and that up to 90% of the model weights can be pruned without detriment to the quality of the output synthesized speech. This was determined based on experiments pruning multiple types of neural acoustic models including Tacotron and Transformer TTS, as well as the Parallel WaveGAN neural vocoder, and evaluated both using objective metrics such as Word Error Rate from an automatic speech recognizer and measures of f0 and utterance duration, as well as by subjective Mean Opinion Score and A/B comparison listening tests. Pruned models are smaller and thus more computationally- and space-efficient, and our results address our original proposed aim to find more efficient neural TTS models. This work resulted in one peer-reviewed conference publication at ICASSP 2022.
Current Status of Research Progress	Current Status of Research Progress 2: Research has progressed on the whole more than it was originally planned. Reason This year's work addresses one of the original aims of our proposal, which is to find more efficient neural TTS models. Although we originally proposed to achieve this using neural architecture search, we found that a neural network pruning based approach was more computationally-efficient and allowed us to straightforwardly and directly examine the tradeoffs between model size and synthesis output quality.
Strategy for Future Research Activity	The original research plans were: 1) articulatory feature based neural TTS; 2) multimodal speech and text translation; 3) neural architecture search for more efficient neural TTS. A change of plan in our project was that we spent this year investigating efficient neural TTS, which was originally scheduled for the third year. For our next steps in this project, we will investigate articulatory feature based input for neural end-to-end speech synthesis in low-resource languages and dialects. Also, based on the findings of this past year, we plan to investigate how well the pruned models that we developed in this past year can make use of low-resource data.
Causes of Carryover	The budget for GPUs and CPUs was not executed due to the global semiconductor shortage. The budget for traveling to international conferences was not executed due to the global pandemic. We plan to purchase servers and attend conferences in person in 2022 if the global situation allows.
Remarks	Webpage for ICASSP 2022 paper containing audio samples.

Research Products
(3 results)

All 2022 Other

All Int'l Joint Research (1 results) Presentation (1 results) (of which Int'l Joint Research: 1 results) Remarks (1 results)

[Int'l Joint Research] Massachusetts Institute of Technology/MIT-IBM Watson AI Lab(米国)
- Country Name
  U.S.A.
- Counterpart Institution
  Massachusetts Institute of Technology/MIT-IBM Watson AI Lab
[Presentation] On the Interplay Between Sparsity, Naturalness, Intelligibility, and Prosody in Speech Synthesis2022
- Author(s)
  Cheng-I Jeff Lai, Erica Cooper, Yang Zhang, Shiyu Chang, Kaizhi Qian, Yi-Lun Liao, Yung-Sung Chuang, Alexander H. Liu, Junichi Yamagishi, David Cox, James Glass
- Organizer
  ICASSP 2022
- Int'l Joint Research
[Remarks] TTS Pruning
- URL
  https://people.csail.mit.edu/clai24/prune-tts/

2021 Fiscal Year Research-status Report

Language-independent, multi-modal, and data-efficient approaches for speech synthesis and translation

Principal Investigator

Cooper Erica 国立情報学研究所, コンテンツ科学研究系, 特任助教 (30843156)

Current Status of Research Progress

Reason

Research Products

[Int'l Joint Research] Massachusetts Institute of Technology/MIT-IBM Watson AI Lab(米国)

Country Name

Counterpart Institution

[Presentation] On the Interplay Between Sparsity, Naturalness, Intelligibility, and Prosody in Speech Synthesis2022

Author(s)

Organizer

[Remarks] TTS Pruning

URL