Language-independent, multi-modal, and data-efficient approaches for speech synthesis and translation
Project/Area Number |
21K11951
|
Research Category |
Grant-in-Aid for Scientific Research (C)
|
Allocation Type | Multi-year Fund |
Section | 一般 |
Review Section |
Basic Section 61010:Perceptual information processing-related
|
Research Institution | National Institute of Informatics |
Principal Investigator |
Cooper Erica 国立情報学研究所, コンテンツ科学研究系, 特任助教 (30843156)
|
Co-Investigator(Kenkyū-buntansha) |
Kruengkrai Canasai 国立情報学研究所, コンテンツ科学研究系, 特任助教 (10895907)
|
Project Period (FY) |
2021-04-01 – 2024-03-31
|
Project Status |
Granted (Fiscal Year 2021)
|
Budget Amount *help |
¥4,160,000 (Direct Cost: ¥3,200,000、Indirect Cost: ¥960,000)
Fiscal Year 2023: ¥1,300,000 (Direct Cost: ¥1,000,000、Indirect Cost: ¥300,000)
Fiscal Year 2022: ¥1,430,000 (Direct Cost: ¥1,100,000、Indirect Cost: ¥330,000)
Fiscal Year 2021: ¥1,430,000 (Direct Cost: ¥1,100,000、Indirect Cost: ¥330,000)
|
Keywords | text-to-speech / vocoder / speech synthesis / pruning / efficiency / multi-lingual / machine translation / deep learning / neural networks |
Outline of Research at the Start |
Language technology has improved due to advances in neural-network-based approaches; for example, speech synthesis has reached the quality of human speech. However, neural models require large quantities of data. Speech technologies bring social benefits of accessibility and communication - to ensure broad access to these benefits, we consider language-independent methods that can make use of less data. We propose 1) articulatory class based end-to-end speech synthesis; 2) multi-modal machine translation with text and speech; and 3) neural architecture search for data-efficient architectures.
|
Outline of Annual Research Achievements |
In this first year, we have explored the effects of pruning end-to-end neural network based models for single-speaker text-to-speech (TTS) synthesis and evaluated the effects of different levels of pruning on naturalness, intelligibility, and prosody of the synthesized speech. The pruning method used follows a "prune, adjust, and re-prune" approach that was found to be effective for self-supervised speech models used in automatic speech recognition. We found that both neural vocoders and neural text-to-mel speech synthesis models are highly over-parametrized, and that up to 90% of the model weights can be pruned without detriment to the quality of the output synthesized speech. This was determined based on experiments pruning multiple types of neural acoustic models including Tacotron and Transformer TTS, as well as the Parallel WaveGAN neural vocoder, and evaluated both using objective metrics such as Word Error Rate from an automatic speech recognizer and measures of f0 and utterance duration, as well as by subjective Mean Opinion Score and A/B comparison listening tests. Pruned models are smaller and thus more computationally- and space-efficient, and our results address our original proposed aim to find more efficient neural TTS models. This work resulted in one peer-reviewed conference publication at ICASSP 2022.
|
Current Status of Research Progress |
Current Status of Research Progress
2: Research has progressed on the whole more than it was originally planned.
Reason
This year's work addresses one of the original aims of our proposal, which is to find more efficient neural TTS models. Although we originally proposed to achieve this using neural architecture search, we found that a neural network pruning based approach was more computationally-efficient and allowed us to straightforwardly and directly examine the tradeoffs between model size and synthesis output quality.
|
Strategy for Future Research Activity |
The original research plans were: 1) articulatory feature based neural TTS; 2) multimodal speech and text translation; 3) neural architecture search for more efficient neural TTS.
A change of plan in our project was that we spent this year investigating efficient neural TTS, which was originally scheduled for the third year. For our next steps in this project, we will investigate articulatory feature based input for neural end-to-end speech synthesis in low-resource languages and dialects. Also, based on the findings of this past year, we plan to investigate how well the pruned models that we developed in this past year can make use of low-resource data.
|
Report
(1 results)
Research Products
(3 results)
-
-
[Presentation] On the Interplay Between Sparsity, Naturalness, Intelligibility, and Prosody in Speech Synthesis2022
Author(s)
Cheng-I Jeff Lai, Erica Cooper, Yang Zhang, Shiyu Chang, Kaizhi Qian, Yi-Lun Liao, Yung-Sung Chuang, Alexander H. Liu, Junichi Yamagishi, David Cox, James Glass
Organizer
ICASSP 2022
Related Report
Int'l Joint Research
-