• Search Research Projects
  • Search Researchers
  • How to Use
  1. Back to previous page

Language-independent, multi-modal, and data-efficient approaches for speech synthesis and translation

Research Project

Project/Area Number 21K11951
Research Category

Grant-in-Aid for Scientific Research (C)

Allocation TypeMulti-year Fund
Section一般
Review Section Basic Section 61010:Perceptual information processing-related
Research InstitutionNational Institute of Informatics

Principal Investigator

Cooper Erica  国立情報学研究所, コンテンツ科学研究系, 特任助教 (30843156)

Co-Investigator(Kenkyū-buntansha) Kruengkrai Canasai  国立情報学研究所, コンテンツ科学研究系, 特任助教 (10895907)
Project Period (FY) 2021-04-01 – 2024-03-31
Project Status Granted (Fiscal Year 2021)
Budget Amount *help
¥4,160,000 (Direct Cost: ¥3,200,000、Indirect Cost: ¥960,000)
Fiscal Year 2023: ¥1,300,000 (Direct Cost: ¥1,000,000、Indirect Cost: ¥300,000)
Fiscal Year 2022: ¥1,430,000 (Direct Cost: ¥1,100,000、Indirect Cost: ¥330,000)
Fiscal Year 2021: ¥1,430,000 (Direct Cost: ¥1,100,000、Indirect Cost: ¥330,000)
Keywordstext-to-speech / vocoder / speech synthesis / pruning / efficiency / multi-lingual / machine translation / deep learning / neural networks
Outline of Research at the Start

Language technology has improved due to advances in neural-network-based approaches; for example, speech synthesis has reached the quality of human speech. However, neural models require large quantities of data. Speech technologies bring social benefits of accessibility and communication - to ensure broad access to these benefits, we consider language-independent methods that can make use of less data. We propose 1) articulatory class based end-to-end speech synthesis; 2) multi-modal machine translation with text and speech; and 3) neural architecture search for data-efficient architectures.

Outline of Annual Research Achievements

In this first year, we have explored the effects of pruning end-to-end neural network based models for single-speaker text-to-speech (TTS) synthesis and evaluated the effects of different levels of pruning on naturalness, intelligibility, and prosody of the synthesized speech. The pruning method used follows a "prune, adjust, and re-prune" approach that was found to be effective for self-supervised speech models used in automatic speech recognition. We found that both neural vocoders and neural text-to-mel speech synthesis models are highly over-parametrized, and that up to 90% of the model weights can be pruned without detriment to the quality of the output synthesized speech. This was determined based on experiments pruning multiple types of neural acoustic models including Tacotron and Transformer TTS, as well as the Parallel WaveGAN neural vocoder, and evaluated both using objective metrics such as Word Error Rate from an automatic speech recognizer and measures of f0 and utterance duration, as well as by subjective Mean Opinion Score and A/B comparison listening tests. Pruned models are smaller and thus more computationally- and space-efficient, and our results address our original proposed aim to find more efficient neural TTS models. This work resulted in one peer-reviewed conference publication at ICASSP 2022.

Current Status of Research Progress
Current Status of Research Progress

2: Research has progressed on the whole more than it was originally planned.

Reason

This year's work addresses one of the original aims of our proposal, which is to find more efficient neural TTS models. Although we originally proposed to achieve this using neural architecture search, we found that a neural network pruning based approach was more computationally-efficient and allowed us to straightforwardly and directly examine the tradeoffs between model size and synthesis output quality.

Strategy for Future Research Activity

The original research plans were: 1) articulatory feature based neural TTS; 2) multimodal speech and text translation; 3) neural architecture search for more efficient neural TTS.

A change of plan in our project was that we spent this year investigating efficient neural TTS, which was originally scheduled for the third year. For our next steps in this project, we will investigate articulatory feature based input for neural end-to-end speech synthesis in low-resource languages and dialects. Also, based on the findings of this past year, we plan to investigate how well the pruned models that we developed in this past year can make use of low-resource data.

Report

(1 results)
  • 2021 Research-status Report

Research Products

(3 results)

All 2022 Other

All Int'l Joint Research (1 results) Presentation (1 results) (of which Int'l Joint Research: 1 results) Remarks (1 results)

  • [Int'l Joint Research] Massachusetts Institute of Technology/MIT-IBM Watson AI Lab(米国)

    • Related Report
      2021 Research-status Report
  • [Presentation] On the Interplay Between Sparsity, Naturalness, Intelligibility, and Prosody in Speech Synthesis2022

    • Author(s)
      Cheng-I Jeff Lai, Erica Cooper, Yang Zhang, Shiyu Chang, Kaizhi Qian, Yi-Lun Liao, Yung-Sung Chuang, Alexander H. Liu, Junichi Yamagishi, David Cox, James Glass
    • Organizer
      ICASSP 2022
    • Related Report
      2021 Research-status Report
    • Int'l Joint Research
  • [Remarks] TTS Pruning

    • URL

      https://people.csail.mit.edu/clai24/prune-tts/

    • Related Report
      2021 Research-status Report

URL: 

Published: 2021-04-28   Modified: 2022-12-28  

Information User Guide FAQ News Terms of Use Attribution of KAKENHI

Powered by NII kakenhi