Language-independent, multi-modal, and data-efficient approaches for speech synthesis and translation

Research Project

Project/Area Number	21K11951
Research Category	Grant-in-Aid for Scientific Research (C)
Allocation Type	Multi-year Fund
Section	一般
Review Section	Basic Section 61010:Perceptual information processing-related
Research Institution	National Institute of Informatics
Principal Investigator	COOPER Erica 国立情報学研究所, コンテンツ科学研究系, 特任准教授 (30843156)
Co-Investigator(Kenkyū-buntansha)	Kruengkrai Canasai 国立情報学研究所, コンテンツ科学研究系, 特任助教 (10895907)
Project Period (FY)	2021-04-01 – 2024-03-31
Project Status	Completed (Fiscal Year 2023)
Budget Amount *help	¥4,160,000 (Direct Cost: ¥3,200,000、Indirect Cost: ¥960,000) Fiscal Year 2023: ¥1,300,000 (Direct Cost: ¥1,000,000、Indirect Cost: ¥300,000) Fiscal Year 2022: ¥1,430,000 (Direct Cost: ¥1,100,000、Indirect Cost: ¥330,000) Fiscal Year 2021: ¥1,430,000 (Direct Cost: ¥1,100,000、Indirect Cost: ¥330,000)
Keywords	Text-to-speech synthesis / Low-resource languages / Neural network pruning / Evaluation / text-to-speech synthesis / low-resource languages / speech evaluation / speech synthesis / self-supervised learning / speech assessment / mean opinion score / text-to-speech / vocoder / pruning / efficiency / multi-lingual / machine translation / deep learning / neural networks
Outline of Research at the Start	Language technology has improved due to advances in neural-network-based approaches; for example, speech synthesis has reached the quality of human speech. However, neural models require large quantities of data. Speech technologies bring social benefits of accessibility and communication - to ensure broad access to these benefits, we consider language-independent methods that can make use of less data. We propose 1) articulatory class based end-to-end speech synthesis; 2) multi-modal machine translation with text and speech; and 3) neural architecture search for data-efficient architectures.
Outline of Final Research Achievements	We explored pruning for lightweight text-to-speech synthesis (TTS), developed data-efficient TTS for low-resource languages, and advanced the field of automatic quality prediction for TTS. We found that up to 90% of TTS model weights can be pruned without reducing output quality. We developed a data processing pipeline for building TTS corpora for low-resource languages using podcast data, resulting in a large-scale, high-quality, publicly-available dataset. We also developed a TTS system using this data that can be repurposed for any language with similar data. As self-supervised speech representations have been effective for many downstream tasks, we next investigated these as an intermediate representation for TTS trained on multilingual data, which can be fine-tuned to a new language. Finally, we identified automatic evaluation of TTS as a critical topic. We launched a series of challenges for this task in 2022 and 2023 which attracted many participants and advanced the field.
Academic Significance and Societal Importance of the Research Achievements	We developed TTS trainable on small amounts of data and lightweight TTS models. We also advanced the field of TTS evaluation. This benefits researchers and society by reducing barriers of entry to creating TTS for low-resource languages, expanding accessibility benefits of TTS to a broader audience.

Report

(4 results)

2023 Annual Research Report Final Research Report ( PDF )
2022 Research-status Report
2021 Research-status Report

Research Products
(28 results)

All 2024 2023 2022 Other

All Int'l Joint Research (8 results) Journal Article (1 results) (of which Int'l Joint Research: 1 results, Open Access: 1 results) Presentation (12 results) (of which Int'l Joint Research: 9 results, Invited: 3 results) Remarks (7 results)

[Int'l Joint Research] up.ai(イスラエル)
- Related Report
  2023 Annual Research Report
[Int'l Joint Research] University of Edinburgh(英国)
- Related Report
  2023 Annual Research Report
[Int'l Joint Research] Academia Sinica(その他の国・地域 Taiwan)
- Related Report
  2023 Annual Research Report
[Int'l Joint Research] National Research Council(カナダ)
- Related Report
  2023 Annual Research Report
[Int'l Joint Research] Academia Sinica(その他の国・地域 Taiwan)
- Related Report
  2022 Research-status Report
[Int'l Joint Research] National Research Council(カナダ)
- Related Report
  2022 Research-status Report
[Int'l Joint Research] University of Edinburgh(英国)
- Related Report
  2022 Research-status Report
[Int'l Joint Research] Massachusetts Institute of Technology/MIT-IBM Watson AI Lab(米国)
- Related Report
  2021 Research-status Report
[Journal Article] A review on subjective and objective evaluation of synthetic speech2024
- Author(s)
  Cooper Erica、Huang Wen-Chin、Tsao Yu、Wang Hsin-Min、Toda Tomoki、Yamagishi Junichi
- Journal Title
  
  Acoustical Science and Technology
  
  Volume: 45 Issue: 4 Pages: 161-183
- DOI
  10.1250/ast.e24.12
- ISSN
  0369-4232, 1346-3969, 1347-5177
- Year and Date
  2024-07-01
- Related Report
  2023 Annual Research Report
- Open Access / Int'l Joint Research
[Presentation] Uncertainty as a Predictor: Leveraging Self-Supervised Learning for Zero-Shot MOS Prediction2024
- Author(s)
  Aditya Ravuri, Erica Cooper, Junichi Yamagishi
- Organizer
  IEEE ICASSP 2024 workshop on Self-supervision in Audio, Speech and Beyond
- Related Report
  2023 Annual Research Report
- Int'l Joint Research
[Presentation] SASPEECH: A Hebrew Single Speaker Dataset for Text to Speech and Voice Conversion2023
- Author(s)
  Orian Sharoni, Roee Shenberg, Erica Cooper
- Organizer
  Interspeech 2023
- Related Report
  2023 Annual Research Report
- Int'l Joint Research
[Presentation] Investigating Range-Equalizing Bias in Mean Opinion Score Ratings of Synthesized Speech2023
- Author(s)
  Erica Cooper, Junichi Yamagishi
- Organizer
  Interspeech 2023
- Related Report
  2023 Annual Research Report
- Int'l Joint Research
[Presentation] Partial Rank Similarity Minimization Method for Quality MOS Prediction of Unseen Speech Synthesis Systems in Zero-Shot and Semi-supervised setting2023
- Author(s)
  Hemant Yadav, Erica Cooper, Junichi Yamagishi, Sunayana Sitaram, Rajiv Ratn Shah
- Organizer
  ASRU 2023
- Related Report
  2023 Annual Research Report
- Int'l Joint Research
[Presentation] The VoiceMOS Challenge 2023: Zero-shot Subjective Speech Quality Prediction for Multiple Domains2023
- Author(s)
  Erica Cooper, Wen-Chin Huang, Yu Tsao, Hsin-Min Wang, Tomoki Toda, Junichi Yamagishi
- Organizer
  ASRU 2023
- Related Report
  2023 Annual Research Report
- Int'l Joint Research
[Presentation] Generalization Ability of MOS Prediction Networks2022
- Author(s)
  Erica Cooper, Wen-Chin Huang, Tomoki Toda, Junichi Yamagishi
- Organizer
  ICASSP 2022
- Related Report
  2022 Research-status Report
- Int'l Joint Research
[Presentation] LDNet: Unified Listener Dependent Modeling in MOS Prediction for Synthetic Speech2022
- Author(s)
  Wen-Chin Huang, Erica Cooper, Junichi Yamagishi, Tomoki Toda
- Organizer
  ICASSP 2022
- Related Report
  2022 Research-status Report
- Int'l Joint Research
[Presentation] The VoiceMOS Challenge 20222022
- Author(s)
  Wen-Chin Huang, Erica Cooper, Yu Tsao, Hsin-Min Wang, Tomoki Toda, Junichi Yamagishi
- Organizer
  Interspeech 2022
- Related Report
  2022 Research-status Report
- Int'l Joint Research
[Presentation] The VoiceMOS Challenge: Data-Driven Mean Opinion Score Prediction for Synthesized Speech2022
- Author(s)
  Erica Cooper
- Organizer
  2022 Autumn Meeting of the Acoustical Society of Japan
- Related Report
  2022 Research-status Report
- Invited
[Presentation] Objective Evaluation in TTS2022
- Author(s)
  Erica Cooper
- Organizer
  KTH Seminar on Speech Synthesis Evaluation, KTH Royal Institute of Technology, Department of Speech, Music, and Hearing
- Related Report
  2022 Research-status Report
- Invited
[Presentation] The VoiceMOS Challenge 20222022
- Author(s)
  Erica Cooper, Wen-Chin Huang
- Organizer
  Special Interest Group on Spoken Language Processing, Information Processing Society of Japan
- Related Report
  2022 Research-status Report
- Invited
[Presentation] On the Interplay Between Sparsity, Naturalness, Intelligibility, and Prosody in Speech Synthesis2022
- Author(s)
  Cheng-I Jeff Lai, Erica Cooper, Yang Zhang, Shiyu Chang, Kaizhi Qian, Yi-Lun Liao, Yung-Sung Chuang, Alexander H. Liu, Junichi Yamagishi, David Cox, James Glass
- Organizer
  ICASSP 2022
- Related Report
  2021 Research-status Report
- Int'l Joint Research
[Remarks] SASPEECH: Hebrew speech and transcripts for TTS
- URL
  https://openslr.org/134/
- Related Report
  2023 Annual Research Report
[Remarks] Listening test data for "Range-Equalizing Bias"
- URL
  https://zenodo.org/records/10005796
- Related Report
  2023 Annual Research Report
[Remarks] Implementation of Partial Rank Similarity
- URL
  https://github.com/nii-yamagishilab/partial_rank_similarity
- Related Report
  2023 Annual Research Report
[Remarks] VoiceMOS Challenge 2023 Homepage
- URL
  https://voicemos-challenge-2023.github.io
- Related Report
  2023 Annual Research Report
[Remarks] The VoiceMOS Challenge 2022 website
- URL
  https://voicemos-challenge-2022.github.io
- Related Report
  2022 Research-status Report
[Remarks] Open-source code for SSL-based MOS predictor
- URL
  https://github.com/nii-yamagishilab/mos-finetune-ssl
- Related Report
  2022 Research-status Report
[Remarks] TTS Pruning
- URL
  https://people.csail.mit.edu/clai24/prune-tts/
- Related Report
  2021 Research-status Report

Language-independent, multi-modal, and data-efficient approaches for speech synthesis and translation

Principal Investigator

COOPER Erica 国立情報学研究所, コンテンツ科学研究系, 特任准教授 (30843156)

¥4,160,000 (Direct Cost: ¥3,200,000、Indirect Cost: ¥960,000)

Report

Research Products

[Int'l Joint Research] up.ai(イスラエル)

Related Report

[Int'l Joint Research] University of Edinburgh(英国)

Related Report

[Int'l Joint Research] Academia Sinica(その他の国・地域 Taiwan)

Related Report

[Int'l Joint Research] National Research Council(カナダ)

Related Report

[Int'l Joint Research] Academia Sinica(その他の国・地域 Taiwan)

Related Report

[Int'l Joint Research] National Research Council(カナダ)

Related Report

[Int'l Joint Research] University of Edinburgh(英国)

Related Report

[Int'l Joint Research] Massachusetts Institute of Technology/MIT-IBM Watson AI Lab(米国)

Related Report

[Journal Article] A review on subjective and objective evaluation of synthetic speech2024

Author(s)

Journal Title

DOI

ISSN

Year and Date

Related Report

[Presentation] Uncertainty as a Predictor: Leveraging Self-Supervised Learning for Zero-Shot MOS Prediction2024

Author(s)

Organizer

Related Report

[Presentation] SASPEECH: A Hebrew Single Speaker Dataset for Text to Speech and Voice Conversion2023

Author(s)

Organizer

Related Report

[Presentation] Investigating Range-Equalizing Bias in Mean Opinion Score Ratings of Synthesized Speech2023

Author(s)

Organizer

Related Report

[Presentation] Partial Rank Similarity Minimization Method for Quality MOS Prediction of Unseen Speech Synthesis Systems in Zero-Shot and Semi-supervised setting2023

Author(s)

Organizer

Related Report

[Presentation] The VoiceMOS Challenge 2023: Zero-shot Subjective Speech Quality Prediction for Multiple Domains2023

Author(s)

Organizer

Related Report

[Presentation] Generalization Ability of MOS Prediction Networks2022

Author(s)

Organizer

Related Report

[Presentation] LDNet: Unified Listener Dependent Modeling in MOS Prediction for Synthetic Speech2022

Author(s)

Organizer

Related Report

[Presentation] The VoiceMOS Challenge 20222022

Author(s)

Organizer

Related Report

[Presentation] The VoiceMOS Challenge: Data-Driven Mean Opinion Score Prediction for Synthesized Speech2022

Author(s)

Organizer

Related Report

[Presentation] Objective Evaluation in TTS2022

Author(s)

Organizer

Related Report

[Presentation] The VoiceMOS Challenge 20222022

Author(s)

Organizer

Related Report

[Presentation] On the Interplay Between Sparsity, Naturalness, Intelligibility, and Prosody in Speech Synthesis2022

Author(s)

Organizer

Related Report

[Remarks] SASPEECH: Hebrew speech and transcripts for TTS

URL

Related Report