2022 Fiscal Year Research-status Report

Language-independent, multi-modal, and data-efficient approaches for speech synthesis and translation

Research Project

Project/Area Number	21K11951
Research Institution	National Institute of Informatics
Principal Investigator	Cooper Erica 国立情報学研究所, コンテンツ科学研究系, 特任助教 (30843156)
Co-Investigator(Kenkyū-buntansha)	Kruengkrai Canasai 国立情報学研究所, コンテンツ科学研究系, 特任助教 (10895907)
Project Period (FY)	2021-04-01 – 2024-03-31
Keywords	speech synthesis / self-supervised learning / low-resource languages / speech assessment / mean opinion score
Outline of Annual Research Achievements	In this second year of the project, we looked at two main topics: language-independent, data-efficient text-to-speech synthesis for low-resource languages using self-supervised speech representations, and automatic mean opinion score prediction. Self-supervised representations for speech have shown remarkable usefulness for many downstream speech-related tasks, and have been shown to contain phonetic information. We therefore chose these as an intermediate representation for text-to-speech synthesis trained on data from many languages, which can then be fine-tuned to a new language using only a small amount of data. This is ongoing work in progress, and we are collaborating with researchers from the National Research Council of Canada and the University of Edinburgh. We have also identified automatic evaluation of synthesized speech as an important topic for low-resource languages, since finding listeners to participate in listening tests can be especially difficult for these languages. In collaboration with Nagoya University and Academia Sinica, we co-organized the first VoiceMOS Challenge, a shared task for automatic mean opinion score (MOS) prediction for synthesized speech. The challenge attracted 22 participating teams from academia and industry, and we ran a special session about the challenge at Interspeech 2022. This challenge has advanced the field by generating a great deal of interest in this topic.
Current Status of Research Progress	Current Status of Research Progress 2: Research has progressed on the whole more than it was originally planned. Reason In the first year of the project, we initially proposed to work on language-independent speech synthesis, but instead we worked on efficient speech synthesis architectures using neural network pruning (originally scheduled for the third year). Therefore, we worked this year on language-independent approaches for speech synthesis. While we had originally planned to investigate articulatory features for this purpose, we shifted our focus to the use of self-supervised speech representations instead, since these seem very promising and well-suited for our task. Although it was outside of our original proposal, automatic speech quality assessment has arisen during this project as an important and relevant topic. The ability to automatically predict the quality of synthesized speech, especially for low-resource languages, will facilitate future research in low-resource speech synthesis.
Strategy for Future Research Activity	Although we changed the order of the topics in original plan somewhat, the topic of multimodal text and speech modeling for low-resource languages still remains -- we will therefore focus on this in the third year. We will also continue our ongoing research in language-independent speech synthesis that is adaptable to low-resource languages, and we will also run the 2023 edition of the VoiceMOS Challenge, which focuses on zero-shot prediction of out-of-domain synthesized speech.
Causes of Carryover	Travel expenses were not used due to the ongoing coronavirus situation in 2022. The budget remaining will be used for attending international conferences in 2023.
Remarks	Official website for the VoiceMOS Challenge 2022, and open-source code for SSL-based MOS predictor which was one of the baseline systems for the challenge.

Research Products
(11 results)

All 2022 Other

All Int'l Joint Research (3 results) Presentation (6 results) (of which Int'l Joint Research: 3 results, Invited: 3 results) Remarks (2 results)

[Int'l Joint Research] Academia Sinica(その他の国・地域 Taiwan)
- Country Name
  その他の国・地域
- Counterpart Institution
  Academia Sinica
[Int'l Joint Research] National Research Council(カナダ)
- Country Name
  CANADA
- Counterpart Institution
  National Research Council
[Int'l Joint Research] University of Edinburgh(英国)
- Country Name
  UNITED KINGDOM
- Counterpart Institution
  University of Edinburgh
[Presentation] Generalization Ability of MOS Prediction Networks2022
- Author(s)
  Erica Cooper, Wen-Chin Huang, Tomoki Toda, Junichi Yamagishi
- Organizer
  ICASSP 2022
- Int'l Joint Research
[Presentation] LDNet: Unified Listener Dependent Modeling in MOS Prediction for Synthetic Speech2022
- Author(s)
  Wen-Chin Huang, Erica Cooper, Junichi Yamagishi, Tomoki Toda
- Organizer
  ICASSP 2022
- Int'l Joint Research
[Presentation] The VoiceMOS Challenge 20222022
- Author(s)
  Wen-Chin Huang, Erica Cooper, Yu Tsao, Hsin-Min Wang, Tomoki Toda, Junichi Yamagishi
- Organizer
  Interspeech 2022
- Int'l Joint Research
[Presentation] The VoiceMOS Challenge: Data-Driven Mean Opinion Score Prediction for Synthesized Speech2022
- Author(s)
  Erica Cooper
- Organizer
  2022 Autumn Meeting of the Acoustical Society of Japan
- Invited
[Presentation] Objective Evaluation in TTS2022
- Author(s)
  Erica Cooper
- Organizer
  KTH Seminar on Speech Synthesis Evaluation, KTH Royal Institute of Technology, Department of Speech, Music, and Hearing
- Invited
[Presentation] The VoiceMOS Challenge 20222022
- Author(s)
  Erica Cooper, Wen-Chin Huang
- Organizer
  Special Interest Group on Spoken Language Processing, Information Processing Society of Japan
- Invited
[Remarks] The VoiceMOS Challenge 2022 website
- URL
  https://voicemos-challenge-2022.github.io
[Remarks] Open-source code for SSL-based MOS predictor
- URL
  https://github.com/nii-yamagishilab/mos-finetune-ssl

2022 Fiscal Year Research-status Report

Language-independent, multi-modal, and data-efficient approaches for speech synthesis and translation

Principal Investigator

Cooper Erica 国立情報学研究所, コンテンツ科学研究系, 特任助教 (30843156)

Current Status of Research Progress

Reason

Research Products

[Int'l Joint Research] Academia Sinica(その他の国・地域 Taiwan)

Country Name

Counterpart Institution

[Int'l Joint Research] National Research Council(カナダ)

Country Name

Counterpart Institution

[Int'l Joint Research] University of Edinburgh(英国)

Country Name

Counterpart Institution

[Presentation] Generalization Ability of MOS Prediction Networks2022

Author(s)

Organizer

[Presentation] LDNet: Unified Listener Dependent Modeling in MOS Prediction for Synthetic Speech2022

Author(s)

Organizer

[Presentation] The VoiceMOS Challenge 20222022

Author(s)

Organizer

[Presentation] The VoiceMOS Challenge: Data-Driven Mean Opinion Score Prediction for Synthesized Speech2022

Author(s)

Organizer

[Presentation] Objective Evaluation in TTS2022

Author(s)

Organizer

[Presentation] The VoiceMOS Challenge 20222022

Author(s)

Organizer

[Remarks] The VoiceMOS Challenge 2022 website

URL

[Remarks] Open-source code for SSL-based MOS predictor

URL