2020 Fiscal Year Annual Research Report

Encoder Factorization for Capturing Dialect and Articulation Level in End-to-End Speech Synthesis

Research Project

Project/Area Number	19K24372
Research Institution	National Institute of Informatics
Principal Investigator	Cooper Erica 国立情報学研究所, コンテンツ科学研究系, 特任助教 (30843156)
Project Period (FY)	2019-08-30 – 2021-03-31
Keywords	speech synthesis / dialect modeling / speaker similarity / transfer learning / neural networks
Outline of Annual Research Achievements	Synthesizing speech in a variety of voices has long been a goal in speech research. Current approaches to multi-speaker synthesis have resulted in high speaker similarity, but fail to capture characteristics such as dialect, and also have the problem of overfitting to speakers who were seen during training. In this past year, we explored dialect modeling to better capture speaker characteristics, as well as data augmentation to improve synthesis of unseen speakers. There are relatively few speakers in our training data (around 100). We hypothesized that increasing the number of speakers could provide better coverage of the speaker space. We explored two methods of so-called "speaker augmentation": artificial augmentation using vocal tract length perturbation (VTLP), where the data is resampled and the resulting signals have different speaker characteristics, and a "found data" approach where we included lower-quality data that contains a large variety of different speakers and dialects. Mixing in lower-quality data from worse recording conditions is expected to degrade synthesis quality, so we incorporated channel labels during training that indicate which corpus each training utterance comes from. We also incorporated transfer-learned dialect embeddings to better capture information about speaker dialects. Experimental results from a crowdsourced listening test revealed that using found data with many English dialects was an effective augmentation method. Listeners’ ratings of perceived dialects were better matched to natural speech for unseen speakers using our approach.
Remarks	Public code for our multi-speaker Tacotron with channel encoding and dialect modeling, and audio samples from our Interspeech paper, "Can Speaker Augmentation Improve Multi-Speaker End-to-End TTS?".

Research Products
(5 results)

All 2020 Other

All Int'l Joint Research (1 results) Journal Article (2 results) (of which Int'l Joint Research: 2 results, Peer Reviewed: 2 results, Open Access: 2 results) Remarks (2 results)

[Int'l Joint Research] Massachusetts Institute of Technology(米国)
- Country Name
  U.S.A.
- Counterpart Institution
  Massachusetts Institute of Technology
[Journal Article] Can Speaker Augmentation Improve Multi-Speaker End-to-End TTS?2020
- Author(s)
  Cooper Erica、Lai Cheng-I、Yasuda Yusuke、Yamagishi Junichi
- Journal Title
  
  Proc. Interspeech 2020
  
  Volume: 2020 Pages: 3979-3983
- DOI
  10.21437/Interspeech.2020-1229
- Peer Reviewed / Open Access / Int'l Joint Research
[Journal Article] Modeling of Rakugo Speech and Its Limitations: Toward Speech Synthesis That Entertains Audiences2020
- Author(s)
  Kato Shuhei、Yasuda Yusuke、Wang Xin、Cooper Erica、Takaki Shinji、Yamagishi Junichi
- Journal Title
  
  IEEE Access
  
  Volume: 8 Pages: 138149～138161
- DOI
  10.1109/ACCESS.2020.3011975
- Peer Reviewed / Open Access / Int'l Joint Research
[Remarks] Public code for multi-speaker tacotron
- URL
  https://github.com/nii-yamagishilab/multi-speaker-tacotron
[Remarks] Audio sample page for Interspeech 2020 paper
- URL
  https://nii-yamagishilab.github.io/samples-multi-speaker-tacotron/augment.html

2020 Fiscal Year Annual Research Report

Encoder Factorization for Capturing Dialect and Articulation Level in End-to-End Speech Synthesis

Principal Investigator

Cooper Erica 国立情報学研究所, コンテンツ科学研究系, 特任助教 (30843156)

Research Products

[Int'l Joint Research] Massachusetts Institute of Technology(米国)

Country Name

Counterpart Institution

[Journal Article] Can Speaker Augmentation Improve Multi-Speaker End-to-End TTS?2020

Author(s)

Journal Title

DOI

[Journal Article] Modeling of Rakugo Speech and Its Limitations: Toward Speech Synthesis That Entertains Audiences2020

Author(s)

Journal Title

DOI

[Remarks] Public code for multi-speaker tacotron

URL

[Remarks] Audio sample page for Interspeech 2020 paper

URL