2020 Fiscal Year Annual Research Report
Encoder Factorization for Capturing Dialect and Articulation Level in End-to-End Speech Synthesis
Project/Area Number |
19K24372
|
Research Institution | National Institute of Informatics |
Principal Investigator |
Cooper Erica 国立情報学研究所, コンテンツ科学研究系, 特任助教 (30843156)
|
Project Period (FY) |
2019-08-30 – 2021-03-31
|
Keywords | speech synthesis / dialect modeling / speaker similarity / transfer learning / neural networks |
Outline of Annual Research Achievements |
Synthesizing speech in a variety of voices has long been a goal in speech research. Current approaches to multi-speaker synthesis have resulted in high speaker similarity, but fail to capture characteristics such as dialect, and also have the problem of overfitting to speakers who were seen during training. In this past year, we explored dialect modeling to better capture speaker characteristics, as well as data augmentation to improve synthesis of unseen speakers.
There are relatively few speakers in our training data (around 100). We hypothesized that increasing the number of speakers could provide better coverage of the speaker space. We explored two methods of so-called "speaker augmentation": artificial augmentation using vocal tract length perturbation (VTLP), where the data is resampled and the resulting signals have different speaker characteristics, and a "found data" approach where we included lower-quality data that contains a large variety of different speakers and dialects.
Mixing in lower-quality data from worse recording conditions is expected to degrade synthesis quality, so we incorporated channel labels during training that indicate which corpus each training utterance comes from. We also incorporated transfer-learned dialect embeddings to better capture information about speaker dialects. Experimental results from a crowdsourced listening test revealed that using found data with many English dialects was an effective augmentation method. Listeners’ ratings of perceived dialects were better matched to natural speech for unseen speakers using our approach.
|
Remarks |
Public code for our multi-speaker Tacotron with channel encoding and dialect modeling, and audio samples from our Interspeech paper, "Can Speaker Augmentation Improve Multi-Speaker End-to-End TTS?".
|
Research Products
(5 results)