2019 Fiscal Year Research-status Report
Encoder Factorization for Capturing Dialect and Articulation Level in End-to-End Speech Synthesis
Project/Area Number |
19K24372
|
Research Institution | National Institute of Informatics |
Principal Investigator |
Cooper Erica 国立情報学研究所, コンテンツ科学研究系, 特任研究員 (30843156)
|
Project Period (FY) |
2019-08-30 – 2021-03-31
|
Keywords | speech synthesis / speaker modeling / deep learning / dialect modeling / articulation |
Outline of Annual Research Achievements |
Synthesizing speech in a variety of speaker voices and styles has long been a goal in speech research. Recent advances in speech synthesis have resulted in very natural-sounding synthetic speech. Current approaches to modeling multiple speakers in synthetic speech have resulted in high similarity to the different speakers, but fail to capture characteristics such as dialect and level of articulation. We aim to determine whether including models of dialect and level of articulation in synthetic speech systems can successfully capture these aspects of speech.
In this past year, we implemented multi-speaker capability for a Tacotron-based TTS system using speaker embeddings obtained from a separately-trained speaker model. We conducted experiments to determine what kind of speaker embeddings work best for synthesis in the voice of a speaker who was unseen during TTS training, and we found that Learnable Dictionary Encoding (LDE) based speaker representations worked well, based on a large-scale crowdsourced listening test. We also found that there is a gap between seen and unseen speaker similarity, so we are currently exploring data augmentation approaches for TTS training, as well as dialect modeling using dialect embeddings similar to our speaker embeddings.
|
Current Status of Research Progress |
Current Status of Research Progress
2: Research has progressed on the whole more than it was originally planned.
Reason
According to our original research plan, we have created and evaluated TTS systems using many different types of speaker embeddings, and we have also completed experiments training dialect models and using embeddings obtained from these models in TTS. We have conducted large-scale listening tests to evaluate our synthesized speech for quality and similarity to the target speaker. We have also begun experiments to further improve speaker similarity by data augmentation, both by creating artificial training data in similar ways that have been shown to improve speech and speaker recognition, and also by including other sources of data that are not typically used for TTS (such as training data for speech recognition), and compensating for differences in recording conditions by including channel information during training.
|
Strategy for Future Research Activity |
Since we obtained better speaker similarity in TTS by using embeddings from improved speaker models, we can next explore matching aspects of speaking style such as level of articulation. However, we still find that there is room for improvement in matching the voices of speakers who were not seen during training, so we will also continue exploring data augmentation approaches to further improve speaker similarity, and we will also continue investigating the reasons why overfitting to seen speakers occurs.
|
Causes of Carryover |
International conferences that we originally planned to attend were postponed due to covid-19. Instead, we plan to use this budget for supercomputer fees and listening tests.
|
Remarks |
Code and audio samples related to our accepted ICASSP paper.
|
Research Products
(4 results)