2019 年度実施状況報告書

Encoder Factorization for Capturing Dialect and Articulation Level in End-to-End Speech Synthesis

研究課題

研究課題/領域番号	19K24372
研究機関	国立情報学研究所
研究代表者	Cooper Erica 国立情報学研究所, コンテンツ科学研究系, 特任研究員 (30843156)
研究期間 (年度)	2019-08-30 – 2021-03-31
キーワード	speech synthesis / speaker modeling / deep learning / dialect modeling / articulation
研究実績の概要	Synthesizing speech in a variety of speaker voices and styles has long been a goal in speech research. Recent advances in speech synthesis have resulted in very natural-sounding synthetic speech. Current approaches to modeling multiple speakers in synthetic speech have resulted in high similarity to the different speakers, but fail to capture characteristics such as dialect and level of articulation. We aim to determine whether including models of dialect and level of articulation in synthetic speech systems can successfully capture these aspects of speech. In this past year, we implemented multi-speaker capability for a Tacotron-based TTS system using speaker embeddings obtained from a separately-trained speaker model. We conducted experiments to determine what kind of speaker embeddings work best for synthesis in the voice of a speaker who was unseen during TTS training, and we found that Learnable Dictionary Encoding (LDE) based speaker representations worked well, based on a large-scale crowdsourced listening test. We also found that there is a gap between seen and unseen speaker similarity, so we are currently exploring data augmentation approaches for TTS training, as well as dialect modeling using dialect embeddings similar to our speaker embeddings.
現在までの達成度 (区分)	現在までの達成度 (区分) 2: おおむね順調に進展している理由 According to our original research plan, we have created and evaluated TTS systems using many different types of speaker embeddings, and we have also completed experiments training dialect models and using embeddings obtained from these models in TTS. We have conducted large-scale listening tests to evaluate our synthesized speech for quality and similarity to the target speaker. We have also begun experiments to further improve speaker similarity by data augmentation, both by creating artificial training data in similar ways that have been shown to improve speech and speaker recognition, and also by including other sources of data that are not typically used for TTS (such as training data for speech recognition), and compensating for differences in recording conditions by including channel information during training.
今後の研究の推進方策	Since we obtained better speaker similarity in TTS by using embeddings from improved speaker models, we can next explore matching aspects of speaking style such as level of articulation. However, we still find that there is room for improvement in matching the voices of speakers who were not seen during training, so we will also continue exploring data augmentation approaches to further improve speaker similarity, and we will also continue investigating the reasons why overfitting to seen speakers occurs.
次年度使用額が生じた理由	International conferences that we originally planned to attend were postponed due to covid-19. Instead, we plan to use this budget for supercomputer fees and listening tests.
備考	Code and audio samples related to our accepted ICASSP paper.