2020 Fiscal Year Final Research Report

Encoder Factorization for Capturing Dialect and Articulation Level in End-to-End Speech Synthesis

Research Project

PDF

Project/Area Number	19K24372
Research Category	Grant-in-Aid for Research Activity Start-up
Allocation Type	Multi-year Fund
Review Section	1002:Human informatics, applied informatics and related fields
Research Institution	National Institute of Informatics
Principal Investigator	Cooper Erica 国立情報学研究所, コンテンツ科学研究系, 特任助教 (30843156)
Project Period (FY)	2019-08-30 – 2021-03-31
Keywords	Speech synthesis / Speaker modeling / Deep learning / Neural network
Outline of Final Research Achievements	Synthesizing speech in many voices and styles has long been a goal in speech research. While current state-of-the-art synthesizers can produce very natural sounding speech, matching the voice of a target speaker when only a small amount of that speaker's data is available is still a challenge, especially for characteristics such as dialect. We conducted experiments to determine what kind of speaker embeddings work best for synthesis in the voice of a new speaker, and found that Learnable Dictionary Encoding (LDE) based speaker representations worked well, based on a crowdsourced listening test. We also found that similarly obtaining LDE-based dialect representations helped to improve the dialect of the synthesized speech. Finally, we explored data augmentation techniques using both artificially modified data as well as real data from non-ideal recording conditions, and found that including the found data in model training could further improve naturalness of synthesized speech.
Free Research Field	Text-to-speech synthesis
Academic Significance and Societal Importance of the Research Achievements	本課題では、end-to-end音声合成における合成音声の話者性や方言再現性の向上のため、エンコーダの因子を制御する方法を調査した。話者の個性や特性をより適切に再現することにより、より多くの目標話者を音声合成システムにおいて利用することが可能になり、技術の応用先が広がると期待される。