Encoder Factorization for Capturing Dialect and Articulation Level in End-to-End Speech Synthesis
Project/Area Number |
19K24372
|
Research Category |
Grant-in-Aid for Research Activity Start-up
|
Allocation Type | Multi-year Fund |
Review Section |
1002:Human informatics, applied informatics and related fields
|
Research Institution | National Institute of Informatics |
Principal Investigator |
Cooper Erica 国立情報学研究所, コンテンツ科学研究系, 特任助教 (30843156)
|
Project Period (FY) |
2019-08-30 – 2021-03-31
|
Project Status |
Completed (Fiscal Year 2020)
|
Budget Amount *help |
¥2,860,000 (Direct Cost: ¥2,200,000、Indirect Cost: ¥660,000)
Fiscal Year 2020: ¥1,430,000 (Direct Cost: ¥1,100,000、Indirect Cost: ¥330,000)
Fiscal Year 2019: ¥1,430,000 (Direct Cost: ¥1,100,000、Indirect Cost: ¥330,000)
|
Keywords | Speech synthesis / Speaker modeling / Deep learning / Neural network / speech synthesis / dialect modeling / speaker similarity / transfer learning / neural networks / speaker modeling / deep learning / articulation |
Outline of Research at the Start |
Synthesizing speech in a variety of speaker voices and styles has long been a goal in speech research. Recent advances in speech synthesis have resulted in very natural-sounding synthetic speech. Current approaches to modeling multiple speakers in synthetic speech have resulted in high similarity to the different speakers, but fail to capture characteristics such as dialect and level of articulation. We aim to determine whether including models of dialect and level of articulation in synthetic speech systems can successfully capture these aspects of speech.
|
Outline of Final Research Achievements |
Synthesizing speech in many voices and styles has long been a goal in speech research. While current state-of-the-art synthesizers can produce very natural sounding speech, matching the voice of a target speaker when only a small amount of that speaker's data is available is still a challenge, especially for characteristics such as dialect. We conducted experiments to determine what kind of speaker embeddings work best for synthesis in the voice of a new speaker, and found that Learnable Dictionary Encoding (LDE) based speaker representations worked well, based on a crowdsourced listening test. We also found that similarly obtaining LDE-based dialect representations helped to improve the dialect of the synthesized speech. Finally, we explored data augmentation techniques using both artificially modified data as well as real data from non-ideal recording conditions, and found that including the found data in model training could further improve naturalness of synthesized speech.
|
Academic Significance and Societal Importance of the Research Achievements |
本課題では、end-to-end音声合成における合成音声の話者性や方言再現性の向上のため、エンコーダの因子を制御する方法を調査した。話者の個性や特性をより適切に再現することにより、より多くの目標話者を音声合成システムにおいて利用することが可能になり、技術の応用先が広がると期待される。
|
Report
(3 results)
Research Products
(9 results)