2022 Fiscal Year Final Research Report

Multilingual speech synthesis based on deep learning to reproduce the speaker and emotion of input speech in different languages

Research Project

PDF

Project/Area Number	20K11862
Research Category	Grant-in-Aid for Scientific Research (C)
Allocation Type	Multi-year Fund
Section	一般
Review Section	Basic Section 61010:Perceptual information processing-related
Research Institution	Nagoya Institute of Technology
Principal Investigator	HASHIMOTO Kei 名古屋工業大学, 工学(系)研究科(研究院), 准教授 (10635907)
Project Period (FY)	2020-04-01 – 2023-03-31
Keywords	音声合成
Outline of Final Research Achievements	To realize multilingual speech synthesis that reproduces the speaker and emotion of input speech in different languages, I have been working on deep neural network (DNN)-based multilingual speech synthesis that can separate speech features that depend on the language, speaker, and emotion of the input speech. I have proposed multilingual speech synthesis based on adversarial learning to separate language and speaker features, and a model structure to separate speaker and emotion. Additionally, I have proposed a speech synthesis model that uses face images as auxiliary features. The proposed method is expected to realize more natural global communication by generating speech that reproduces the characteristics of the speaker in different languages.
Free Research Field	音声情報処理
Academic Significance and Societal Importance of the Research Achievements	本研究では、音声に含まれる話者・言語・感情といった3つの特徴に注目し、入力音声と異なる言語において入力音声の声質や感情を再現する多言語音声合成技術に確立に取り組んだ。本研究の成果は、音声翻訳システムに応用することで、自分の話すことができない言語においても、自分の声のまま、感情表現を含む自然なコミュニケーションを実現することが期待される。