2019 Fiscal Year Research-status Report

Can we reduce misperceptions of emotional content of speech in the noisy environments?

Research Project

Project/Area Number	19K24373
Research Institution	National Institute of Informatics
Principal Investigator	Zhao Yi 国立情報学研究所, コンテンツ科学研究系, 特任研究員 (10843162)
Project Period (FY)	2019-08-30 – 2021-03-31
Keywords	emotion enhancement / speaker embedding / neural vocoder / VQVAE / F0 encoder
Outline of Annual Research Achievements	Although a lot of studies have been carried out on enhancing the speech intelligibility under the noisy environments, none of them takes into account the interaction of emotional categories and the Lombard effect in the noisy environments at the same time. Due to the complex variations of emotional speech under the noisy condition, traditional enhancement methods are no longer applicable to the emotional speech in noise. Our proposed idea is aimed at reducing misunderstanding of emotional content of speech produced under the noisy condition. In the first term, we have conducted a mapping from general emotional speech to target emotional speech using the well-trained speakers’ data. First, we have collected enough corpus for experiments. Second, we investigated different neural vocoders and published a paper to ICASSP 2020. Third, we designed the most suitable ways to extract the speaker embeddings for our work. Last but least, we significantly improved the regenerated speech quality of the original waveform level VQVAE model by adding F0 module and carefully controlled the loss function. We have achieved good evaluation results and I am submitting a paper to Interspeech 2020 based on this work.
Current Status of Research Progress	Current Status of Research Progress 2: Research has progressed on the whole more than it was originally planned. Reason According to our current experimental results, this project moves smoothly. My aim on this project is to reduce misunderstanding of emotional content of speech produced under the noisy condition by enhancing the emotion of speech under noisy environment. We have recorded a private emotional corpus which is parallel under both clean and noisy environment. We have gathered several other either multi-emotional or multi-speaker corpus such as JTES and JVS to enlarge our training database. We finally decided to use WaveRNN for our work. We have compared various kinds of speaker embeddings including x-vector and LDE vector with/without whitening. We finished experiments on speech fabrication, voice conversion as well as emotion conversion based on the designed VQVAE model.
Strategy for Future Research Activity	Our next step will focus on improving the speech quality, speaker similarity as well as emotion intelligence after adaptation and enhancement. We have proposed to do speaker and emotion conversion by combining Vector Quantised-Variational Auto Encoder and characteristics embeddings (including speaker identity embedding and emotion embeddings). So far, this framework was only tested with well-trained speaker’s data and only in clean environment. Next step, we will select the less-confusable speech of the less-trained speakers according to listeners’ judgments and use selected data for supervised adaptation. We will also move our experiment to noisy environment later.
Causes of Carryover	We were planning to attend ICASSP 2020 and other international conference using the budget, but these conferences are postponed due to covid-19. We have to move these budgets to next year. We plan to use this budget for supercomputer fees, paid proofreading and listening tests.