• Search Research Projects
  • Search Researchers
  • How to Use
  1. Back to project page

2019 Fiscal Year Research-status Report

Can we reduce misperceptions of emotional content of speech in the noisy environments?

Research Project

Project/Area Number 19K24373
Research InstitutionNational Institute of Informatics

Principal Investigator

Zhao Yi  国立情報学研究所, コンテンツ科学研究系, 特任研究員 (10843162)

Project Period (FY) 2019-08-30 – 2021-03-31
Keywordsemotion enhancement / speaker embedding / neural vocoder / VQVAE / F0 encoder
Outline of Annual Research Achievements

Although a lot of studies have been carried out on enhancing the speech intelligibility under the noisy environments, none of them takes into account the interaction of emotional categories and the Lombard effect in the noisy environments at the same time. Due to the complex variations of emotional speech under the noisy condition, traditional enhancement methods are no longer applicable to the emotional speech in noise. Our proposed idea is aimed at reducing misunderstanding of emotional content of speech produced under the noisy condition.

In the first term, we have conducted a mapping from general emotional speech to target emotional speech using the well-trained speakers’ data. First, we have collected enough corpus for experiments. Second, we investigated different neural vocoders and published a paper to ICASSP 2020. Third, we designed the most suitable ways to extract the speaker embeddings for our work. Last but least, we significantly improved the regenerated speech quality of the original waveform level VQVAE model by adding F0 module and carefully controlled the loss function. We have achieved good evaluation results and I am submitting a paper to Interspeech 2020 based on this work.

Current Status of Research Progress
Current Status of Research Progress

2: Research has progressed on the whole more than it was originally planned.

Reason

According to our current experimental results, this project moves smoothly.
My aim on this project is to reduce misunderstanding of emotional content of speech produced under the noisy condition by enhancing the emotion of speech under noisy environment. We have recorded a private emotional corpus which is parallel under both clean and noisy environment. We have gathered several other either multi-emotional or multi-speaker corpus such as JTES and JVS to enlarge our training database. We finally decided to use WaveRNN for our work. We have compared various kinds of speaker embeddings including x-vector and LDE vector with/without whitening. We finished experiments on speech fabrication, voice conversion as well as emotion conversion based on the designed VQVAE model.

Strategy for Future Research Activity

Our next step will focus on improving the speech quality, speaker similarity as well as emotion intelligence after adaptation and enhancement.
We have proposed to do speaker and emotion conversion by combining Vector Quantised-Variational Auto Encoder and characteristics embeddings (including speaker identity embedding and emotion embeddings). So far, this framework was only tested with well-trained speaker’s data and only in clean environment. Next step, we will select the less-confusable speech of the less-trained speakers according to listeners’ judgments and use selected data for supervised adaptation. We will also move our experiment to noisy environment later.

Causes of Carryover

We were planning to attend ICASSP 2020 and other international conference using the budget, but these conferences are postponed due to covid-19. We have to move these budgets to next year. We plan to use this budget for supercomputer fees, paid proofreading and listening tests.

  • Research Products

    (5 results)

All 2020 2019 Other

All Int'l Joint Research (1 results) Journal Article (1 results) (of which Int'l Joint Research: 1 results,  Peer Reviewed: 1 results,  Open Access: 1 results) Presentation (1 results) (of which Invited: 1 results) Remarks (2 results)

  • [Int'l Joint Research] Aalto University(フィンランド)

    • Country Name
      FINLAND
    • Counterpart Institution
      Aalto University
  • [Journal Article] Transferring Neural Speech Waveform Synthesizers to Musical Instrument Sounds Generation2020

    • Author(s)
      Yi Zhao ; Xin Wang ; Lauri Juvela ; Junichi Yamagishi
    • Journal Title

      ICASSP 2020

      Volume: - Pages: 6269 - 6273

    • DOI

      10.1109/ICASSP40776.2020

    • Peer Reviewed / Open Access / Int'l Joint Research
  • [Presentation] Waveform loss-based acoustic modeling for text-to-speech synthesis and speech-to-musical sound transferring2019

    • Author(s)
      Yi Zhao
    • Organizer
      Seminar in National University of Singapore
    • Invited
  • [Remarks] Samples for emotional clean/noisy speech

    • URL

      https://nii-yamagishilab.github.io/EmotionaLombardSpeech/

  • [Remarks] Samples for neural waveform vocoders

    • URL

      https://nii-yamagishilab.github.io/samples-nsf/neural-music.html

URL: 

Published: 2021-01-27  

Information User Guide FAQ News Terms of Use Attribution of KAKENHI

Powered by NII kakenhi