2019 Fiscal Year Research-status Report

One model for all sounds: fast and high-quality neural source-filter model for speech and non-speech waveform modeling

Research Project

Project/Area Number	19K24371
Research Institution	National Institute of Informatics
Principal Investigator	Wang Xin 国立情報学研究所, コンテンツ科学研究系, 特任助教 (60843141)
Project Period (FY)	2019-08-30 – 2021-03-31
Keywords	Speech synthesis / Waveform modeling / Deep learning / Neural network
Outline of Annual Research Achievements	How to generate natural-sounding speech waveform from a digital system is a fundamental question in speech science. The purpose of this project is to combine classical speech science and recent deep-learning techniques and design a neural waveform model that generate high-quality waveform in a fast speed. Specifically, this project has three goals: 1. fast waveform generation; 2. improved quality of generated waveforms; 3. generation of not only speech but also non-speech waveforms. In the first year, we have proposed a family of neural source-filter waveform models that combines the classical source-filter speech production model and the recent dilated convolution neural networks. We have achieved the three goals above. For the first goal, we have shown that the proposed models have a real-time generation speed and is faster than the commonly used model called WaveNet, and the generated speech quality is close to WaveNet. This is published as a journal paper in IEEE/ACM transaction. For the second goal, we introduced the harmonic-plus-noise model with a trainable maximum voiced frequency to the neural source-filter models, which improved the quality of generated speech waveforms. This is published in Speech Synthesis Workshop. Finally, we applied the proposed model to generate music signals for multiple instruments, and it outperformed WaveNet and WaveGlow models. This work has been accepted to the ICASSP conference 2020. We open-sourced the code and scripts for the proposed models, and there have been a few applications based on our proposed models.
Current Status of Research Progress	Current Status of Research Progress 1: Research has progressed more than it was originally planned. Reason As the Summary of Research Achievements describes, we have proposed a family of neural source-filter waveform models and achieved the three goals defined in the proposal. For the first goal, we published the IEEE/ACM transaction paper called Neural Source-Filter Waveform Models for Statistical Parametric Speech Synthesis. We have shown that the proposed model is both theoretically and empirically faster than the famous WaveNet model in waveform generation, and the generated speech quality is close to WaveNet on a larger scale Japanese female voice. For the second goal, we published in ISCA Speech Synthesis Workshop paper called Neural Harmonic-plus-Noise Waveform Model with Trainable Maximum Voice Frequency for Text-to-Speech Synthesis. Here, we combined the proposed neural source-filter models with the classical harmonic-plus-noise model with a trainable maximum voiced frequency. We have shown that this new model outperformed previous neural source-filter models. For the third goal, our paper Transferring neural speech waveform synthesizers to musical instrument sounds generation is recently accepted by the IEEE Conference ICASSP. In this paper, we have shown how the neural source-filter model can be used to generate multiple music instruments such as violin and trumpet. Our experiments have shown that our proposed models outperformed WaveNet and WaveGlow. Furthermore, we can transfer our model trained on speech data to the music data by simple fine-tuning, which achieved the best performance among all the experimental models on the corpus.
Strategy for Future Research Activity	Although the three goals in the proposal have been achieved, we found a few shortcomings of the proposed models and plan to further improve them. For speech waveform generation, we found that the sine-waveform-based source signals in the proposed neural source-filter models may not be the optimal choice for specific voiced sounds such as creaky, low-pitched, and breathy sounds. Based on the classical work on speech production and perception, we plan to try different types of source signals and further improve the quality of the generated waveforms on those specific voiced sound types. For music waveform generation, we found that the proposed model achieved high performance on monophonic instruments, i.e., instruments that can only play one note at a time. However, its performance degraded for polyphonic string instruments such as violin and cello. We plan to investigate this issue and introduce the ideas of digital signal processing to model the polyphonic instrument sounds.
Causes of Carryover	Because of the COVID-19, academic conference and trips abroad were canceled, the budget for those trips are not used. Rather than academic trip, the budget is used to upgrade the computer equipment.
Remarks	Webpage (1) is the home page of our work on neural source-filter waveform models, including slides, audio samples. Web (2) and (3) are open-sourced code and scripts for the proposed models.

Research Products

(9 results)

All 2020 2019 Other

All Int'l Joint Research (2 results) Journal Article (3 results) (of which Int'l Joint Research: 1 results, Peer Reviewed: 3 results, Open Access: 3 results) Presentation (1 results) (of which Invited: 1 results) Remarks (3 results)

[Int'l Joint Research] University of Edinburgh(英国)
- Country Name
  UNITED KINGDOM
- Counterpart Institution
  University of Edinburgh
[Int'l Joint Research] Aalto University(フィンランド)
- Country Name
  FINLAND
- Counterpart Institution
  Aalto University
[Journal Article] Neural Source-Filter Waveform Models for Statistical Parametric Speech Synthesis2020
- Author(s)
  Wang Xin、Takaki Shinji、Yamagishi Junichi
- Journal Title
  
  IEEE/ACM Transactions on Audio, Speech, and Language Processing
  
  Volume: 28 Pages: 402～415
- DOI
  10.1109/TASLP.2019.2956145
- Peer Reviewed / Open Access
[Journal Article] Transferring neural speech waveform synthesizers to musical instrument sounds generation2020
- Author(s)
  Zhao Yi, Wang Xin, Juvela Lauri, Yamagishi Junichi
- Journal Title
  
  IEEE International Conference on Acoustics, Speech and Signal Processing
  
  Volume: - Pages: 6269-6273
- DOI
  10.1109/ICASSP40776.2020.9053047
- Peer Reviewed / Open Access / Int'l Joint Research
[Journal Article] Neural Harmonic-plus-Noise Waveform Model with Trainable Maximum Voice Frequency for Text-to-Speech Synthesis2019
- Author(s)
  Wang Xin、Yamagishi Junichi
- Journal Title
  
  Proceeding of Speech Synthesis Workshop
  
  Volume: - Pages: 1~6
- DOI
  10.21437/SSW.2019-1
- Peer Reviewed / Open Access
[Presentation] Neural-network-based waveform modeling for text-to-speech synthesis2019
- Author(s)
  Wang Xin
- Organizer
  Lecture Series on Natural Language Processing
- Invited
[Remarks] Home page of neural source-filter waveform models
- URL
  https://nii-yamagishilab.github.io/samples-nsf/
[Remarks] Neural source-filter waveform model source code
- URL
  https://github.com/nii-yamagishilab/project-CURRENNT-public
[Remarks] Scripts to train and use the proposed models
- URL
  https://github.com/nii-yamagishilab/project-CURRENNT-scripts

2019 Fiscal Year Research-status Report

One model for all sounds: fast and high-quality neural source-filter model for speech and non-speech waveform modeling

Principal Investigator

Wang Xin 国立情報学研究所, コンテンツ科学研究系, 特任助教 (60843141)

Current Status of Research Progress

Reason

Research Products

[Int'l Joint Research] University of Edinburgh(英国)

Country Name

Counterpart Institution

[Int'l Joint Research] Aalto University(フィンランド)

Country Name

Counterpart Institution

[Journal Article] Neural Source-Filter Waveform Models for Statistical Parametric Speech Synthesis2020

Author(s)

Journal Title

DOI

[Journal Article] Transferring neural speech waveform synthesizers to musical instrument sounds generation2020

Author(s)

Journal Title

DOI

[Journal Article] Neural Harmonic-plus-Noise Waveform Model with Trainable Maximum Voice Frequency for Text-to-Speech Synthesis2019

Author(s)

Journal Title

DOI

[Presentation] Neural-network-based waveform modeling for text-to-speech synthesis2019

Author(s)

Organizer

[Remarks] Home page of neural source-filter waveform models

URL

[Remarks] Neural source-filter waveform model source code

URL

[Remarks] Scripts to train and use the proposed models

URL