2019 Fiscal Year Research-status Report
One model for all sounds: fast and high-quality neural source-filter model for speech and non-speech waveform modeling
Project/Area Number |
19K24371
|
Research Institution | National Institute of Informatics |
Principal Investigator |
Wang Xin 国立情報学研究所, コンテンツ科学研究系, 特任助教 (60843141)
|
Project Period (FY) |
2019-08-30 – 2021-03-31
|
Keywords | Speech synthesis / Waveform modeling / Deep learning / Neural network |
Outline of Annual Research Achievements |
How to generate natural-sounding speech waveform from a digital system is a fundamental question in speech science. The purpose of this project is to combine classical speech science and recent deep-learning techniques and design a neural waveform model that generate high-quality waveform in a fast speed. Specifically, this project has three goals: 1. fast waveform generation; 2. improved quality of generated waveforms; 3. generation of not only speech but also non-speech waveforms.
In the first year, we have proposed a family of neural source-filter waveform models that combines the classical source-filter speech production model and the recent dilated convolution neural networks. We have achieved the three goals above. For the first goal, we have shown that the proposed models have a real-time generation speed and is faster than the commonly used model called WaveNet, and the generated speech quality is close to WaveNet. This is published as a journal paper in IEEE/ACM transaction. For the second goal, we introduced the harmonic-plus-noise model with a trainable maximum voiced frequency to the neural source-filter models, which improved the quality of generated speech waveforms. This is published in Speech Synthesis Workshop. Finally, we applied the proposed model to generate music signals for multiple instruments, and it outperformed WaveNet and WaveGlow models. This work has been accepted to the ICASSP conference 2020.
We open-sourced the code and scripts for the proposed models, and there have been a few applications based on our proposed models.
|
Current Status of Research Progress |
Current Status of Research Progress
1: Research has progressed more than it was originally planned.
Reason
As the Summary of Research Achievements describes, we have proposed a family of neural source-filter waveform models and achieved the three goals defined in the proposal. For the first goal, we published the IEEE/ACM transaction paper called Neural Source-Filter Waveform Models for Statistical Parametric Speech Synthesis. We have shown that the proposed model is both theoretically and empirically faster than the famous WaveNet model in waveform generation, and the generated speech quality is close to WaveNet on a larger scale Japanese female voice. For the second goal, we published in ISCA Speech Synthesis Workshop paper called Neural Harmonic-plus-Noise Waveform Model with Trainable Maximum Voice Frequency for Text-to-Speech Synthesis. Here, we combined the proposed neural source-filter models with the classical harmonic-plus-noise model with a trainable maximum voiced frequency. We have shown that this new model outperformed previous neural source-filter models. For the third goal, our paper Transferring neural speech waveform synthesizers to musical instrument sounds generation is recently accepted by the IEEE Conference ICASSP. In this paper, we have shown how the neural source-filter model can be used to generate multiple music instruments such as violin and trumpet. Our experiments have shown that our proposed models outperformed WaveNet and WaveGlow. Furthermore, we can transfer our model trained on speech data to the music data by simple fine-tuning, which achieved the best performance among all the experimental models on the corpus.
|
Strategy for Future Research Activity |
Although the three goals in the proposal have been achieved, we found a few shortcomings of the proposed models and plan to further improve them.
For speech waveform generation, we found that the sine-waveform-based source signals in the proposed neural source-filter models may not be the optimal choice for specific voiced sounds such as creaky, low-pitched, and breathy sounds. Based on the classical work on speech production and perception, we plan to try different types of source signals and further improve the quality of the generated waveforms on those specific voiced sound types.
For music waveform generation, we found that the proposed model achieved high performance on monophonic instruments, i.e., instruments that can only play one note at a time. However, its performance degraded for polyphonic string instruments such as violin and cello. We plan to investigate this issue and introduce the ideas of digital signal processing to model the polyphonic instrument sounds.
|
Causes of Carryover |
Because of the COVID-19, academic conference and trips abroad were canceled, the budget for those trips are not used. Rather than academic trip, the budget is used to upgrade the computer equipment.
|
Remarks |
Webpage (1) is the home page of our work on neural source-filter waveform models, including slides, audio samples. Web (2) and (3) are open-sourced code and scripts for the proposed models.
|
Research Products
(9 results)