研究課題/領域番号 |
19K24371
|
研究機関 | 国立情報学研究所 |
研究代表者 |
Wang Xin 国立情報学研究所, コンテンツ科学研究系, 特任助教 (60843141)
|
研究期間 (年度) |
2019-08-30 – 2021-03-31
|
キーワード | Speech synthesis / Waveform modeling / Deep learning / Neural network |
研究実績の概要 |
How to generate natural-sounding speech waveform from a digital system is a fundamental question in speech science. The purpose of this project is to combine classical speech science and recent deep-learning techniques and design a neural waveform model that generate high-quality waveform in a fast speed. Specifically, this project has three goals: 1. fast waveform generation; 2. improved quality of generated waveforms; 3. generation of not only speech but also non-speech waveforms.
In the first year, we have proposed a family of neural source-filter waveform models that combines the classical source-filter speech production model and the recent dilated convolution neural networks. We have achieved the three goals above. For the first goal, we have shown that the proposed models have a real-time generation speed and is faster than the commonly used model called WaveNet, and the generated speech quality is close to WaveNet. This is published as a journal paper in IEEE/ACM transaction. For the second goal, we introduced the harmonic-plus-noise model with a trainable maximum voiced frequency to the neural source-filter models, which improved the quality of generated speech waveforms. This is published in Speech Synthesis Workshop. Finally, we applied the proposed model to generate music signals for multiple instruments, and it outperformed WaveNet and WaveGlow models. This work has been accepted to the ICASSP conference 2020.
We open-sourced the code and scripts for the proposed models, and there have been a few applications based on our proposed models.
|
現在までの達成度 (区分) |
現在までの達成度 (区分)
1: 当初の計画以上に進展している
理由
As the Summary of Research Achievements describes, we have proposed a family of neural source-filter waveform models and achieved the three goals defined in the proposal. For the first goal, we published the IEEE/ACM transaction paper called Neural Source-Filter Waveform Models for Statistical Parametric Speech Synthesis. We have shown that the proposed model is both theoretically and empirically faster than the famous WaveNet model in waveform generation, and the generated speech quality is close to WaveNet on a larger scale Japanese female voice. For the second goal, we published in ISCA Speech Synthesis Workshop paper called Neural Harmonic-plus-Noise Waveform Model with Trainable Maximum Voice Frequency for Text-to-Speech Synthesis. Here, we combined the proposed neural source-filter models with the classical harmonic-plus-noise model with a trainable maximum voiced frequency. We have shown that this new model outperformed previous neural source-filter models. For the third goal, our paper Transferring neural speech waveform synthesizers to musical instrument sounds generation is recently accepted by the IEEE Conference ICASSP. In this paper, we have shown how the neural source-filter model can be used to generate multiple music instruments such as violin and trumpet. Our experiments have shown that our proposed models outperformed WaveNet and WaveGlow. Furthermore, we can transfer our model trained on speech data to the music data by simple fine-tuning, which achieved the best performance among all the experimental models on the corpus.
|
今後の研究の推進方策 |
Although the three goals in the proposal have been achieved, we found a few shortcomings of the proposed models and plan to further improve them.
For speech waveform generation, we found that the sine-waveform-based source signals in the proposed neural source-filter models may not be the optimal choice for specific voiced sounds such as creaky, low-pitched, and breathy sounds. Based on the classical work on speech production and perception, we plan to try different types of source signals and further improve the quality of the generated waveforms on those specific voiced sound types.
For music waveform generation, we found that the proposed model achieved high performance on monophonic instruments, i.e., instruments that can only play one note at a time. However, its performance degraded for polyphonic string instruments such as violin and cello. We plan to investigate this issue and introduce the ideas of digital signal processing to model the polyphonic instrument sounds.
|
次年度使用額が生じた理由 |
Because of the COVID-19, academic conference and trips abroad were canceled, the budget for those trips are not used. Rather than academic trip, the budget is used to upgrade the computer equipment.
|
備考 |
Webpage (1) is the home page of our work on neural source-filter waveform models, including slides, audio samples. Web (2) and (3) are open-sourced code and scripts for the proposed models.
|