研究実績の概要 |
How to generate natural-sounding speech waveform from a digital system is a fundamental question in speech science. The purpose of this project is to combine classical speech science and recent deep-learning techniques and design a neural waveform model that generates high-quality waveforms at a fast speed. Specifically, this project has three goals: 1. fast waveform generation; 2. improved quality of generated waveforms; 3. generation of not only speech but also non-speech waveforms. In the 1st year, we have proposed a family of neural source-filter waveform models that combines the classical source-filter speech production model and the recent dilated convolution neural networks, and we have achieved the three goals. During the 2nd year, we extended the proposed models for the 2nd and 3rd goals. For the 2nd goal, we enhanced the proposed models with a trainable cyclic-noise-based source module and demonstrated its better performance in modeling multiple speakers' speech data using a single model. This is published in Interspeech 2020. We also designed optional trainable digital FIR filters for the proposed models so that it can better model the speech data with reverberation, and this is published in Interspeech 2020 and IEEE SLT 2021. For the 3rd goal, we applied the proposed models to polyphonic piano sound modeling and demonstrated that the model not only works with monophonic but also polyphonic sounds. We are preparing a paper for this work. Finally, we re-implemented and open-sourced the proposed models using a popular deep learning language called Pytorch.
|