One model for all sounds: fast and high-quality neural source-filter model for speech and non-speech waveform modeling
Project/Area Number |
19K24371
|
Research Category |
Grant-in-Aid for Research Activity Start-up
|
Allocation Type | Multi-year Fund |
Review Section |
1002:Human informatics, applied informatics and related fields
|
Research Institution | National Institute of Informatics |
Principal Investigator |
Wang Xin 国立情報学研究所, コンテンツ科学研究系, 特任助教 (60843141)
|
Project Period (FY) |
2019-08-30 – 2021-03-31
|
Project Status |
Completed (Fiscal Year 2020)
|
Budget Amount *help |
¥2,860,000 (Direct Cost: ¥2,200,000、Indirect Cost: ¥660,000)
Fiscal Year 2020: ¥1,430,000 (Direct Cost: ¥1,100,000、Indirect Cost: ¥330,000)
Fiscal Year 2019: ¥1,430,000 (Direct Cost: ¥1,100,000、Indirect Cost: ¥330,000)
|
Keywords | Speech synthesis / Waveform modeling / Deep learning / Neural network / speech synthesis / waveform modeling / deep learning / neural network |
Outline of Research at the Start |
Generating natural-sounding waveforms from a computer is a fundamental speech science topic. In this research, we plan to combine speech science and deep learning. We propose to combine a classical speech production model called source-filter model with neural network, which results in a neural source-filter waveform model. Our model is expected to generate waveforms with a faster speed and improved quality; it is also expected to be applicable not only to speech but also to singing voice and non-speech sounds. Such a new model will be useful in many applications such as text-to-speech.
|
Outline of Final Research Achievements |
How to generate natural-sounding speech waveform from a digital system is a fundamental question in speech science. By combining classical speech science, signal processing methods, and recent deep-learning techniques, this research project proposes a family of neural waveform model called neural source-filter waveform (NSF) models. It was demonstrated that the proposed NSF models can produce high-quality waveforms at a much faster speed than the commonly used WaveNet models. It was also demonstrated that the NSF models can be extended to incorporate other classical methods from the speech modeling field, including harmonic-plus-noise speech model. Finally, it was demonstrated that the NSF model can be applied to music instrumental audios, showing its flexibility and potential in modeling speech and non-speech sounds.
|
Academic Significance and Societal Importance of the Research Achievements |
Deep learningにより音声波形モデリング技術は近年盛んに研究されている。深層学習手法だけを使用して多くのモデルが提案されている一方で、本研究は深層学習と古典的な信号処理技術の組み合わせることにとりニューラルソースフィルター波形モデル(NSF)と呼ばれるモデルを提案した。 提案されたモデルは、深層学習と信号処理の方法を組み合わせるの方法を示しています。 そして、提案されたモデルは実際のアプリケーションで使用されています。
|
Report
(3 results)
Research Products
(18 results)