2020 Fiscal Year Final Research Report

One model for all sounds: fast and high-quality neural source-filter model for speech and non-speech waveform modeling

Research Project

PDF

Project/Area Number	19K24371
Research Category	Grant-in-Aid for Research Activity Start-up
Allocation Type	Multi-year Fund
Review Section	1002:Human informatics, applied informatics and related fields
Research Institution	National Institute of Informatics
Principal Investigator	Wang Xin 国立情報学研究所, コンテンツ科学研究系, 特任助教 (60843141)
Project Period (FY)	2019-08-30 – 2021-03-31
Keywords	Speech synthesis / Waveform modeling / Deep learning / Neural network
Outline of Final Research Achievements	How to generate natural-sounding speech waveform from a digital system is a fundamental question in speech science. By combining classical speech science, signal processing methods, and recent deep-learning techniques, this research project proposes a family of neural waveform model called neural source-filter waveform (NSF) models. It was demonstrated that the proposed NSF models can produce high-quality waveforms at a much faster speed than the commonly used WaveNet models. It was also demonstrated that the NSF models can be extended to incorporate other classical methods from the speech modeling field, including harmonic-plus-noise speech model. Finally, it was demonstrated that the NSF model can be applied to music instrumental audios, showing its flexibility and potential in modeling speech and non-speech sounds.
Free Research Field	知覚情報処理
Academic Significance and Societal Importance of the Research Achievements	Deep learningにより音声波形モデリング技術は近年盛んに研究されている。深層学習手法だけを使用して多くのモデルが提案されている一方で、本研究は深層学習と古典的な信号処理技術の組み合わせることにとりニューラルソースフィルター波形モデル（NSF）と呼ばれるモデルを提案した。提案されたモデルは、深層学習と信号処理の方法を組み合わせるの方法を示しています。そして、提案されたモデルは実際のアプリケーションで使用されています。