2019 年度実施状況報告書

One model for all sounds: fast and high-quality neural source-filter model for speech and non-speech waveform modeling

研究課題

研究課題/領域番号	19K24371
研究機関	国立情報学研究所
研究代表者	Wang Xin 国立情報学研究所, コンテンツ科学研究系, 特任助教 (60843141)
研究期間 (年度)	2019-08-30 – 2021-03-31
キーワード	Speech synthesis / Waveform modeling / Deep learning / Neural network
研究実績の概要	How to generate natural-sounding speech waveform from a digital system is a fundamental question in speech science. The purpose of this project is to combine classical speech science and recent deep-learning techniques and design a neural waveform model that generate high-quality waveform in a fast speed. Specifically, this project has three goals: 1. fast waveform generation; 2. improved quality of generated waveforms; 3. generation of not only speech but also non-speech waveforms. In the first year, we have proposed a family of neural source-filter waveform models that combines the classical source-filter speech production model and the recent dilated convolution neural networks. We have achieved the three goals above. For the first goal, we have shown that the proposed models have a real-time generation speed and is faster than the commonly used model called WaveNet, and the generated speech quality is close to WaveNet. This is published as a journal paper in IEEE/ACM transaction. For the second goal, we introduced the harmonic-plus-noise model with a trainable maximum voiced frequency to the neural source-filter models, which improved the quality of generated speech waveforms. This is published in Speech Synthesis Workshop. Finally, we applied the proposed model to generate music signals for multiple instruments, and it outperformed WaveNet and WaveGlow models. This work has been accepted to the ICASSP conference 2020. We open-sourced the code and scripts for the proposed models, and there have been a few applications based on our proposed models.
現在までの達成度 (区分)	現在までの達成度 (区分) 1: 当初の計画以上に進展している理由 As the Summary of Research Achievements describes, we have proposed a family of neural source-filter waveform models and achieved the three goals defined in the proposal. For the first goal, we published the IEEE/ACM transaction paper called Neural Source-Filter Waveform Models for Statistical Parametric Speech Synthesis. We have shown that the proposed model is both theoretically and empirically faster than the famous WaveNet model in waveform generation, and the generated speech quality is close to WaveNet on a larger scale Japanese female voice. For the second goal, we published in ISCA Speech Synthesis Workshop paper called Neural Harmonic-plus-Noise Waveform Model with Trainable Maximum Voice Frequency for Text-to-Speech Synthesis. Here, we combined the proposed neural source-filter models with the classical harmonic-plus-noise model with a trainable maximum voiced frequency. We have shown that this new model outperformed previous neural source-filter models. For the third goal, our paper Transferring neural speech waveform synthesizers to musical instrument sounds generation is recently accepted by the IEEE Conference ICASSP. In this paper, we have shown how the neural source-filter model can be used to generate multiple music instruments such as violin and trumpet. Our experiments have shown that our proposed models outperformed WaveNet and WaveGlow. Furthermore, we can transfer our model trained on speech data to the music data by simple fine-tuning, which achieved the best performance among all the experimental models on the corpus.
今後の研究の推進方策	Although the three goals in the proposal have been achieved, we found a few shortcomings of the proposed models and plan to further improve them. For speech waveform generation, we found that the sine-waveform-based source signals in the proposed neural source-filter models may not be the optimal choice for specific voiced sounds such as creaky, low-pitched, and breathy sounds. Based on the classical work on speech production and perception, we plan to try different types of source signals and further improve the quality of the generated waveforms on those specific voiced sound types. For music waveform generation, we found that the proposed model achieved high performance on monophonic instruments, i.e., instruments that can only play one note at a time. However, its performance degraded for polyphonic string instruments such as violin and cello. We plan to investigate this issue and introduce the ideas of digital signal processing to model the polyphonic instrument sounds.
次年度使用額が生じた理由	Because of the COVID-19, academic conference and trips abroad were canceled, the budget for those trips are not used. Rather than academic trip, the budget is used to upgrade the computer equipment.
備考	Webpage (1) is the home page of our work on neural source-filter waveform models, including slides, audio samples. Web (2) and (3) are open-sourced code and scripts for the proposed models.

研究成果
(9件)

すべて 2020 2019 その他

すべて国際共同研究 (2件) 雑誌論文 (3件) (うち国際共著 1件、査読あり 3件、オープンアクセス 3件) 学会発表 (1件) (うち招待講演 1件) 備考 (3件)

[国際共同研究] University of Edinburgh(英国)
- 国名
  英国
- 外国機関名
  University of Edinburgh
[国際共同研究] Aalto University(フィンランド)
- 国名
  フィンランド
- 外国機関名
  Aalto University
[雑誌論文] Neural Source-Filter Waveform Models for Statistical Parametric Speech Synthesis2020
- 著者名/発表者名
  Wang Xin、Takaki Shinji、Yamagishi Junichi
- 雑誌名
  
  IEEE/ACM Transactions on Audio, Speech, and Language Processing
  
  巻: 28 ページ: 402～415
- DOI
  10.1109/TASLP.2019.2956145
- 査読あり / オープンアクセス
[雑誌論文] Transferring neural speech waveform synthesizers to musical instrument sounds generation2020
- 著者名/発表者名
  Zhao Yi, Wang Xin, Juvela Lauri, Yamagishi Junichi
- 雑誌名
  
  IEEE International Conference on Acoustics, Speech and Signal Processing
  
  巻: - ページ: 6269-6273
- DOI
  10.1109/ICASSP40776.2020.9053047
- 査読あり / オープンアクセス / 国際共著
[雑誌論文] Neural Harmonic-plus-Noise Waveform Model with Trainable Maximum Voice Frequency for Text-to-Speech Synthesis2019
- 著者名/発表者名
  Wang Xin、Yamagishi Junichi
- 雑誌名
  
  Proceeding of Speech Synthesis Workshop
  
  巻: - ページ: 1~6
- DOI
  10.21437/SSW.2019-1
- 査読あり / オープンアクセス
[学会発表] Neural-network-based waveform modeling for text-to-speech synthesis2019
- 著者名/発表者名
  Wang Xin
- 学会等名
  Lecture Series on Natural Language Processing
- 招待講演
[備考] Home page of neural source-filter waveform models
- URL
  https://nii-yamagishilab.github.io/samples-nsf/
[備考] Neural source-filter waveform model source code
- URL
  https://github.com/nii-yamagishilab/project-CURRENNT-public
[備考] Scripts to train and use the proposed models
- URL
  https://github.com/nii-yamagishilab/project-CURRENNT-scripts

2019 年度 実施状況報告書

One model for all sounds: fast and high-quality neural source-filter model for speech and non-speech waveform modeling

研究代表者

Wang Xin 国立情報学研究所, コンテンツ科学研究系, 特任助教 (60843141)

現在までの達成度 (区分)

理由

研究成果

[国際共同研究] University of Edinburgh(英国)

国名

外国機関名

[国際共同研究] Aalto University(フィンランド)

国名

外国機関名

[雑誌論文] Neural Source-Filter Waveform Models for Statistical Parametric Speech Synthesis2020

著者名/発表者名

雑誌名

DOI

[雑誌論文] Transferring neural speech waveform synthesizers to musical instrument sounds generation2020

著者名/発表者名

雑誌名

DOI

[雑誌論文] Neural Harmonic-plus-Noise Waveform Model with Trainable Maximum Voice Frequency for Text-to-Speech Synthesis2019

著者名/発表者名

雑誌名

DOI

[学会発表] Neural-network-based waveform modeling for text-to-speech synthesis2019

著者名/発表者名

学会等名

[備考] Home page of neural source-filter waveform models

URL

[備考] Neural source-filter waveform model source code

URL

[備考] Scripts to train and use the proposed models

URL

2019 年度実施状況報告書