Direct modeling of speech waveform using a DNN for text-to-speech synthesis

Research Project

Project/Area Number	16K16096
Research Category	Grant-in-Aid for Young Scientists (B)
Allocation Type	Multi-year Fund
Research Field	Perceptual information processing
Research Institution	National Institute of Informatics
Principal Investigator	Takaki Shinji 国立情報学研究所, コンテンツ科学研究系, 特任助教 (50735090)
Project Period (FY)	2016-04-01 – 2019-03-31
Project Status	Completed (Fiscal Year 2018)
Budget Amount *help	¥3,900,000 (Direct Cost: ¥3,000,000、Indirect Cost: ¥900,000) Fiscal Year 2018: ¥1,040,000 (Direct Cost: ¥800,000、Indirect Cost: ¥240,000) Fiscal Year 2017: ¥1,040,000 (Direct Cost: ¥800,000、Indirect Cost: ¥240,000) Fiscal Year 2016: ¥1,820,000 (Direct Cost: ¥1,400,000、Indirect Cost: ¥420,000)
Keywords	音声合成 / DNN / スペクトル / ディープニューラルネットワーク / 信号処理
Outline of Final Research Achievements	The purpose of this work is to realize text-to-speech synthesis based on direct modeling of speech waveform using a deep neural network. In this work, we exclude heuristic processing included in conventional text-to-speech synthesis. Modeling of amplitude spectra obtained by utilizing simple windowing and Fourier transform, modeling of spectra including phase information and direct modeling of speech waveform were investigated. We realized a direct modeling method of speech waveform for text-to-speech synthesis.
Academic Significance and Societal Importance of the Research Achievements	音声インターフェースの核となる技術であるテキスト音声合成の性能改善のため、Deep Neural Networkを用いた音声波形モデリングが盛んに研究されている。本課題では、非常に注目されているこの研究トピックについて取り組み、テキスト音声合成の性能改善を行った。テキスト音声合成を用いる既存のシステムの性能改善，性能改善に伴う応用アプリの普及等多くの波及効果を期待できる。

Report

(4 results)

2018 Annual Research Report Final Research Report ( PDF )
2017 Research-status Report
2016 Research-status Report

Research Products
(15 results)

All 2019 2018 2017 2016

All Journal Article (3 results) (of which Int'l Joint Research: 1 results, Peer Reviewed: 3 results, Open Access: 2 results, Acknowledgement Compliant: 1 results) Presentation (12 results) (of which Int'l Joint Research: 5 results, Invited: 2 results)

[Journal Article] Complex-Valued Restricted Boltzmann Machine for Speaker-Dependent Speech Parameterization From Complex Spectra2019
- Author(s)
  Nakashika Toru、Takaki Shinji、Yamagishi Junichi
- Journal Title
  
  IEEE/ACM Transactions on Audio, Speech, and Language Processing
  
  Volume: 27 Issue: 2 Pages: 244-254
- DOI
  10.1109/taslp.2018.2877465
- Related Report
  2018 Annual Research Report
- Peer Reviewed / Open Access
[Journal Article] Investigating very deep highway networks for parametric speech synthesis2018
- Author(s)
  Wang Xin、Takaki Shinji、Yamagishi Junichi
- Journal Title
  
  Speech Communication
  
  Volume: 96 Pages: 1-9
- DOI
  10.1016/j.specom.2017.11.002
- Related Report
  2017 Research-status Report
- Peer Reviewed / Open Access / Int'l Joint Research
[Journal Article] Investigation of Using Continuous Representation of Various Linguistic Units in Neural Network Based Text-to-Speech Synthesis2016
- Author(s)
  Xin Wang, Shinji Takaki, Junichi Yamagishi
- Journal Title
  
  IEICE Transactions on Information and Systems
  
  Volume: E99.D Issue: 10 Pages: 2471-2480
- DOI
  10.1587/transinf.2016SLP0011
- NAID
  130005598240
- ISSN
  0916-8532, 1745-1361
- Related Report
  2016 Research-status Report
- Peer Reviewed / Acknowledgement Compliant
[Presentation] CWTスペクトル誤差に基づくDNN音声波形モデルの学習2019
- Author(s)
  高木信二, 亀岡弘和, 山岸順一
- Organizer
  音声研究会
- Related Report
  2018 Annual Research Report
[Presentation] スペクトル系列誤差に基づくDNN音声波形モデルの学習2018
- Author(s)
  高木信二, 中鹿亘, 山岸順一
- Organizer
  日本音響学会秋季研究発表会
- Related Report
  2018 Annual Research Report
[Presentation] ディープラーニングによるテキスト音声合成の進展2018
- Author(s)
  高木信二
- Organizer
  日本音響学会春季研究発表会
- Related Report
  2017 Research-status Report
- Invited
[Presentation] An Autoregressive Recurrent Mixture Density Network for Parametric Speech Synthesis2017
- Author(s)
  Xin Wang, Shinji Takaki, Junichi Yamagishi
- Organizer
  IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- Place of Presentation
  Hilton Conference Centre, New Orleans, USA
- Year and Date
  2017-03-07
- Related Report
  2016 Research-status Report
- Int'l Joint Research
[Presentation] とてもDeepなテキスト音声合成2017
- Author(s)
  高木信二
- Organizer
  音声研究会
- Place of Presentation
  東京大学
- Year and Date
  2017-01-21
- Related Report
  2016 Research-status Report
- Invited
[Presentation] Direct modeling of frequency spectra and waveform generation based on phase recovery for DNN-based speech synthesis2017
- Author(s)
  Shinji Takaki, Hirokazu Kameoka, Junichi Yamagishi
- Organizer
  INTERSPEECH
- Related Report
  2017 Research-status Report
- Int'l Joint Research
[Presentation] Complex-valued restricted Boltzmann machine for direct learning of frequency spectra2017
- Author(s)
  Toru Nakashika, Shinji Takaki, Junichi Yamagishi
- Organizer
  INTERSPEECH
- Related Report
  2017 Research-status Report
- Int'l Joint Research
[Presentation] Generative Adversarial Network-based Postfilter for STFT Spectrograms2017
- Author(s)
  Takuhiro Kaneko, Shinji Takaki, Hirokazu Kameoka, Junichi Yamagishi
- Organizer
  INTERSPEECH
- Related Report
  2017 Research-status Report
- Int'l Joint Research
[Presentation] DNNに基づくテキスト音声合成のためのFFTスペクトルを用いた位相復元に基づく音声波形生成2016
- Author(s)
  高木信二，SangJin Kim，亀岡弘和，山岸順一
- Organizer
  第18回音声言語シンポジウム
- Place of Presentation
  NTT武蔵野研究開発センタ
- Year and Date
  2016-12-20
- Related Report
  2016 Research-status Report
[Presentation] DNNに基づくテキスト音声合成における話者・ジェンダー・年齢コード利用の検討2016
- Author(s)
  Hieu Thi Luong, 高木信二, SangJin Kim, 山岸順一
- Organizer
  音声研究会
- Place of Presentation
  静岡大学
- Year and Date
  2016-10-27
- Related Report
  2016 Research-status Report
[Presentation] Speaker Adaptation of Various Components in Deep Neural Network based Speech Synthesis2016
- Author(s)
  Shinji Takaki, SangJin Kim, Junichi Yamagishi
- Organizer
  9th Speech Synthesis Workshop (SSW9)
- Place of Presentation
  Plug and Play Tech Center
- Year and Date
  2016-09-14
- Related Report
  2016 Research-status Report
- Int'l Joint Research
[Presentation] 巨大特定話者データを用いたHMM・DNN・RNNに基づく音声合成システムの性能評価2016
- Author(s)
  Wang Xin，高木信二，山岸順一
- Organizer
  第112回音声言語情報処理研究
- Place of Presentation
  山形県天童市鎌田本町・天童温泉・ほほえみの宿「滝の湯」
- Year and Date
  2016-07-28
- Related Report
  2016 Research-status Report

Direct modeling of speech waveform using a DNN for text-to-speech synthesis

Principal Investigator

Takaki Shinji 国立情報学研究所, コンテンツ科学研究系, 特任助教 (50735090)

¥3,900,000 (Direct Cost: ¥3,000,000、Indirect Cost: ¥900,000)

Report

Research Products

[Journal Article] Complex-Valued Restricted Boltzmann Machine for Speaker-Dependent Speech Parameterization From Complex Spectra2019

Author(s)

Journal Title

DOI

Related Report

[Journal Article] Investigating very deep highway networks for parametric speech synthesis2018

Author(s)

Journal Title

DOI

Related Report

[Journal Article] Investigation of Using Continuous Representation of Various Linguistic Units in Neural Network Based Text-to-Speech Synthesis2016

Author(s)

Journal Title

DOI

NAID

ISSN

Related Report

[Presentation] CWTスペクトル誤差に基づくDNN音声波形モデルの学習2019

Author(s)

Organizer

Related Report

[Presentation] スペクトル系列誤差に基づくDNN音声波形モデルの学習2018

Author(s)

Organizer

Related Report

[Presentation] ディープラーニングによるテキスト音声合成の進展2018

Author(s)

Organizer

Related Report

[Presentation] An Autoregressive Recurrent Mixture Density Network for Parametric Speech Synthesis2017

Author(s)

Organizer

Place of Presentation

Year and Date

Related Report

[Presentation] とてもDeepなテキスト音声合成2017

Author(s)

Organizer

Place of Presentation

Year and Date

Related Report

[Presentation] Direct modeling of frequency spectra and waveform generation based on phase recovery for DNN-based speech synthesis2017

Author(s)

Organizer

Related Report

[Presentation] Complex-valued restricted Boltzmann machine for direct learning of frequency spectra2017

Author(s)

Organizer

Related Report

[Presentation] Generative Adversarial Network-based Postfilter for STFT Spectrograms2017

Author(s)

Organizer

Related Report

[Presentation] DNNに基づくテキスト音声合成のためのFFTスペクトルを用いた位相復元に基づく音声波形生成2016

Author(s)

Organizer

Place of Presentation

Year and Date

Related Report

[Presentation] DNNに基づくテキスト音声合成における話者・ジェンダー・年齢コード利用の検討2016

Author(s)

Organizer

Place of Presentation

Year and Date

Related Report

[Presentation] Speaker Adaptation of Various Components in Deep Neural Network based Speech Synthesis2016

Author(s)

Organizer

Place of Presentation

Year and Date

Related Report

[Presentation] 巨大特定話者データを用いたHMM・DNN・RNNに基づく音声合成システムの性能評価2016

Author(s)

Organizer