2018 Fiscal Year Annual Research Report

Robust voice cloning technologies in noisy environments and its applications

Research Project

Project/Area Number	17H04687
Research Institution	National Institute of Informatics
Principal Investigator	山岸順一国立情報学研究所, コンテンツ科学研究系, 准教授 (70709352)
Project Period (FY)	2017-04-01 – 2020-03-31
Keywords	音声合成 / ディープラーニング / デジタルクローン / 話者適応
Outline of Annual Research Achievements	話者適応は音声合成を応用した「声のデジタルクローン技術」であり、音声の障害者応用で著しい成功を納めている。本研究は、音声合成用途以外の劣環境で収録された音声を新たにデジタルクローンの対象とすべく、必要な要素技術を先駆的に生み出す。具体的には、耐雑音・耐反響性を向上させ、高価な音声収録器材を不要とする、ディープラーニングによる話者適応、及び、教師なし話者適応手法を実現することが目的である。本年度は、音声のデジタルクローン技術のさらに利便性を向上せる教師なし適応についての研究を邁進し、業績をあげた。通常、音声合成では、音素等のバランスや頻度を考慮して人工的に作成された「音素バランス文」を読み上げた音声を利用する。しかしながら、故人の声をデジタルクローンにより再現するという様な応用を考えた場合、新たに読み上げ音声を収録するという選択肢は無く、収録済みの会話や対話音声といった必ずしもテキストデータが付随しない音声データにより音声合成システムを構築する必要がある。そこで、テキストデータが付随しない音声データからでも手軽に声のデジタルクローン出来るように、Multi-modal architectureという新たなニューラルネットワーク構造を提案し、これにより話者適応が音声のみからでも行えることを示した。さらに、合成音声の品質はボコーダという音響特徴量を音声波形信号に変換する技術により大きく制約されてしまうことから、このボコーダの改良も鋭意行った。Neural source-filter modelなどの新たなニューラル波形モデルを提案し、複数の論文発表を行った。
Current Status of Research Progress	Current Status of Research Progress 1: Research has progressed more than it was originally planned. Reason 当初の目的である教師なし話者適応技術の開発だけでなく、Neural source-filter modelなどの新たなニューラル波形モデルの開発にも成功したことから、当初の計画以上に進展していると判断した。
Strategy for Future Research Activity	最終年度である本年度は、提案教師なし適応技術をより緻密に評価し、ジャーナル論文化することを目指す。また、これまで提案・開発してきた要素技術を統合し、劣悪環境に頑健な提案話者適応技術を、障碍者応用等の実データに対して適用し、その有効性を検討する。例えば、病気や疾患ですでに声を失ってしまったが、過去の音声データを記録し保有する障碍者も、本提案技術により個人用音声合成システムを利用できる可能性が非常に高くなる事から、その改善程度を評価する。

Research Products
(25 results)

All 2019 2018 Other

All Int'l Joint Research (2 results) Journal Article (13 results) (of which Int'l Joint Research: 5 results, Peer Reviewed: 13 results, Open Access: 12 results) Presentation (10 results) (of which Int'l Joint Research: 10 results)

[Int'l Joint Research] Aalto university(フィンランド)
- Country Name
  FINLAND
- Counterpart Institution
  Aalto university
[Int'l Joint Research] Polytechnic University of Catalonia(スペイン)
- Country Name
  SPAIN
- Counterpart Institution
  Polytechnic University of Catalonia
[Journal Article] Complex-Valued Restricted Boltzmann Machine for Speaker-Dependent Speech Parameterization From Complex Spectra2019
- Author(s)
  Toru Nakashika, Shinji Takaki, Junichi Yamagishi,
- Journal Title
  
  IEEE/ACM Transactions on Audio, Speech, and Language Processing
  
  Volume: 27(2) Pages: 244-254
- DOI
  https://doi.org/10.1109/TASLP.2018.2877465
- Peer Reviewed / Open Access
[Journal Article] STFT spectral loss for training a neural speech waveform model2019
- Author(s)
  Shinji Takaki, Toru Nakashika, Xin Wang, Junichi Yamagishi
- Journal Title
  
  2019 IEEE International Conference on Acoustics, Speech and Signal Processing
  
  Volume: - Pages: 印刷中
- Peer Reviewed / Open Access
[Journal Article] Neural source-filter-based waveform model for statistical parametric speech synthesis2019
- Author(s)
  Xin Wang, Shinji Takaki, Junichi Yamagishi
- Journal Title
  
  2019 IEEE International Conference on Acoustics, Speech and Signal Processing
  
  Volume: - Pages: 印刷中
- Peer Reviewed / Open Access
[Journal Article] Investigation of enhanced Tacotron text-to-speech synthesis systems with self-attention for pitch accent language2019
- Author(s)
  Yusuke Yasuda, Xin Wang, Shinji Takaki, Junichi Yamagishi
- Journal Title
  
  2019 IEEE International Conference on Acoustics, Speech and Signal Processing
  
  Volume: - Pages: 印刷中
- Peer Reviewed / Open Access
[Journal Article] Waveform generation for text-to-speech synthesis using pitch-synchronous multi-scale generative adversarial networks2019
- Author(s)
  Lauri Juvela, Bajibabu Bollepalli, Junichi Yamagishi, Paavo Alku
- Journal Title
  
  2019 IEEE International Conference on Acoustics, Speech and Signal Processing
  
  Volume: - Pages: 印刷中
- Peer Reviewed / Open Access / Int'l Joint Research
[Journal Article] Audiovisual speaker conversion: jointly and simultaneously transforming facial expression and acoustic characteristics2019
- Author(s)
  Fuming Fang, Xin Wang, Junichi Yamagishi, Isao Echizen
- Journal Title
  
  2019 IEEE International Conference on Acoustics, Speech and Signal Processing
  
  Volume: - Pages: 印刷中
- Peer Reviewed / Open Access
[Journal Article] Cycle-consistent adversarial networks for non-parallel vocal effort based speaking style conversion2019
- Author(s)
  Shreyas Seshadri, Lauri Juvela, Junichi Yamagishi, Okko Rasanen, Paavo Alku
- Journal Title
  
  2019 IEEE International Conference on Acoustics, Speech and Signal Processing
  
  Volume: - Pages: 印刷中
- Peer Reviewed / Open Access / Int'l Joint Research
[Journal Article] A comparison between STRAIGHT, glottal, and sinusoidal vocoding in statistical parametric speech synthesis2018
- Author(s)
  Airaksinen Manu, Juvela Lauri, Bollepalli Bajibabu, Junichi Yamagishi, Alku Paavo,
- Journal Title
  
  IEEE/ACM Transactions on Audio, Speech and Language Processing
  
  Volume: 26(9) Pages: 1658-1670
- DOI
  https://doi.org/10.1109/TASLP.2018.2835720
- Peer Reviewed / Int'l Joint Research
[Journal Article] Expressive Speech Synthesis Using Sentiment Embeddings2018
- Author(s)
  Igor Jauk, Jaime Lorenzo-Trueba, Junichi Yamagishi, Antonio Bonafonte
- Journal Title
  
  Proc. Interspeech 2018
  
  Volume: - Pages: 3062--3066
- DOI
  http://dx.doi.org/10.21437/Interspeech.2018-2467
- Peer Reviewed / Open Access / Int'l Joint Research
[Journal Article] Speaker-independent Raw Waveform Model for Glottal Excitation2018
- Author(s)
  Lauri Juvela, Vassilis Tsiaras, Bajibabu Bollepalli, Manu Airaksinen, Junichi Yamagishi, Paavo Alku
- Journal Title
  
  Proc. Interspeech 2018
  
  Volume: - Pages: 2012--2016
- DOI
  http://dx.doi.org/10.21437/Interspeech.2018-1635
- Peer Reviewed / Open Access / Int'l Joint Research
[Journal Article] Multimodal Speech Synthesis Architecture for Unsupervised Speaker Adaptation2018
- Author(s)
  Hieu-Thi Luong, Junichi Yamagishi
- Journal Title
  
  Proc. Interspeech 2018
  
  Volume: - Pages: 2494--2498
- DOI
  http://dx.doi.org/10.21437/Interspeech.2018-1791
- Peer Reviewed / Open Access
[Journal Article] Wasserstein GAN and Waveform Loss-based Acoustic Model Training for Multi-speaker Text-to-Speech Synthesis Systems Using a WaveNet Neural Vocoder2018
- Author(s)
  Yi Zhao, Shinji Takaki, Hieu-Thi Luong, Junichi Yamagishi, Daisuke Saito, Nobuaki Minematsu
- Journal Title
  
  IEEE Access
  
  Volume: 6(1) Pages: 60478-60488
- DOI
  https://doi.org/10.1109/ACCESS.2018.2872060
- Peer Reviewed / Open Access
[Journal Article] Scaling and bias codes for modeling speaker-adaptive DNN-based speech synthesis systems2018
- Author(s)
  Hieu-Thi Luong, Junichi Yamagishi
- Journal Title
  
  2018 IEEE Spoken Language Technology Workshop (SLT)
  
  Volume: - Pages: 610-617
- DOI
  https://doi.org/10.1109/SLT.2018.8639659
- Peer Reviewed / Open Access
[Presentation] Scaling and bias codes for modeling speaker-adaptive DNN-based speech synthesis systems2019
- Author(s)
  Hieu-Thi Luong, Junichi Yamagishi
- Organizer
  2018 IEEE Spoken Language Technology Workshop (SLT)
- Int'l Joint Research
[Presentation] STFT spectral loss for training a neural speech waveform model2019
- Author(s)
  Shinji Takaki, Toru Nakashika, Xin Wang, Junichi Yamagishi
- Organizer
  2019 IEEE International Conference on Acoustics, Speech and Signal Processing
- Int'l Joint Research
[Presentation] Neural source-filter-based waveform model for statistical parametric speech synthesis2019
- Author(s)
  Xin Wang, Shinji Takaki, Junichi Yamagishi
- Organizer
  2019 IEEE International Conference on Acoustics, Speech and Signal Processing
- Int'l Joint Research
[Presentation] Investigation of enhanced Tacotron text-to-speech synthesis systems with self-attention for pitch accent language2019
- Author(s)
  Yusuke Yasuda, Xin Wang, Shinji Takaki, Junichi Yamagishi
- Organizer
  2019 IEEE International Conference on Acoustics, Speech and Signal Processing
- Int'l Joint Research
[Presentation] Waveform generation for text-to-speech synthesis using pitch-synchronous multi-scale generative adversarial networks2019
- Author(s)
  Lauri Juvela, Bajibabu Bollepalli, Junichi Yamagishi, Paavo Alku
- Organizer
  2019 IEEE International Conference on Acoustics, Speech and Signal Processing
- Int'l Joint Research
[Presentation] Audiovisual speaker conversion: jointly and simultaneously transforming facial expression and acoustic characteristics2019
- Author(s)
  Fuming Fang, Xin Wang, Junichi Yamagishi, Isao Echizen
- Organizer
  2019 IEEE International Conference on Acoustics, Speech and Signal Processing
- Int'l Joint Research
[Presentation] Cycle-consistent adversarial networks for non-parallel vocal effort based speaking style conversion2019
- Author(s)
  Shreyas Seshadri, Lauri Juvela, Junichi Yamagishi, Okko Rasanen, Paavo Alku
- Organizer
  2019 IEEE International Conference on Acoustics, Speech and Signal Processing
- Int'l Joint Research
[Presentation] Expressive Speech Synthesis Using Sentiment Embeddings2018
- Author(s)
  Igor Jauk, Jaime Lorenzo-Trueba, Junichi Yamagishi, Antonio Bonafonte
- Organizer
  Interspeech 2018
- Int'l Joint Research
[Presentation] Speaker-independent Raw Waveform Model for Glottal Excitation2018
- Author(s)
  Lauri Juvela, Vassilis Tsiaras, Bajibabu Bollepalli, Manu Airaksinen, Junichi Yamagishi, Paavo Alku
- Organizer
  Interspeech 2018
- Int'l Joint Research
[Presentation] Multimodal Speech Synthesis Architecture for Unsupervised Speaker Adaptation2018
- Author(s)
  Hieu-Thi Luong, Junichi Yamagishi
- Organizer
  Interspeech 2018
- Int'l Joint Research

2018 Fiscal Year Annual Research Report

Robust voice cloning technologies in noisy environments and its applications

Principal Investigator

山岸 順一 国立情報学研究所, コンテンツ科学研究系, 准教授 (70709352)

Current Status of Research Progress

Reason

Research Products

[Int'l Joint Research] Aalto university(フィンランド)

Country Name

Counterpart Institution

[Int'l Joint Research] Polytechnic University of Catalonia(スペイン)

Country Name

Counterpart Institution

[Journal Article] Complex-Valued Restricted Boltzmann Machine for Speaker-Dependent Speech Parameterization From Complex Spectra2019

Author(s)

Journal Title

DOI

[Journal Article] STFT spectral loss for training a neural speech waveform model2019

Author(s)

Journal Title

[Journal Article] Neural source-filter-based waveform model for statistical parametric speech synthesis2019

Author(s)

Journal Title

[Journal Article] Investigation of enhanced Tacotron text-to-speech synthesis systems with self-attention for pitch accent language2019

Author(s)

Journal Title

[Journal Article] Waveform generation for text-to-speech synthesis using pitch-synchronous multi-scale generative adversarial networks2019

Author(s)

Journal Title

[Journal Article] Audiovisual speaker conversion: jointly and simultaneously transforming facial expression and acoustic characteristics2019

Author(s)

Journal Title

[Journal Article] Cycle-consistent adversarial networks for non-parallel vocal effort based speaking style conversion2019

Author(s)

Journal Title

[Journal Article] A comparison between STRAIGHT, glottal, and sinusoidal vocoding in statistical parametric speech synthesis2018

Author(s)

Journal Title

DOI

[Journal Article] Expressive Speech Synthesis Using Sentiment Embeddings2018

Author(s)

Journal Title

DOI

[Journal Article] Speaker-independent Raw Waveform Model for Glottal Excitation2018

Author(s)

Journal Title

DOI

[Journal Article] Multimodal Speech Synthesis Architecture for Unsupervised Speaker Adaptation2018

Author(s)

Journal Title

DOI

[Journal Article] Wasserstein GAN and Waveform Loss-based Acoustic Model Training for Multi-speaker Text-to-Speech Synthesis Systems Using a WaveNet Neural Vocoder2018

Author(s)

Journal Title

DOI

[Journal Article] Scaling and bias codes for modeling speaker-adaptive DNN-based speech synthesis systems2018

Author(s)

Journal Title

DOI

[Presentation] Scaling and bias codes for modeling speaker-adaptive DNN-based speech synthesis systems2019

Author(s)

Organizer

[Presentation] STFT spectral loss for training a neural speech waveform model2019

Author(s)

Organizer

[Presentation] Neural source-filter-based waveform model for statistical parametric speech synthesis2019

Author(s)

Organizer

[Presentation] Investigation of enhanced Tacotron text-to-speech synthesis systems with self-attention for pitch accent language2019

Author(s)

Organizer

[Presentation] Waveform generation for text-to-speech synthesis using pitch-synchronous multi-scale generative adversarial networks2019

Author(s)

Organizer

[Presentation] Audiovisual speaker conversion: jointly and simultaneously transforming facial expression and acoustic characteristics2019

Author(s)

Organizer

[Presentation] Cycle-consistent adversarial networks for non-parallel vocal effort based speaking style conversion2019

Author(s)

Organizer

山岸順一国立情報学研究所, コンテンツ科学研究系, 准教授 (70709352)