• Search Research Projects
  • Search Researchers
  • How to Use
  1. Back to previous page

Encoder Factorization for Capturing Dialect and Articulation Level in End-to-End Speech Synthesis

Research Project

Project/Area Number 19K24372
Research Category

Grant-in-Aid for Research Activity Start-up

Allocation TypeMulti-year Fund
Review Section 1002:Human informatics, applied informatics and related fields
Research InstitutionNational Institute of Informatics

Principal Investigator

Cooper Erica  国立情報学研究所, コンテンツ科学研究系, 特任助教 (30843156)

Project Period (FY) 2019-08-30 – 2021-03-31
Project Status Completed (Fiscal Year 2020)
Budget Amount *help
¥2,860,000 (Direct Cost: ¥2,200,000、Indirect Cost: ¥660,000)
Fiscal Year 2020: ¥1,430,000 (Direct Cost: ¥1,100,000、Indirect Cost: ¥330,000)
Fiscal Year 2019: ¥1,430,000 (Direct Cost: ¥1,100,000、Indirect Cost: ¥330,000)
KeywordsSpeech synthesis / Speaker modeling / Deep learning / Neural network / speech synthesis / dialect modeling / speaker similarity / transfer learning / neural networks / speaker modeling / deep learning / articulation
Outline of Research at the Start

Synthesizing speech in a variety of speaker voices and styles has long been a goal in speech research. Recent advances in speech synthesis have resulted in very natural-sounding synthetic speech. Current approaches to modeling multiple speakers in synthetic speech have resulted in high similarity to the different speakers, but fail to capture characteristics such as dialect and level of articulation. We aim to determine whether including models of dialect and level of articulation in synthetic speech systems can successfully capture these aspects of speech.

Outline of Final Research Achievements

Synthesizing speech in many voices and styles has long been a goal in speech research. While current state-of-the-art synthesizers can produce very natural sounding speech, matching the voice of a target speaker when only a small amount of that speaker's data is available is still a challenge, especially for characteristics such as dialect. We conducted experiments to determine what kind of speaker embeddings work best for synthesis in the voice of a new speaker, and found that Learnable Dictionary Encoding (LDE) based speaker representations worked well, based on a crowdsourced listening test. We also found that similarly obtaining LDE-based dialect representations helped to improve the dialect of the synthesized speech. Finally, we explored data augmentation techniques using both artificially modified data as well as real data from non-ideal recording conditions, and found that including the found data in model training could further improve naturalness of synthesized speech.

Academic Significance and Societal Importance of the Research Achievements

本課題では、end-to-end音声合成における合成音声の話者性や方言再現性の向上のため、エンコーダの因子を制御する方法を調査した。話者の個性や特性をより適切に再現することにより、より多くの目標話者を音声合成システムにおいて利用することが可能になり、技術の応用先が広がると期待される。

Report

(3 results)
  • 2020 Annual Research Report   Final Research Report ( PDF )
  • 2019 Research-status Report
  • Research Products

    (9 results)

All 2020 Other

All Int'l Joint Research (2 results) Journal Article (3 results) (of which Int'l Joint Research: 3 results,  Peer Reviewed: 3 results,  Open Access: 3 results) Remarks (4 results)

  • [Int'l Joint Research] Massachusetts Institute of Technology(米国)

    • Related Report
      2020 Annual Research Report
  • [Int'l Joint Research] Massachusetts Institute of Technology(米国)

    • Related Report
      2019 Research-status Report
  • [Journal Article] Can Speaker Augmentation Improve Multi-Speaker End-to-End TTS?2020

    • Author(s)
      Cooper Erica、Lai Cheng-I、Yasuda Yusuke、Yamagishi Junichi
    • Journal Title

      Proc. Interspeech 2020

      Volume: 2020 Pages: 3979-3983

    • DOI

      10.21437/interspeech.2020-1229

    • Related Report
      2020 Annual Research Report
    • Peer Reviewed / Open Access / Int'l Joint Research
  • [Journal Article] Modeling of rakugo speech and its limitations: toward speech synthesis that entertains audiences2020

    • Author(s)
      Shuhei Kato, Yusuke Yasuda, Xin Wang, Erica Cooper, Shinji Takaki, and Junichi Yamagishi
    • Journal Title

      IEEE Access

      Volume: 8 Pages: 138149-138161

    • DOI

      10.1109/access.2020.3011975

    • Related Report
      2020 Annual Research Report
    • Peer Reviewed / Open Access / Int'l Joint Research
  • [Journal Article] Zero-Shot Multi-Speaker Text-To-Speech with State-Of-The-Art Neural Speaker Embeddings2020

    • Author(s)
      Erica Cooper, Cheng-I Lai, Yusuke Yasuda, Fuming Fang, Xin Wang, Nanxin Chen, Junichi Yamagishi
    • Journal Title

      ICASSP 2020

      Volume: - Pages: 6184-6188

    • DOI

      10.1109/icassp40776.2020.9054535

    • Related Report
      2019 Research-status Report
    • Peer Reviewed / Open Access / Int'l Joint Research
  • [Remarks] Public code for multi-speaker tacotron

    • URL

      https://github.com/nii-yamagishilab/multi-speaker-tacotron

    • Related Report
      2020 Annual Research Report
  • [Remarks] Audio sample page for Interspeech 2020 paper

    • URL

      https://nii-yamagishilab.github.io/samples-multi-speaker-tacotron/augment.html

    • Related Report
      2020 Annual Research Report
  • [Remarks] Multi-speaker Tacotron Code

    • URL

      https://github.com/nii-yamagishilab/multi-speaker-tacotron

    • Related Report
      2019 Research-status Report
  • [Remarks] Audio Samples for Multi-Speaker Tacotron

    • URL

      https://nii-yamagishilab.github.io/samples-multi-speaker-tacotron/

    • Related Report
      2019 Research-status Report

URL: 

Published: 2019-09-03   Modified: 2022-01-27  

Information User Guide FAQ News Terms of Use Attribution of KAKENHI

Powered by NII kakenhi