Project/Area Number |
15300055
|
Research Category |
Grant-in-Aid for Scientific Research (B)
|
Allocation Type | Single-year Grants |
Section | 一般 |
Research Field |
Perception information processing/Intelligent robotics
|
Research Institution | Tokyo Institute of Technology |
Principal Investigator |
KOBAYASHI Takao Tokyo Institute of Technology, Interdisciplinary Graduate School of Science and Engineering, Professor, 大学院・総合理工学研究科, 教授 (70153616)
|
Co-Investigator(Kenkyū-buntansha) |
MASUKO Takashi Tokyo Institute of Technology, Interdisciplinary Graduate School of Science and Engineering, Research Associate, 大学院・総合理工学研究科, 助手 (90272715)
SUMITA Kazuo Toshiba Corporation, Corporate Research & Development Center, Knowledge Media Laboratory, Laboratory Leader, 研究開発センター知識メディアラボラトリー, 室長(研究職)
|
Project Period (FY) |
2003 – 2005
|
Project Status |
Completed (Fiscal Year 2005)
|
Budget Amount *help |
¥9,000,000 (Direct Cost: ¥9,000,000)
Fiscal Year 2005: ¥2,500,000 (Direct Cost: ¥2,500,000)
Fiscal Year 2004: ¥3,300,000 (Direct Cost: ¥3,300,000)
Fiscal Year 2003: ¥3,200,000 (Direct Cost: ¥3,200,000)
|
Keywords | text-to-speech synthesis / hidden Markov model (HMM) / average voice / speaker adaptation / style interpolation / style adaptation / style control / hidden semi-Markov model (HSMM) / 感情音声・発話様式(スタイル) / HMM音声合成 / 隠れセミマルコフモデル / 感情音声・発話スタイル / 音声合成 / 発話スタイル / 感情音声 |
Research Abstract |
The purpose of this research is the realization of text-to-speech synthesis that can generate speech with an arbitrarily given speaker's voice and diverse speaking styles and/or emotional expressions. We have obtained the following results. 1. Speech synthesis with arbitrary speaker's voice based on average voice model We have proposed a new training method of average voice model for speech synthesis in which an arbitrary speaker's voice is generated based on speaker adaptation. We have also proposed new speaker adaptation techniques based on hidden semi-Markov model (HSMM) that can model phone duration more precisely than the conventional hidden Markov model (HMM). From the results of objective and subjective evaluation tests, it has been shown that the average-voice-model-based speech synthesis can generates natural sounding speech of the target speaker. 2. Speech synthesis with various speaking styles and emotional expressions We have proposed several approaches to the realization of emotional expressivity and speaking style variability in text-to-speech synthesis. We investigated two methods for modeling speaking styles and/or emotional expressions based on an HMM-based speech synthesis framework, and then proposed some approaches to adding various styles to synthetic speech, such as style interpolation, style morphing, style adaptation, and style control techniques. From results of subjective experiments, we have shown that the effectiveness of the proposed approaches. 3. Prosody We have developed a robust fundamental frequency estimation and voice/unvoiced determination technique based on instantaneous frequency amplitude spectrum. We have also proposed modeling techniques for phone duration and pause for high quality text-to-speech synthesis.
|