Synthesis of speech in any speaking styles based on corpus-based generation of prosodic features using the generation process model
Project/Area Number |
17300055
|
Research Category |
Grant-in-Aid for Scientific Research (B)
|
Allocation Type | Single-year Grants |
Section | 一般 |
Research Field |
Perception information processing/Intelligent robotics
|
Research Institution | The University of Tokyo |
Principal Investigator |
HIROSE Keikichi The University of Tokyo, Graduate School of Information Science and Technology, Professor (50111472)
|
Co-Investigator(Kenkyū-buntansha) |
MINEMATSU Nobuaki The University of Tokyo, Graduate Frontier Sciences, Associate Professor (90273333)
|
Project Period (FY) |
2005 – 2007
|
Project Status |
Completed (Fiscal Year 2007)
|
Budget Amount *help |
¥16,860,000 (Direct Cost: ¥15,300,000、Indirect Cost: ¥1,560,000)
Fiscal Year 2007: ¥6,760,000 (Direct Cost: ¥5,200,000、Indirect Cost: ¥1,560,000)
Fiscal Year 2006: ¥5,100,000 (Direct Cost: ¥5,100,000)
Fiscal Year 2005: ¥5,000,000 (Direct Cost: ¥5,000,000)
|
Keywords | Generation process model / Fundamental frequency contour / Corpus-based method / Prosodic control / Speaking style / HMM speech synthesis / Focus control / Spoken dialogue system / コーパスベース韻律制御 / 発話焦点 / 2段階処理 / 感情の程度 / 統計的手法 / 2段階手法 / 音声コーパス / 種々の調子 / 自動推定 / 感情 / アクセント属性 |
Research Abstract |
Research works were conducted to establish a corpus-based speech synthesis method, which is based on generation process model of fundamental frequency contours and can generate high-quality speech in any speaking styles. The original research plan was fulfilled with the following results : 1. A method was developed to predict the command parameters of the generation process model using binary decision trees with inputs such as linguistic information available by parsing texts, and thus to synthesize fundamental frequency contours. An integrated method of prosodic control was realized by integrating the above method with other methods using binary decision trees to predict pause positions and lengths and phoneme durations. The validity of the method was shown through experiments on speech synthesis of various styles including emotional speech. A method was also developed to automatically extract the command parameters from observed fundamental frequency contours using binary decision tre
… More
es. It was shown that the accuracy of extraction increased by including linguistic information of the text into inputs of the trees. 2. Binary decision trees were constructed to predict deviations in phrase and accent commands of the utterances with specific focuses from those without. Their inputs are accent types and positions in sentences of the focused words, and command values of the corresponding parts of the utterances without specific focus. An appropriate focus control was realized by modifying the phrase and accent commands predicted by the method in section 1 based on the predicted deviations. 3. A two-step method was developed for generating fundamental frequency contours of Standard Chinese. It first generates phrase components in a corpus-based way, and then generates tone components in a corpus-based way. The method has a high flexibility in synthesizing fundamental frequency contours. As an example of flexible control, it was shown that proper focus control could be realized in a simple set of rules. 4. Speech synthesis systems were constructed for Japanese and Chinese by integrating methods developed in sections 1 and 2 above with HMM speech synthesis. It was shown that synthetic speech with higher natural ness could be realized by our system than using "full" HMM synthesizer, where prosodic control was done in HMM framework. It was also shown that various styles of synthetic speech could be realized by our system. 5. Spoken dialogue systems for road guidance and TV program guidance were constructed using the above speech synthesis systems. The validity of the developed speech synthesis method was proved through experiments on the control of speaking styles of reply speech depending on the user's characters and situations. Less
|
Report
(4 results)
Research Products
(57 results)