2002 Fiscal Year Final Research Report Summary
High-quality Speech Synthesis based on Accurate Analysis Method and Statistical Method
Project/Area Number |
12480079
|
Research Category |
Grant-in-Aid for Scientific Research (B)
|
Allocation Type | Single-year Grants |
Section | 一般 |
Research Field |
Intelligent informatics
|
Research Institution | The University of Tokyo |
Principal Investigator |
HIROSE Keikichi Graduate School of Frontier Sciences, Professor, 大学院・新領域創成科学研究科, 教授 (50111472)
|
Co-Investigator(Kenkyū-buntansha) |
MINEMATSU Nobuaki Graduate School of Information Science and Technology, Associate Professor, 大学院・情報理工学系研究科, 助教授 (90273333)
|
Project Period (FY) |
2000 – 2002
|
Keywords | Statistical Speech Synthesis / Terminal Analogue Synthesis / Waveform Concatanative Synthesis / HMM Speech Syntheses / AR-HMM Model / Fundamental Frequency Contour / Generation Process Model / Emotional Speech Synthesis |
Research Abstract |
The original research plan, which aims at realizing high-quality speech synthesis through utilizing accurate pole-zero information of vocal transfer function for segmental feature generation and applying the functional model constraints for prosodic feature generation, was accomplished with the following results : 1. A successive approximation was applied to ARX analysis enabling accurate pole-zero estimation. The method was combined with our formerly developed terminal analogue synthesizer to construct a analysis-synthesis workbench. Using this, we succeeded to improve the quality of liquid sound. 2. A speech synthesizer, hybrid of terminal analogue and waveform concatenation, was developed. A high-quality speech synthesis was realized. 3. A method was developed for stable formant extraction, which was based on AR-HMM modeling, representing source waveform using HMM. Result of speech synthesis experiment showed that the method could generate high-quality even for a large F0 (fundamental
… More
frequency) change. 4. By adding natural waveform of junction periods in the spectral domain with appropriate weighting to the concatenated speech, we successfully realized a smooth spectral transition. Also we developed a method to effectively reduce the corpus size for concatenative synthesis by the weighted VQ according to the frequency. 5. The necessary data size for speaker adaptation was investigated form the viewpoint of speech quality after developing a HMM speech synthesizer. It was shown that a good quality was obtainable 10 and more sentences. 6. F0 contour generation was realized by estimating the generation process model parameters using statistical methods. A high speech quality was realized only from a small speech corpus by using linguistic information such as on direct modification relations of words. Also we succeeded to estimate the accent phrase boundaries form text using the same statistical framework. Furthermore, F0 contour generation and phoneme length estimation were realized for emotional speech with a good result. 7. A method for automatically estimating F0 contour generation process model commands was realized. Using the method, a prosodic corpus was made. This corpus is indispensable for the above F0 contour generation. 8. A rule for controlling mora duration for dialogue-like speech synthesis was constructed. The result of the speech synthesis experiment showed the validity of the rule. Less
|