研究実績の概要 |
Synthesizing speech in a variety of voices has long been a goal in speech research. Current approaches to multi-speaker synthesis have resulted in high speaker similarity, but fail to capture characteristics such as dialect, and also have the problem of overfitting to speakers who were seen during training. In this past year, we explored dialect modeling to better capture speaker characteristics, as well as data augmentation to improve synthesis of unseen speakers.
There are relatively few speakers in our training data (around 100). We hypothesized that increasing the number of speakers could provide better coverage of the speaker space. We explored two methods of so-called "speaker augmentation": artificial augmentation using vocal tract length perturbation (VTLP), where the data is resampled and the resulting signals have different speaker characteristics, and a "found data" approach where we included lower-quality data that contains a large variety of different speakers and dialects.
Mixing in lower-quality data from worse recording conditions is expected to degrade synthesis quality, so we incorporated channel labels during training that indicate which corpus each training utterance comes from. We also incorporated transfer-learned dialect embeddings to better capture information about speaker dialects. Experimental results from a crowdsourced listening test revealed that using found data with many English dialects was an effective augmentation method. Listeners’ ratings of perceived dialects were better matched to natural speech for unseen speakers using our approach.
|