2021 Fiscal Year Annual Research Report
Augmented speech communication using multi-modal signals with real-time, low-latency voice conversion
Project/Area Number |
21J20920
|
Allocation Type | Single-year Grants |
Research Institution | Nagoya University |
Principal Investigator |
HUANG WENCHIN 名古屋大学, 情報学研究科, 特別研究員(DC1)
|
Project Period (FY) |
2021-04-28 – 2024-03-31
|
Keywords | voice conversion / self-supervised learning / dysarthria / electrolaryngeal speech |
Outline of Annual Research Achievements |
The purpose of this research is to apply voice conversion (VC) to realize an interactive speech production paradigm for real-world applications, with the help of multimodal signals and real-time processing techniques. In the first year, the applicant focused on three aspects. (1) Continued improvement on fundamental VC techniques, specifically self-supervised speech representation (S3R)-based VC, an emerging trend which reduces training data requirements. The applicant released S3PRL-VC, an open-source toolkit for researchers to evaluate S3R models for VC. By collaborating with research institutes including Carnegie Mellon University and National Taiwan University, and results are published in ICASSP 2021 and 2022, a top conference in signal processing. (2) Medical applications of VC, specifically dysarthric VC, a task that helps dysarthria patients to speak and communicate normally again. Thanks to the collaboration with Academia Sinica, Taiwan, data collection was smooth, and results were published in INTERSPEECH 2021, a top conference in speech processing. (3) Initial investigations on how to apply multi-modal signals to VC, specifically electrolaryngeal (EL) VC, a task that tries to enhance the robotic EL speech to become more natural. Again, thanks to the collaboration with Academia Sinica, a new dataset which contains both the visual and audio signals of patients and normal speakers was recorded, and the lip video improved the performance of ELVC. Results were published in APSIPA ASC, a conference in signal processing.
|
Current Status of Research Progress |
Current Status of Research Progress
1: Research has progressed more than it was originally planned.
Reason
The original plan for the first year was to record a multi-modal VC dataset and verify it with initial experiments. This year, thanks to the collaboration with Academia Sinica, Taiwan, the collection of an audio-visual dataset containing speech samples and facial video with and without the electrolarynx was successful. In addition, to apply augmented communication to more patients, another collaboration with TU Delft, Netherlands, was started. They helped connecting pathological specialists to evaluate the applicant's system, and a process of collecting speech from oral cancer patients is also ongoing. Furthermore, the applicant's efforts in improving fundamental VC techniques shall serve as the base point for future research.
|
Strategy for Future Research Activity |
The original plan for the second year was to start collecting new multi-modal datasets, with a special focus on human body signals, such as hand gestures. However, due to COVID-19, the applicant has not been able to return to Japan, so further data collection will be difficult. The other objective of the project, which is real-time, low-latency modeling, is also important to fulfill the practical application needs. However, it is also hard to accomplish if the applicant is not in Japan. Therefore, the applicants to focus on the followings. (1) Keep improving fundamental VC techniques. Real-world applications such as medical devices are often more low-resourced than normal VC tasks, so the applicant needs to tackle situations where the data from the target patient is more scarce. To verify the effectiveness, experiments can be applied to existing collected datasets. (2) Problems that only exists in medical applications. Specifically, to develop a VC-based speaking-aid device, we need to collect the normal, natural voice of the patient, which is impossible. We previously proposed a method to preserve the speaker identity of the patient, but the performance was not satisfying. We seek to further improve this aspect.
|