研究課題/領域番号 |
21K17775
|
研究機関 | 国立情報学研究所 |
研究代表者 |
Wang Xin 国立情報学研究所, コンテンツ科学研究系, 特任助教 (60843141)
|
研究期間 (年度) |
2021-04-01 – 2024-03-31
|
キーワード | speech privacy / speaker anonymization / speech waveform modeling / neural network / deep learning |
研究実績の概要 |
The first year's work on the speaker anonymization includes three part: Part 1) following the research plan, the flow-based invertible anonymization system was implemented, and experiments were conducted on the VoicePrivacy 2020 evaluation platforms. As expected, anonymized speech can be de-anonymized (i.e., inverted back to the original wavform), and the de-anonymized waveform were recognized by speaker verification system with similar accuracy to the original waveform. Word error rate was also similar. However, the anonymized speech still contained speaker information and performed worse than the baseline. Furthermore, the quality of anonymized speech was degraded. Thus, the 1st edition of the flow-based anonymization system needs improvement.
Part 2) while not included the research plan, I was contributing to the VoicePrivacy 2022 challenge and building new baseline speaker anonymization models. These models are different from the flow-based model above, and they are combined from the neural waveform model (KAKENHI 19K24371) and latest general-adversarial-network-based approach for speech modeling. The baseline models are released for free (see https://www.voiceprivacychallenge.org).
Part 3) A new language-independent speaker anonymization system was proposed and accepted to Odyssey 2022 workshop. Although this system is not designed to be reversible, its advantage is that the language-dependent speech recognizer is not required as the systems built in Part 2). Thus, it can be directly used to anonymize other languages such as Mandarin.
|
現在までの達成度 (区分) |
現在までの達成度 (区分)
2: おおむね順調に進展している
理由
As planned for the 1st year, the flow-based invertible anonymization model was implemented and testified. An input waveform can be anonymized and then de-anonymized. The de-anonymized waveform encodes the original speaker information and has high quality (i.e., low word error rate). Thus, the goal of invertibility was partially achieved. However, the anonymized speech has degraded quality, and it still contains much speaker information. In short, while the de-anonymization performance is satisfying, the anonymization processing is limited.
Most of the efforts were paid to the organization of the VoicePrivacy Challenge 2022 (https://www.voiceprivacychallenge.org). Supported by this KAKEN project, new baseline models were built and released on GitHub for free access. Compared with the baseline models of the previous challenge, the new baseline models were based on a popular deep learning programming language called PyTorch, which makes it easier for users to digest and modify. Furthermore, the new baselines incorporate advanced general adversarial network (GAN)-based neural vocoders, and the anonymized audio quality was perpetually improved.
Finally, the new language-independent speaker anonymization system was proposed. It uses a language-independent self-supervised speech model (SSL) to replace the language-dependent speech recognizer for speech content extraction. This is a new direction for speaker anonymization. The new paper was accepted to ISCA Speaker Odyssey 2022 workshop.
|
今後の研究の推進方策 |
The original research plans were: 1) 2nd year: anonymization of accent and other speaker-related information; 2) 3rd year: joint optimization of the speaker anonymization system with speech recognition system (ASR), speaker verification (ASV), and other components that recognize the speaker-related information.
Based on the findings in the 1st year, we plan to focus on the language-independent anonymization framework in the 2nd year, following the accepted paper to Odyssey 2022 workshop. This new framework requires no language-dependent components (such as the ASR), and it is relatively easier to be extended to anonymize other speaker attributes such as accent and ethnicity.
The 3rd year's plan was slightly revised because ASR is not necessary for the new language-independent speaker anonymization framework. Instead, it uses a self-supervised speech (SSL) model to extract speech content from the input speech waveform. Thus, joint optimization will be conducted on the SSL and the rest of the anonymization system.
|
次年度使用額が生じた理由 |
The budget to purchase the GPU card was not executed due to the global semiconductor shortage. However, we plan to purchase the aforementioned hardware or other CPU/GPU servers in the next fiscal year if possible.
The budget for traveling to international conference was not executed because of the pandemic. However, we plan to attend international conferences in person from 2022 September as long as the situation becomes better.
|