Phantom in the Opera: the Vulnerabilities of Speech Interface for Robotic Dialogue System

Research Project

Project/Area Number	21K17837
Research Category	Grant-in-Aid for Early-Career Scientists
Allocation Type	Multi-year Fund
Review Section	Basic Section 61050:Intelligent robotics-related
Research Institution	National Institute of Information and Communications Technology
Principal Investigator	Li Sheng 国立研究開発法人情報通信研究機構, ユニバーサルコミュニケーション研究所先進的音声翻訳研究開発推進センター, 研究員 (70840940)
Project Period (FY)	2021-04-01 – 2023-03-31
Project Status	Completed (Fiscal Year 2022)
Budget Amount *help	¥2,080,000 (Direct Cost: ¥1,600,000、Indirect Cost: ¥480,000) Fiscal Year 2022: ¥1,040,000 (Direct Cost: ¥800,000、Indirect Cost: ¥240,000) Fiscal Year 2021: ¥1,040,000 (Direct Cost: ¥800,000、Indirect Cost: ¥240,000)
Keywords	speech recognition / adversarial attack / privacy perserving / deepfake detection / spoken dialogue / federated learning / security / privacy preserving / quality estimation / spoken dialogue system / adversarial attacks / speech enhancement / Speech recognition / Dialogue robotic system / Adversarial attack / Deep neural network
Outline of Research at the Start	As the most natural human-machine interface, the automatic speech recognition (ASR) module plays a crucial role in these recent robot dialogue systems. However, a deep neural network (DNN) is known to be vulnerable to adversarial examples (or attacks). This is a severe problem. This study will make an in-depth study to the robustness of the ASR modules of a robot dialogue system.
Outline of Final Research Achievements	In this project, we carefully studied the principles of speech recognition systems and researched all possible attack details. We summarized our findings in a review and proposed methods for improving the front-end and back-end of speech recognition systems. We expanded our research scope with a universal point of view. Similar attacks can co-exist in speech-related systems, not just speech recognition systems. We also consider adversarial attacks as particular noise, then combining traditional speech enhancement, modeling, and post-processing methods in system development can sufficiently deal with this attack. Top journals and conferences in the speech field accepted our achievements, such as Interspeech and ICASSP. Above two years of research achievement have been introduced into two books (ISBN: 978-4-904020-26-5, ISBN: 978-4-904020-28-9) by NICT and stored in the national library Kansai. These efforts are our contribution to ensuring the security and reliability of AI systems.
Academic Significance and Societal Importance of the Research Achievements	The development of deep neural networks has been progressing rapidly and the evolution of speech recognition systems has been incredibly fast. The study aims to provide researchers with ideas on improving system security in light of the increasingly severe security issues.

Report

(3 results)

2022 Annual Research Report Final Research Report ( PDF )
2021 Research-status Report

Research Products
(40 results)

All 2023 2022 2021 Other

All Int'l Joint Research (2 results) Journal Article (4 results) (of which Int'l Joint Research: 2 results, Peer Reviewed: 4 results, Open Access: 4 results) Presentation (28 results) (of which Int'l Joint Research: 28 results) Book (2 results) Remarks (4 results)

[Int'l Joint Research] Tianjin University/Xinjiang University/Royal Flush AI Research Inc.(中国)
- Related Report
  2021 Research-status Report
[Int'l Joint Research] Nanyang Technological University(シンガポール)
- Related Report
  2021 Research-status Report
[Journal Article] Cross-Lingual Transfer Learning for End-to-End Speech Translation2022
- Author(s)
  Shimizu Shuichiro、Chu Chenhui、Li Sheng、Kurohashi Sadao
- Journal Title
  
  Journal of Natural Language Processing
  
  Volume: 29 Issue: 2 Pages: 611-637
- DOI
  10.5715/jnlp.29.611
- ISSN
  1340-7619, 2185-8314
- Related Report
  2022 Annual Research Report
- Peer Reviewed / Open Access
[Journal Article] TriECCC: Trilingual Corpus of the Extraordinary Chambers in the Courts of Cambodia for Speech Recognition and Translation Studies2022
- Author(s)
  Soky Kak、Mimura Masato、Kawahara Tatsuya、Chu Chenhui、Li Sheng、Ding Chenchen、Sam Sethserey
- Journal Title
  
  International Journal of Asian Language Processing
  
  Volume: 31 Issue: 03n04 Pages: 1-21
- DOI
  10.1142/s2717554522500072
- Related Report
  2022 Annual Research Report
- Peer Reviewed / Open Access
[Journal Article] Improving low-resource Tibetan end-to-end ASR by multilingual and multilevel unit modeling2022
- Author(s)
  Qin Siqing、Wang Longbiao、Li Sheng、Dang Jianwu、Pan Lixin
- Journal Title
  
  EURASIP Journal on Audio, Speech, and Music Processing
  
  Volume: 2022 Issue: 1 Pages: 1-10
- DOI
  10.1186/s13636-021-00233-4
- Related Report
  2021 Research-status Report
- Peer Reviewed / Open Access / Int'l Joint Research
[Journal Article] Adversarial Attack and Defense on Deep Neural Network-Based Voice Processing Systems: An Overview2021
- Author(s)
  Chen Xiaojiao、Li Sheng、Huang Hao
- Journal Title
  
  Applied Sciences
  
  Volume: 11 Issue: 18 Pages: 8450-8450
- DOI
  10.3390/app11188450
- Related Report
  2021 Research-status Report
- Peer Reviewed / Open Access / Int'l Joint Research
[Presentation] GENERAL OR SPECIFIC? INVESTIGATING EFFECTIVE PRIVACY PROTECTION IN FEDERATED LEARNING FOR SPEECH EMOTION RECOGNITION2023
- Author(s)
  Chao Tan, Yang Cao, Sheng Li and Masatoshi Yoshikawa
- Organizer
  ICASSP
- Related Report
  2022 Annual Research Report
- Int'l Joint Research
[Presentation] DOMAIN AND LANGUAGE ADAPTATION USING HETEROGENEOUS DATASETS FOR WAV2VEC2.0-BASED SPEECH RECOGNITION OF LOW-RESOURCE LANGUAGE2023
- Author(s)
  Kak Soky, Sheng Li, Chenhui Chu, Tatsuya Kawahara
- Organizer
  ICASSP
- Related Report
  2022 Annual Research Report
- Int'l Joint Research
[Presentation] Relationship Between Speakers' Physiological Structure and Acoustic Speech Signals: Data-Driven Study Based on Frequency-Wise Attentional Neural Network2022
- Author(s)
  Kai Li, Xugang Lu, Masato Akagi, Jianwu Dang, Sheng Li, Masashi Unoki
- Organizer
  30th European Signal Processing Conference (EUSIPCO)
- Related Report
  2022 Annual Research Report
- Int'l Joint Research
[Presentation] Leveraging Simultaneous Translation for Enhancing Transcription of Low-resource Language via Cross Attention Mechanism2022
- Author(s)
  Kak Soky, Sheng Li, Masato Mimura, Chenhui Chu, Tatsuya Kawahara
- Organizer
  INTERSPEECH 2022
- Related Report
  2022 Annual Research Report
- Int'l Joint Research
[Presentation] Augmented Adversarial Self-Supervised Learning for Early-Stage Alzheimer's Speech Detection2022
- Author(s)
  Longfei Yang, Wenqing Wei, Sheng Li, Jiyi Li, Takahiro Shinozaki
- Organizer
  INTERSPEECH 2022
- Related Report
  2022 Annual Research Report
- Int'l Joint Research
[Presentation] Data Augmentation Using McAdams-Coefficient-Based Speaker Anonymization for Fake Audio Detection2022
- Author(s)
  Kai Li, Sheng Li, Xugang Lu, Masato Akagi, Meng Liu, Lin Zhang, Chang Zeng, Longbiao Wang, Jianwu Dang, Masashi Unoki
- Organizer
  INTERSPEECH 2022
- Related Report
  2022 Annual Research Report
- Int'l Joint Research
[Presentation] Fusion of Self-supervised Learned Models for MOS Prediction2022
- Author(s)
  Zhengdong Yang, Wangjin Zhou, Chenhui Chu, Sheng Li, Raj Dabre, Raphael Rubino, Yi Zhao
- Organizer
  INTERSPEECH 2022
- Related Report
  2022 Annual Research Report
- Int'l Joint Research
[Presentation] Monaural Speech Enhancement Based on Spectrogram Decomposition for Convolutional Neural Network-sensitive Feature Extraction2022
- Author(s)
  Hao Shi, Longbiao Wang, Sheng Li, Jianwu Dang, Tatsuya Kawahara
- Organizer
  INTERSPEECH 2022
- Related Report
  2022 Annual Research Report
- Int'l Joint Research
[Presentation] Multi-Domain Dialogue State Tracking with Top-k Slot Self Attention2022
- Author(s)
  Longfei Yang, Jiyi Li, Sheng Li, Takahiro Shinozaki
- Organizer
  SIGdial Meeting Discourse \& Dialogue 2022
- Related Report
  2022 Annual Research Report
- Int'l Joint Research
[Presentation] Nict-Tib1: A Public Speech Corpus Of Lhasa Dialect For Benchmarking Tibetan Language Speech Recognition Systems2022
- Author(s)
  Kak Soky, Zhuo Gong, Sheng Li
- Organizer
  25th Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)
- Related Report
  2022 Annual Research Report
- Int'l Joint Research
[Presentation] Subband-based Spectrogram Fusion for Speech Enhancement by Combining Mapping and Masking Approaches2022
- Author(s)
  Hao Shi, Longbiao Wang, Sheng Li, Jianwu Dang, Tatsuya Kawahara
- Organizer
  Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)
- Related Report
  2022 Annual Research Report
- Int'l Joint Research
[Presentation] Can We Train a Language Model Inside an End-to-End ASR Model? - Investigating Effective Implicit Language Modeling2022
- Author(s)
  Zhuo Gong, Saito Daisuke, Sheng Li, Hisashi Kawai, Minematsu Nobuaki
- Organizer
  Proceedings of the Second Workshop on When Creative AI Meets Conversational AI
- Related Report
  2022 Annual Research Report
- Int'l Joint Research
[Presentation] Self-Adaptive Multilingual ASR Rescoring with Language Identification and Unified Language Model2022
- Author(s)
  Z. Gong, D. Saito, L. Yang, T. Shinozaki, S. Li, H. Kawai and N. Minematsu
- Organizer
  ISCA-Odyssey (The Speaker and Language Recognition Workshop)
- Related Report
  2021 Research-status Report
- Int'l Joint Research
[Presentation] Adversarial Speech Generation and Natural Speech Recovery for Speech Content Protection2022
- Author(s)
  S. Li, J. Li, Q. Liu and Z. Gong
- Organizer
  LREC (Language Resources and Evaluation Conference)
- Related Report
  2021 Research-status Report
- Int'l Joint Research
[Presentation] Compressing Transformer-based ASR Model by Task-driven Loss and Attention-based Multi-level Feature Distillation2022
- Author(s)
  Y. Lv, L. Wang, M. Ge, S. Li, C. Ding, L. Pan, Y. Wang, J. Dang, K. Honda
- Organizer
  in Proc. IEEE-ICASSP, pp. 7992--7996, 2022.
- Related Report
  2021 Research-status Report
- Int'l Joint Research
[Presentation] Mining Hard Samples Locally and Globally for Improved Speech Separation2022
- Author(s)
  K. Wang, Y. Peng, H. Huang, Y. Hu, and S. Li
- Organizer
  in Proc. IEEE-ICASSP, pp. 6037--6041, 2022.
- Related Report
  2021 Research-status Report
- Int'l Joint Research
[Presentation] The System Description for VoiceMOS Challenge 2022 (KK team, main/ood tasks)2022
- Author(s)
  S. Li, R. Dabre, R. Raphael, W. Zhou, Z. Yang, C. Chu, Y. Zhao
- Organizer
  VoiceMOS Challenge 2022
- Related Report
  2021 Research-status Report
- Int'l Joint Research
[Presentation] Spectrograms Fusion-based End-to-End Robust Automatic Speech Recognition2021
- Author(s)
  H. Shi, L. Wang, S. Li, C. Fan, J. Dang, and T. Kawahara
- Organizer
  In Proc. APSIPA ASC, pp. 438--442, 2021.
- Related Report
  2021 Research-status Report
- Int'l Joint Research
[Presentation] Multilingual Approach to Joint Speech and Accent Recognition with DNN-HMM Framework2021
- Author(s)
  Y. Peng, J. Zhang, H. Zhang, H. Xu, H. Huang, S. Li, and E.S. Chng
- Organizer
  In Proc. APSIPA ASC, pp. 1043--1048, 2021.
- Related Report
  2021 Research-status Report
- Int'l Joint Research
[Presentation] On the Use of Speaker Information for Automatic Speech Recognition in Speaker-imbalanced Corpora2021
- Author(s)
  K. Soky, S. Li, M. Mimura, C. Chu, and T. Kawahara
- Organizer
  In Proc. APSIPA ASC, pp. 433--437, 2021.
- Related Report
  2021 Research-status Report
- Int'l Joint Research
[Presentation] An End-to-End Dialect Identification System with Transfer Learning from a Multilingual Automatic Speech Recognition Model2021
- Author(s)
  D. Wang, S. Ye, X. Hu, S. Li, and X. Xu
- Organizer
  in Proc. INTERSPEECH, pp. 3266--3270, 2021.
- Related Report
  2021 Research-status Report
- Int'l Joint Research
[Presentation] End-to-End Speech Separation Using Orthogonal Representation in Complex and Real Time- Frequency Domain2021
- Author(s)
  K. Wang, H. Huang, Y. Hu, Z. Huang, and S. Li
- Organizer
  in Proc. INTERSPEECH, pp. 3046--3050, 2021.
- Related Report
  2021 Research-status Report
- Int'l Joint Research
[Presentation] The RoyalFlush-NICT System Description for AP21-OLR Challenge (Silk-road team, full tasks)2021
- Author(s)
  D. Wang, S. Ye, X. Hu, S. Li
- Organizer
  OLR2021 (oriental language recognition challenge)
- Related Report
  2021 Research-status Report
- Int'l Joint Research
[Presentation] System description of Alzheimer's disease early detection (Silk-road team, short speech track)2021
- Author(s)
  W. Wei, R. Wong, S. Li, Y. Guo and H. Huang
- Organizer
  In special session of NCMMSC2021 (Alzheimer's disease detection challenge), 2021
- Related Report
  2021 Research-status Report
- Int'l Joint Research
[Presentation] Adversarial Attack and Defense on Deep Neural Network-based Voice Processing Systems: An Overview2021
- Author(s)
  X. Chen, H. Huang, and S. Li
- Organizer
  National Conference on Man-Machine Speech Communication (NCMMSC), 2021. (report is selected to publish in Applied Sciences, Special Issues of Machine Speech Communication)
- Related Report
  2021 Research-status Report
- Int'l Joint Research
[Presentation] Speech Dereverberation Based on Scale-aware Mean Square Error Loss2021
- Author(s)
  L. Qiang, H. Shi, M. Ge, H. Yin, N. Li, L. Wang, S. Li and J. Dang
- Organizer
  International Conference on Neural Information Processing (ICONIP2021), pp 55-63, Springer, 2021.
- Related Report
  2021 Research-status Report
- Int'l Joint Research
[Presentation] Simultaneous Progressive Filtering-based Monaural Speech Enhancement2021
- Author(s)
  H. Yin, L. Qiang, H. Shi, L. Wang, S. Li, M. Ge, G. Zhang and J. Dang
- Organizer
  International Conference on Neural Information Processing (ICONIP2021), pp 213-221, Springer, 2021.
- Related Report
  2021 Research-status Report
- Int'l Joint Research
[Presentation] Exploring Effective Speech Representation via ASR for High-Quality End-to-End Multispeaker TTS2021
- Author(s)
  D. Liu, L. Wang, S. Li, H. Li, C. Ding, J. Zhang and J. Dang
- Organizer
  International Conference on Neural Information Processing (ICONIP2021), pp 110-118, Springer, 2021.
- Related Report
  2021 Research-status Report
- Int'l Joint Research
[Book] Voices of the Himalayas: Investigation of Speech Recognition Technology for the Tibetan Language2022
- Author(s)
  Sheng Li
- Total Pages
  112
- Publisher
  NICT
- ISBN
  9784904020289
- Related Report
  2022 Annual Research Report
[Book] Phantom in the Opera: The Vulnerabilities of Speech-based Artificial Intelligence Systems2022
- Author(s)
  Sheng Li
- Total Pages
  110
- Publisher
  NICT
- ISBN
  9784904020265
- Related Report
  2022 Annual Research Report
[Remarks] 情報通信研究機構の研究成果として、各年ごとの発表論文を日付順で紹介します。
- URL
  https://www.nict.go.jp/outcome/journals/journals_2021_j.html
- Related Report
  2021 Research-status Report
[Remarks] 情報通信研究機構の研究成果として、各年ごとの発表論文を日付順で紹介します。
- URL
  https://www.nict.go.jp/outcome/proceedings/proceedings_2021_j.html
- Related Report
  2021 Research-status Report
[Remarks] google scholar of Sheng Li
- URL
  https://scholar.google.com/citations?user=zHAhs0IAAAAJ&hl=en
- Related Report
  2021 Research-status Report
[Remarks] Lab homepage of Sheng Li
- URL
  https://ast-astrec.nict.go.jp/member/sheng-li/index.html
- Related Report
  2021 Research-status Report

Phantom in the Opera: the Vulnerabilities of Speech Interface for Robotic Dialogue System

Principal Investigator

Li Sheng 国立研究開発法人情報通信研究機構, ユニバーサルコミュニケーション研究所先進的音声翻訳研究開発推進センター, 研究員 (70840940)

¥2,080,000 (Direct Cost: ¥1,600,000、Indirect Cost: ¥480,000)

Report

Research Products

[Int'l Joint Research] Tianjin University/Xinjiang University/Royal Flush AI Research Inc.(中国)

Related Report

[Int'l Joint Research] Nanyang Technological University(シンガポール)

Related Report

[Journal Article] Cross-Lingual Transfer Learning for End-to-End Speech Translation2022

Author(s)

Journal Title

DOI

ISSN

Related Report

[Journal Article] TriECCC: Trilingual Corpus of the Extraordinary Chambers in the Courts of Cambodia for Speech Recognition and Translation Studies2022

Author(s)

Journal Title

DOI

Related Report

[Journal Article] Improving low-resource Tibetan end-to-end ASR by multilingual and multilevel unit modeling2022

Author(s)

Journal Title

DOI

Related Report

[Journal Article] Adversarial Attack and Defense on Deep Neural Network-Based Voice Processing Systems: An Overview2021

Author(s)

Journal Title

DOI

Related Report

[Presentation] GENERAL OR SPECIFIC? INVESTIGATING EFFECTIVE PRIVACY PROTECTION IN FEDERATED LEARNING FOR SPEECH EMOTION RECOGNITION2023

Author(s)

Organizer

Related Report

[Presentation] DOMAIN AND LANGUAGE ADAPTATION USING HETEROGENEOUS DATASETS FOR WAV2VEC2.0-BASED SPEECH RECOGNITION OF LOW-RESOURCE LANGUAGE2023

Author(s)

Organizer

Related Report

[Presentation] Relationship Between Speakers' Physiological Structure and Acoustic Speech Signals: Data-Driven Study Based on Frequency-Wise Attentional Neural Network2022

Author(s)

Organizer

Related Report

[Presentation] Leveraging Simultaneous Translation for Enhancing Transcription of Low-resource Language via Cross Attention Mechanism2022

Author(s)

Organizer

Related Report

[Presentation] Augmented Adversarial Self-Supervised Learning for Early-Stage Alzheimer's Speech Detection2022

Author(s)

Organizer

Related Report

[Presentation] Data Augmentation Using McAdams-Coefficient-Based Speaker Anonymization for Fake Audio Detection2022

Author(s)

Organizer

Related Report

[Presentation] Fusion of Self-supervised Learned Models for MOS Prediction2022

Author(s)

Organizer

Related Report

[Presentation] Monaural Speech Enhancement Based on Spectrogram Decomposition for Convolutional Neural Network-sensitive Feature Extraction2022

Author(s)

Organizer

Related Report

[Presentation] Multi-Domain Dialogue State Tracking with Top-k Slot Self Attention2022

Author(s)

Organizer

Related Report

[Presentation] Nict-Tib1: A Public Speech Corpus Of Lhasa Dialect For Benchmarking Tibetan Language Speech Recognition Systems2022

Author(s)

Organizer

Related Report

[Presentation] Subband-based Spectrogram Fusion for Speech Enhancement by Combining Mapping and Masking Approaches2022

Author(s)

Organizer

Related Report

[Presentation] Can We Train a Language Model Inside an End-to-End ASR Model? - Investigating Effective Implicit Language Modeling2022

Author(s)

Organizer

Related Report

[Presentation] Self-Adaptive Multilingual ASR Rescoring with Language Identification and Unified Language Model2022