End-to-End Model for Task-Independent Speech Understanding and Dialogue

Research Project

Project/Area Number	20H00602
Research Category	Grant-in-Aid for Scientific Research (A)
Allocation Type	Single-year Grants
Section	一般
Review Section	Medium-sized Section 61:Human informatics and related fields
Research Institution	Kyoto University
Principal Investigator	Kawahara Tatsuya 京都大学, 情報学研究科, 教授 (00234104)
Co-Investigator(Kenkyū-buntansha)	井上昂治京都大学, 情報学研究科, 助教 (10838684) 吉井和佳京都大学, 情報学研究科, 准教授 (20510001)
Project Period (FY)	2020-04-01 – 2024-03-31
Project Status	Completed (Fiscal Year 2023)
Budget Amount *help	¥44,720,000 (Direct Cost: ¥34,400,000、Indirect Cost: ¥10,320,000) Fiscal Year 2023: ¥9,620,000 (Direct Cost: ¥7,400,000、Indirect Cost: ¥2,220,000) Fiscal Year 2022: ¥12,220,000 (Direct Cost: ¥9,400,000、Indirect Cost: ¥2,820,000) Fiscal Year 2021: ¥12,220,000 (Direct Cost: ¥9,400,000、Indirect Cost: ¥2,820,000) Fiscal Year 2020: ¥10,660,000 (Direct Cost: ¥8,200,000、Indirect Cost: ¥2,460,000)
Keywords	音声理解 / 音声対話 / 音声認識 / End-to-Endモデル
Outline of Research at the Start	人間どうしが行うような音声コミュニケーションにおいて、相手の意図・概念・感情を理解し、応答するためのモデルを研究する。音声から理解さらには相槌生成を行う系と、理解結果に応じて適切な知識・モデルを用いて応答生成を行う系のEnd-to-Endモデル化を行う。これにより、音声認識誤りの影響と音声に含まれるニュアンスや感情などの情報を考慮して、インタラクションを行うシステムを実現する。傾聴・カウンセリングや就職面接などを対象として、モデル化及び対話システムのロボットによる実装を行う。これにより、人間のコミュニケーションスキルの解明と実現を目指す。
Outline of Final Research Achievements	For general-purpose speech understanding and dialogue based on the end-to-end models, various studies were conducted from the perspective of advanced speech recognition and dialogue generation. First, we designed and implemented an end-to-end system that directly recognizes dialogue acts and emotions from speech. Next, we proposed an effective learning method for speech recognition of low-resource languages by integrating speaker, language and domain recognition. We also built a model for generating punctuated and cleaned text directly from speech. Furthermore, we studied how to integrate emotion recognition with speech and gender recognition for effective learning. With regard to dialogue generation, end-to-end models represented by the large-scale language models have become the mainstream, and we proposed a mechanism to reason the user's intention and emotion and the system's intention and emotion before response generation.
Academic Significance and Societal Importance of the Research Achievements	音声認識はend-to-endモデルを大規模なデータで学習することで、大きな性能の向上を実現したが、少資源言語の音声認識や感情認識の性能はまだ十分でない。これに対して、様々な音声の属性を統合することで、大きな改善が得られることを示した。対話生成においても大規模言語モデルが隆盛を極めているが、ロボットなどに実装する際には意図や感情などの内部状態のモデルを構築・学習することで、共感的・共生的なシステムの実現につながることが期待される。

Report

(6 results)

2023 Annual Research Report Final Research Report ( PDF )
2022 Annual Research Report
2021 Annual Research Report
2020 Comments on the Screening Results Annual Research Report

Research Products
(34 results)

All 2024 2023 2022 2021 2020

All Journal Article (5 results) (of which Peer Reviewed: 4 results, Open Access: 4 results) Presentation (27 results) (of which Int'l Joint Research: 27 results) Book (2 results)

[Journal Article] Automatic speech recognition based on large pretrained models2023
- Author(s)
  河原達也、三村正人
- Journal Title
  
  THE JOURNAL OF THE ACOUSTICAL SOCIETY OF JAPAN
  
  Volume: 79 Issue: 9 Pages: 455-460
- DOI
  10.20697/jasj.79.9_455
- ISSN
  0369-4232, 2432-2040
- Year and Date
  2023-09-01
- Related Report
  2023 Annual Research Report
[Journal Article] End-to-End Generation of Written-style Transcript of Speech from Parliamentary Meetings2023
- Author(s)
  Mimura Masato、Kawahara Tatsuya
- Journal Title
  
  Journal of Natural Language Processing
  
  Volume: 30 Issue: 1 Pages: 88-124
- DOI
  10.5715/jnlp.30.88
- ISSN
  1340-7619, 2185-8314
- Related Report
  2022 Annual Research Report
- Peer Reviewed / Open Access
[Journal Article] TriECCC: Trilingual Corpus of the Extraordinary Chambers in the Courts of Cambodia for Speech Recognition and Translation Studies2022
- Author(s)
  Soky Kak、Mimura Masato、Kawahara Tatsuya、Chu Chenhui、Li Sheng、Ding Chenchen、Sam Sethserey
- Journal Title
  
  International Journal of Asian Language Processing
  
  Volume: 31 Issue: 03n04 Pages: 1-21
- DOI
  10.1142/s2717554522500072
- Related Report
  2022 Annual Research Report
- Peer Reviewed / Open Access
[Journal Article] Synthesizing waveform sequence-to-sequence to augment training data for sequence-to-sequence speech recognition2021
- Author(s)
  S.Ueno, M.Mimura, S.Sakai, and T.Kawahara
- Journal Title
  
  Acoustical Science and Technology
  
  Volume: 42 Issue: 6 Pages: 333-343
- DOI
  10.1250/ast.42.333
- NAID
  130008110355
- ISSN
  0369-4232, 1346-3969, 1347-5177
- Year and Date
  2021-11-01
- Related Report
  2021 Annual Research Report
- Peer Reviewed / Open Access
[Journal Article] Alignment knowledge distillation for online streaming attention-based speech recognition2021
- Author(s)
  H.Inaguma and T.Kawahara
- Journal Title
  
  IEEE/ACM Trans. Audio, Speech & Language Process
  
  Volume: Vol.29 Pages: 1-15
- DOI
  10.1109/taslp.2021.3133217
- Related Report
  2021 Annual Research Report
- Peer Reviewed / Open Access
[Presentation] Enhancing two-stage finetuning for speech emotion recognition using adapters.2024
- Author(s)
  Y.Gao, H.Shi, C.Chu, and T.Kawahara.
- Organizer
  IEEE-ICASSP
- Related Report
  2023 Annual Research Report
- Int'l Joint Research
[Presentation] Reasoning before responding: Integrating commonsense-based causality explanation for empathetic response generation.2023
- Author(s)
  Y.Fu, K.Inoue, C.Chu, and T.Kawahara.
- Organizer
  SIGDIAL
- Related Report
  2023 Annual Research Report
- Int'l Joint Research
[Presentation] Two-stage finetuning of wav2vec 2.0 for speech emotion recognition with ASR and gender pretraining.2023
- Author(s)
  Y.Gao, C.Chu, and T.Kawahara.
- Organizer
  INTERSPEECH
- Related Report
  2023 Annual Research Report
- Int'l Joint Research
[Presentation] Embedding articulatory constraints for low-resource speech recognition based on large pre-trained model.2023
- Author(s)
  J.Lee, M.Mimura, and T.Kawahara.
- Organizer
  INTERSPEECH
- Related Report
  2023 Annual Research Report
- Int'l Joint Research
[Presentation] Time-domain speech enhancement assisted by multi-resolution frequency encoder and decoder.2023
- Author(s)
  H.Shi, M.Mimura, L.Wang, J.Dang, and T.Kawahara.
- Organizer
  IEEE-ICASSP
- Related Report
  2023 Annual Research Report
- Int'l Joint Research
[Presentation] Domain and language adaptation using heterogeneous datasets for wav2vec2.0-based speech recognition of low-resource language.2023
- Author(s)
  K.Soky, S.Li, C.Chu, and T.Kawahara.
- Organizer
  IEEE-ICASSP
- Related Report
  2023 Annual Research Report
- Int'l Joint Research
[Presentation] Fusing multiple bandwidth spectrograms for improving speech enhancement.2022
- Author(s)
  H.Shi, Y.Shu, L.Wang, J.Dang, and T.Kawahara.
- Organizer
  APSIPA ASC
- Related Report
  2022 Annual Research Report
- Int'l Joint Research
[Presentation] Subband-based spectrogram fusion for speech enhancement by combining mapping and masking approaches.2022
- Author(s)
  H.Shi, L.Wang, S.Li, J.Dang, and T.Kawahara.
- Organizer
  APSIPA ASC
- Related Report
  2022 Annual Research Report
- Int'l Joint Research
[Presentation] Non-autoregressive error correction for CTC-based ASR with phone-conditioned masked LM.2022
- Author(s)
  H.Futami, H.Inaguma, S.Ueno, M.Mimura, S.Sakai, and T.Kawahara.
- Organizer
  INTERSPEECH
- Related Report
  2022 Annual Research Report
- Int'l Joint Research
[Presentation] End-to-end speech-to-punctuated-text recognition.2022
- Author(s)
  J.Nozaki, T.Kawahara, K.Ishizuka, and T.Hashimoto.
- Organizer
  INTERSPEECH
- Related Report
  2022 Annual Research Report
- Int'l Joint Research
[Presentation] Leveraging simultaneous translation for enhancing transcription of low-resource language via cross attention mechanism.2022
- Author(s)
  K.Soky, S.Li, M.Mimura, C.Chu, and T.Kawahara.
- Organizer
  INTERSPEECH
- Related Report
  2022 Annual Research Report
- Int'l Joint Research
[Presentation] Monaural speech enhancement based on spectrogram decomposition for convolutional neural network-sensitive feature extraction.2022
- Author(s)
  H.Shi, L.Wang, S.Li, J.Dang, and T.Kawahara.
- Organizer
  INTERSPEECH
- Related Report
  2022 Annual Research Report
- Int'l Joint Research
[Presentation] Selective multi-task learning for speech emotion recognition using corpora of different styles.2022
- Author(s)
  H.Zhang, M.Mimura, T.Kawahara, and K.Ishizuka.
- Organizer
  IEEE-ICASSP
- Related Report
  2022 Annual Research Report
- Int'l Joint Research
[Presentation] Phone-informed refinement of synthesized mel spectrogram for data augmentation in speech recognition.2022
- Author(s)
  S.Ueno and T.Kawahara.
- Organizer
  IEEE-ICASSP
- Related Report
  2022 Annual Research Report
- Int'l Joint Research
[Presentation] An end-to-end model from speech to clean transcript for parliamentary meetings2021
- Author(s)
  M.Mimura, S.Sakai, and T.Kawahara
- Organizer
  APSIPA ASC
- Related Report
  2021 Annual Research Report
- Int'l Joint Research
[Presentation] ASR rescoring and confidence estimation with ELECTRA2021
- Author(s)
  H.Futami, H.Inaguma, M.Mimura, S.Sakai, and T.Kawahara
- Organizer
  IEEE Workshop Automatic Speech Recognition & Understanding (ASRU)
- Related Report
  2021 Annual Research Report
- Int'l Joint Research
[Presentation] Data augmentation for ASR using TTS via a discrete representation2021
- Author(s)
  S.Ueno, M.Mimura, S.Sakai, and T.Kawahara
- Organizer
  IEEE Workshop Automatic Speech Recognition & Understanding (ASRU)
- Related Report
  2021 Annual Research Report
- Int'l Joint Research
[Presentation] VAD-free streaming hybrid CTC/Attention ASR for unsegmented recording2021
- Author(s)
  H.Inaguma, M.Mimura, and T.Kawahara
- Organizer
  INTERSPEECH
- Related Report
  2021 Annual Research Report
- Int'l Joint Research
[Presentation] StableEmit: Selection probability discount for reducing emission latency of streaming monotonic attention ASR2021
- Author(s)
  H.Inaguma, M.Mimura, and T.Kawahara
- Organizer
  INTERSPEECH
- Related Report
  2021 Annual Research Report
- Int'l Joint Research
[Presentation] Multi-referenced training for dialogue response generation2021
- Author(s)
  T.Zhao and T.Kawahara
- Organizer
  SIGdial Meeting Discourse & Dialogue
- Related Report
  2021 Annual Research Report
- Int'l Joint Research
[Presentation] Response generation to out-of-database questions for example-based dialogue systems.2020
- Author(s)
  S.Isonishi, K.Inoue, D.Lala, K.Takanashi, and T.Kawahara.
- Organizer
  Int'l Workshop Spoken Dialogue Systems (IWSDS)
- Related Report
  2020 Annual Research Report
- Int'l Joint Research
[Presentation] End-to-end speech emotion recognition combined with acoustic-to-word ASR model.2020
- Author(s)
  H.Feng, S.Ueno, and T.Kawahara.
- Organizer
  INTERSPEECH
- Related Report
  2020 Annual Research Report
- Int'l Joint Research
[Presentation] End-to-end speech-to-dialog-act recognition.2020
- Author(s)
  T.V.Dang, T.Zhao, S.Ueno, H.Inaguma, and T.Kawahara.
- Organizer
  INTERSPEECH
- Related Report
  2020 Annual Research Report
- Int'l Joint Research
[Presentation] Topic-relevant response generation using optimal transport for an open-domain dialog system.2020
- Author(s)
  S.Zhang, T.Zhao, and T.Kawahara.
- Organizer
  COLING
- Related Report
  2020 Annual Research Report
- Int'l Joint Research
[Presentation] Distilling the knowledge of BERT for sequence-to-sequence ASR.2020
- Author(s)
  H.Futami, H.Inaguma, S.Ueno, M.Mimura, S.Sakai, and T.Kawahara.
- Organizer
  INTERSPEECH
- Related Report
  2020 Annual Research Report
- Int'l Joint Research
[Presentation] CTC-synchronous training for monotonic attention model.2020
- Author(s)
  H.Inaguma, M.Mimura, and T.Kawahara.
- Organizer
  INTERSPEECH
- Related Report
  2020 Annual Research Report
- Int'l Joint Research
[Presentation] Enhancing monotonic multihead attention for streaming ASR.2020
- Author(s)
  H.Inaguma, M.Mimura, and T.Kawahara.
- Organizer
  INTERSPEECH
- Related Report
  2020 Annual Research Report
- Int'l Joint Research
[Book] 音声（下）2022
- Author(s)
  日本音響学会、岩野公司、河原達也、篠田浩一、伊藤彰則、増村亮、小川哲司、駒谷和範
- Total Pages
  208
- Publisher
  コロナ社
- ISBN
  9784339013672
- Related Report
  2022 Annual Research Report
[Book] 音声対話システム2022
- Author(s)
  井上昂治、河原達也
- Total Pages
  272
- Publisher
  オーム社
- ISBN
  9784274229541
- Related Report
  2022 Annual Research Report

End-to-End Model for Task-Independent Speech Understanding and Dialogue

Principal Investigator

Kawahara Tatsuya 京都大学, 情報学研究科, 教授 (00234104)

¥44,720,000 (Direct Cost: ¥34,400,000、Indirect Cost: ¥10,320,000)

Report

Research Products

[Journal Article] Automatic speech recognition based on large pretrained models2023

Author(s)

Journal Title

DOI

ISSN

Year and Date

Related Report

[Journal Article] End-to-End Generation of Written-style Transcript of Speech from Parliamentary Meetings2023

Author(s)

Journal Title

DOI

ISSN

Related Report

[Journal Article] TriECCC: Trilingual Corpus of the Extraordinary Chambers in the Courts of Cambodia for Speech Recognition and Translation Studies2022

Author(s)

Journal Title

DOI

Related Report

[Journal Article] Synthesizing waveform sequence-to-sequence to augment training data for sequence-to-sequence speech recognition2021

Author(s)

Journal Title

DOI

NAID

ISSN

Year and Date

Related Report

[Journal Article] Alignment knowledge distillation for online streaming attention-based speech recognition2021

Author(s)

Journal Title

DOI

Related Report

[Presentation] Enhancing two-stage finetuning for speech emotion recognition using adapters.2024

Author(s)

Organizer

Related Report

[Presentation] Reasoning before responding: Integrating commonsense-based causality explanation for empathetic response generation.2023

Author(s)

Organizer

Related Report

[Presentation] Two-stage finetuning of wav2vec 2.0 for speech emotion recognition with ASR and gender pretraining.2023

Author(s)

Organizer

Related Report

[Presentation] Embedding articulatory constraints for low-resource speech recognition based on large pre-trained model.2023

Author(s)

Organizer

Related Report

[Presentation] Time-domain speech enhancement assisted by multi-resolution frequency encoder and decoder.2023

Author(s)

Organizer

Related Report

[Presentation] Domain and language adaptation using heterogeneous datasets for wav2vec2.0-based speech recognition of low-resource language.2023

Author(s)

Organizer

Related Report

[Presentation] Fusing multiple bandwidth spectrograms for improving speech enhancement.2022

Author(s)

Organizer

Related Report

[Presentation] Subband-based spectrogram fusion for speech enhancement by combining mapping and masking approaches.2022

Author(s)

Organizer

Related Report

[Presentation] Non-autoregressive error correction for CTC-based ASR with phone-conditioned masked LM.2022

Author(s)

Organizer

Related Report

[Presentation] End-to-end speech-to-punctuated-text recognition.2022

Author(s)

Organizer

Related Report

[Presentation] Leveraging simultaneous translation for enhancing transcription of low-resource language via cross attention mechanism.2022

Author(s)

Organizer