2020 年度実施状況報告書

A Unified Computational Model for Audio-Visual Recognition of Human Social Interaction

研究課題

研究課題/領域番号	20K19833
研究機関	国立研究開発法人理化学研究所
研究代表者	Nugraha Aditya (Arie) 国立研究開発法人理化学研究所, 革新知能統合研究センター, 特別研究員 (60858025)
研究期間 (年度)	2020-04-01 – 2023-03-31
キーワード	deep speech model / deep generative model / latent variable model / variational autoencoder / normalizing flow
研究実績の概要	This study aims to formulate a probabilistic computational model of audio-visual scene understanding. In FY2020, as we planned, we developed a model that generates time-varying speech given different latent variables: (1) time-invariant speaker label, encoding the voice characteristic, (2) time-invariant phone label, conveying the message, and (3) additional time-varying latent variables, holding other unspecified aspects in speech. The generated speech sounded natural given a good set of latent variables, and voice conversion could be done by modifying the speaker label. The model also worked well for speech separation outperforming the classical model without phone and speaker information. In FY2021, we would focus on the visual part by developing a generative model for lip movement.
現在までの達成度 (区分)	現在までの達成度 (区分) 3: やや遅れている理由 For FY2020, we originally planned to work on the speech generative modeling and start working on the lip-movement generative modeling. However, we were too focused on developing and improving the speech model, causing the development of the lip-movement model behind schedule. In addition to the phone- and speaker-aware speech model based on the widely used variational autoencoder (VAE), we also developed another model based on a combination of VAE and the normalizing flow (NF). We showed that this novel model represents and produces better speech harmonics and improves a speech enhancement system utilizing it. We also developed a novel NF-based independent vector analysis for source separation purposes. Our knowledge of NF would be valuable for the development of the lip-movement model.
今後の研究の推進方策	In FY2021, we would focus on developing a generative model for lip movement. Since the research proposal is written, there are works on audio-visual speech enhancement systems exploiting the lip movement images published by other researchers. We believe that these recent works would boost the progress of our study. However, since this research area evolves very quickly, we might need to adjust our research objectives to work on something novel and original.
次年度使用額が生じた理由	Because of the global COVID-19 pandemic, international and domestic conferences are held online, so most travel budget remains unused. We decided not to spend it for other purposes in a rush and preserve it for possible critical needs in the future.

研究成果
(6件)

すべて 2020

すべて雑誌論文 (3件) (うち国際共著 3件、査読あり 3件、オープンアクセス 2件) 学会発表 (3件) (うち国際学会 3件)

[雑誌論文] A Flow-Based Deep Latent Variable Model for Speech Spectrogram Modeling and Enhancement2020
- 著者名/発表者名
  Nugraha Aditya Arie、Sekiguchi Kouhei、Yoshii Kazuyoshi
- 雑誌名
  
  IEEE/ACM Transactions on Audio, Speech, and Language Processing
  
  巻: 28 ページ: 1104～1117
- DOI
  10.1109/TASLP.2020.2979603
- 査読あり / 国際共著
[雑誌論文] Fast Multichannel Nonnegative Matrix Factorization With Directivity-Aware Jointly-Diagonalizable Spatial Covariance Matrices for Blind Source Separation2020
- 著者名/発表者名
  Sekiguchi Kouhei、Bando Yoshiaki、Nugraha Aditya Arie、Yoshii Kazuyoshi、Kawahara Tatsuya
- 雑誌名
  
  IEEE/ACM Transactions on Audio, Speech, and Language Processing
  
  巻: 28 ページ: 2610～2625
- DOI
  10.1109/TASLP.2020.3019181
- 査読あり / オープンアクセス / 国際共著
[雑誌論文] Flow-Based Independent Vector Analysis for Blind Source Separation2020
- 著者名/発表者名
  Nugraha Aditya Arie、Sekiguchi Kouhei、Fontaine Mathieu、Bando Yoshiaki、Yoshii Kazuyoshi
- 雑誌名
  
  IEEE Signal Processing Letters
  
  巻: 27 ページ: 2173～2177
- DOI
  10.1109/LSP.2020.3039944
- 査読あり / オープンアクセス / 国際共著
[学会発表] Unsupervised Robust Speech Enhancement Based on Alpha-Stable Fast Multichannel Nonnegative Matrix Factorization2020
- 著者名/発表者名
  Fontaine Mathieu、Sekiguchi Kouhei、Nugraha Aditya Arie、Yoshii Kazuyoshi
- 学会等名
  INTERSPEECH
- 国際学会
[学会発表] Fast Multichannel Correlated Tensor Factorization for Blind Source Separation2020
- 著者名/発表者名
  Yoshii Kazuyoshi、Sekiguchi Kouhei、Bando Yoshiaki、Fontaine Mathieu、Nugraha Aditya Arie
- 学会等名
  EUSIPCO
- 国際学会
[学会発表] Semi-supervised Multichannel Speech Separation Based on a Phone- and Speaker-Aware Deep Generative Model of Speech Spectrograms2020
- 著者名/発表者名
  Du Yicheng、Sekiguchi Kouhei、Bando Yoshiaki、Nugraha Aditya Arie、Fontaine Mathieu、Yoshii Kazuyoshi、Kawahara Tatsuya
- 学会等名
  EUSIPCO
- 国際学会

2020 年度 実施状況報告書

A Unified Computational Model for Audio-Visual Recognition of Human Social Interaction

研究代表者

Nugraha Aditya (Arie) 国立研究開発法人理化学研究所, 革新知能統合研究センター, 特別研究員 (60858025)

現在までの達成度 (区分)

理由

研究成果

[雑誌論文] A Flow-Based Deep Latent Variable Model for Speech Spectrogram Modeling and Enhancement2020

著者名/発表者名

雑誌名

DOI

[雑誌論文] Fast Multichannel Nonnegative Matrix Factorization With Directivity-Aware Jointly-Diagonalizable Spatial Covariance Matrices for Blind Source Separation2020

著者名/発表者名

雑誌名

DOI

[雑誌論文] Flow-Based Independent Vector Analysis for Blind Source Separation2020

著者名/発表者名

雑誌名

DOI

[学会発表] Unsupervised Robust Speech Enhancement Based on Alpha-Stable Fast Multichannel Nonnegative Matrix Factorization2020

著者名/発表者名

学会等名

[学会発表] Fast Multichannel Correlated Tensor Factorization for Blind Source Separation2020

著者名/発表者名

学会等名

[学会発表] Semi-supervised Multichannel Speech Separation Based on a Phone- and Speaker-Aware Deep Generative Model of Speech Spectrograms2020

著者名/発表者名

学会等名

2020 年度実施状況報告書