2021 年度実施状況報告書

A Unified Computational Model for Audio-Visual Recognition of Human Social Interaction

研究課題

研究課題/領域番号	20K19833
研究機関	国立研究開発法人理化学研究所
研究代表者	Nugraha Aditya (Arie) 国立研究開発法人理化学研究所, 革新知能統合研究センター, 特別研究員 (60858025)
研究期間 (年度)	2020-04-01 – 2023-03-31
キーワード	Audio-visual processing / Blind source separation / Speech enhancement / Dereverberation / Deep spatial model / Normalizing flow
研究実績の概要	This study aims to formulate a probabilistic computational model of audio-visual information processing. We build upon the probabilistic local Gaussian model for a multichannel audio signal that is parameterized by the spectral parameters portraying the source characteristics and the spatial parameters representing the source and microphone locations in an environment. In FY2020, we have developed multiple deep spectral models, including one based on the speaker and phone disentanglement. In FY2021, we continued working on modeling deep spatial models based on the normalizing flow while integrating our state-of-the-art joint diagonalization techniques for spatial covariance matrices. Most importantly, we started to incorporate visual aspects into our audio-visual information processing. Although we originally planned to model the lip movement specifically, we decided to shift our focus to tackling issues in real-world scenarios, where lip movement detection is often not reliable, e.g., due to low-quality images, and thus, the typical lip-movement-informed speech enhancement is not possible. Conversely, human faces or bodies could be detected relatively easier. We regarded the detected humans as possible speaker locations that govern the spatial parameter optimization of an audio source separation or speech enhancement technique. Building upon this visually-informed audio signal processing, in FY2022, we would attempt to explore more visual aspects and develop techniques that exploit both audio and visual data in a mutually-dependent manner.
現在までの達成度 (区分)	現在までの達成度 (区分) 2: おおむね順調に進展している理由 In FY2021, following the research implementation plan, we worked on the visual aspect of audio-visual information processing. We originally planned to work on modeling the lip movement, as one of the prominent visual aspects in documenting human social interaction. However, learning from multiple recently published works on lip-movement-informed speech enhancement, we decided to shift our focus to tackling issues in real-world scenarios where lip movement detection is not reliable, e.g., because the target speaker is too far away from the camera or the image resolution is too low due to the camera hardware limitations. In this case, instead of lip movement, human faces or bodies can still be detected from the camera images to inform an audio source separation technique about the possible locations of the target speakers. The multichannel audio signals are assumed to follow the probabilistic local Gaussian model parameterized by the spectral parameters portraying the source characteristics and the spatial parameters representing the relative source and microphone locations in an environment. The detected possible speaker locations can then be exploited to govern the spatial parameter optimization. The publication list, unfortunately, has not reflected this study on visually-informed audio source separation yet because multiple papers are currently under review. Additionally, we continued working on modeling the spectral and spatial parameters with deep neural networks while integrating our state-of-the-art joint diagonalization techniques for spatial covariance matrices.
今後の研究の推進方策	In FY2022, we would explore more visual aspects, including probably the lip movement as planned earlier, to be incorporated into our audio-visual information processing. Although the lip movements might not be effective for lip reading given low-resolution images, the movements might still be useful to help speech activity detection. Building upon our visually-informed audio signal processing, we would attempt to develop signal processing techniques that exploit both audio and visual data in a mutually-dependent manner. It is also important that the techniques are applicable to real environments, where the noise signals are non-stationary most of the time and the target speakers may move. Thus, to be effective, the techniques must be versatile. As always, since the research field evolves rapidly, we might need to adjust our research objectives to keep working on something novel and original.
次年度使用額が生じた理由	Because of the global COVID-19 pandemic, international and domestic conferences are still mostly held online, so most travel budget remains unused. We decided not to spend it for other purposes in a rush and preserve it for possible critical needs in the future. Considering the improving pandemic situation and the easing of travel restrictions, we may expect that there will be more participation in on-site conferences soon.

研究成果
(9件)

すべて 2021 その他

すべて雑誌論文 (1件) (うち国際共著 1件、査読あり 1件、オープンアクセス 1件) 学会発表 (5件) (うち国際学会 2件) 備考 (3件)

[雑誌論文] Neural Full-Rank Spatial Covariance Analysis for Blind Source Separation2021
- 著者名/発表者名
  Bando Yoshiaki、Sekiguchi Kouhei、Masuyama Yoshiki、Nugraha Aditya Arie、Fontaine Mathieu、Yoshii Kazuyoshi
- 雑誌名
  
  IEEE Signal Processing Letters
  
  巻: 28 ページ: 1670～1674
- DOI
  10.1109/LSP.2021.3101699
- 査読あり / オープンアクセス / 国際共著
[学会発表] Alpha-Stable Autoregressive Fast Multichannel Nonnegative Matrix Factorization for Joint Speech Enhancement and Dereverberation2021
- 著者名/発表者名
  Fontaine Mathieu、Sekiguchi Kouhei、Nugraha Aditya Arie、Bando Yoshiaki、Yoshii Kazuyoshi
- 学会等名
  INTERSPEECH
- 国際学会
[学会発表] Autoregressive Fast Multichannel Nonnegative Matrix Factorization For Joint Blind Source Separation And Dereverberation2021
- 著者名/発表者名
  Sekiguchi Kouhei、Bando Yoshiaki、Nugraha Aditya Arie、Fontaine Mathieu、Yoshii Kazuyoshi
- 学会等名
  ICASSP
- 国際学会
[学会発表] Determined Blind Source Separation Based on NF-IVA with Time-Varying Linear Transformations2021
- 著者名/発表者名
  Nugraha Aditya Arie、Sekiguchi Kouhei、Fontaine Mathieu、Bando Yoshiaki、Yoshii Kazuyoshi
- 学会等名
  ASJ (Spring Meeting)
[学会発表] Joint Blind Source Separation and Dereverberation Based on ARMA-FastMNMF2021
- 著者名/発表者名
  Sekiguchi Kouhei、Bando Yoshiaki、Nugraha Aditya Arie、Fontaine Mathieu、Yoshii Kazuyoshi
- 学会等名
  ASJ (Spring Meeting)
[学会発表] Unsupervised Source Separation with Deep Spatial Models2021
- 著者名/発表者名
  Nugraha Aditya Arie、Sekiguchi Kouhei、Fontaine Mathieu、Bando Yoshiaki、Yoshii Kazuyoshi
- 学会等名
  RIKEN-AIP Open Seminar
[備考] Demo web page for Neural FCA
- URL
  https://ybando.jp/projects/spl2021/
[備考] Demo web page for NF-IVA
- URL
  https://aanugraha.github.io/demo/nfiva/
[備考] Demo web page for GF-VAE
- URL
  https://aanugraha.github.io/demo/gfvae/

2021 年度 実施状況報告書

A Unified Computational Model for Audio-Visual Recognition of Human Social Interaction

研究代表者

Nugraha Aditya (Arie) 国立研究開発法人理化学研究所, 革新知能統合研究センター, 特別研究員 (60858025)

現在までの達成度 (区分)

理由

研究成果

[雑誌論文] Neural Full-Rank Spatial Covariance Analysis for Blind Source Separation2021

著者名/発表者名

雑誌名

DOI

[学会発表] Alpha-Stable Autoregressive Fast Multichannel Nonnegative Matrix Factorization for Joint Speech Enhancement and Dereverberation2021

著者名/発表者名

学会等名

[学会発表] Autoregressive Fast Multichannel Nonnegative Matrix Factorization For Joint Blind Source Separation And Dereverberation2021

著者名/発表者名

学会等名

[学会発表] Determined Blind Source Separation Based on NF-IVA with Time-Varying Linear Transformations2021

著者名/発表者名

学会等名

[学会発表] Joint Blind Source Separation and Dereverberation Based on ARMA-FastMNMF2021

著者名/発表者名

学会等名

[学会発表] Unsupervised Source Separation with Deep Spatial Models2021

著者名/発表者名

学会等名

[備考] Demo web page for Neural FCA

URL

[備考] Demo web page for NF-IVA

URL

[備考] Demo web page for GF-VAE

URL

2021 年度実施状況報告書