-
- TAMURA Satoshi
- Gifu University
-
- NINOMIYA Hiroshi
- Nagoya University
-
- KITAOKA Norihide
- Tokushima University
-
- OSUGA Shin
- Aisin Seiki Co., Ltd.
-
- IRIBE Yurie
- Aichi Prefectural University
-
- TAKEDA Kazuya
- Nagoya University
-
- HAYAMIZU Satoru
- Gifu University
抄録
<p>Audio-Visual Speech Recognition (AVSR) is one of techniques to enhance robustness of speech recognizer in noisy or real environments. On the other hand, Deep Neural Networks (DNNs) have recently attracted a lot of attentions of researchers in the speech recognition field, because we can drastically improve recognition performance by using DNNs. There are two ways to employ DNN techniques for speech recognition: a hybrid approach and a tandem approach; in the hybrid approach an emission probability on each Hidden Markov Model (HMM) state is computed using a DNN, while in the tandem approach a DNN is composed into a feature extraction scheme. In this paper, we investigate and compare several DNN-based AVSR methods to mainly clarify how we should incorporate audio and visual modalities using DNNs. We carried out recognition experiments using a corpus CENSREC-1-AV, and we discuss the results to find out the best DNN-based AVSR modeling. Then it turns out that a tandem-based method using audio Deep Bottle-Neck Features (DBNFs) and visual ones with multi-stream HMMs is the most suitable, followed by a hybrid approach and another tandem scheme using audio-visual DBNFs.</p>
収録刊行物
-
- IEICE Transactions on Information and Systems
-
IEICE Transactions on Information and Systems E99.D (10), 2444-2451, 2016
一般社団法人 電子情報通信学会
- Tweet
詳細情報
-
- CRID
- 1390001204379815040
-
- NII論文ID
- 130005598220
-
- ISSN
- 17451361
- 09168532
-
- 本文言語コード
- en
-
- データソース種別
-
- JaLC
- Crossref
- CiNii Articles
- IDR
- KAKEN
-
- 抄録ライセンスフラグ
- 使用不可