2016 Fiscal Year Annual Research Report

深層学習によるマルチモーダル時系列データ認識基盤の構築

Research Project

Project/Area Number	16H02845
Research Institution	Tokyo Institute of Technology
Principal Investigator	篠田浩一東京工業大学, 情報理工学院, 教授 (10343097)
Co-Investigator(Kenkyū-buntansha)	井上中順東京工業大学, 情報理工学院, 助教 (10733397) 岩野公司東京都市大学, メディア情報学部, 教授 (90323823)
Project Period (FY)	2016-04-01 – 2019-03-31
Keywords	知覚情報処理 / 音声情報処理 / 動画情報処理 / 深層学習
Outline of Annual Research Achievements	本研究の目的は、マルチメディア時系列データの高精度な認識である。認識方式として再帰型ニューラルネットワーク(recurrent neural network, RNN)を用い、音声や動画などの個々のモード毎の認識器を作り、さらにそれらを統合してEnd-to-End学習に基づくマルチモーダルな認識システムを構築する。そこでは、サイズ縮小や転移学習が重要な役割を果たす。初年度の今年度は基本方式の実装によるベースライン構築に注力した。音響処理については以下の2つの成果があった。まず、フィードフォワード型深層ニューラルネットワーク(deep neural network, DNN)による音声認識を実装し、それに対しDistillation(蒸留)処理を行うことにより、認識性能を劣化させずにより小さいサイズのDNNを構築することに成功した。また、複数話者の音声を分離するDNNと音声認識DNNとを統合して学習するEnd-to-End学習の枠組みを構築し、個別に学習する場合よりも高い性能をもつことを確認した。映像処理においては、TRECVID マルチメディアイベント検出(multimedia event detection, MED)に対し、畳み込みニューラルネットワーク(convolutional neural network, CNN)により抽出した特徴量を入力としたRNNを構築した。時間軸方向の相関をより精度よくモデル化するために長・短期記憶(long-short term memory, LSTM)を用い、従来手法よりも高い性能を確認した。
Current Status of Research Progress	Current Status of Research Progress 2: Research has progressed on the whole more than it was originally planned. Reason ほぼ計画通り進んでいる。音声認識のためのRNNのベースラインがまだ開発中であるが、すでに実装は終わっており、特に大きな支障はないと認識している。
Strategy for Future Research Activity	当初の計画通り、音声認識、マルチメディアイベント認識、音楽における自動採譜の各々のタスクにおいて性能向上を目指した方式開発を行う。

Research Products
(10 results)

All 2017 2016

All Journal Article (3 results) (of which Peer Reviewed: 3 results, Open Access: 2 results, Acknowledgement Compliant: 1 results) Presentation (7 results) (of which Int'l Joint Research: 4 results, Invited: 4 results)

[Journal Article] 音声言語処理における深層学習：総説2017
- Author(s)
  篠田浩一
- Journal Title
  
  日本音響学会誌
  
  Volume: 73 Pages: 25-30
- Peer Reviewed / Acknowledgement Compliant
[Journal Article] [Invited Paper] Semantic Indexing for Large-Scale Video Retrieval2016
- Author(s)
  Nakamasa Inoue, Koichi Shinoda
- Journal Title
  
  ITE Transactions on Media Technology and Applications
  
  Volume: 4 Pages: 209-217
- DOI
  10.3169/mta.4.209
- Peer Reviewed / Open Access
[Journal Article] Wise teachers train better DNN acoustic models2016
- Author(s)
  Ryan Price, Ken-ichi Iso and Koichi Shinoda
- Journal Title
  
  EURASIP Journal on Audio, Speech, and Music Processing
  
  Volume: 2016 Pages: 1-19
- DOI
  10.1186/s13636-016-0088-7
- Peer Reviewed / Open Access
[Presentation] Speaker Separation in Multi-Channel Environment Using Deep Learning2017
- Author(s)
  Conggui Liu, Nakamasa Inoue, Koichi Shinoda
- Organizer
  情報処理学会音声言語情報処理研究会
- Place of Presentation
  琴平グランドホテル桜の抄, 香川県琴平町
- Year and Date
  2017-02-17 – 2017-02-18
[Presentation] Video Semantic Indexing and Localization2016
- Author(s)
  Koichi Shinoda
- Organizer
  5th Joint Meeting of the Acoustical Society of America and the Acoustical Society of Japan
- Place of Presentation
  Hilton Hawaiian Village, Honolulu, USA
- Year and Date
  2016-11-28 – 2016-12-02
- Int'l Joint Research / Invited
[Presentation] TokyoTech at TRECVID 20162016
- Author(s)
  Nakamasa Inoue, Ryosuke Yamamoto, Na Rong, Koichi Shinoda
- Organizer
  NIST TRECVID workshop
- Place of Presentation
  NIST, Gaithersburg, MA, USA
- Year and Date
  2016-11-14 – 2016-11-16
- Int'l Joint Research / Invited
[Presentation] Adaptation of Word Vectors using Tree Structure for Visual Semantics2016
- Author(s)
  Nakamasa Inoue, Koichi Shinoda
- Organizer
  ACM Multimedia 2016
- Place of Presentation
  Theater Tuschinski, アムステルダム
- Year and Date
  2016-10-15 – 2016-10-19
- Int'l Joint Research
[Presentation] 東工大TSUBAMEの活用事例：マルチメディア認識のための深層学習2016
- Author(s)
  篠田浩一
- Organizer
  GTC Japan 2016
- Place of Presentation
  ヒルトン東京お台場, 東京都港区
- Year and Date
  2016-10-05 – 2016-10-05
- Invited
[Presentation] Deep Learning for Speech, Image, and Video2016
- Author(s)
  Koichi Shinoda
- Organizer
  International Conference on Computer, Control, Informatics, and Its Applications (IC3INA)
- Place of Presentation
  Indonesia Convention Exhibition (ICE), Tangerang, Indonesia
- Year and Date
  2016-10-03 – 2016-10-03
- Int'l Joint Research / Invited
[Presentation] Concept Elimination for Zero-Shot Event Detection2016
- Author(s)
  Tran Hai Dang, Nakamasa Inoue, Koichi Shinoda
- Organizer
  The 22nd Symposium on Sensing via Image Information (SSII)
- Place of Presentation
  パシフィコ横浜アネックス, 横浜市
- Year and Date
  2016-06-08 – 2016-06-10

2016 Fiscal Year Annual Research Report

深層学習によるマルチモーダル時系列データ認識基盤の構築

Principal Investigator

篠田 浩一 東京工業大学, 情報理工学院, 教授 (10343097)

Current Status of Research Progress

Reason

Research Products

[Journal Article] 音声言語処理における深層学習：総説2017

Author(s)

Journal Title

[Journal Article] [Invited Paper] Semantic Indexing for Large-Scale Video Retrieval2016

Author(s)

Journal Title

DOI

[Journal Article] Wise teachers train better DNN acoustic models2016

Author(s)

Journal Title

DOI

[Presentation] Speaker Separation in Multi-Channel Environment Using Deep Learning2017

Author(s)

Organizer

Place of Presentation

Year and Date

[Presentation] Video Semantic Indexing and Localization2016

Author(s)

Organizer

Place of Presentation

Year and Date

[Presentation] TokyoTech at TRECVID 20162016

Author(s)

Organizer

Place of Presentation

Year and Date

[Presentation] Adaptation of Word Vectors using Tree Structure for Visual Semantics2016

Author(s)

Organizer

Place of Presentation

Year and Date

[Presentation] 東工大TSUBAMEの活用事例：マルチメディア認識のための深層学習2016

Author(s)

Organizer

Place of Presentation

Year and Date

[Presentation] Deep Learning for Speech, Image, and Video2016

Author(s)

Organizer

Place of Presentation

Year and Date

[Presentation] Concept Elimination for Zero-Shot Event Detection2016

Author(s)

Organizer

Place of Presentation

Year and Date

篠田浩一東京工業大学, 情報理工学院, 教授 (10343097)