2015 Fiscal Year Research-status Report

人間の聴覚特性を導入した深層ニューラルネットワークによる高精度な実環境下音声認識

Research Project

Project/Area Number	15K00233
Research Institution	Toyohashi University of Technology
Principal Investigator	山本一公豊橋技術科学大学, 工学(系)研究科(研究院), 准教授 (40324230)
Co-Investigator(Kenkyū-buntansha)	中川聖一豊橋技術科学大学, リーディング大学院教育推進機構, 特任教授 (20115893)
Project Period (FY)	2015-04-01 – 2018-03-31
Keywords	音声認識 / 深層学習 / Deep Neural Network / 聴覚特性 / 音響特徴量
Outline of Annual Research Achievements	現在、音声認識技術において、深層学習（Deep Neural Network; DNN）を用いた音響モデルが一般化しつつある。しかしながら、雑音環境下や遠隔発話条件での音声認識性能は未だ充分ではない。本研究の目的は、DNNを用いた音響モデル（特に特徴抽出の部分）に人間の聴覚特性を融合させることで、特に雑音環境下等で音声認識精度の改善を得ることである。本年度は、DNNでの直接的な特徴抽出ではなく、音響特徴量に聴覚特性を導入する場合の効果について検討を行った。まず、人間の聴覚特性である等ラウドネス特性（人間は1000～4000Hzぐらいの周波数の音は振幅が小さくても良く聴こえるが、低い周波数の音や高い周波数の音は振幅が大きくても聴こえにくいという特性）をFBANK特徴量に導入することで認識率の改善を図った。等ラウドネス特性は、各周波数に対する重みとして表現されるため、DNNの層間の重みとしても表現可能であるが、特徴量としてフィルタバンクを通す前のFFTスペクトルに対して重み付けを行うことで差別化している。また、周辺雑音に対して頑健に音声認識が行える特徴量として提案されているPNS (Power Normalized Spectrum) 特徴量の導入も行った。PNS特徴量は、人間の聴覚特性である、順向マスキング（過去の音がマスクとなって現在の音を聴こえにくくする効果）と中時間パワー正規化（過去の100～200ms程度の音声区間の平均パワーによって現在の音がマスクされる効果）を導入している。これらのPNSの聴覚特性は時間変化を扱うものであり、DNNで直接的に表現することは難しいと考えられる。これらの特徴量を導入した結果として、雑音環境下音声認識タスクにおいて、認識精度の改善を得た。特にPNSを用いることで、自動車雑音環境下で大幅な精度改善を得ることができた。
Current Status of Research Progress	Current Status of Research Progress 2: Research has progressed on the whole more than it was originally planned. Reason 当初計画では、聴覚特性をDNNで表現する方法を最初に検討する予定であったが、他機関の研究成果によってその効果がほとんどない（人為的にパラメータを設定しても自動で学習しても差がない）ことが分かったため、本年度はその次の段階で行う予定であった、聴覚特性を導入した音響特徴量（DNNの前段階にフィルタとして聴覚特性を導入すしたもの）を使用することで、どの程度認識精度が改善できるかという検討を行った。行う予定であった検討は実施できたため、研究は順調に進展していると判断する。
Strategy for Future Research Activity	人間の聴覚では、蝸牛内の基底膜がフィルタバンクの働きをしており、入力音声によって基底膜が振動することで基底膜上の有毛細胞で神経発火が起き、それが聴覚神経を伝わって脳内に送られることが分かっている。基底膜の運動は連続であるが、現在の音声認識技術では、音声信号を短時間フレームに切り出して抽出した離散時間特徴量を用いているため、時間連続性が途切れている。人間の聴覚は、変化に対して敏感であるため、音声知覚では音素のオンセットが重要であると言われているが、現在の音響特徴抽出ではオンセットを扱うための時間分解能が十分でないと考えられる。そこで、今後の研究では、フィルタバンク毎の時間信号を用いることで時間分解能を向上し、これを時間波形を直接扱うことのできる畳込ニューラルネットワークに入力することで、音声認識精度の向上を図っていく予定である。
Causes of Carryover	当初予算計画段階ではGPGPU搭載ワークステーションを購入予定であったが、他の予算でより高性能なものを購入できたため、当該年度の物品費の使用をストレージ（ハードディスク）等のみに切り替えた。そのために使用額が大幅に減り、それにより残額が発生した。
Expenditure Plan for Carryover Budget	当初予算計画は、2年目以降の物品費・旅費を圧縮したものとなっていたため、物品費・旅費として使用する。

Research Products
(8 results)

All 2016 2015

All Journal Article (1 results) (of which Peer Reviewed: 1 results) Presentation (7 results) (of which Int'l Joint Research: 4 results)

[Journal Article] 複数の対話エージェントを用いた雑談指向の音声対話システム2016
- Author(s)
  藤堂祐樹, 西村良太, 山本一公, 中川聖一
- Journal Title
  
  電子情報通信学会論文誌
  
  Volume: J99-D Pages: 188-200
- DOI
  10.14923/transinfj.2015JDP7010
- Peer Reviewed
[Presentation] Speech analysis of sung-speech and lyric recognition in monophonic singing2016
- Author(s)
  Dairoku Kawai, Kazumasa Yamamoto, Seiichi Nakagawa
- Organizer
  IEEE International Conference on Acoustics, Speech, and Signal Processing
- Place of Presentation
  Shanghai, China
- Year and Date
  2016-03-20 – 2016-03-25
- Int'l Joint Research
[Presentation] 畳み込みニューラルネットワークの教師なし逐次適応学習の検討2016
- Author(s)
  関博史，山本一公，中川聖一
- Organizer
  日本音響学会
- Place of Presentation
  桐蔭横浜大学
- Year and Date
  2016-03-09 – 2016-03-11
[Presentation] NMFによる任意の音楽重畳音声の認識2016
- Author(s)
  橋本尚亮，山本一公，中川聖一
- Organizer
  日本音響学会
- Place of Presentation
  桐蔭横浜大学
- Year and Date
  2016-03-09 – 2016-03-11
[Presentation] 歌声音声の特徴分析とピッチ特徴量を考慮した歌詞認識の検討2016
- Author(s)
  川井大陸，山本一公，中川聖一
- Organizer
  日本音響学会
- Place of Presentation
  桐蔭横浜大学
- Year and Date
  2016-03-09 – 2016-03-11
[Presentation] Speech recognition based on Itakura-Saito divergence and dynamics / sparseness constraints from mixed sound of speech and music by non-negative matrix factorization2015
- Author(s)
  Naoaki Hashimoto, Kazumasa Yamamoto, Seiichi Nakagawa
- Organizer
  Asia-Pacific Signal and Information Processing Association Annual Summit and Conference
- Place of Presentation
  Hong Kong
- Year and Date
  2015-12-16 – 2015-12-19
- Int'l Joint Research
[Presentation] Deep neural network based acoustic model using speaker-class information for short time utterance2015
- Author(s)
  Hiroshi Seki, Kazumasa Yamamoto, Seiichi Nakagawa
- Organizer
  Asia-Pacific Signal and Information Processing Association Annual Summit and Conference
- Place of Presentation
  Hong Kong
- Year and Date
  2015-12-16 – 2015-12-19
- Int'l Joint Research
[Presentation] Robust speech recognition using DNN-HMM acoustic model combining noise-aware training with spectral subtraction2015
- Author(s)
  Akihiro Abe, Kazumasa Yamamoto, Seiichi Nakagawa
- Organizer
  INTERSPEECH
- Place of Presentation
  Dresden, Germany
- Year and Date
  2015-09-06 – 2015-09-10
- Int'l Joint Research

2015 Fiscal Year Research-status Report

人間の聴覚特性を導入した深層ニューラルネットワークによる高精度な実環境下音声認識

Principal Investigator

山本 一公 豊橋技術科学大学, 工学(系)研究科(研究院), 准教授 (40324230)

Current Status of Research Progress

Reason

Research Products

[Journal Article] 複数の対話エージェントを用いた雑談指向の音声対話システム2016

Author(s)

Journal Title

DOI

[Presentation] Speech analysis of sung-speech and lyric recognition in monophonic singing2016

Author(s)

Organizer

Place of Presentation

Year and Date

[Presentation] 畳み込みニューラルネットワークの教師なし逐次適応学習の検討2016

Author(s)

Organizer

Place of Presentation

Year and Date

[Presentation] NMFによる任意の音楽重畳音声の認識2016

Author(s)

Organizer

Place of Presentation

Year and Date

[Presentation] 歌声音声の特徴分析とピッチ特徴量を考慮した歌詞認識の検討2016

Author(s)

Organizer

Place of Presentation

Year and Date

[Presentation] Speech recognition based on Itakura-Saito divergence and dynamics / sparseness constraints from mixed sound of speech and music by non-negative matrix factorization2015

Author(s)

Organizer

Place of Presentation

Year and Date

[Presentation] Deep neural network based acoustic model using speaker-class information for short time utterance2015

Author(s)

Organizer

Place of Presentation

Year and Date

[Presentation] Robust speech recognition using DNN-HMM acoustic model combining noise-aware training with spectral subtraction2015

Author(s)

Organizer

Place of Presentation

Year and Date

山本一公豊橋技術科学大学, 工学(系)研究科(研究院), 准教授 (40324230)