2016 Fiscal Year Research-status Report

大量の映像群からテキストの内容に沿った映像を生成する映像要約手法の開発

Research Project

Project/Area Number	16K16086
Research Institution	Osaka University
Principal Investigator	中島悠太大阪大学, データビリティフロンティア機構, 准教授 (70633551)
Project Period (FY)	2016-04-01 – 2019-03-31
Keywords	映像要約 / 自然言語処理 / ディープニューラルネットワーク / 畳み込みニューラルネットワーク
Outline of Annual Research Achievements	本研究では、ユーザが入力したテキストの内容に従って大量・長時間の映像から短時間の映像を生成する映像要約手法に関して研究開発を実施した。本年度は以下の課題に取り組んだ。・課題(A) 撮影者や視聴者が映像中のどの領域に注目するかは、その映像内容の理解などにおいて有用であると考えられる。本課題では、映像要約への応用を目指し、映像における主要なオブジェクトである人物に着目した映像中の重要領域推定（その映像における重要人物の識別）手法を開発した。提案手法では、撮影者にとっての重要人物を識別することを目的とし、映像中の人物の動きとカメラの動きに関する特徴量を用いて識別器を学習した。・課題(B) テキストの内容に従った要約映像の生成のためのアプローチとしては、部分映像とテキストの間の類似度を算出し、その値が高いものを要約映像に含めるなどが考えられる。本課題では、Recurrent Neural Network（RNN）を利用することによりテキストを符号化するとともに、畳み込みニューラッルネットワーク（CNN）を利用して映像を符号化し、それらの符号を同一の空間に写像する関数を学習することで、部分映像とテキストの類似度を計算する手法を提案した。RNNによるテキストの符号化では、元の単語の情報が失われることがあるため、本課題では画像検索の結果を援用してこの問題を低減した。・課題(C) 本課題では、テキストの入力を利用せず、映像の冗長性を低減するための映像要約手法を開発した。この手法では、課題(B)で確立した映像とテキストを同一空間に写像する手法を用い、部分映像を文と同レベルの意味を表すと考えられるベクトルで表現した上でクラスタリングすることで、冗長性が低く、かつ元の映像の内容をカバーした要約映像を生成する。
Current Status of Research Progress	Current Status of Research Progress 2: Research has progressed on the whole more than it was originally planned. Reason 課題(A)では、まずサポートベクトルマシン（SVM）を利用した識別器の出力に加えて、CRFにより人物間の相関を考慮した重要人物識別モデルを構築した。結果、人間による識別には及ばないものの、SVMのみの場合に比べ、Area Under Curve (AUC)を向上することができた。また、SVMに替えてディープニューラルネットワーク（DNN）を利用し、CRFのパラメータを含めてネットワークを学習した場合、同一の特徴量では、精度79.3%（F1スコア84.3%）、人の顔領域から抽出したカラーヒストグラムを合わせて利用した場合では、精度82.0%（F1スコア86.5%）となり、これは同一データセットでSVMのみの場合を評価した場合の精度73.5%（F1スコア79.5%）の場合に比べて高い。課題(B)については、アプリケーションのひとつであるテキストによる映像検索により性能を評価した。提案手法に加えて、比較対象として、テキストのみをクエリとした場合とウェブ画像のみをクエリとした場合検索した場合とについても評価したところ、R@5（高い方が高性能）でそれぞれ23.4、22.8、16.4、aR（低い方が高性能）で49.08、54.1、84.5であり、どちらの指標についても性能の向上が見られた。課題（C）では、映像を等間隔にサンプリングしたものや、視覚的注意を利用したもの、教師あり学習により要約映像に含まれやすい映像を学習したもの（Leave-One-Outにより評価）を評価対象とした。結果、等間隔のサンプリングではF1スコアで12.4%、視覚的注意を利用したもので16.7％、教師あり学習では23.4％であった。提案手法は18.3%であり、教師あり学習の場合に比べて性能は劣るものの、学習を必要としないことからデータセットに含まれない映像についても同様の性能が得られるものと期待できる。
Strategy for Future Research Activity	次年度は、テキストを入力として要約映像の内容を制御可能な映像要約手法の開発に取り組む。具体的には、本年度取り組んだ課題(B)のテキストと部分映像の類似度尺度を利用することで複数の文から構成されるテキストの内容にあった部分映像を元の映像から選出し、それらをテキストに含まれる文の順に並べることにより映像要約を実現する。この際、ひとつの部分映像が複数の文に関連する場合など、同一、もしくは冗長な映像が選択される可能性があることから、要約映像生成時に部分映像間の類似性などについても考慮する。また、マルコフ確率場（Markov Random Field; MRF）などによってテキストと映像の関係をモデル化し、動的計画法などにより最適な映像を選択するなどについても検討する。映像とテキストの類似度尺度についても、現在の部分映像単位での類似度計算ではなく、映像のフレーム単位での類似度計算を可能にすることで、事前に元映像を部分映像に分割する必要のないアプローチについても考える。

Research Products
(7 results)

All 2016 Other

All Int'l Joint Research (2 results) Journal Article (2 results) (of which Int'l Joint Research: 1 results, Peer Reviewed: 2 results, Acknowledgement Compliant: 2 results) Presentation (3 results) (of which Int'l Joint Research: 3 results)

[Int'l Joint Research] University of Oulu(フィンランド)
- Country Name
  FINLAND
- Counterpart Institution
  University of Oulu
[Int'l Joint Research] University of Valladolid(スペイン)
- Country Name
  SPAIN
- Counterpart Institution
  University of Valladolid
[Journal Article] Video summarization using textual descriptions for authoring video blogs2016
- Author(s)
  Mayu Otani, Yuta Nakashima, Tomokazu Sato, and Naokazu Yokoya
- Journal Title
  
  Multimedia Tools and Applications
  
  Volume: - Pages: -
- DOI
  10.1007/s11042-016-4061-3
- Peer Reviewed / Acknowledgement Compliant
[Journal Article] Flexible human action recognition in depth video sequences using masked joint trajectories2016
- Author(s)
  Antonio Tejero-de-Pablos, Yuta Nakashima, Naokazu Yokoya, Francisco-Javier Diaz-Pernas, and Mario Martinez-Zarzuela
- Journal Title
  
  EURASIP Journal on Image and Video Processing
  
  Volume: 2016 Pages: -
- DOI
  10.1186/s13640-016-0120-y
- Peer Reviewed / Int'l Joint Research / Acknowledgement Compliant
[Presentation] Video summarization using deep semantic features2016
- Author(s)
  Mayu Otani, Yuta Nakashima, Esa Rahtu, Janne Heikkila, and Naokazu Yokoya
- Organizer
  13th Asian Conference on Computer Vision
- Place of Presentation
  Taipei
- Year and Date
  2016-11-20 – 2016-11-24
- Int'l Joint Research
[Presentation] Learning joint representations of videos and sentences with web image search2016
- Author(s)
  Mayu Otani, Yuta Nakashima, Esa Rahtu, Janne Heikkila, and Naokazu Yokoya
- Organizer
  4th Workshop on Web-scale Vision and Social Media (VSM) in conjunction with ECCV 2016
- Place of Presentation
  Amsterdam
- Year and Date
  2016-10-10 – 2016-10-10
- Int'l Joint Research
[Presentation] Human action recognition-based video summarization for RGB-D personal sports video2016
- Author(s)
  Antonio Tejero-de-Pablos, Yuta Nakashima, Tomokazu Sato, and Naokazu Yokoya
- Organizer
  2016 IEEE International Conference on Multimedia and Expo
- Place of Presentation
  Seattle
- Year and Date
  2016-07-11 – 2016-07-15
- Int'l Joint Research

2016 Fiscal Year Research-status Report

大量の映像群からテキストの内容に沿った映像を生成する映像要約手法の開発

Principal Investigator

中島 悠太 大阪大学, データビリティフロンティア機構, 准教授 (70633551)

Current Status of Research Progress

Reason

Research Products

[Int'l Joint Research] University of Oulu(フィンランド)

Country Name

Counterpart Institution

[Int'l Joint Research] University of Valladolid(スペイン)

Country Name

Counterpart Institution

[Journal Article] Video summarization using textual descriptions for authoring video blogs2016

Author(s)

Journal Title

DOI

[Journal Article] Flexible human action recognition in depth video sequences using masked joint trajectories2016

Author(s)

Journal Title

DOI

[Presentation] Video summarization using deep semantic features2016

Author(s)

Organizer

Place of Presentation

Year and Date

[Presentation] Learning joint representations of videos and sentences with web image search2016

Author(s)

Organizer

Place of Presentation

Year and Date

[Presentation] Human action recognition-based video summarization for RGB-D personal sports video2016

Author(s)

Organizer

Place of Presentation

Year and Date

中島悠太大阪大学, データビリティフロンティア機構, 准教授 (70633551)