Object State Recognition via Multi-Modal Analysis of Videos and Video Caption Sequences

Research Project

Project/Area Number	22K21296
Research Category	Grant-in-Aid for Research Activity Start-up
Allocation Type	Multi-year Fund
Review Section	1002:Human informatics, applied informatics and related fields
Research Institution	National Institute of Advanced Industrial Science and Technology (2023) The University of Tokyo (2022)
Principal Investigator	Yagi Takuma 国立研究開発法人産業技術総合研究所, 情報・人間工学領域, 研究員 (50964277)
Project Period (FY)	2022-08-31 – 2024-03-31
Project Status	Completed (Fiscal Year 2023)
Budget Amount *help	¥2,860,000 (Direct Cost: ¥2,200,000、Indirect Cost: ¥660,000) Fiscal Year 2023: ¥1,430,000 (Direct Cost: ¥1,100,000、Indirect Cost: ¥330,000) Fiscal Year 2022: ¥1,430,000 (Direct Cost: ¥1,100,000、Indirect Cost: ¥330,000)
Keywords	物体状態認識 / 大規模言語モデル / 映像字幕からの学習 / 状態記述キャプション / 視覚言語モデル
Outline of Research at the Start	動画像中の物体の状態およびその変化を明示的に説明したキャプション系列（状態記述キャプション）から物体単位での柔軟な状態認識を実現する。具体的には、物体状態変化を含む動画像に対して出現物体の位置・状態およびその変化をもたらした行動や現象を説明するキャプションを新たに付与し、対象物体および周辺の見えの変化と対応づける学習を行うことで物体単位での特徴表現を獲得する。本研究期間では、(a) 状態記述キャプションコーパスの構築 (b) 限られた教師情報からの状態変化領域の自動追跡手法の開発 (c) 物体単位での状態（変化）表現モデルの構築の3項目に取り組む。
Outline of Final Research Achievements	We developed a computational model that recognizes the states of objects that appear in a video (e.g., an egg is cracked or boiled). Recognizing object states requires annotations of the object states that correspond to the video, but collecting training information for various object states is costly and unrealistic. In this study, we proposed a new framework that automatically generates training information for various object states by applying large language models (LLM) to the information in the narrations included in Internet videos.
Academic Significance and Societal Importance of the Research Achievements	従来人の行動やその周辺環境の理解にあたっては、人が何をしているか（行動）および何があるか（物体）の認識が主で、ある物体が人の行動の結果どのような状態になったかといったシーンの詳細に関する認識が十分に取り組まれていなかった。様々な物体状態を映像から自動で認識することで、例えばロボットが行動を意図した通りに実行できたかを実際に物体の状態が変化したかによって判定でき、より信頼性の高いタスク遂行が期待できる。また、LLMは任意の状態記述に対応できるため語彙の変更が容易で、ユーザの要求に合わせた認識結果を提供することも可能となる。