Unifying Object Detection and Image Captioning using Vision-Language Knowledge Base for Open-World Comprehension

Research Project

Project/Area Number	24K20830
Research Category	Grant-in-Aid for Early-Career Scientists
Allocation Type	Multi-year Fund
Review Section	Basic Section 61030:Intelligent informatics-related
Research Institution	The University of Tokyo
Principal Investigator	ヴォミンデュク東京大学, 大学院情報理工学系研究科, 特任助教 (40939906)
Project Period (FY)	2024-04-01 – 2026-03-31
Project Status	Granted (Fiscal Year 2024)
Budget Amount *help	¥4,680,000 (Direct Cost: ¥3,600,000、Indirect Cost: ¥1,080,000) Fiscal Year 2025: ¥1,820,000 (Direct Cost: ¥1,400,000、Indirect Cost: ¥420,000) Fiscal Year 2024: ¥2,860,000 (Direct Cost: ¥2,200,000、Indirect Cost: ¥660,000)
Keywords	vision - language / image captioning / object recognition
Outline of Research at the Start	Object detection and image captioning tasks are connected, but each has the potential to recognize and depict objects that are beyond the scope of the other. This research investigates a more comprehensive and cohesive understanding of visual content by unifying both tasks in the context of generative task. We aim to develop a vision - language knowledge base method that not only detects and describes the objects in the training dataset, but also on novel objects not seen during training.