Toward describing fine-grained details in computer vision through visual discrimination learning

Research Project

Project/Area Number	23K16945
Research Category	Grant-in-Aid for Early-Career Scientists
Allocation Type	Multi-year Fund
Review Section	Basic Section 61030:Intelligent informatics-related
Research Institution	Kyoto University
Principal Investigator	KASTNER MarcAurel 京都大学, 情報学研究科, 助教 (30966700)
Project Period (FY)	2023-04-01 – 2025-03-31
Project Status	Granted (Fiscal Year 2023)
Budget Amount *help	¥4,680,000 (Direct Cost: ¥3,600,000、Indirect Cost: ¥1,080,000) Fiscal Year 2024: ¥1,820,000 (Direct Cost: ¥1,400,000、Indirect Cost: ¥420,000) Fiscal Year 2023: ¥2,860,000 (Direct Cost: ¥2,200,000、Indirect Cost: ¥660,000)
Keywords	fine-grained vision / image captioning / vision and language / summarization / Computer Vision / Vision & Language / Fine-grained Data / Discrimination Learning
Outline of Research at the Start	Can a machine explain fine visual differences between two birds to a human? In this research, I develop a system to analyze fine-grained classes using visual discrimination learning. A main goal is to summarize visual characteristics specific to a class, generating a standalone textual description.
Outline of Annual Research Achievements	The initial motivation of the research project was to analyze and summarize visual differences in groups of images, i.e., those of two different bird species, and describe all characterizing visual differences as a human would. To build such a model, I first investigated techniques for summarizing image collections to a combined representation, i.e., generating a text which describes the contents shared across all images. Using this, a collection of images of the same bird species can be summarized to highlight its differentiating features. As such, I first investigated two approaches to extract the contents of related images as scene graph representations, and then combine them using graph theory. The combined information is further enriched using external knowledge, which allows for better generalization across multiple images. The resulting combined scene graphs are an intermediate combined representation of a concept, i.e., a bird species, and contain complementing details detected across all images. It can be used for text generation or for further graph-based analysis. The results of the first year have been published as two IEEE Access journal papers, as well as discussed at domestic symposiums.
Current Status of Research Progress	Current Status of Research Progress 2: Research has progressed on the whole more than it was originally planned. Reason To build the target model, it was necessary to first analyze groups of images and construct a comparable intermediate representation which can combine the visual details detected across all images. Such an approach can complement and combine visual details which are only shown in some of the images, e.g., when looking at the same bird species in different environments, time of day, or from different angles. In the first work (published at IEEE Access 2023 vol. 11), I built a model to generate a combined text representation from multiple related images. The approach uses intermediate scene graph representations and combines them to extract all shared components. Finally, a short text summary of shared objects across all images can be generated. In the second work (published at IEEE Access 2023 vol. 12), I improved the generalization of cross-image relationships by including additional external knowledge. This can generalize concepts detected in separate images and as such generate a more sophisticated and complete scene graph representation. The resulting scene graphs are planned to be used as an intermediate representation of, i.e., two species of birds. While the first steps have been only looking at a single scene graph per image collection, the next step would be to generate such representations for a large number of collections (i.e., different species) and use graph theory to each graphs.
Strategy for Future Research Activity	The first year progress moved towards generating an intermediate scene graph representation for a set of images, i.e., using many images of the same bird species to complement the knowledge to a combined scene graph representation. Such a representation contains all visual details of a species, even those which cannot be obtained from a single image. As the current models are trained on off-the-shelf object detection models, the amount of fine-grained details detected is still limited. The next step is to further increase the amount of fine-grained details incorporated into the scene graph representation. For this, an approach using data augmentation and diffusion models will be used to generate a larger corpus of fine-grained training data. Then, contrastive learning techniques are used to train the detection of fine grained details, such as types of wings, colors, head shapes, and so on. Combining this with the previous work, the scene graphs can be enriched with more fine-grained visual details. Lastly, the resulting scene graph representations will be compared using graph theory to extract the most differentiating details of each species. A textual description which can help humans to distinguish pairs of species will be generated.

Report

(1 results)

2023 Research-status Report

Research Products
(3 results)

All 2024 2023

All Journal Article (2 results) (of which Int'l Joint Research: 2 results, Peer Reviewed: 2 results, Open Access: 2 results) Presentation (1 results)

[Journal Article] Image-Collection Summarization Using Scene-Graph Generation With External Knowledge2024
- Author(s)
  Phueaksri Itthisak、Kastner Marc A.、Kawanishi Yasutomo、Komamizu Takahiro、Ide Ichiro
- Journal Title
  
  IEEE Access
  
  Volume: 12 Pages: 17499-17512
- DOI
  10.1109/access.2024.3360113
- Related Report
  2023 Research-status Report
- Peer Reviewed / Open Access / Int'l Joint Research
[Journal Article] An Approach to Generate a Caption for an Image Collection Using Scene Graph Generation2023
- Author(s)
  Phueaksri Itthisak、Kastner Marc A.、Kawanishi Yasutomo、Komamizu Takahiro、Ide Ichiro
- Journal Title
  
  IEEE Access
  
  Volume: 11 Pages: 128245-128260
- DOI
  10.1109/access.2023.3332098
- Related Report
  2023 Research-status Report
- Peer Reviewed / Open Access / Int'l Joint Research
[Presentation] Image Collection Scene Graph Summarization Enhancing Relation Predictor with External Knowledge2023
- Author(s)
  Phueaksri Itthisak、Kastner Marc A.、Kawanishi Yasutomo、Komamizu Takahiro、Ide Ichiro
- Organizer
  画像の認識・理解シンポジウム
- Related Report
  2023 Research-status Report

Toward describing fine-grained details in computer vision through visual discrimination learning

Principal Investigator

KASTNER MarcAurel 京都大学, 情報学研究科, 助教 (30966700)

¥4,680,000 (Direct Cost: ¥3,600,000、Indirect Cost: ¥1,080,000)

Current Status of Research Progress

Reason

Report

Research Products

[Journal Article] Image-Collection Summarization Using Scene-Graph Generation With External Knowledge2024

Author(s)

Journal Title

DOI

Related Report

[Journal Article] An Approach to Generate a Caption for an Image Collection Using Scene Graph Generation2023

Author(s)

Journal Title

DOI

Related Report

[Presentation] Image Collection Scene Graph Summarization Enhancing Relation Predictor with External Knowledge2023

Author(s)

Organizer

Related Report