Project/Area Number |
23K16945
|
Research Category |
Grant-in-Aid for Early-Career Scientists
|
Allocation Type | Multi-year Fund |
Review Section |
Basic Section 61030:Intelligent informatics-related
|
Research Institution | Kyoto University |
Principal Investigator |
|
Project Period (FY) |
2023-04-01 – 2025-03-31
|
Project Status |
Granted (Fiscal Year 2023)
|
Budget Amount *help |
¥4,680,000 (Direct Cost: ¥3,600,000、Indirect Cost: ¥1,080,000)
Fiscal Year 2024: ¥1,820,000 (Direct Cost: ¥1,400,000、Indirect Cost: ¥420,000)
Fiscal Year 2023: ¥2,860,000 (Direct Cost: ¥2,200,000、Indirect Cost: ¥660,000)
|
Keywords | fine-grained vision / image captioning / vision and language / summarization / Computer Vision / Vision & Language / Fine-grained Data / Discrimination Learning |
Outline of Research at the Start |
Can a machine explain fine visual differences between two birds to a human? In this research, I develop a system to analyze fine-grained classes using visual discrimination learning. A main goal is to summarize visual characteristics specific to a class, generating a standalone textual description.
|
Outline of Annual Research Achievements |
The initial motivation of the research project was to analyze and summarize visual differences in groups of images, i.e., those of two different bird species, and describe all characterizing visual differences as a human would. To build such a model, I first investigated techniques for summarizing image collections to a combined representation, i.e., generating a text which describes the contents shared across all images. Using this, a collection of images of the same bird species can be summarized to highlight its differentiating features. As such, I first investigated two approaches to extract the contents of related images as scene graph representations, and then combine them using graph theory. The combined information is further enriched using external knowledge, which allows for better generalization across multiple images. The resulting combined scene graphs are an intermediate combined representation of a concept, i.e., a bird species, and contain complementing details detected across all images. It can be used for text generation or for further graph-based analysis. The results of the first year have been published as two IEEE Access journal papers, as well as discussed at domestic symposiums.
|
Current Status of Research Progress |
Current Status of Research Progress
2: Research has progressed on the whole more than it was originally planned.
Reason
To build the target model, it was necessary to first analyze groups of images and construct a comparable intermediate representation which can combine the visual details detected across all images. Such an approach can complement and combine visual details which are only shown in some of the images, e.g., when looking at the same bird species in different environments, time of day, or from different angles. In the first work (published at IEEE Access 2023 vol. 11), I built a model to generate a combined text representation from multiple related images. The approach uses intermediate scene graph representations and combines them to extract all shared components. Finally, a short text summary of shared objects across all images can be generated. In the second work (published at IEEE Access 2023 vol. 12), I improved the generalization of cross-image relationships by including additional external knowledge. This can generalize concepts detected in separate images and as such generate a more sophisticated and complete scene graph representation. The resulting scene graphs are planned to be used as an intermediate representation of, i.e., two species of birds. While the first steps have been only looking at a single scene graph per image collection, the next step would be to generate such representations for a large number of collections (i.e., different species) and use graph theory to each graphs.
|
Strategy for Future Research Activity |
The first year progress moved towards generating an intermediate scene graph representation for a set of images, i.e., using many images of the same bird species to complement the knowledge to a combined scene graph representation. Such a representation contains all visual details of a species, even those which cannot be obtained from a single image. As the current models are trained on off-the-shelf object detection models, the amount of fine-grained details detected is still limited. The next step is to further increase the amount of fine-grained details incorporated into the scene graph representation. For this, an approach using data augmentation and diffusion models will be used to generate a larger corpus of fine-grained training data. Then, contrastive learning techniques are used to train the detection of fine grained details, such as types of wings, colors, head shapes, and so on. Combining this with the previous work, the scene graphs can be enriched with more fine-grained visual details. Lastly, the resulting scene graph representations will be compared using graph theory to extract the most differentiating details of each species. A textual description which can help humans to distinguish pairs of species will be generated.
|