2023 Fiscal Year Research-status Report
Toward describing fine-grained details in computer vision through visual discrimination learning
Project/Area Number |
23K16945
|
Research Institution | Kyoto University |
Principal Investigator |
|
Project Period (FY) |
2023-04-01 – 2025-03-31
|
Keywords | fine-grained vision / image captioning / vision and language / summarization |
Outline of Annual Research Achievements |
The initial motivation of the research project was to analyze and summarize visual differences in groups of images, i.e., those of two different bird species, and describe all characterizing visual differences as a human would. To build such a model, I first investigated techniques for summarizing image collections to a combined representation, i.e., generating a text which describes the contents shared across all images. Using this, a collection of images of the same bird species can be summarized to highlight its differentiating features. As such, I first investigated two approaches to extract the contents of related images as scene graph representations, and then combine them using graph theory. The combined information is further enriched using external knowledge, which allows for better generalization across multiple images. The resulting combined scene graphs are an intermediate combined representation of a concept, i.e., a bird species, and contain complementing details detected across all images. It can be used for text generation or for further graph-based analysis. The results of the first year have been published as two IEEE Access journal papers, as well as discussed at domestic symposiums.
|
Current Status of Research Progress |
Current Status of Research Progress
2: Research has progressed on the whole more than it was originally planned.
Reason
To build the target model, it was necessary to first analyze groups of images and construct a comparable intermediate representation which can combine the visual details detected across all images. Such an approach can complement and combine visual details which are only shown in some of the images, e.g., when looking at the same bird species in different environments, time of day, or from different angles. In the first work (published at IEEE Access 2023 vol. 11), I built a model to generate a combined text representation from multiple related images. The approach uses intermediate scene graph representations and combines them to extract all shared components. Finally, a short text summary of shared objects across all images can be generated. In the second work (published at IEEE Access 2023 vol. 12), I improved the generalization of cross-image relationships by including additional external knowledge. This can generalize concepts detected in separate images and as such generate a more sophisticated and complete scene graph representation. The resulting scene graphs are planned to be used as an intermediate representation of, i.e., two species of birds. While the first steps have been only looking at a single scene graph per image collection, the next step would be to generate such representations for a large number of collections (i.e., different species) and use graph theory to each graphs.
|
Strategy for Future Research Activity |
The first year progress moved towards generating an intermediate scene graph representation for a set of images, i.e., using many images of the same bird species to complement the knowledge to a combined scene graph representation. Such a representation contains all visual details of a species, even those which cannot be obtained from a single image. As the current models are trained on off-the-shelf object detection models, the amount of fine-grained details detected is still limited. The next step is to further increase the amount of fine-grained details incorporated into the scene graph representation. For this, an approach using data augmentation and diffusion models will be used to generate a larger corpus of fine-grained training data. Then, contrastive learning techniques are used to train the detection of fine grained details, such as types of wings, colors, head shapes, and so on. Combining this with the previous work, the scene graphs can be enriched with more fine-grained visual details. Lastly, the resulting scene graph representations will be compared using graph theory to extract the most differentiating details of each species. A textual description which can help humans to distinguish pairs of species will be generated.
|
Causes of Carryover |
It was planned to travel to CVPR 2023 and ICCV 2023 but due to high travel costs after COVID and shifting the paper plans to submit to the IEEE Access journal, the international traveling was postponed. The funding is used in 2024 to fund a trip to a related top international conference such as CVPR 2024 and ACMMM2024. Further, there was plans to do a crowd-sourced user study to evaluate the work. In 2023, the research proceeded using existing and augmented datasets. In the current step, there was no need to do human annotations. The research planned for 2024 includes a user study, so the remaining funding is planned to used for this purpose. The remaining funding will be used for necessary computing machines such as GPU servers for deep learning training.
|