研究課題/領域番号 |
23K16945
|
研究種目 |
若手研究
|
配分区分 | 基金 |
審査区分 |
小区分61030:知能情報学関連
|
研究機関 | 京都大学 |
研究代表者 |
|
研究期間 (年度) |
2023-04-01 – 2025-03-31
|
研究課題ステータス |
交付 (2023年度)
|
配分額 *注記 |
4,680千円 (直接経費: 3,600千円、間接経費: 1,080千円)
2024年度: 1,820千円 (直接経費: 1,400千円、間接経費: 420千円)
2023年度: 2,860千円 (直接経費: 2,200千円、間接経費: 660千円)
|
キーワード | fine-grained vision / image captioning / vision and language / summarization / Computer Vision / Vision & Language / Fine-grained Data / Discrimination Learning |
研究開始時の研究の概要 |
Can a machine explain fine visual differences between two birds to a human? In this research, I develop a system to analyze fine-grained classes using visual discrimination learning. A main goal is to summarize visual characteristics specific to a class, generating a standalone textual description.
|
研究実績の概要 |
The initial motivation of the research project was to analyze and summarize visual differences in groups of images, i.e., those of two different bird species, and describe all characterizing visual differences as a human would. To build such a model, I first investigated techniques for summarizing image collections to a combined representation, i.e., generating a text which describes the contents shared across all images. Using this, a collection of images of the same bird species can be summarized to highlight its differentiating features. As such, I first investigated two approaches to extract the contents of related images as scene graph representations, and then combine them using graph theory. The combined information is further enriched using external knowledge, which allows for better generalization across multiple images. The resulting combined scene graphs are an intermediate combined representation of a concept, i.e., a bird species, and contain complementing details detected across all images. It can be used for text generation or for further graph-based analysis. The results of the first year have been published as two IEEE Access journal papers, as well as discussed at domestic symposiums.
|
現在までの達成度 (区分) |
現在までの達成度 (区分)
2: おおむね順調に進展している
理由
To build the target model, it was necessary to first analyze groups of images and construct a comparable intermediate representation which can combine the visual details detected across all images. Such an approach can complement and combine visual details which are only shown in some of the images, e.g., when looking at the same bird species in different environments, time of day, or from different angles. In the first work (published at IEEE Access 2023 vol. 11), I built a model to generate a combined text representation from multiple related images. The approach uses intermediate scene graph representations and combines them to extract all shared components. Finally, a short text summary of shared objects across all images can be generated. In the second work (published at IEEE Access 2023 vol. 12), I improved the generalization of cross-image relationships by including additional external knowledge. This can generalize concepts detected in separate images and as such generate a more sophisticated and complete scene graph representation. The resulting scene graphs are planned to be used as an intermediate representation of, i.e., two species of birds. While the first steps have been only looking at a single scene graph per image collection, the next step would be to generate such representations for a large number of collections (i.e., different species) and use graph theory to each graphs.
|
今後の研究の推進方策 |
The first year progress moved towards generating an intermediate scene graph representation for a set of images, i.e., using many images of the same bird species to complement the knowledge to a combined scene graph representation. Such a representation contains all visual details of a species, even those which cannot be obtained from a single image. As the current models are trained on off-the-shelf object detection models, the amount of fine-grained details detected is still limited. The next step is to further increase the amount of fine-grained details incorporated into the scene graph representation. For this, an approach using data augmentation and diffusion models will be used to generate a larger corpus of fine-grained training data. Then, contrastive learning techniques are used to train the detection of fine grained details, such as types of wings, colors, head shapes, and so on. Combining this with the previous work, the scene graphs can be enriched with more fine-grained visual details. Lastly, the resulting scene graph representations will be compared using graph theory to extract the most differentiating details of each species. A textual description which can help humans to distinguish pairs of species will be generated.
|