2022 Fiscal Year Research-status Report
Vision and language cross-modal for training conditional GANs with long-tail data.
Project/Area Number |
22K17947
|
Research Institution | The University of Tokyo |
Principal Investigator |
ヴォ ミンデュク 東京大学, 大学院情報理工学系研究科, 特任助教 (40939906)
|
Project Period (FY) |
2022-04-01 – 2024-03-31
|
Keywords | Vision and language / Novel object captioning / GANs / Vision and language / External knowledge / Story evaluation / Dataset |
Outline of Annual Research Achievements |
We learn the cross-modality between vision and language spaces. We obtained three achievements: 1.We collected a set of objects' names and definitions from the open dictionary Wiktionary and used the pre-trained BERT model to embed the definitions. We incorporated this external knowledge into an image captioning model, outperforming other methods in novel object captioning task. It was published at CVPR 2022. 2.We proposed a new training scheme for GANs by using flipped and non-flipped non-saturating losses. It was published in the IEEE Access journal (IF 3.476). 3.We created a new dataset for story evaluation, consisting of 100k story ranking data and 46k aspect rating and reasoning collected through the Reddit website and crowd-sourcing annotation process. It was published at EMNLP 2022.
|
Current Status of Research Progress |
Current Status of Research Progress
2: Research has progressed on the whole more than it was originally planned.
Reason
The research has been progressing according to the initial plan without any major obstacles.
|
Strategy for Future Research Activity |
In our ongoing research, we will delve further into the direction we have pursued so far. Our focus this year will be to achieve the following objectives: 1.We aim to broaden the scope of our dataset by gathering not only the definitions of objects but also their appearances from real and synthesized images. This will give us a more comprehensive understanding of the cross-modal between vision and language spaces we are studying. 2.We plan to utilize the new dataset to train an image generation model as well as other vision-language models. By incorporating this new data, we hope to improve the accuracy and robustness of our models.
|
Research Products
(5 results)