研究課題/領域番号 |
22K17947
|
研究種目 |
若手研究
|
配分区分 | 基金 |
審査区分 |
小区分61030:知能情報学関連
|
研究機関 | 東京大学 |
研究代表者 |
ヴォ ミンデュク 東京大学, 大学院情報理工学系研究科, 特任助教 (40939906)
|
研究期間 (年度) |
2022-04-01 – 2024-03-31
|
研究課題ステータス |
交付 (2022年度)
|
配分額 *注記 |
2,600千円 (直接経費: 2,000千円、間接経費: 600千円)
2023年度: 1,170千円 (直接経費: 900千円、間接経費: 270千円)
2022年度: 1,430千円 (直接経費: 1,100千円、間接経費: 330千円)
|
キーワード | Vision and language / Novel object captioning / GANs / External knowledge / Story evaluation / Dataset / Conditional GANs / Long-tail data |
研究開始時の研究の概要 |
1) Creating a dataset for our study because existing datasets are insufficient. 2) Constructing vision-language cross-modal by learning cross-modal similarity. 3) Learning data augmentation using vision-language cross-modal. 4) Incorporating the vision-language cross-modal into the conditional GANs.
|
研究実績の概要 |
We learn the cross-modality between vision and language spaces. We obtained three achievements: 1.We collected a set of objects' names and definitions from the open dictionary Wiktionary and used the pre-trained BERT model to embed the definitions. We incorporated this external knowledge into an image captioning model, outperforming other methods in novel object captioning task. It was published at CVPR 2022. 2.We proposed a new training scheme for GANs by using flipped and non-flipped non-saturating losses. It was published in the IEEE Access journal (IF 3.476). 3.We created a new dataset for story evaluation, consisting of 100k story ranking data and 46k aspect rating and reasoning collected through the Reddit website and crowd-sourcing annotation process. It was published at EMNLP 2022.
|
現在までの達成度 (区分) |
現在までの達成度 (区分)
2: おおむね順調に進展している
理由
The research has been progressing according to the initial plan without any major obstacles.
|
今後の研究の推進方策 |
In our ongoing research, we will delve further into the direction we have pursued so far. Our focus this year will be to achieve the following objectives: 1.We aim to broaden the scope of our dataset by gathering not only the definitions of objects but also their appearances from real and synthesized images. This will give us a more comprehensive understanding of the cross-modal between vision and language spaces we are studying. 2.We plan to utilize the new dataset to train an image generation model as well as other vision-language models. By incorporating this new data, we hope to improve the accuracy and robustness of our models.
|