研究課題/領域番号 |
17K00295
|
研究機関 | 北海道大学 |
研究代表者 |
RZEPKA Rafal 北海道大学, 情報科学研究院, 助教 (80396316)
|
研究期間 (年度) |
2017-04-01 – 2022-03-31
|
キーワード | common sense / causal relations / story generation / story understanding / crowdsourcing / context understanding |
研究実績の概要 |
The last year of my project was supposed to finalize my hitherto preparations for crowdsourcing and extending the dataset for commonsense-related stories in Japanese language. Because of the COVID-19 outbreak I had to change my plans and decided to use existing datasets for English (ROCStories, ATOMIC, GLUCOSE, etc.) and automatically translate them to other languages. To tackle the problem of imperfect translations I have tried to develop novel methods using paraphrases in automatic post-editing to achieve more natural knowledge, but the results have not yet been satisfactory. As this grant is paid by Japanese taxpayers I did not want to develop further the context of story-related datasets for English, therefore I decided to develop smaller ones from scratch to acquire seeds for further expansion. I chose two topics in which context or its changes are crucial - moral judgement and figurative speech. I have created both sets and extended the grant to be able to prepare them for extension and validity test via crowdsourcing. The first dataset consist of acts (verbs) inspired by related research on detecting aggression online, subjects (actors) and objects (including patients) most often related to acts (Wikipedia embeddings were used to obtain representative words). Almost 4,500 sentences have been created by 18 annotators who filled gaps in automatically generated templates. After eliminating duplicates 3,165 samples with danger level labels were created. The second dataset of figurative and non-figurative sentences consists of over 2,500 manually annotated sentences.
|
現在までの達成度 (区分) |
現在までの達成度 (区分)
3: やや遅れている
理由
This project aims to provide a novel narrative dataset which will be available to public. During the first years I approached the topic of context from many angles and touched any facets of meaning layers - from tacit knowledge to be automatically discovered from corpora to nuanced changes of meaning e.g. due to emoticons use. With my students I have worked with English, Japanese and Chinese to observe differences in causal knowledge retrieval and processing among languages and cultures. Unfortunately, due to the still ongoing problems with international collaboration I decided to concentrate on Japanese. As mentioned above I have created and validated two seed databases of different character - one including a set of acts with different actors, objects and places, one with figurative and non-figurative sentences. The first was build to show how little contextual changes influence a human evaluator opinion, the second is to help to distinguish descriptions of physical world from abstractive expressions. Both datasets were validated with simple machine learning methods but no publishable results have been obtained yet. The last year was supposed to be crucial for gathering all experiences from last three years and build the final dataset but I was forced to apply for extension and deliver the planned output in the fiscal year 2021.
|
今後の研究の推進方策 |
The last year of the project is to combine the building blocks from the previous years and create an expandable dataset which contains causal relations in form of longer narratives. As there is a tacit knowledge problem (what is obvious is not usually stated in text), I will employ crowdsourcing on bigger scale to add what I call “perception channels” to the mini stories. Very fine-grained information on contextual details as color or sound description has never been added to knowledge graphs in the shape of mini-stories, ontologies like ConceptNet or COMET generalize human knowledge from short recollections of annotators and are not able to infer subtle changes in context which influence the narrative flow. If I am not able to solve the unnatural translation problem with post-editing and my two above-mentioned datasets are not sufficient as seeds for stories creations by annotators, I will also consider using crowdsourcing also for manual post-editing the translations of English datasets.
|
次年度使用額が生じた理由 |
The plan for the last year had to be changed due to the COVID-19 situation. The costs will be moved mostly to crowdsourcing the data in Japan instead of international in-person collaboration. I also plan to use part of the remaining fund to develop more sophisticated (half-automatic) system for generating story-like knowledge (a web-based solution which would allow automatically redirect newly generated knowledge to another annotator). This would eliminate the need of physical interaction as the vaccination situation in Japan might remain uncertain.
|