2022 Fiscal Year Research-status Report
Creation and Evaluation of Tacit Knowledge Based on Semantic Primes
Project/Area Number |
22K12160
|
Research Institution | Hokkaido University |
Principal Investigator |
RZEPKA Rafal 北海道大学, 情報科学研究院, 助教 (80396316)
|
Project Period (FY) |
2022-04-01 – 2025-03-31
|
Keywords | Semantic Primes / Dataset construction / Tacit knowledge / Perception |
Outline of Annual Research Achievements |
During the first year I have concentrated on developing the first set of tacit knowledge for further experiments. I considered the limited number of semantic primes to answer the question "which types of cognitive functionality an agent should posses before exploring the environment and understanding or learning about the world?". Inspired by the exponents grouped into related semantic primitives categories and their translation to Japanese, I created a list of perception-related 23 types of prompts. After series of many preliminary experiments and annotator tests, I simplified, aggregated and specified some prompts to make the annotation shorter and easier. In order to prepare the final dataset I utilized a Japanese sentences from the previous project. Meant for detecting danger level changes in slightly different contexts, it comprises of short sentence pairs as "child eats a soap" and "child eats a soap-shaped candy". I have created a program for generating prompts to make queries about agents, patients and acts in context of semantic primes and hired 66 annotators to choose answers about a sentence. The final golden set for future experiments consisted of 62,687 annotated sentence-prompt-choice triples and has been described in detail in an international conference paper which is currently under review. Experimental results show that although the agreement between annotators was high, classic language models as BERT and RoBERTa performed poorly in a task of recognizing cognitive perception-related questions based on semantic primes.
|
Current Status of Research Progress |
Current Status of Research Progress
2: Research has progressed on the whole more than it was originally planned.
Reason
The progress is rather smooth, but there were time-consuming obstacles during the first dataset preparation process. The main difficulty was to find a set of queries which consists of as many of existing semantic prime categories while keeping them simple for annotators. Most of the queries have two questions, for example "visible" / "not visible", but there are categories involving more than 2 options, for example frequency in the Time category consists of multiple choices (happen 0 times, 1 time, 2 times / several times / a lot of times). Because of the big number of required sentences, it was unrealistic to prepare them by hand. For that reason I decided to create a sentence generator allowing producing context sentences and prompts asking about particular parts of a sentence. To increase the variation level of the tacit knowledge to be examined, I also added tenses to the sentences - past tense, continuous present, and willingness to show a what an agent or patient want to do but has not been done. By doing so, I also increased the difficulty for language models which often lose track of time. By choosing base sentences representing only slight changes of context I managed to come up with quite hard benchmark for testing language models. There was a risk of delay regarding the automatic generation process, but the structure of Japanese language helped to finish the main task on time. Unfortunately, the creation of data process has taken more time than expected and the paper describing it has been submitted in the end of the first year resulting with no accepted paper.
|
Strategy for Future Research Activity |
Two main keywords for the second year are "stories" and "heuristics". The newly created dataset will be tested to answer the research question "how the tacit knowledge based on semantic primes is useful in understanding not only of sentences but also stories". I plan to utilize a short story dataset from the previous project and extend it automatically (first step) and manually (second step) by using human annotators. As the current language models abstract human knowledge into obvious patterns, I plan to ask the annotators to be creative and come up with more inconspicuous changes to the existing stories. Except these heuristics, my plan is to confirm if prompts regarding tacit knowledge created during the first year can be used to semantically follow the applied changes. If this is confirmed, I will try to examine how the semantic primitives could play role in learning, reasoning and memorizing knowledge in a life-long learning agent. One thing I have discovered during the first year is that the abstract thinking and analogy making with semantic primes could be easier by using so called molecules, alongside the categories. Molecules are basic concepts grouped in categories as Physical, Environmental or Body parts. The challange will be to find a method to incorporate them into the prompts for acquiring heuristics from annotators. This endeavor could potentially bring new ideas on how we use basic perception-related knowledge of the world and how new paradigms for creating artificial agents could lead to more human-like intelligence, which could lead to more publications.
|
Causes of Carryover |
As the paper describing the first year achievement has been sent in the end of the fiscal year, the planned travel expenses did not occur. Also by finding a way to automatically generate prompts for annotation, I managed to save a part of fund which is planned to be used for increasing the volume of data in the following year.
|