2020 Fiscal Year Research-status Report

Development of Interactive Advisory System for Security Export Control

Research Project

Project/Area Number	20K12556
Research Institution	Hokkaido University
Principal Investigator	大林明彦北海道大学, 産学・地域協働推進機構, 教授 (80798124)
Co-Investigator(Kenkyū-buntansha)	RZEPKA Rafal 北海道大学, 情報科学研究院, 助教 (80396316)
Project Period (FY)	2020-04-01 – 2023-03-31
Keywords	人工知能 / 安全保障輸出管理 / オントロジー / 該非判定 / エキスパートシステム / 対話システム / テキスト分類 / 質疑応答
Outline of Annual Research Achievements	During the first year of our project we have managed to build a prototype of our conversational expert system for supporting non-expert in deciding if their research or goods for export are require governmental permission. We have purchased a GPU machine to allow experiments with deep learning and outsourced some programming tasks using grant money. We have prepared and evaluated a dataset containing questions and answers in order to allow experiments with machine learning. We have also started building seed ontology for the planned large knowledge graph which will contain both controlled Goods/Technologies Matrix Table and external knowledge as Wikipedia or ConceptNet for future potential danger degree calculation. Because both QA and legal regulation texts are comparatively small, machine learning experiments on classifying user input and matrix table categories are still not satisfactory.
Current Status of Research Progress	Current Status of Research Progress 2: Research has progressed on the whole more than it was originally planned. Reason This project combines two approaches - classic artificial intelligence (natural language processing) approach of expert systems which require exact and trustful output, and modern machine learning approach allowing to guess what is the intent of a user who asks the system about controlled goods. As the first step of the first approach we have used Protege ontology editing tool to prepare a seed of a knowledge graph which is supposed to be extended semi-automatically. As the first of the machine learning approach we have implemented a basic research paper analysis tool which accepts a pdf file from the user inside the dialog system. We tested standard text classification methods to allow our system to recognize one of 15 basic topics of the Controlled Goods and Technology Matrix Table. The results are not satisfactory enough for a research paper as most scientific works we have experimented with can only be loosely connected to prohibited items. Also parsing algorithms for pdf files are far from perfect and we need to improve them further. Because our conversational dialog system is meant for accepting any utterance, we have implemented Sentence-BERT, state of the art model to find Matrix Table passages related to the input sentence. This is useful when a user uses synonyms of terms which are not in the Table. However, this approach is not perfect and cannot assure the sufficient relevance, therefore we are currently experimenting with additional knowledge from the ontology containing article numbers and Wikipedia synonyms or CAS numbers for chemical substances.
Strategy for Future Research Activity	The next important step is to enlarge the concrete knowledge inside the ontology. We will experiment with Owlready2 library for Python programing language which is able not only to combine knowledge graphs with retrieved textual resources, but also perform reasoning within an ontology. Taking inspiration from large common sense graphs like ConceptNet or ATOMIC, we are going to combine specialistic knowledge from the Matrix Table with common sense which humans usually use for interpreting more difficult texts. Because text classification on scientific papers was not successful mostly because of the lack of problematic input documents, we will concentrate on the questions and answers dataset we prepared using government guidelines. These guidelines, except Q&A sections, also contain explanations of legal issues which can be added to machine learning algorithms. So far we have used BERT-based sentence similarity algorithms, but more and more Japanese language models are being published and we plan to test them. There are also large datasets in English and we have already translated some of them automatically. Next we will experiment with such translated models (e.g. SQUAD - Stanford QUestion Answering Dataset) to compare their efficiency with Japanese models. We have also started experiments with GPT-2 which can generate moderately fluent replies when our system does not recognize any export control topic in the user’s utterance.
Causes of Carryover	コロナ禍の影響により、大学外での活動を自粛したため、出張旅費は発生していない。次年度は、研究分担者と共にポーランドのKES2021に参加してこれまでの研究成果について、発表・意見交換したいと考えている。