2018 Fiscal Year Annual Research Report

良質な用例を大規模なコーパスから自動的に抽出できるモデルの構築および試作版の開発

Research Project

Project/Area Number	18F18808
Research Institution	National Institute for Japanese Language and Linguistics
Principal Investigator	PARDESHI P.V. 大学共同利用機関法人人間文化研究機構国立国語研究所, 理論・対照研究領域, 教授 (00374984)
Co-Investigator(Kenkyū-buntansha)	HMELJAK MARIJA 大学共同利用機関法人人間文化研究機構国立国語研究所, 理論・対照研究領域, 外国人特別研究員
Project Period (FY)	2018-11-09 – 2020-03-31
Keywords	example sentences / learners' dictionary / lexicography
Outline of Annual Research Achievements	As the title 良質な用例を大規模なコーパスから自動的に抽出できるモデルの構築および試作版の開発 suggests the aim of this project is to develop a model for selecting pedagogically valid Japanese example sentences from a general corpus. In order to develop a filter to select pedagogically valid Japanese example sentences from a general corpus, we started to investigate automatically measurable criteria of readability, typicality and informativity. We collected example sentences from learners' dictionaries, reference works and graded readers and are in the process of constructing a graded corpus of example sentences, to be used as a data set for verifying the usabililty of existing readability formulas on single sentences or short usage examples for learners of Japanese as a foreign language.
Current Status of Research Progress	Current Status of Research Progress 2: Research has progressed on the whole more than it was originally planned. Reason We collected example sentences from learners' dictionaries, reference works and graded readers and are in the process of constructing a graded corpus of example sentences, to be used as a data set for verifying the usabililty of existing readability formulas on single sentences or short usage examples for learners of Japanese as a foreign language.
Strategy for Future Research Activity	Firstly, we plan to verify the usability of existing readability formulas on the graded corpus of example sentences. Secondly, using the same data set, we plan to develop a methodology to assess the typicality of example candidates, by comparing their syntactical and collocational patterns to those found in NINJAL-LWP for BCCWJ. Thirdly, to assess the informativity of example sentence candidates, we plan to collect examples of different length from other corpora, annotate the informativity level of each example sentence, investigate measurable criteria (including the presence of typical syntactic patterns and their elements; the proportion of proper nouns, pronouns, etc.) and produce a statistical model for the assessment of informativity. Finally we plan to implement this model in an openly accessible online example search system.