Automatic collocation generation for English learners as a foreign language using document similarity analysis

Research Project

Project/Area Number	16K00489
Research Category	Grant-in-Aid for Scientific Research (C)
Allocation Type	Multi-year Fund
Section	一般
Research Field	Learning support system
Research Institution	Tsuda University
Principal Investigator	Kishi Nobuko 津田塾大学, 学芸学部, 教授 (50245990)
Co-Investigator(Kenkyū-buntansha)	岸康人神奈川大学, 付置研究所, 研究員 (50552999) 田近裕子津田塾大学, 総合政策学部, 教授 (80188268) 久島智津子津田塾大学, 言語文化研究所, 研究員 (80623876)
Project Period (FY)	2016-04-01 – 2020-03-31
Project Status	Completed (Fiscal Year 2019)
Budget Amount *help	¥4,420,000 (Direct Cost: ¥3,400,000、Indirect Cost: ¥1,020,000) Fiscal Year 2018: ¥780,000 (Direct Cost: ¥600,000、Indirect Cost: ¥180,000) Fiscal Year 2017: ¥1,820,000 (Direct Cost: ¥1,400,000、Indirect Cost: ¥420,000) Fiscal Year 2016: ¥1,820,000 (Direct Cost: ¥1,400,000、Indirect Cost: ¥420,000)
Keywords	英語学習 / 文書類似度 / 文書分類 / 潜在意味解析 / 教材自動生成 / 機械学習 / 語彙学習 / Latent Semantic Analysis / 教材自動作成 / 教材作成 / 教材生成 / 学習コンテンツ開発支援
Outline of Final Research Achievements	This study uses three types of document similarity evaluation methods: latent semantic analysis, bag of words, term-frequency and inverse document frequency, to generate English collocations for the learners of English as a foreign language. In the previous study, we find the latent semantics analysis is more suitable for generating collocations for English for specific purposes. However, the generated collocations were not usable as real learning materials because the difficulty level of collocations are not considered, and the subject area is limited. In this study, we used more computational resources to increase the speed of calculation and the quantity of documents. Furthermore, we used two sets of documents: an easy set and a difficult set, to estimate the difficulty level of collocations based on the different similarities to the two sets. We also added other algorithms to calculate the similarity from shallow machine learning algorithms such as word2vec.
Academic Significance and Societal Importance of the Research Achievements	この研究は、第2言語として英語を学ぶ学習者に、学習者の興味や習熟度にあった教材を自動生成する研究の一環として行っている。社会人や大学生の英語学習者の場合、本人の仕事や専門分野で実際に使われる表現の習得を効率的に行うことが望ましいが、適した教材（教科書、書籍、動画など）は非常に少ない。一方、Wikipediaや各種オープンコンテンツの普及により、英語テキストは入手しやすくなっている。そこで、情報検索分野で使われている、潜在意味解析、頻度分析などの手法を利用して、大規模テキストデータから、教材の素材となる用例（英語の分離）の自動抽出を行った。