2022 Fiscal Year Final Research Report

Study on Automatic extraction of language teaching materials from a large closed caption corpus by bottom-up assembly of linguistic units such as words, phrases, and conversations.

Research Project

PDF

Project/Area Number	19H04224
Research Category	Grant-in-Aid for Scientific Research (B)
Allocation Type	Single-year Grants
Section	一般
Review Section	Basic Section 62030:Learning support system-related
Research Institution	Tokyo University of Foreign Studies
Principal Investigator	Mochizuki Hajime 東京外国語大学, 大学院総合国際学研究院, 准教授 (70313707)
Co-Investigator(Kenkyū-buntansha)	芝野耕司東京外国語大学, その他部局等, 名誉教授 (50216024)
Project Period (FY)	2019-04-01 – 2023-03-31
Keywords	学習コンテンツ開発支援 / eラーニング / 日本語教育 / 自然言語処理 / Formulaic Sequences
Outline of Final Research Achievements	We have developed an integrated context n-gram that enables batch comparison of frequencies of n-grams of different sizes, and extracted formulaic sequences (FS) that combine multiple words. By clustering FSs in distributed expressions, we confirmed that we could obtain clusters of FSs as "functionally similar phrase sets" with different surface expressions. We also tried to automatically extract the conversation parts of the corpus using a deep learning model, and obtained certain results. The subtitle corpus has been continuously expanded to 2.2 billion words. The research results were presented in peer-reviewed papers mainly at international conferences such as EDMEDIA and E-Learn.
Free Research Field	情報科学
Academic Significance and Societal Importance of the Research Achievements	大規模日本語会話コーパスの構築を続け，10年以上にわたる日本のテレビ番組の字幕データを整備し，59万8千番組，2億8百文，22億5千万語超に達した。また，コーパスの語彙調査を行い，テレビ字幕データが言語教材として十分に有益であることを確認した。これまで実現していなかったサイズの異なるnグラムの頻度を一括して比較可能な統合文脈nグラムを開発したコーパス内のすべての文から定型表現としてフォーミュライックシーケンス，FSの抽出を行い，FSが日本語教科書の重要フレーズを含むことを確認した。日本語Can-doに対応した会話データを教師データとして整備し，機械学習モデルでの会話セグメント自動抽出を行った。