2022 Fiscal Year Annual Research Report

大規模字幕コーパスからの単語・フレーズ・会話のボトムアップ言語教材自動抽出

Research Project

Project/Area Number	19H04224
Research Institution	Tokyo University of Foreign Studies
Principal Investigator	望月源東京外国語大学, 大学院総合国際学研究院, 准教授 (70313707)
Co-Investigator(Kenkyū-buntansha)	芝野耕司東京外国語大学, その他部局等, 名誉教授 (50216024)
Project Period (FY)	2019-04-01 – 2023-03-31
Keywords	学習コンテンツ開発支援 / 自然言語処理 / eラーニング / 日本語教育 / 分散表現
Outline of Annual Research Achievements	本研究では，大規模字幕コーパスを対象に，各文に出現する文字単位での全Nグラム，および，形態素解析結果を用いた単語単位での全Nグラムを計算し，それぞれの単位での異なりNグラムについて，コーパス内の全出現文リストを比較し，出現文が完全に一致する異なりNグラムをソートし，包含関係にあるNグラム集合の中で最長のNグラムを有意なNグラムとして抽出する独自の統合文脈Nグラム分析を開発した．複数の語や単語の組み合わせからなるこのNグラムを日本語教育における教科書のキーフレーズとの一致度に基づいて評価し３３４のキーフレーズの約８３．２％が含まれることを確認し，言語教材として重要なフレーズが取得できることが明らかになった．また，抽出したNグラムを定型表現（Formulaic Sequences，FS）とみなし，字幕コーパス内の全文をFS単位に分割した後，FSによる分散表現であるFS2vecを計算することでFS間の類似度を直接計算できる状態に整備した．このFS2vecに基づいて，抽出したFSをクラスタリングしたところ，表層表現は異なるものの機能的に類似した効果を持つ表現が同じクラスタに集まる明確な傾向が確認できた．また，大規模字幕データ内に含まれる会話部分とCan-doリストとの対応付けを行った約１，６００の会話データを用いて，機械学習に基づくセグメント境界推定による会話抽出を行った．Bertによる事前学習モデルともとのテキストからのCan-do会話部分を抽出するためのファインチューニングを行い約６３パーセントの精度を得ている．今後の会話抽出精度を向上させる余地があるが，全体として，我々の大規模字幕コーパスが教材として有効であり，有効な言語教材の自動抽出について一定の成果が得られた．
Research Progress Status	令和4年度が最終年度であるため、記入しない。
Strategy for Future Research Activity	令和4年度が最終年度であるため、記入しない。

Research Products
(4 results)

All 2023 2022

All Presentation (4 results) (of which Int'l Joint Research: 4 results)

[Presentation] Extracting Japanese Sentence-Ending Expressions using Formulaic Sequences with Consolidated Contextualized N-gram Analysis2023
- Author(s)
  Hajime Mochizuki, Kohji Shibano
- Organizer
  The 21st Annual Conference of Hawaii International Conference on Education,
- Int'l Joint Research
[Presentation] Mining Formulaic Sequences from a Spoken Japanese Based on Consolidated Contextualized N-gram Analyses and Its Verification with Key Phrases in Japanese Language Textbooks2022
- Author(s)
  Hajime Mochizuki, Kohji Shibano
- Organizer
  World Conference On Educational Media and Technology + INNOVATE LEARNING 2022
- Int'l Joint Research
[Presentation] Investigation of Formulaic Sequences at The End of Sentence in Japanese Closed Caption TV Corpus2022
- Author(s)
  Hajime Mochizuki, Kohji Shibano
- Organizer
  2023 STEM/STEAM and Education Conference
- Int'l Joint Research
[Presentation] Real Word Statistics and End of Sentence Expressions in Japanese Closed Caption TV Corpus2022
- Author(s)
  Hajime Mochizuki
- Organizer
  9th International Conference on Language, Literature and Linguistics (LLL2022)
- Int'l Joint Research

2022 Fiscal Year Annual Research Report

大規模字幕コーパスからの単語・フレーズ・会話のボトムアップ言語教材自動抽出

Principal Investigator

望月 源 東京外国語大学, 大学院総合国際学研究院, 准教授 (70313707)

Research Products

[Presentation] Extracting Japanese Sentence-Ending Expressions using Formulaic Sequences with Consolidated Contextualized N-gram Analysis2023

Author(s)

Organizer

[Presentation] Mining Formulaic Sequences from a Spoken Japanese Based on Consolidated Contextualized N-gram Analyses and Its Verification with Key Phrases in Japanese Language Textbooks2022

Author(s)

Organizer

[Presentation] Investigation of Formulaic Sequences at The End of Sentence in Japanese Closed Caption TV Corpus2022

Author(s)

Organizer

[Presentation] Real Word Statistics and End of Sentence Expressions in Japanese Closed Caption TV Corpus2022

Author(s)

Organizer

望月源東京外国語大学, 大学院総合国際学研究院, 准教授 (70313707)