語彙学習のための大規模 Data-Driven Learning システム開発

Research Project

Project/Area Number	23K21732
Project/Area Number (Other)	21H03564 (2021-2023)
Research Category	Grant-in-Aid for Scientific Research (B)
Allocation Type	Multi-year Fund (2024) Single-year Grants (2021-2023)
Section	一般
Review Section	Basic Section 62030:Learning support system-related
Research Institution	Tokyo Institute of Technology (2024) Osaka University (2021-2023)
Principal Investigator	荒瀬由紀東京工業大学, 情報理工学院, 教授 (00747165)
Co-Investigator(Kenkyū-buntansha)	内田諭九州大学, 言語文化研究院, 准教授 (20589254) 梶原智之愛媛大学, 理工学研究科(工学系), 講師 (70824960)
Project Period (FY)	2021-04-01 – 2025-03-31
Project Status	Granted (Fiscal Year 2024)
Budget Amount *help	¥17,030,000 (Direct Cost: ¥13,100,000、Indirect Cost: ¥3,930,000) Fiscal Year 2024: ¥4,030,000 (Direct Cost: ¥3,100,000、Indirect Cost: ¥930,000) Fiscal Year 2023: ¥4,420,000 (Direct Cost: ¥3,400,000、Indirect Cost: ¥1,020,000) Fiscal Year 2022: ¥4,030,000 (Direct Cost: ¥3,100,000、Indirect Cost: ¥930,000) Fiscal Year 2021: ¥4,550,000 (Direct Cost: ¥3,500,000、Indirect Cost: ¥1,050,000)
Keywords	Data-Driven Learning / CEFR / テキスト平易化 / 言い換え生成 / 英文難易度推定 / パラフレーズ / 言語教育 / 語彙学習
Outline of Research at the Start	本研究では語彙学習に対する Data-Driven Learning (DDL) を実現し、英語教員や学生がWebブラウザから自由に利用できる学習プラットフォームを構築する。DDLを語彙学習に用いるには、学習項目を含む文を適切な難易度で多量に観察し、段階的に難易度の高い文を学習することが効果的である。そこで本研究では、Common European Framework of Reference for Languages に準拠した英文の難易度指標を策定し、難易度を調整する言い換えを自動的に行うことで、様々な難易度の用例を大規模に獲得する。
Outline of Annual Research Achievements	本研究では語彙学習に対する Data-Driven Learning (DDL) を実現し、英語教員や学生がWebブラウザから自由に利用できる学習プラットフォームを構築する。DDLを語彙学習に用いるには、学習項目を含む文を適切な難易度で多量に観察し、段階的に難易度の高い文を学習することが効果的である。しかしこの条件を満たす英文はWeb上にもごく僅かしか存在せず、また英文の難易度判定の指標も非自明である。そこで本研究では、Common European Framework of Reference for Languages（CEFR）に準拠した英文の難易度指標を策定し、難易度を調整する言い換えを自動的に行うことで、様々な難易度の用例を大規模に獲得する。 2021年度に約2万文の英文に対しCEFRレベルを付与したコーパス（CEFR-SP）を構築したが、当該年度はその詳細な分析を実施し、各CEFRレベルの言語的特性を明らかにした。さらにCEFR-SPコーパスについて、異なる難易度の言い換え文を作文するクラウドソーシングを実施し、パラレルコーパスの作成に着手した。 CEFR-SPコーパスを用いて高精度な英文難易度推定モデルを構築、評価実験を実施した成果を論文としてまとめた。当論文は自然言語処理における最重要国際会議の一つであるthe Conference on Empirical Methods in Natural Language Processing (EMNLP) に採択され、発表済みである。さらに、既存のテキスト平易化コーパスを用いて強化学習による英文難易度変換モデルを構築、評価実験を実施した成果を論文としてまとめ、国際会議である the Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the International Joint Conference on Natural Language Processing（AACL-IJCNLP）にて発表した。
Current Status of Research Progress	Current Status of Research Progress 1: Research has progressed more than it was originally planned. Reason 2021年度に約2万文の英文に対しCEFRレベルを付与したCEFR-SPコーパスを構築したが、当該年度はその詳細な分析を実施した。文長のような基礎的な統計量、構文木の深さや単語の品詞分布などの文法的特性、単語の文字列長やCEFRレベルなどの語彙的特性という、文書の難易度推定における代表的な指標を計測した。そして各CEFRレベルの文を特徴づける指標の特定および隣接レベルの識別に有効な指標を特定した。さらに各CEFRレベルにおける典型的な文の抽出と分析を行った。本成果は国際ジャーナルに投稿中である。 CEFR-SPコーパスについて、異なる難易度の言い換え文を作文するクラウドソーシングを実施し、パラレルコーパスの作成に着手した。2023年度はこのパラレルコーパスの拡充に引き続き取り組む。またCEFR-SPコーパスを用いて高精度な英文難易度推定モデルを構築、評価実験を実施した成果を論文としてまとめた。当論文は自然言語処理における最重要国際会議の一つであるthe Conference on Empirical Methods in Natural Language Processing (EMNLP) に採択され、発表済みである。さらに、既存のテキスト平易化コーパスを用いて強化学習による英文難易度変換モデルを構築、評価実験を実施した成果を論文としてまとめ、the Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the International Joint Conference on Natural Language Processing（AACL-IJCNLP）にて発表した。
Strategy for Future Research Activity	本研究は語彙学習のために様々な難易度の用例を潤沢に提供する大規模DDLシステムを開発する。研究目的を達成するため、(i) 英文難易度付きパラレルコーパスの構築、(ii) 難易度調整言い換えモデルの開発、の二つの研究課題に取り組む。(i) では、Common European Framework of Reference for Languages (CEFR) に準拠した難易度を付与したパラレルコーパスを構築し、英文言い換えモデルの構築に活用できるよう整備する。(ii) では、オーセンティックな英文を自動的に言い換えるモデルを構築し、scaffolding を可能とする様々な難易度の用例を獲得する。 2023年度は以下の課題に取り組む計画である。 (i) 英文難易度付きパラレルコーパス構築 2022年度には、それまでに構築した約2万文のCEFR難易度付き英文に対し、言い換え文を作文しCEFRレベルを付与することで、難易度付きパラレルコーパスを構築した。2023年度はこのコーパスをさらに拡張し、約4万文対のパラレルコーパス構築を目指す。 (ii) 難易度調整言い換えモデルの開発難易度調整言い換えモデルの構築には、ある難易度の英文を別の難易度の英文に言い換えた文対を収集したパラレルコーパスが必要である。しかし大規模なパラレルコーパス構築は非常にコストが高い。そこでこれまでに構築した言い換え生成モデルと難易度自動推定モデルを用いて、疑似訓練データを自動構築する。本疑似データを用いた言語生成モデルの追加訓練、さらに少量のパラレルコーパスによる転移学習により、少量のデータセットで高品質な難易度別言い換え生成を実現する。

Report

(2 results)

2022 Annual Research Report
2021 Annual Research Report

Research Products
(7 results)

All 2023 2022 2021 Other

All Presentation (6 results) (of which Int'l Joint Research: 5 results) Remarks (1 results)

[Presentation] 問題タイプを考慮した英単語穴埋め問題の不正解選択肢の自動生成2023
- Author(s)
  吉見菜那, 梶原智之, 内田諭, 荒瀬由紀, 二宮崇
- Organizer
  言語処理学会第29回年次大会
- Related Report
  2022 Annual Research Report
[Presentation] CEFR-Based Sentence Difficulty Annotation and Assessment2022
- Author(s)
  Yuki Arase, Satoru Uchida, and Tomoyuki Kajiwara
- Organizer
  The Conference on Empirical Methods in Natural Language Processing (EMNLP)
- Related Report
  2022 Annual Research Report
- Int'l Joint Research
[Presentation] Controllable Text Simplification with Deep Reinforcement Learning2022
- Author(s)
  Daiki Yanamoto, Tomoki Ikawa, Tomoyuki Kajiwara, Takashi Ninomiya, Satoru Uchida, and Yuki Arase
- Organizer
  The 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (AACL-IJCNLP)
- Related Report
  2022 Annual Research Report
- Int'l Joint Research
[Presentation] JADE: Corpus for Japanese Definition Modelling2022
- Author(s)
  Han Huang, Tomoyuki Kajiwara, Yuki Arase
- Organizer
  13th Edition of its Language Resources and Evaluation Conference
- Related Report
  2021 Annual Research Report
- Int'l Joint Research
[Presentation] Definition Modelling for Appropriate Specificity2021
- Author(s)
  Han Huang, Tomoyuki Kajiwara, Yuki Arase
- Organizer
  2021 Conference on Empirical Methods in Natural Language Processing
- Related Report
  2021 Annual Research Report
- Int'l Joint Research
[Presentation] Toward constructing a corpus with CEFR-based sentence level annotations2021
- Author(s)
  Satoru Uchida, Yuki Arase and Tomoyuki Kajiwara
- Organizer
  Workshop on Building CEFR-graded resources for second and foreign language learning
- Related Report
  2021 Annual Research Report
- Int'l Joint Research
[Remarks] CEFR-SP
- URL
  https://github.com/yukiar/CEFR-SP
- Related Report
  2022 Annual Research Report

語彙学習のための大規模 Data-Driven Learning システム開発

Principal Investigator

荒瀬 由紀 東京工業大学, 情報理工学院, 教授 (00747165)

¥17,030,000 (Direct Cost: ¥13,100,000、Indirect Cost: ¥3,930,000)

Current Status of Research Progress

Reason

Report

Research Products

[Presentation] 問題タイプを考慮した英単語穴埋め問題の不正解選択肢の自動生成2023

Author(s)

Organizer

Related Report

[Presentation] CEFR-Based Sentence Difficulty Annotation and Assessment2022

Author(s)

Organizer

Related Report

[Presentation] Controllable Text Simplification with Deep Reinforcement Learning2022

Author(s)

Organizer

Related Report

[Presentation] JADE: Corpus for Japanese Definition Modelling2022

Author(s)

Organizer

Related Report

[Presentation] Definition Modelling for Appropriate Specificity2021

Author(s)

Organizer

Related Report

[Presentation] Toward constructing a corpus with CEFR-based sentence level annotations2021

Author(s)

Organizer

Related Report

[Remarks] CEFR-SP

URL

Related Report

荒瀬由紀東京工業大学, 情報理工学院, 教授 (00747165)