• Search Research Projects
  • Search Researchers
  • How to Use
  1. Back to previous page

Study on Automatic extraction of language teaching materials from a large closed caption corpus by bottom-up assembly of linguistic units such as words, phrases, and conversations.

Research Project

Project/Area Number 19H04224
Research Category

Grant-in-Aid for Scientific Research (B)

Allocation TypeSingle-year Grants
Section一般
Review Section Basic Section 62030:Learning support system-related
Research InstitutionTokyo University of Foreign Studies

Principal Investigator

Mochizuki Hajime  東京外国語大学, 大学院総合国際学研究院, 准教授 (70313707)

Co-Investigator(Kenkyū-buntansha) 芝野 耕司  東京外国語大学, その他部局等, 名誉教授 (50216024)
Project Period (FY) 2019-04-01 – 2023-03-31
Project Status Completed (Fiscal Year 2022)
Budget Amount *help
¥17,160,000 (Direct Cost: ¥13,200,000、Indirect Cost: ¥3,960,000)
Fiscal Year 2022: ¥3,640,000 (Direct Cost: ¥2,800,000、Indirect Cost: ¥840,000)
Fiscal Year 2021: ¥3,900,000 (Direct Cost: ¥3,000,000、Indirect Cost: ¥900,000)
Fiscal Year 2020: ¥4,290,000 (Direct Cost: ¥3,300,000、Indirect Cost: ¥990,000)
Fiscal Year 2019: ¥5,330,000 (Direct Cost: ¥4,100,000、Indirect Cost: ¥1,230,000)
Keywords学習コンテンツ開発支援 / eラーニング / 日本語教育 / 自然言語処理 / Formulaic Sequences / 分散表現
Outline of Research at the Start

この研究では,我々のTV字幕に基づく大規模話し言葉コーパスから,語,フレーズ,会話という言語単位のボトムアップな組み上げによって,言語教育Can-doに対応する言語教材の自動抽出を試みるとともに,Can-doとキーフレーズの対応付けも行う。
具体的には,(1)語の分散表現を語義ごとに分割する計算方法を探る。(2)ひとかたまりの意味を持つフレーズであるFSの網羅的抽出と語用論的意味分析による知識化を行い,FS分散表現とそのFSの用途の違いによる分割計算を可能にする。(3)FS分散表現の深層学習によるCan-do会話教材の自動抽出により,Can-doに対応するキーフレーズを抽出する。

Outline of Final Research Achievements

We have developed an integrated context n-gram that enables batch comparison of frequencies of n-grams of different sizes, and extracted formulaic sequences (FS) that combine multiple words. By clustering FSs in distributed expressions, we confirmed that we could obtain clusters of FSs as "functionally similar phrase sets" with different surface expressions. We also tried to automatically extract the conversation parts of the corpus using a deep learning model, and obtained certain results. The subtitle corpus has been continuously expanded to 2.2 billion words. The research results were presented in peer-reviewed papers mainly at international conferences such as EDMEDIA and E-Learn.

Academic Significance and Societal Importance of the Research Achievements

大規模日本語会話コーパスの構築を続け,10年以上にわたる日本のテレビ番組の字幕データを整備し,59万8千番組,2億8百文,22億5千万語超に達した。また,コーパスの語彙調査を行い,テレビ字幕データが言語教材として十分に有益であることを確認した。これまで実現していなかったサイズの異なるnグラムの頻度を一括して比較可能な統合文脈nグラムを開発したコーパス内のすべての文から定型表現としてフォーミュライックシーケンス,FSの抽出を行い,FSが日本語教科書の重要フレーズを含むことを確認した。日本語Can-doに対応した会話データを教師データとして整備し,機械学習モデルでの会話セグメント自動抽出を行った。

Report

(5 results)
  • 2022 Annual Research Report   Final Research Report ( PDF )
  • 2021 Annual Research Report
  • 2020 Annual Research Report
  • 2019 Annual Research Report
  • Research Products

    (9 results)

All 2023 2022 2020 2019

All Presentation (9 results) (of which Int'l Joint Research: 4 results)

  • [Presentation] Extracting Japanese Sentence-Ending Expressions using Formulaic Sequences with Consolidated Contextualized N-gram Analysis2023

    • Author(s)
      Hajime Mochizuki, Kohji Shibano
    • Organizer
      The 21st Annual Conference of Hawaii International Conference on Education,
    • Related Report
      2022 Annual Research Report
    • Int'l Joint Research
  • [Presentation] Mining Formulaic Sequences from a Spoken Japanese Based on Consolidated Contextualized N-gram Analyses and Its Verification with Key Phrases in Japanese Language Textbooks2022

    • Author(s)
      Hajime Mochizuki, Kohji Shibano
    • Organizer
      World Conference On Educational Media and Technology + INNOVATE LEARNING 2022
    • Related Report
      2022 Annual Research Report
    • Int'l Joint Research
  • [Presentation] Investigation of Formulaic Sequences at The End of Sentence in Japanese Closed Caption TV Corpus2022

    • Author(s)
      Hajime Mochizuki, Kohji Shibano
    • Organizer
      2023 STEM/STEAM and Education Conference
    • Related Report
      2022 Annual Research Report
    • Int'l Joint Research
  • [Presentation] Real Word Statistics and End of Sentence Expressions in Japanese Closed Caption TV Corpus2022

    • Author(s)
      Hajime Mochizuki
    • Organizer
      9th International Conference on Language, Literature and Linguistics (LLL2022)
    • Related Report
      2022 Annual Research Report
    • Int'l Joint Research
  • [Presentation] Can-do型日本語学習用資源としてのアニメーション字幕の分析2022

    • Author(s)
      大河原龍太朗, 望月源
    • Organizer
      言語処理学会第28回年次大会
    • Related Report
      2021 Annual Research Report
  • [Presentation] テレビ字幕データを用いた感情分析による「ある日の日本の気分」推定に関する研究2022

    • Author(s)
      イーフエイチー, 望月源
    • Organizer
      言語処理学会第28回年次大会
    • Related Report
      2021 Annual Research Report
  • [Presentation] 中国語母語話者の日本語学習におけるL1L2字幕利用の考察2020

    • Author(s)
      王 楽淑 and 望月 源 and 鈴木 美加
    • Organizer
      言語処理学会第26回年次大会
    • Related Report
      2019 Annual Research Report
  • [Presentation] Investigation of Words in a Japanese Closed Caption TV Corpus2019

    • Author(s)
      Hajime Mochizuki
    • Organizer
      Hawaii University Conferences, STAM/STEAM Education Conference, 2019
    • Related Report
      2019 Annual Research Report
  • [Presentation] Incorporating a State-of-the-Art Speech Recognition to a Japanese Language e-Learning System2019

    • Author(s)
      Hajime Mochizuki and Kohji Shibano
    • Organizer
      E-Learn: World Conference on E-Learning in Corporate, Government, Healthcare, and Higher Education 2019
    • Related Report
      2019 Annual Research Report

URL: 

Published: 2019-04-18   Modified: 2024-01-30  

Information User Guide FAQ News Terms of Use Attribution of KAKENHI

Powered by NII kakenhi