2017 Fiscal Year Annual Research Report

大規模会話コーパスのＦＳ２ｖｅｃ処理によるＣＥＦＲ　Ｃａｎ-ｄｏ言語教材の開発

Research Project

Project/Area Number	15H02794
Research Institution	Tokyo University of Foreign Studies
Principal Investigator	望月源東京外国語大学, 大学院総合国際学研究院, 准教授 (70313707)
Co-Investigator(Kenkyū-buntansha)	芝野耕司東京外国語大学, アジア・アフリカ言語文化研究所, 教授 (50216024) 佐野洋東京外国語大学, 大学院総合国際学研究院, 教授 (30282776) 藤村知子東京外国語大学, 大学院国際日本学研究院, 教授 (20229040)
Project Period (FY)	2015-04-01 – 2019-03-31
Keywords	学習コンテンツ開発支援 / eラーニング / 日本語教育 / 自然言語処理
Outline of Annual Research Achievements	本研究では計画段階の３億３千万語、５万３時間分のテレビ字幕データから、１１億４千６百万語，約１億４７８万文，１８万５千時間分，２９万４千番組分に拡張したこれまでに存在していなかった規模の大規模会話コーパスを構築している．このコーパスから字幕表示時間に基づいて文を組み合わせた会話セグメントの抽出をプログラムによって行った．２０１７年度末の段階で，セグメント数は３千百万に達している．開発したMapReduce型アルゴリズムのプログラムにより，単語のNグラムによる組み合わせパターンを作成し，Formulaic Sequence (FS) の重要候補を抽出した．２９年度は大量に抽出されたFSの中から特に有効なFSを選び出すための手法として，比較的長い文字列で構成され，出現頻度も多いFSを有効なFSと考え，長さ９文字以上，頻度９回以上の閾値を設けて抽出した．また，カイ2乗値を用いてジャンルによるFSの出現の偏りを計算し，ドラマ，バラエティ，情報番組を対象にカイ2乗値の上位100のFSにみられる表現的特徴を調べた．約８割のFSは「あいさつ」「感謝」「要求」「謝罪」「祝意」「推測」といった目的を示す表現に関連づき，Can-doと特定のFSの目的による対応付けが現実的であることを確認した．また，同一のFSを含む複数セグメントを話題，場面で分類するため，クラスタリングを行った．まずFSをキーに会話セグメントを検索し，Doc2vecでセグメント間類似度ベクトルを計算し，SVDでの次元縮退の後，k-means法によるクラスタリングを行った．同一クラスタ内の会話内容をサンプリング調査した結果，類似した話題，場面を含むセグメントが分類されていることを確認した．FSによって表現される会話の目的と，名詞などで表現される会話セグメント内の話題や場面による分類とCan-doとの対応付けに取り掛かっている．
Current Status of Research Progress	Current Status of Research Progress 2: Research has progressed on the whole more than it was originally planned. Reason 字幕データ取得システムは引き続き安定運用が行えており，構築を続けているテレビ字幕コーパスも順調に規模を拡大している．これまでのところ，およそテレビ字幕５年分，２９万４千番組，１１億４千６百万語，約１億４７８万文のコーパスデータに達している．前年度に絞り込みを行ったFormulaic Sequenceから，さらに有効と思われるFSを文字列長と出現頻度に基づいて抽出する手法を検討，実装し，実行した．絞り込まれたFSの分析を行い約８割は会話の目的に関連づくことが確認できた．同一のFSを含む会話セグメントを検索し，クラスタリングすることで類似した話題，場面を含むセグメントとFSの対応付けが行え，Can-doとの対応付けに取り掛かることができた．
Strategy for Future Research Activity	２９年度に引き続き，２７年度に開発したMapReduce型アルゴリズムを改良し，データを継続的に扱えるようにする．収集期間が５年を超え長期化していることから，３０年度は各月ごとのFormulaic Sequence(FS)の計算を独立させ，任意の期間の組み合わせで一定期間におけるFSの計算やジャンル別のFSの計算が行えるようにアルゴリズムを洗練させる．２９年度に開発したFSから検索した会話セグメントをクラスタリングするアルゴリズムを洗練させる．２９年度に引き続き，同一FSを含む会話セグメントからできるクラスタごとの特徴を分析し，Can-doの定義文記述への対応付けを行う．FSを含む会話セグメントとCan-do定義文との対応付けを行い．機械学習手法により，コーパスからのCan-do言語教材作成を行うプログラム開発を行う．

Research Products
(8 results)

All 2018 2017

All Presentation (8 results) (of which Int'l Joint Research: 8 results, Invited: 1 results)

[Presentation] Analyzing Usefulness of Dialogues from Closed Caption TV Corpus as an Example of Can-do Statements for Language Learnin2018
- Author(s)
  Hajime Mochizuki and Kohji Shibano
- Organizer
  2018 Hawaii University Conference, Arts, Humanities, Social Sciences & Education (AHSE)
- Int'l Joint Research
[Presentation] Developing Intimacy by Style-shifting in Japanese: A TV Subtitle Corpus-based Study2017
- Author(s)
  XIAO Tingting and Kohji Shibano
- Organizer
  The 2017 conference of the American Association for Applied Linguistics (AAAL 2017)
- Int'l Joint Research
[Presentation] The Acquisition of a Japanese Practical Formulaic Sequences List from a Closed Caption TV Corpus2017
- Author(s)
  Hajime Mochizuki and Kohji Shibano
- Organizer
  Hawaii University Conferences, STAM/STEAM Education Conference
- Int'l Joint Research
[Presentation] Augmented Reality Applications for Multilingual Learning with Intuitive Understanding2017
- Author(s)
  Hajime Mochizuki
- Organizer
  World Conference on Educational Media and Technology (EDMEDIA) 2017
- Int'l Joint Research
[Presentation] Analyzing formulaic sequences in spoken Japanese from a large Japanese TV closed caption corpus2017
- Author(s)
  Kohji Shibano
- Organizer
  The 18th World Congress of Applied Linguistics (AILA 2017)
- Int'l Joint Research
[Presentation] Discourse Segment Clustering with Word Embedding based on Formulaic Sequences for Language Education2017
- Author(s)
  Hajime Mochizuki and Kohji Shibano
- Organizer
  2017 International Conference on Education and Multimedia Technology (ICEMT 2017)
- Int'l Joint Research
[Presentation] Building a Very Large Spoken Language Corpus from Closed Caption TV and Extracting Practical Formulaic Sequences for Language Learning2017
- Author(s)
  Hajime Mochizuki
- Organizer
  The 10th International Conference on Advanced Computer Theory and Engineering
- Int'l Joint Research / Invited
[Presentation] Searching Discourse Segments for Formulaic Sequences in a Closed Caption TV Corpus for Language Learning2017
- Author(s)
  Hajime Mochizuki and Kohji Shibano
- Organizer
  World Conference on E-Learning in Corporate, Government, Healthcare, and Higher Education 2017
- Int'l Joint Research

2017 Fiscal Year Annual Research Report

大規模会話コーパスのＦＳ２ｖｅｃ処理によるＣＥＦＲ Ｃａｎ-ｄｏ言語教材の開発

Principal Investigator

望月 源 東京外国語大学, 大学院総合国際学研究院, 准教授 (70313707)

Current Status of Research Progress

Reason

Research Products

[Presentation] Analyzing Usefulness of Dialogues from Closed Caption TV Corpus as an Example of Can-do Statements for Language Learnin2018

Author(s)

Organizer

[Presentation] Developing Intimacy by Style-shifting in Japanese: A TV Subtitle Corpus-based Study2017

Author(s)

Organizer

[Presentation] The Acquisition of a Japanese Practical Formulaic Sequences List from a Closed Caption TV Corpus2017

Author(s)

Organizer

[Presentation] Augmented Reality Applications for Multilingual Learning with Intuitive Understanding2017

Author(s)

Organizer

[Presentation] Analyzing formulaic sequences in spoken Japanese from a large Japanese TV closed caption corpus2017

Author(s)

Organizer

[Presentation] Discourse Segment Clustering with Word Embedding based on Formulaic Sequences for Language Education2017

Author(s)

Organizer

[Presentation] Building a Very Large Spoken Language Corpus from Closed Caption TV and Extracting Practical Formulaic Sequences for Language Learning2017

Author(s)

Organizer

[Presentation] Searching Discourse Segments for Formulaic Sequences in a Closed Caption TV Corpus for Language Learning2017

Author(s)

Organizer

大規模会話コーパスのＦＳ２ｖｅｃ処理によるＣＥＦＲ　Ｃａｎ-ｄｏ言語教材の開発

望月源東京外国語大学, 大学院総合国際学研究院, 准教授 (70313707)