Development of CEFR Can-do Language Learning Materials by FS2vec Processing of Large-scale Spoken Language Corpus

Research Project

Project/Area Number	15H02794
Research Category	Grant-in-Aid for Scientific Research (B)
Allocation Type	Single-year Grants
Section	一般
Research Field	Learning support system
Research Institution	Tokyo University of Foreign Studies
Principal Investigator	Mochizuki Hajime 東京外国語大学, 大学院総合国際学研究院, 准教授 (70313707)
Co-Investigator(Kenkyū-buntansha)	芝野耕司東京外国語大学, その他部局等, 名誉教授 (50216024) 佐野洋東京外国語大学, 大学院総合国際学研究院, 教授 (30282776) 藤村知子東京外国語大学, 大学院国際日本学研究院, 教授 (20229040)
Project Period (FY)	2015-04-01 – 2019-03-31
Project Status	Completed (Fiscal Year 2018)
Budget Amount *help	¥15,340,000 (Direct Cost: ¥11,800,000、Indirect Cost: ¥3,540,000) Fiscal Year 2018: ¥3,120,000 (Direct Cost: ¥2,400,000、Indirect Cost: ¥720,000) Fiscal Year 2017: ¥3,640,000 (Direct Cost: ¥2,800,000、Indirect Cost: ¥840,000) Fiscal Year 2016: ¥3,640,000 (Direct Cost: ¥2,800,000、Indirect Cost: ¥840,000) Fiscal Year 2015: ¥4,940,000 (Direct Cost: ¥3,800,000、Indirect Cost: ¥1,140,000)
Keywords	学習コンテンツ開発支援 / eラーニング / 日本語教育 / 自然言語処理 / Formulaic Sequences / Formulaic Sequence / 学習コンテンツ開発
Outline of Final Research Achievements	We developed a method for extracting formulaic sequences from Japanese closed caption TV Corpus. In this research we extract signifiant n-grams as candidates for formulaic sequences of continuous words from a CCTV corpus. To calculate n-gram frequencies we developed programs to sort, marge, and count based on the MapReduce algorithm. We examined clustering of discourse segments by topics and scenes and confirmed the existence of suitable can-do statements for them. We have been continuing to build the CCTV corpus. The total number of words in our corpus has reached over 1,300 million morphemes. Regarding the research results, we presented peer-reviewed papers mainly on international academic societies such as AAAL, EDMEDIA, and E-Learn.
Academic Significance and Societal Importance of the Research Achievements	これまで存在していなかった大規模な日本語会話コーパスの構築を続け，６年以上にわたる日本のテレビ番組の字幕データを整備した。規模は３５万番組，１億２千４百万文，１３億３千６百万語超に達した。この大規模なコーパスから，日本語学習教材にも応用できる特別な意味を持つ複数単語のまとまりであるFormulaic Sequence（定型表現）を大量に抽出した。定型表現を核にして，コーパス内の会話セグメントを取り出し，セグメント内の定型表現が表す機能と，各セグメントの話題，場面をCan-doと対応づけることで有益な教材が作成できることを確認した。

Report

(5 results)

2018 Annual Research Report Final Research Report ( PDF )
2017 Annual Research Report
2016 Annual Research Report
2015 Annual Research Report

Research Products
(25 results)

All 2019 2018 2017 2016 2015

All Journal Article (1 results) (of which Peer Reviewed: 1 results, Acknowledgement Compliant: 1 results) Presentation (24 results) (of which Int'l Joint Research: 20 results, Invited: 1 results)

[Journal Article] Re-Mining Topics Popular in the Recent Past from a Large-Scale Closed Caption TV Corpus2015
- Author(s)
  Hajme Mochizuki and Kohji Shibano
- Journal Title
  
  International Joural of Future Computer and Communication
  
  Volume: 4 Pages: 98-103
- Related Report
  2015 Annual Research Report
- Peer Reviewed / Acknowledgement Compliant
[Presentation] Investigation of Words in Japanese Closed Caption TV Corpus2019
- Author(s)
  Hajime Mochizuki
- Organizer
  STEM & STEAM Education Conference, 2019
- Related Report
  2018 Annual Research Report
- Int'l Joint Research
[Presentation] Analyzing Usefulness of Dialogues from Closed Caption TV Corpus as an Example of Can-do Statements for Language Learnin2018
- Author(s)
  Hajime Mochizuki and Kohji Shibano
- Organizer
  2018 Hawaii University Conference, Arts, Humanities, Social Sciences & Education (AHSE)
- Related Report
  2017 Annual Research Report
- Int'l Joint Research
[Presentation] Modification of word2vec by Formulaic Sequences and Extraction of Useful Expressions for Language Learning from Closed Caption TV Corpus2017
- Author(s)
  Hajime Mochizuki and Kohji Shibano
- Organizer
  The IAFOR International Conference on Language Learning Hawaii
- Place of Presentation
  Honolulu, USA
- Year and Date
  2017-01-08
- Related Report
  2016 Annual Research Report
- Int'l Joint Research
[Presentation] Developing Intimacy by Style-shifting in Japanese: A TV Subtitle Corpus-based Study2017
- Author(s)
  XIAO Tingting and Kohji Shibano
- Organizer
  The 2017 conference of the American Association for Applied Linguistics (AAAL 2017)
- Related Report
  2017 Annual Research Report 2016 Annual Research Report
- Int'l Joint Research
[Presentation] The Acquisition of a Japanese Practical Formulaic Sequences List from a Closed Caption TV Corpus2017
- Author(s)
  Hajime Mochizuki and Kohji Shibano
- Organizer
  Hawaii University Conferences, STAM/STEAM Education Conference
- Related Report
  2017 Annual Research Report
- Int'l Joint Research
[Presentation] Augmented Reality Applications for Multilingual Learning with Intuitive Understanding2017
- Author(s)
  Hajime Mochizuki
- Organizer
  World Conference on Educational Media and Technology (EDMEDIA) 2017
- Related Report
  2017 Annual Research Report
- Int'l Joint Research
[Presentation] Analyzing formulaic sequences in spoken Japanese from a large Japanese TV closed caption corpus2017
- Author(s)
  Kohji Shibano
- Organizer
  The 18th World Congress of Applied Linguistics (AILA 2017)
- Related Report
  2017 Annual Research Report 2016 Annual Research Report
- Int'l Joint Research
[Presentation] Discourse Segment Clustering with Word Embedding based on Formulaic Sequences for Language Education2017
- Author(s)
  Hajime Mochizuki and Kohji Shibano
- Organizer
  2017 International Conference on Education and Multimedia Technology (ICEMT 2017)
- Related Report
  2017 Annual Research Report
- Int'l Joint Research
[Presentation] Building a Very Large Spoken Language Corpus from Closed Caption TV and Extracting Practical Formulaic Sequences for Language Learning2017
- Author(s)
  Hajime Mochizuki
- Organizer
  The 10th International Conference on Advanced Computer Theory and Engineering
- Related Report
  2017 Annual Research Report
- Int'l Joint Research / Invited
[Presentation] Searching Discourse Segments for Formulaic Sequences in a Closed Caption TV Corpus for Language Learning2017
- Author(s)
  Hajime Mochizuki and Kohji Shibano
- Organizer
  World Conference on E-Learning in Corporate, Government, Healthcare, and Higher Education 2017
- Related Report
  2017 Annual Research Report
- Int'l Joint Research
[Presentation] Extracting Formulaic Sequences Containing Useful Expressions for Language Learning from Closed Caption TV Corpus2016
- Author(s)
  Hajime Mochizuki and Kohji Shibano
- Organizer
  World Conference on E-Learning in Corporate, Government, Healthcare, and Higher Education, E-Learn 2016
- Place of Presentation
  Alexandria, USA
- Year and Date
  2016-11-14
- Related Report
  2016 Annual Research Report
- Int'l Joint Research
[Presentation] Development of a Closed Caption TV Corpus Retrieval System for Language Learning2016
- Author(s)
  Hajime Mochizuki
- Organizer
  8th International Conference on Education Technology and Computers (ICETC 2016)
- Place of Presentation
  Singapore
- Year and Date
  2016-09-28
- Related Report
  2016 Annual Research Report
- Int'l Joint Research
[Presentation] Straightforward Expansion of word2vec by Formulaic Sequences in CCTV corpus2016
- Author(s)
  Hajime Mochizuki
- Organizer
  Nineth International Conference on Advanced Computer Theory and Engineering, ICACTE 2016
- Place of Presentation
  Hong Kong
- Year and Date
  2016-08-19
- Related Report
  2016 Annual Research Report
- Int'l Joint Research
[Presentation] Development of AR Materials for Understanding Roles of Japanese Particles2016
- Author(s)
  Hajime Mochizuki
- Organizer
  2016 STEM & STEAM Education Conference
- Place of Presentation
  Honolulu, USA
- Year and Date
  2016-06-10
- Related Report
  2016 Annual Research Report
- Int'l Joint Research
[Presentation] Japanese Language Learning System for Understanding a Sentence that has Correct Syntax but has Semantic Errors2016
- Author(s)
  Hajime Mochizuki
- Organizer
  the 2nd International Conference on Information Technology (ICIT 2016)
- Place of Presentation
  Melbourne, Australia
- Year and Date
  2016-03-03
- Related Report
  2015 Annual Research Report
- Int'l Joint Research
[Presentation] Analyzing Attractiveness of Specific Location Names of Tourist Destination from a Closed Caption TV Corpus2016
- Author(s)
  Hajime Mochizuki and Kohji Shibano
- Organizer
  Hawaii University Conferences, Arts, Humanities, Social Sciences & Education (AHSE)
- Place of Presentation
  Hawaii, USA
- Year and Date
  2016-01-08
- Related Report
  2015 Annual Research Report
- Int'l Joint Research
[Presentation] 事態把握の違いを利用した語学教材の提案(2)2016
- Author(s)
  佐野洋
- Organizer
  第135回CE研究発表会, 情報処理学会
- Place of Presentation
  信州大学，長野県
- Related Report
  2016 Annual Research Report
[Presentation] 事態把握の違いを用いた語学学習法(2)2016
- Author(s)
  佐野洋
- Organizer
  思考と言語研究会 (TL)，電子通信学会
- Place of Presentation
  早稲田大学，東京都
- Related Report
  2016 Annual Research Report
[Presentation] 事態把握の違いを用いた語学学習法(3)2016
- Author(s)
  佐野洋
- Organizer
  思考と言語研究会 (TL)，電子通信学会
- Place of Presentation
  ポートアイランド，兵庫県
- Related Report
  2016 Annual Research Report
[Presentation] 事態把握の違いを利用した語学教材の提案（3）2016
- Author(s)
  佐野洋
- Organizer
  第136回CE研究発表会，情報処理学会
- Place of Presentation
  長崎県立大学シーボルト校，長崎県
- Related Report
  2016 Annual Research Report
[Presentation] Detecting Topics Popular in the Recent Past from a Closed Caption TV Corpus as a Categorized Chronicle data2015
- Author(s)
  Hajime Mochizuki and Kohji Shibano
- Organizer
  the 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (KMIS)
- Place of Presentation
  Lisbon, Portgal
- Year and Date
  2015-11-12
- Related Report
  2015 Annual Research Report
- Int'l Joint Research
[Presentation] 日本語話し言葉コーパスの構築と会話用例検索システム2015
- Author(s)
  芝野耕司
- Organizer
  6th CASTEL/J Hawaii 2015
- Place of Presentation
  Hawaii, USA
- Year and Date
  2015-08-07
- Related Report
  2015 Annual Research Report
- Int'l Joint Research
[Presentation] A Quantitative Formulaic Analysis of Large TV Closed Caption Corpus – Pragmatic Use of Utterance End in Japanese Animation Languages2015
- Author(s)
  Kohji Shibano
- Organizer
  14th International Pragmatics Conference
- Place of Presentation
  Antwerp Belgium
- Year and Date
  2015-07-26
- Related Report
  2015 Annual Research Report
- Int'l Joint Research
[Presentation] Development of a Closed Caption TV Corpus Retrieval System to Seek Video Scenes Containing Useful Expressions for Language Learning2015
- Author(s)
  Hajime Mochizuki and Kohji Shibano
- Organizer
  World Conference on Educational Media and Technology (EDMEDIA)
- Place of Presentation
  Montreal, Canada
- Year and Date
  2015-06-22
- Related Report
  2015 Annual Research Report
- Int'l Joint Research

Development of CEFR Can-do Language Learning Materials by FS2vec Processing of Large-scale Spoken Language Corpus

Principal Investigator

Mochizuki Hajime 東京外国語大学, 大学院総合国際学研究院, 准教授 (70313707)

¥15,340,000 (Direct Cost: ¥11,800,000、Indirect Cost: ¥3,540,000)

Report

Research Products

[Journal Article] Re-Mining Topics Popular in the Recent Past from a Large-Scale Closed Caption TV Corpus2015

Author(s)

Journal Title

Related Report

[Presentation] Investigation of Words in Japanese Closed Caption TV Corpus2019

Author(s)

Organizer

Related Report

[Presentation] Analyzing Usefulness of Dialogues from Closed Caption TV Corpus as an Example of Can-do Statements for Language Learnin2018

Author(s)

Organizer

Related Report

[Presentation] Modification of word2vec by Formulaic Sequences and Extraction of Useful Expressions for Language Learning from Closed Caption TV Corpus2017

Author(s)

Organizer

Place of Presentation

Year and Date

Related Report

[Presentation] Developing Intimacy by Style-shifting in Japanese: A TV Subtitle Corpus-based Study2017

Author(s)

Organizer

Related Report

[Presentation] The Acquisition of a Japanese Practical Formulaic Sequences List from a Closed Caption TV Corpus2017

Author(s)

Organizer

Related Report

[Presentation] Augmented Reality Applications for Multilingual Learning with Intuitive Understanding2017

Author(s)

Organizer

Related Report

[Presentation] Analyzing formulaic sequences in spoken Japanese from a large Japanese TV closed caption corpus2017

Author(s)

Organizer

Related Report

[Presentation] Discourse Segment Clustering with Word Embedding based on Formulaic Sequences for Language Education2017

Author(s)

Organizer

Related Report

[Presentation] Building a Very Large Spoken Language Corpus from Closed Caption TV and Extracting Practical Formulaic Sequences for Language Learning2017

Author(s)

Organizer

Related Report

[Presentation] Searching Discourse Segments for Formulaic Sequences in a Closed Caption TV Corpus for Language Learning2017

Author(s)

Organizer

Related Report

[Presentation] Extracting Formulaic Sequences Containing Useful Expressions for Language Learning from Closed Caption TV Corpus2016

Author(s)

Organizer

Place of Presentation

Year and Date

Related Report

[Presentation] Development of a Closed Caption TV Corpus Retrieval System for Language Learning2016

Author(s)

Organizer

Place of Presentation

Year and Date

Related Report

[Presentation] Straightforward Expansion of word2vec by Formulaic Sequences in CCTV corpus2016

Author(s)

Organizer

Place of Presentation

Year and Date

Related Report

[Presentation] Development of AR Materials for Understanding Roles of Japanese Particles2016

Author(s)

Organizer

Place of Presentation

Year and Date

Related Report

[Presentation] Japanese Language Learning System for Understanding a Sentence that has Correct Syntax but has Semantic Errors2016

Author(s)

Organizer

Place of Presentation

[Presentation] A Quantitative Formulaic Analysis of Large TV Closed Caption Corpus – Pragmatic Use of Utterance End in Japanese Animation Languages2015