2004 Fiscal Year Final Research Report Summary

Analysis of all-digitized textbook data and the construction of a large-scale Japanese corpus

Research Project

Project/Area Number	14380174
Research Category	Grant-in-Aid for Scientific Research (B)
Allocation Type	Single-year Grants
Section	一般
Research Field	情報システム学(含情報図書館学)
Research Institution	Tokyo University of Foreign Studies
Principal Investigator	SANO Hiroshi Tokyo University of Foreign Studies, Faculty of Foreign Studies, Professor, 外国語学部, 教授 (30282776)
Co-Investigator(Kenkyū-buntansha)	SHIBANO Koji Tokyo University of Foreign Studies, Research Institute for languages and Cultures of Asia and Africa, Professor, アジア・アフリカ言語文化研究所, 教授 (50216024) MINEGISHI Makoto Tokyo University of Foreign Studies, Research Institute for languages and Cultures of Asia and Africa, Professor, アジア・アフリカ言語文化研究所, 教授 (20190712) FUJIMURA Tomoko Tokyo University of Foreign Studies, Japanese Language Center for International Studies, Associate Professor, 留学生日本語教育センター, 助教授 (20229040)
Project Period (FY)	2002 – 2004
Keywords	Japanese Language processing / Large-scale Japanese corpus / Mathematical Linguistics / Corpus Linguistics
Research Abstract	Research results In 2002 In 2002, we used morphological analysis techniques to process a Japanese corpus that was made up of academic subjects such as Japanese language (65 files), social studies (209 files), and home economics (11 files) totaling 285 files of digitized textbook data. Because of the relatively high occurrence of hiragana characters in primary school textbooks, the computational morphological analysis produced a great deal of erroneous results. It was required to take a linguistic, albeit manual, approach to analyze the erroneous results and improve the methodology. It was found that the errors were not only caused by the high occurrence of hiragana in primary and secondary school textbooks, but also a lack of linguistic consideration in some parts of the morphological analysis. The types of errors were classified and organized. In 2005, the plan is to refer to this type classification and add more textbook data to the corpus for processing. For the morphological analysis, … More one piece of software that was used was called "Chaser" (developed by the Matsumoto Lab at Nara Advanced Institute of Science and Technology). Additional software was developed to aid in the analysis of the morphologically processed corpus. Improvements were made to the Japanese language analytical software over time as it was used. These software programs were made available as freeware on the Internet. In 2003 A Japanese language corpus was created from Japanese textbooks (the contents included Japanese language (65 files), social studies (209 files), and home economics (11 files) for a total of 285 files). The corpus was morphologically processed and the results were analyzed. Some software was developed for this analysis in 2004. This software for Japanese text analysis was made available as freeware in 2004. The software that was developed made it possible to efficiently do analysis of the Japanese language and to collect information from a large-scale corpus to a degree that was not available before. It was used by graduate students to do their own Japanese language research. A book introducing methodology for analysis of Japanese was published, and the software was attached to it as a CD-ROM. Using the textbook corpus, our research also included the analyses of kanji usage attributes, the word as a unit, the contextual usage of words, and the usage ratios of kanji characters in context. We found that the use of kanji characters largely depends on the type of word they appear in. Prior to this research, a definitive study of the usage qualities of kanji, their distribution, and their usage frequencies in written text did not exist. In 2004 Further incremental improvements were made on our methodologies. The usage attributes of kanji and words were further analyzed within the paragraph as a unit. English textbooks were collected for the concurrent analysis of an English corpus. Less

Research Products
(5 results)

All 2003

All Journal Article (4 results) Book (1 results)

[Journal Article] 日本語学習素材作成のための日本語処理ソフトウェア2003
- Author(s)
  佐野洋
- Journal Title
  
  CIEC(コンピュータ&エデュケーション)会誌 VOL15
  
  Pages: 8
- Description
  「研究成果報告書概要(和文)」より
[Journal Article] 日本語調査用ツール CLTOOL2003
- Author(s)
  佐野洋
- Journal Title
  
  東京外国語大学,語学研究書論集第8号
  
  Pages: 8
- Description
  「研究成果報告書概要(和文)」より
[Journal Article] ESP適合の教材コンテンツを実現する語学教育支援システム2003
- Author(s)
  佐野洋
- Journal Title
  
  CIEC(コンピュータ利用教育協議会)外国語教育研究部会 (CD-ROM書籍出版)
  
  Pages: 10
- Description
  「研究成果報告書概要(和文)」より
[Journal Article] Development of a Software for Creating Japanese Educational Materials2003
- Author(s)
  SANO, Hiroshi
- Journal Title
  
  Computer & Education, Council for Improvement of Education through Computers Vol15
  
  Pages: 85-90
- Description
  「研究成果報告書概要(欧文)」より
[Book] WindowsPCによる日本語研究法2003
- Author(s)
  佐野洋
- Total Pages
  148
- Publisher
  共立出版
- Description
  「研究成果報告書概要(和文)」より

2004 Fiscal Year Final Research Report Summary

Analysis of all-digitized textbook data and the construction of a large-scale Japanese corpus

Principal Investigator

SANO Hiroshi Tokyo University of Foreign Studies, Faculty of Foreign Studies, Professor, 外国語学部, 教授 (30282776)

Research Products

[Journal Article] 日本語学習素材作成のための日本語処理ソフトウェア2003

Author(s)

Journal Title

Description

[Journal Article] 日本語調査用ツール CLTOOL2003

Author(s)

Journal Title

Description

[Journal Article] ESP適合の教材コンテンツを実現する語学教育支援システム2003

Author(s)

Journal Title

Description

[Journal Article] Development of a Software for Creating Japanese Educational Materials2003

Author(s)

Journal Title

Description

[Book] WindowsPCによる日本語研究法2003

Author(s)

Total Pages

Publisher

Description