• Search Research Projects
  • Search Researchers
  • How to Use
  1. Back to project page

2004 Fiscal Year Final Research Report Summary

Analysis of all-digitized textbook data and the construction of a large-scale Japanese corpus

Research Project

Project/Area Number 14380174
Research Category

Grant-in-Aid for Scientific Research (B)

Allocation TypeSingle-year Grants
Section一般
Research Field 情報システム学(含情報図書館学)
Research InstitutionTokyo University of Foreign Studies

Principal Investigator

SANO Hiroshi  Tokyo University of Foreign Studies, Faculty of Foreign Studies, Professor, 外国語学部, 教授 (30282776)

Co-Investigator(Kenkyū-buntansha) SHIBANO Koji  Tokyo University of Foreign Studies, Research Institute for languages and Cultures of Asia and Africa, Professor, アジア・アフリカ言語文化研究所, 教授 (50216024)
MINEGISHI Makoto  Tokyo University of Foreign Studies, Research Institute for languages and Cultures of Asia and Africa, Professor, アジア・アフリカ言語文化研究所, 教授 (20190712)
FUJIMURA Tomoko  Tokyo University of Foreign Studies, Japanese Language Center for International Studies, Associate Professor, 留学生日本語教育センター, 助教授 (20229040)
Project Period (FY) 2002 – 2004
KeywordsJapanese Language processing / Large-scale Japanese corpus / Mathematical Linguistics / Corpus Linguistics
Research Abstract

Research results
In 2002
In 2002, we used morphological analysis techniques to process a Japanese corpus that was made up of academic subjects such as Japanese language (65 files), social studies (209 files), and home economics (11 files) totaling 285 files of digitized textbook data. Because of the relatively high occurrence of hiragana characters in primary school textbooks, the computational morphological analysis produced a great deal of erroneous results. It was required to take a linguistic, albeit manual, approach to analyze the erroneous results and improve the methodology. It was found that the errors were not only caused by the high occurrence of hiragana in primary and secondary school textbooks, but also a lack of linguistic consideration in some parts of the morphological analysis. The types of errors were classified and organized. In 2005, the plan is to refer to this type classification and add more textbook data to the corpus for processing.
For the morphological analysis, … More one piece of software that was used was called "Chaser" (developed by the Matsumoto Lab at Nara Advanced Institute of Science and Technology). Additional software was developed to aid in the analysis of the morphologically processed corpus. Improvements were made to the Japanese language analytical software over time as it was used. These software programs were made available as freeware on the Internet.
In 2003
A Japanese language corpus was created from Japanese textbooks (the contents included Japanese language (65 files), social studies (209 files), and home economics (11 files) for a total of 285 files). The corpus was morphologically processed and the results were analyzed. Some software was developed for this analysis in 2004. This software for Japanese text analysis was made available as freeware in 2004.
The software that was developed made it possible to efficiently do analysis of the Japanese language and to collect information from a large-scale corpus to a degree that was not available before. It was used by graduate students to do their own Japanese language research. A book introducing methodology for analysis of Japanese was published, and the software was attached to it as a CD-ROM.
Using the textbook corpus, our research also included the analyses of kanji usage attributes, the word as a unit, the contextual usage of words, and the usage ratios of kanji characters in context. We found that the use of kanji characters largely depends on the type of word they appear in. Prior to this research, a definitive study of the usage qualities of kanji, their distribution, and their usage frequencies in written text did not exist.
In 2004
Further incremental improvements were made on our methodologies. The usage attributes of kanji and words were further analyzed within the paragraph as a unit. English textbooks were collected for the concurrent analysis of an English corpus. Less

  • Research Products

    (5 results)

All 2003

All Journal Article (4 results) Book (1 results)

  • [Journal Article] 日本語学習素材作成のための日本語処理ソフトウェア2003

    • Author(s)
      佐野洋
    • Journal Title

      CIEC(コンピュータ&エデュケーション)会誌 VOL15

      Pages: 8

    • Description
      「研究成果報告書概要(和文)」より
  • [Journal Article] 日本語調査用ツール CLTOOL2003

    • Author(s)
      佐野洋
    • Journal Title

      東京外国語大学,語学研究書論集 第8号

      Pages: 8

    • Description
      「研究成果報告書概要(和文)」より
  • [Journal Article] ESP適合の教材コンテンツを実現する語学教育支援システム2003

    • Author(s)
      佐野洋
    • Journal Title

      CIEC(コンピュータ利用教育協議会)外国語教育研究部会 (CD-ROM書籍出版)

      Pages: 10

    • Description
      「研究成果報告書概要(和文)」より
  • [Journal Article] Development of a Software for Creating Japanese Educational Materials2003

    • Author(s)
      SANO, Hiroshi
    • Journal Title

      Computer & Education, Council for Improvement of Education through Computers Vol15

      Pages: 85-90

    • Description
      「研究成果報告書概要(欧文)」より
  • [Book] WindowsPCによる日本語研究法2003

    • Author(s)
      佐野洋
    • Total Pages
      148
    • Publisher
      共立出版
    • Description
      「研究成果報告書概要(和文)」より

URL: 

Published: 2006-07-11  

Information User Guide FAQ News Terms of Use Attribution of KAKENHI

Powered by NII kakenhi