2004 Fiscal Year Final Research Report Summary
Analysis of all-digitized textbook data and the construction of a large-scale Japanese corpus
Project/Area Number |
14380174
|
Research Category |
Grant-in-Aid for Scientific Research (B)
|
Allocation Type | Single-year Grants |
Section | 一般 |
Research Field |
情報システム学(含情報図書館学)
|
Research Institution | Tokyo University of Foreign Studies |
Principal Investigator |
SANO Hiroshi Tokyo University of Foreign Studies, Faculty of Foreign Studies, Professor, 外国語学部, 教授 (30282776)
|
Co-Investigator(Kenkyū-buntansha) |
SHIBANO Koji Tokyo University of Foreign Studies, Research Institute for languages and Cultures of Asia and Africa, Professor, アジア・アフリカ言語文化研究所, 教授 (50216024)
MINEGISHI Makoto Tokyo University of Foreign Studies, Research Institute for languages and Cultures of Asia and Africa, Professor, アジア・アフリカ言語文化研究所, 教授 (20190712)
FUJIMURA Tomoko Tokyo University of Foreign Studies, Japanese Language Center for International Studies, Associate Professor, 留学生日本語教育センター, 助教授 (20229040)
|
Project Period (FY) |
2002 – 2004
|
Keywords | Japanese Language processing / Large-scale Japanese corpus / Mathematical Linguistics / Corpus Linguistics |
Research Abstract |
Research results In 2002 In 2002, we used morphological analysis techniques to process a Japanese corpus that was made up of academic subjects such as Japanese language (65 files), social studies (209 files), and home economics (11 files) totaling 285 files of digitized textbook data. Because of the relatively high occurrence of hiragana characters in primary school textbooks, the computational morphological analysis produced a great deal of erroneous results. It was required to take a linguistic, albeit manual, approach to analyze the erroneous results and improve the methodology. It was found that the errors were not only caused by the high occurrence of hiragana in primary and secondary school textbooks, but also a lack of linguistic consideration in some parts of the morphological analysis. The types of errors were classified and organized. In 2005, the plan is to refer to this type classification and add more textbook data to the corpus for processing. For the morphological analysis,
… More
one piece of software that was used was called "Chaser" (developed by the Matsumoto Lab at Nara Advanced Institute of Science and Technology). Additional software was developed to aid in the analysis of the morphologically processed corpus. Improvements were made to the Japanese language analytical software over time as it was used. These software programs were made available as freeware on the Internet. In 2003 A Japanese language corpus was created from Japanese textbooks (the contents included Japanese language (65 files), social studies (209 files), and home economics (11 files) for a total of 285 files). The corpus was morphologically processed and the results were analyzed. Some software was developed for this analysis in 2004. This software for Japanese text analysis was made available as freeware in 2004. The software that was developed made it possible to efficiently do analysis of the Japanese language and to collect information from a large-scale corpus to a degree that was not available before. It was used by graduate students to do their own Japanese language research. A book introducing methodology for analysis of Japanese was published, and the software was attached to it as a CD-ROM. Using the textbook corpus, our research also included the analyses of kanji usage attributes, the word as a unit, the contextual usage of words, and the usage ratios of kanji characters in context. We found that the use of kanji characters largely depends on the type of word they appear in. Prior to this research, a definitive study of the usage qualities of kanji, their distribution, and their usage frequencies in written text did not exist. In 2004 Further incremental improvements were made on our methodologies. The usage attributes of kanji and words were further analyzed within the paragraph as a unit. English textbooks were collected for the concurrent analysis of an English corpus. Less
|
Research Products
(5 results)