Development of a Concordance-Making System Utilizing an Optical Character Reader

Research Project

Project/Area Number	62450054
Research Category	Grant-in-Aid for General Scientific Research (B)
Allocation Type	Single-year Grants
Research Field	国語学
Research Institution	National Language Research Institute
Principal Investigator	HIDA Yoshifumi Director, Department of Language Change, NLRI, 言語変化研究部, 部長 (40000418)
Co-Investigator(Kenkyū-buntansha)	加藤信明国語辞典編集室, 調査員 SAITOO Hidenori Head, 3rd Research Section, Department of Computational Linguistics, NLRI, 言語計量研究部第三研究室, 室長 (70000429) 木村睦子国立国語研究所, 国語辞典編集室, 室長見坊豪紀国立国語研究所, 国語辞典編集室, 調査員 HAYASHI Ooki Emeritus Researcher, NLRI, 国語辞典編集室, 名誉所員調査員 (20000002) KIMURA Mutsuko Section for Dictionary Research, NLRI KENBOO Hidetoshi Section for Dictionary Research, NLRI KATOO Nobuaki Section for Dictionary Research, NLRI
Project Period (FY)	1987 – 1988
Project Status	Completed (Fiscal Year 1988)
Budget Amount *help	¥6,200,000 (Direct Cost: ¥6,200,000) Fiscal Year 1988: ¥2,000,000 (Direct Cost: ¥2,000,000) Fiscal Year 1987: ¥4,200,000 (Direct Cost: ¥4,200,000)
Keywords	OCR / Concordance / 『尋常小学国語読本』 / 『尋常小学国語読』
Research Abstract	The purpose of this study is to develop a system for an efficient making of a concordance by using an optical character reader (OCR). An OCR reads and identifies hand-written characters on worksheets including Katakana. Alphabet, figures, and other symbols and feed them into a computer. It can process information written on worksheets such as word units, entry words, parts of speech, homonym Id codes, etc. The next chosen for this study is that of the "Jinjoo Shoogaku Kokugo Tokuhon" or state-compiled elementary school readers used nationwide from 1918 to 1938. The text includes a total of about 100,000 words. Following word has been completed during the two-year period. (1)Word unit identification for each entry word. (2)Input of style information such as spoken, written, dialog, and verse to each quotation by OCR worksheets. (3)Input of data such as entry words, parts of speech, and homonym ID by OCR worksheets. (4)Readout by the OCR. (5)Correction of data. (6)Programming for processing revised data. (7)Programming for KWIC output. (8)Printout of KWIC lists.

Report

(3 results)

1988 Annual Research Report Final Research Report Summary
1987 Annual Research Report