1992 Fiscal Year Final Research Report Summary

ON THE OCR APPROACH TO CREATING FULL-TEXT DATA BASE OF JAPANESE CLASSICAL LITERATURE

Research Project

Project/Area Number	04610271
Research Category	Grant-in-Aid for General Scientific Research (C)
Allocation Type	Single-year Grants
Research Field	国文学
Research Institution	National Institute of Japanese Literature
Principal Investigator	HARA Shoichiro National Institute of Japanese Literature Research Information Department Associate Professor, 研究情報部, 助教授 (50218616)
Project Period (FY)	1992
Keywords	IMAGE PROCESSING / JAPANESE CLASSICAL LITERATURE / OCR / IMAGE CLASSIFICATION / DISCRIMINANT THRESHOLD SELECTION METHOD / CLUSTER ANALYSIS / NOISE REDUCTION
Research Abstract	A new approach to reducing image noises which disturb the optical character recognition has been studied. A peculiarity of the study is to use information about color to improve classification of "true" letters from image noises such as red letters, paper, pseudo-letters which are written on the reverse side of translucent papers and so on. Japanese original classical books written by the Chinese black ink on the white Japanese classical papers were selected as the research samples. The results are as follows : (1) Characteristics of Color Distribution : Original images were digitized by the color image scanner (100dpi, 256 gray-levels/R,G,B). and each picture cells are represented as 3-dimensional vector in the RGB-chromaticity coordinates then analyzed. The characteristics of the color distribution are, (a) many of the picture cells have the color distribution along with the line of R=G=B, (b)red letters have the different color distribution from (a), (c) brightness histograms of R,G and B colors are almost bimodal. (2) Classification of Images : (a) The characteristic of (a) and (b) in (1) are useful to distinguish red letters from another images. (b) The discriminant threshold selection method (Ohtu's method) was applied to each brightness histograms to determine thresholds between black letters and paper segments. This method can classify both segments sharply, but it is inclined to slices off the peripheral picture cells of the "true" black letters. (c) The cluster analysis was introduced to classify "true" black letters and paper segments more precisely, which gives better result. This study verify usefulness of the color information to eliminate image noise.