Study on the Digitization Support System for Historical Documents
Grant-in-Aid for Scientific Research (B)
|Allocation Type||Single-year Grants |
|Research Institution||National Institute of Japanese Literature |
HARA Shoichiro National Institute of Japanese Literature, Department of Interdisciplinary Studies, Associate Professor, 複合領域研究系, 助教授 (50218616)
YASUNAGA Hisashi National Institute of Japanese Literature, Department of Interdisciplinary Studies, Professor, 複合領域研究系, 教授 (20017411)
SHIBAYAMA Mamoru Kyoto University, Center for Southeast Asian Studies, Professor, 東南アジア研究所, 教授 (10162645)
YAMADA Shoji International Research Center for Japanese Studies Research Department, Associate Professor, 研究部, 助教授 (20248751)
AIDA Mitsuru National Institute of Japanese Literature, Department of Literary Development Studies, Research Associate, 文学研究形成系, 助手 (00249921)
IWASAKI Hiroyuki Tokiwa University, College of Community Development, Professor, コミュニティー振興学部, 教授 (50087904)
勝村 哲也 島根県立大学, 総合政策学部, 教授 (50066411)
|Project Period (FY)
2002 – 2004
Completed(Fiscal Year 2004)
|Budget Amount *help
¥9,700,000 (Direct Cost : ¥9,700,000)
Fiscal Year 2004 : ¥3,100,000 (Direct Cost : ¥3,100,000)
Fiscal Year 2003 : ¥3,800,000 (Direct Cost : ¥3,800,000)
Fiscal Year 2002 : ¥2,800,000 (Direct Cost : ¥2,800,000)
|Keywords||OCR / Image Processing / Multi resolution Analysis / n-gram / Historical Document / 古文書OCR|
Classical papers often suffer from wormholes and discoloration due to aging and there are sometimes seals and annotations overlapped on characters. These make character extraction and recognition difficult. Moreover, characters in classical texts are often cursive.
Thus, the segmentation of a cursive string into characters is important. Due to the aforementioned problems in preprocessing historical documents, new means of character segmentation are examined.
The proposed method begins with some filtering, i.e., a color filter to extract candidate pixels of characters according to their color, some noise reduction filters, a converters to create gray images then binarization images. Layout information as to whether a text is written vertically or horizontally as well as average character size in a page are obtained from the analysis of a peripherally projected histogram. A character is constructed gradually from pixels. At last, segmentation of a cursive string is done basically along the
line connecting the nearest concavities on the same contour. The strength of the new methods avoids the need for language specific character style knowledge and layout information.
The defect of aforementioned procedure is that results are strongly affected by local shapes of contours. To compensate for this problem, a kind of multi-resolution analysis method is introduced. The basic idea is that an original image I is blurred by convoluting a Gaussian function G such as G^*I, then Laplacian operator ▽^2 is applied such as ▽^2(G^*I) = (▽^2G)^*I=O to get edges. The Gaussian function behaves as a band pass filter that wipes out small structures at scale less than the parameter σ (standard deviation). When σ becomes larger, a picture becomes rougher. If a concavity conserved in a rougher picture means that shape changed largely around there. The important issue is that the large change of shape in a rougher picture is also conserved in the detail picture, that is, separation lines in a rougher picture must exist in the detail picture. Experiments showed that this method is robuster than aforementioned to choose appropriate lines to segment cursive string. Multi-resolution analysis by wavelet is introduced to facilitate this procedure.
Also, extracting titles using page layout information and recognizing hand-written characters using n-gram were done as the preliminary examinations for full character recognition. Less
Research Products (10results)