Study on the Digitization Support System for Historical Documents

Research Project

Project/Area Number	14310166
Research Category	Grant-in-Aid for Scientific Research (B)
Allocation Type	Single-year Grants
Section	一般
Research Field	Japanese history
Research Institution	National Institute of Japanese Literature
Principal Investigator	HARA Shoichiro National Institute of Japanese Literature, Department of Interdisciplinary Studies, Associate Professor, 複合領域研究系, 助教授 (50218616)
Co-Investigator(Kenkyū-buntansha)	YASUNAGA Hisashi National Institute of Japanese Literature, Department of Interdisciplinary Studies, Professor, 複合領域研究系, 教授 (20017411) SHIBAYAMA Mamoru Kyoto University, Center for Southeast Asian Studies, Professor, 東南アジア研究所, 教授 (10162645) YAMADA Shoji International Research Center for Japanese Studies Research Department, Associate Professor, 研究部, 助教授 (20248751) AIDA Mitsuru National Institute of Japanese Literature, Department of Literary Development Studies, Research Associate, 文学研究形成系, 助手 (00249921) IWASAKI Hiroyuki Tokiwa University, College of Community Development, Professor, コミュニティー振興学部, 教授 (50087904) 勝村哲也島根県立大学, 総合政策学部, 教授 (50066411)
Project Period (FY)	2002 – 2004
Project Status	Completed (Fiscal Year 2004)
Budget Amount *help	¥9,700,000 (Direct Cost: ¥9,700,000) Fiscal Year 2004: ¥3,100,000 (Direct Cost: ¥3,100,000) Fiscal Year 2003: ¥3,800,000 (Direct Cost: ¥3,800,000) Fiscal Year 2002: ¥2,800,000 (Direct Cost: ¥2,800,000)
Keywords	OCR / Image Processing / Multi resolution Analysis / n-gram / Historical Document / 古文書OCR
Research Abstract	Classical papers often suffer from wormholes and discoloration due to aging and there are sometimes seals and annotations overlapped on characters. These make character extraction and recognition difficult. Moreover, characters in classical texts are often cursive. Thus, the segmentation of a cursive string into characters is important. Due to the aforementioned problems in preprocessing historical documents, new means of character segmentation are examined. The proposed method begins with some filtering, i.e., a color filter to extract candidate pixels of characters according to their color, some noise reduction filters, a converters to create gray images then binarization images. Layout information as to whether a text is written vertically or horizontally as well as average character size in a page are obtained from the analysis of a peripherally projected histogram. A character is constructed gradually from pixels. At last, segmentation of a cursive string is done basically along the … More line connecting the nearest concavities on the same contour. The strength of the new methods avoids the need for language specific character style knowledge and layout information. The defect of aforementioned procedure is that results are strongly affected by local shapes of contours. To compensate for this problem, a kind of multi-resolution analysis method is introduced. The basic idea is that an original image I is blurred by convoluting a Gaussian function G such as G^I, then Laplacian operator ▽^2 is applied such as ▽^2(G^I) = (▽^2G)^*I=O to get edges. The Gaussian function behaves as a band pass filter that wipes out small structures at scale less than the parameter σ (standard deviation). When σ becomes larger, a picture becomes rougher. If a concavity conserved in a rougher picture means that shape changed largely around there. The important issue is that the large change of shape in a rougher picture is also conserved in the detail picture, that is, separation lines in a rougher picture must exist in the detail picture. Experiments showed that this method is robuster than aforementioned to choose appropriate lines to segment cursive string. Multi-resolution analysis by wavelet is introduced to facilitate this procedure. Also, extracting titles using page layout information and recognizing hand-written characters using n-gram were done as the preliminary examinations for full character recognition. Less

Report

(4 results)

2004 Annual Research Report Final Research Report Summary
2003 Annual Research Report
2002 Annual Research Report

Research Products
(10 results)

All 2004 2003 2002 Other

All Journal Article (7 results) Publications (3 results)

[Journal Article] OCR for Japanese Classical Documents - Segmentation of Cursive Characters2004
- Author(s)
  Shoichiro HARA
- Journal Title
  
  PNC 2004 Annual Conference in Conjunction with PRDLA Program Abstracts
  
  Pages: 121-121
- Description
  「研究成果報告書概要(和文)」より
- Related Report
  2004 Annual Research Report 2004 Final Research Report Summary
[Journal Article] OCR for Japanese Classical Documents -Segmentation of Cursive Characters-2004
- Author(s)
  Shoichiro HARA
- Journal Title
  
  PNC 2004 Annual Conference in Conjunction with PRDLA Program Abstracts
  
  Pages: 121-121
- Description
  「研究成果報告書概要(欧文)」より
- Related Report
  2004 Final Research Report Summary
[Journal Article] OCR for Japanese Classical Documents2003
- Author(s)
  Shoichiro Hara, Mamoru Shibayama
- Journal Title
  
  2003 PNC Annual Conference and Joint Meetings Program and Abstracts
  
  Pages: 126-127
- Description
  「研究成果報告書概要(和文)」より
- Related Report
  2004 Final Research Report Summary
[Journal Article] 古文書OCRのための文字切り出し2002
- Author(s)
  原正一郎
- Journal Title
  
  情報処理学会研究報告 2002-CH-55 Vol.2002 No.73
  
  Pages: 51-56
- NAID
  110002930162
- Description
  「研究成果報告書概要(和文)」より
- Related Report
  2004 Final Research Report Summary
[Journal Article] OCR for Japanese Classical Documents - Segmentation of Cursive Characters -2002
- Author(s)
  Shoichiro Hara
- Journal Title
  
  Conference Proceedings IEEE ICITA 2002 (in CD-ROM) CD-ROM
- Description
  「研究成果報告書概要(和文)」より
- Related Report
  2004 Final Research Report Summary
[Journal Article] Segmentation of Cursive Character for Classical Literal OCR2002
- Author(s)
  Shoichiro HARA
- Journal Title
  
  IPSJ SIG Technical Report 2002-CH-55 Vol.2002,No.73
  
  Pages: 51-56
- NAID
  110002930162
- Description
  「研究成果報告書概要(欧文)」より
- Related Report
  2004 Final Research Report Summary
[Journal Article] OCR for Japanese Classical Documents -Segmentation of Cursive Characters-2002
- Author(s)
  Shoichiro Hara
- Journal Title
  
  Conference Proceedings IEEE ICITA 2002 (149-10)1-6(CD-ROM)
- Description
  「研究成果報告書概要(欧文)」より
- Related Report
  2004 Final Research Report Summary
[Publications] Shoichiro Hara, Mamoru Shibayama: "OCR for Japanese Classical Documents"2003 PNC Annual Conference and Joint Meetings Program and Abstracts. 126-127 (2003)
- Related Report
  2003 Annual Research Report
[Publications] 原正一郎: "古文書OCRのための文字切り出し"情報処理学会研究報告2002-CH-55. Vol.2002,No.73. 51-56 (2002)
- Related Report
  2002 Annual Research Report
[Publications] Shoichiro Hara: "OCR for Japanese Classical Documents -Segmentaion of Cursive Characters-"Conference Proceedings IEEE ICITA 202. (CD-ROM). (149-10)1-(149-10)6 (2002)
- Related Report
  2002 Annual Research Report

Study on the Digitization Support System for Historical Documents

Principal Investigator

HARA Shoichiro National Institute of Japanese Literature, Department of Interdisciplinary Studies, Associate Professor, 複合領域研究系, 助教授 (50218616)

¥9,700,000 (Direct Cost: ¥9,700,000)

Report

Research Products

[Journal Article] OCR for Japanese Classical Documents - Segmentation of Cursive Characters2004

Author(s)

Journal Title

Description

Related Report

[Journal Article] OCR for Japanese Classical Documents -Segmentation of Cursive Characters-2004

Author(s)

Journal Title

Description

Related Report

[Journal Article] OCR for Japanese Classical Documents2003

Author(s)

Journal Title

Description

Related Report

[Journal Article] 古文書OCRのための文字切り出し2002

Author(s)

Journal Title

NAID

Description

Related Report

[Journal Article] OCR for Japanese Classical Documents - Segmentation of Cursive Characters -2002

Author(s)

Journal Title

Description

Related Report

[Journal Article] Segmentation of Cursive Character for Classical Literal OCR2002

Author(s)

Journal Title

NAID

Description

Related Report

[Journal Article] OCR for Japanese Classical Documents -Segmentation of Cursive Characters-2002

Author(s)

Journal Title

Description

Related Report

[Publications] Shoichiro Hara, Mamoru Shibayama: "OCR for Japanese Classical Documents"2003 PNC Annual Conference and Joint Meetings Program and Abstracts. 126-127 (2003)

Related Report

[Publications] 原 正一郎: "古文書OCRのための文字切り出し"情報処理学会研究報告2002-CH-55. Vol.2002,No.73. 51-56 (2002)

Related Report

[Publications] Shoichiro Hara: "OCR for Japanese Classical Documents -Segmentaion of Cursive Characters-"Conference Proceedings IEEE ICITA 202. (CD-ROM). (149-10)1-(149-10)6 (2002)

Related Report

[Publications] 原正一郎: "古文書OCRのための文字切り出し"情報処理学会研究報告2002-CH-55. Vol.2002,No.73. 51-56 (2002)