Text Recognition of Historical Japanese Documents

Research Project

Project/Area Number	18K19800
Research Category	Grant-in-Aid for Challenging Research (Exploratory)
Allocation Type	Multi-year Fund
Review Section	Medium-sized Section 61:Human informatics and related fields
Research Institution	Gunma University
Principal Investigator	Nagai Ayumu 群馬大学, 大学院理工学府, 助教 (70375567)
Project Period (FY)	2018-06-29 – 2023-03-31
Project Status	Completed (Fiscal Year 2022)
Budget Amount *help	¥2,340,000 (Direct Cost: ¥1,800,000、Indirect Cost: ¥540,000) Fiscal Year 2020: ¥520,000 (Direct Cost: ¥400,000、Indirect Cost: ¥120,000) Fiscal Year 2019: ¥520,000 (Direct Cost: ¥400,000、Indirect Cost: ¥120,000) Fiscal Year 2018: ¥1,300,000 (Direct Cost: ¥1,000,000、Indirect Cost: ¥300,000)
Keywords	くずし字認識 / 文字認識 / 深層学習 / 翻刻 / 崩し字 / 言語モデル / データ拡大 / 古文書
Outline of Final Research Achievements	We have developed a deep neural network that outputs transcription from input images of lines of historical Japanese cursive. We placed 2nd out of 41 teams in the PRMU competition, where recognition of 3-character strings was the task, and 6th out of 293 teams in the Kaggle contest, where recognition of page units was the task. Furthermore, with the aim of improving the recognition rate of autograph historical Japanese cursive, we developed a system that takes as input a page image of cursive and the corresponding ground-truth text for each page, and outputs a pair of a line image of cursive and its ground-truth text. This is the first time that data of a certain size consists only of autograph historical Japanese cursive. Using this data together with the conventional public data for training improved the accuracy rate by about 4.5% compared to the conventional data alone.
Academic Significance and Societal Importance of the Research Achievements	本研究の意義は，崩し字で書かれた版本や写本を計算機で自動的に活字化（翻刻）することである．江戸時代の古文書はその99%以上が翻刻されておらず，最後に残された最大の文字文化である．しかし多くの現代人にはそれを容易には読みこなせない問題がある．古文書を読むためには専門的な知識と訓練を要し，現状では圧倒的に人手が足りていない．この問題を解消すべく，計算機による自動的な古文書の翻刻に貢献した．現在では版本に対しては95%前後の正解率を叩き出すが、版本ではない肉筆の古文書の崩し字には、読みにくい文字がまだまだ沢山ある。これらの難易度の高い崩し字の認識も視野に見据え、正解率を高める１つの方法を提案した。