• Search Research Projects
  • Search Researchers
  • How to Use
  1. Back to previous page

Text Recognition of Historical Japanese Documents

Research Project

Project/Area Number 18K19800
Research Category

Grant-in-Aid for Challenging Research (Exploratory)

Allocation TypeMulti-year Fund
Review Section Medium-sized Section 61:Human informatics and related fields
Research InstitutionGunma University

Principal Investigator

Nagai Ayumu  群馬大学, 大学院理工学府, 助教 (70375567)

Project Period (FY) 2018-06-29 – 2023-03-31
Project Status Completed (Fiscal Year 2022)
Budget Amount *help
¥2,340,000 (Direct Cost: ¥1,800,000、Indirect Cost: ¥540,000)
Fiscal Year 2020: ¥520,000 (Direct Cost: ¥400,000、Indirect Cost: ¥120,000)
Fiscal Year 2019: ¥520,000 (Direct Cost: ¥400,000、Indirect Cost: ¥120,000)
Fiscal Year 2018: ¥1,300,000 (Direct Cost: ¥1,000,000、Indirect Cost: ¥300,000)
Keywordsくずし字認識 / 文字認識 / 深層学習 / 翻刻 / 崩し字 / 言語モデル / データ拡大 / 古文書
Outline of Final Research Achievements

We have developed a deep neural network that outputs transcription from input images of lines of historical Japanese cursive. We placed 2nd out of 41 teams in the PRMU competition, where recognition of 3-character strings was the task, and 6th out of 293 teams in the Kaggle contest, where recognition of page units was the task.
Furthermore, with the aim of improving the recognition rate of autograph historical Japanese cursive, we developed a system that takes as input a page image of cursive and the corresponding ground-truth text for each page, and outputs a pair of a line image of cursive and its ground-truth text. This is the first time that data of a certain size consists only of autograph historical Japanese cursive. Using this data together with the conventional public data for training improved the accuracy rate by about 4.5% compared to the conventional data alone.

Academic Significance and Societal Importance of the Research Achievements

本研究の意義は,崩し字で書かれた版本や写本を計算機で自動的に活字化(翻刻)することである.江戸時代の古文書はその99%以上が翻刻されておらず,最後に残された最大の文字文化である.しかし多くの現代人にはそれを容易には読みこなせない問題がある.古文書を読むためには専門的な知識と訓練を要し,現状では圧倒的に人手が足りていない.この問題を解消すべく,計算機による自動的な古文書の翻刻に貢献した.現在では版本に対しては95%前後の正解率を叩き出すが、版本ではない肉筆の古文書の崩し字には、読みにくい文字がまだまだ沢山ある。これらの難易度の高い崩し字の認識も視野に見据え、正解率を高める1つの方法を提案した。

Report

(6 results)
  • 2022 Annual Research Report   Final Research Report ( PDF )
  • 2021 Research-status Report
  • 2020 Research-status Report
  • 2019 Research-status Report
  • 2018 Research-status Report
  • Research Products

    (3 results)

All 2021 2019 Other

All Presentation (2 results) (of which Int'l Joint Research: 2 results) Remarks (1 results)

  • [Presentation] Generation of a Large-Scale Line Image Dataset with Ground Truth Texts from Page-Level Autograph Documents2021

    • Author(s)
      Ayumu Nagai
    • Organizer
      International Conference on Neural Information Processing (ICONIP 2021)
    • Related Report
      2021 Research-status Report
    • Int'l Joint Research
  • [Presentation] On the Improvement of Recognizing Single-line Strings of Japanese Historical Cursive2019

    • Author(s)
      Ayumu Nagai
    • Organizer
      15th International Conference on Document Analysis and Recognition
    • Related Report
      2019 Research-status Report
    • Int'l Joint Research
  • [Remarks] くずし字の肉筆データ(ground truth付き)

    • URL

      https://gadget.inf.gunma-u.ac.jp/dl/autograph.tar.gz

    • Related Report
      2021 Research-status Report

URL: 

Published: 2018-07-25   Modified: 2024-01-30  

Information User Guide FAQ News Terms of Use Attribution of KAKENHI

Powered by NII kakenhi