Image-based contents analysis for untranscribed document image archives

Research Project

Project/Area Number	17K00241
Research Category	Grant-in-Aid for Scientific Research (C)
Allocation Type	Multi-year Fund
Section	一般
Research Field	Perceptual information processing
Research Institution	Future University-Hakodate
Principal Investigator	Terasawa Kengo 公立はこだて未来大学, システム情報科学部, 准教授 (10435985)
Project Period (FY)	2017-04-01 – 2020-03-31
Project Status	Completed (Fiscal Year 2019)
Budget Amount *help	¥2,730,000 (Direct Cost: ¥2,100,000、Indirect Cost: ¥630,000) Fiscal Year 2019: ¥910,000 (Direct Cost: ¥700,000、Indirect Cost: ¥210,000) Fiscal Year 2018: ¥910,000 (Direct Cost: ¥700,000、Indirect Cost: ¥210,000) Fiscal Year 2017: ¥910,000 (Direct Cost: ¥700,000、Indirect Cost: ¥210,000)
Keywords	画像、文章、音声等認識 / パターン認識 / データベース / デジタルアーカイブ
Outline of Final Research Achievements	In this study, we achieved the extraction of frequently appearing words and evaluation of the importance of the extracted words from machine-unreadable untranscribed document images, using image-based analysis of the frequency and pattern of occurrence of certain text strings. We also achieved to summarize the content of each document and to extract the part that is highly related a specific topic. We conducted a experiment on untranscribed newspaper images published in Meiji Era, and confirmed the performance and effectiveness of the proposed method. Our achievement will promote effective use of digital archives of document images.
Academic Significance and Societal Importance of the Research Achievements	本研究の成果により、手書きであったり経年劣化を経ているなどの理由で機械判読が困難である文書画像に対しても、その内容の要約や、特定のトピックと関連の高い箇所を閲覧者に提示することが可能となる。これにより、各地で整備が進み蓄積されている文書画像デジタルアーカイブが、専門研究者のみならず、一般市民や地域史に興味を持つ人々などにとっても、使いやすく便利な文献資料として、その価値を高めていくことが期待される。