2021 Fiscal Year Final Research Report

Devanagari OCR and Sanskrit E-Text Archive

Research Project

PDF

Project/Area Number	20K20692
Research Category	Grant-in-Aid for Challenging Research (Exploratory)
Allocation Type	Multi-year Fund
Review Section	Medium-sized Section 2:Literature, linguistics, and related fields
Research Institution	The University of Tokyo
Principal Investigator	KATO Takahiro 東京大学, 大学院人文社会系研究科(文学部), 准教授 (80637934)
Project Period (FY)	2020-07-30 – 2022-03-31
Keywords	サンスクリット / OCR / デーヴァナーガリー
Outline of Final Research Achievements	This project aims to develop a Devanagari OCR system. We introduced and evaluated preceding and on-going OCR software. We also reviewed the writing system of Devanagari and described how we correlate each combining letter with the Unicode encoding scheme. We took each letter as a composite of several elements. In this regard, we set a unit of letter called the “character shape.” We expounded the process of designing the “training data” through which an AI-OCR is generated. An AI-OCR was generated through machine learning using the prepared datasets. Following is a brief overview of the outcomes obtained from the generated AI-OCR model. Outcomes of Single Character Recognition (Out of the 2,434 sample letters): a. 2,340 letters exactly recognized (Accuracy rate 96.14 %) b. 2,397 letters correctly listed (Accuracy rate 98.48 %)
Free Research Field	インド哲学・サンスクリット文献学
Academic Significance and Societal Importance of the Research Achievements	本研究によって開発されたデーヴァナーガリー文字ＯＣＲは、第一の目的としてサンスクリット語文献（版本）をテキストデータ化するためのものであるが、その延長線上に開けた可能性として、インド国内外に大量に保存されているサンスクリット語写本資料をテキストデータ化への応用も視野に入れている。かつてマイクロフィルムに残されたものが、最近ではデジタル撮影・デジタルスキャンによって電子アーカイブ化が進められている。今後はこうした写本資料のテキストデータ化、さらには構造化が必要となってくるだろう。今回のＯＣＲ共同開発プロジェクトは、こうした研究の進展を見越したものである。