Devanagari OCR and Sanskrit E-Text Archive

Research Project

Project/Area Number	20K20692
Research Category	Grant-in-Aid for Challenging Research (Exploratory)
Allocation Type	Multi-year Fund
Review Section	Medium-sized Section 2:Literature, linguistics, and related fields
Research Institution	The University of Tokyo
Principal Investigator	KATO Takahiro 東京大学, 大学院人文社会系研究科(文学部), 准教授 (80637934)
Project Period (FY)	2020-07-30 – 2022-03-31
Project Status	Completed (Fiscal Year 2021)
Budget Amount *help	¥6,370,000 (Direct Cost: ¥4,900,000、Indirect Cost: ¥1,470,000) Fiscal Year 2021: ¥2,340,000 (Direct Cost: ¥1,800,000、Indirect Cost: ¥540,000) Fiscal Year 2020: ¥4,030,000 (Direct Cost: ¥3,100,000、Indirect Cost: ¥930,000)
Keywords	サンスクリット / OCR / デーヴァナーガリー / 光学文字認識 / データベース
Outline of Research at the Start	本研究では、ヒンディー語、サンスクリット語、ネパール語などの諸語に用いられるインド系文字の一つ、デーヴァナーガリー文字を読み取るための光学文字認識（ＯＣＲ）ソフトウェアを開発し、その技術を用いて読み取った文献群のデータベースを構築する。研究の第一段階では、これまでなかった高精度のＯＣＲの共同開発を試み、第二段階では開発された文字認識ソフトウェアを利用して、世界各国で先行する同様のプロジェクトを凌駕しうるような規模の電子テキスト・データベース構築に向けて準備を整えたい。
Outline of Final Research Achievements	This project aims to develop a Devanagari OCR system. We introduced and evaluated preceding and on-going OCR software. We also reviewed the writing system of Devanagari and described how we correlate each combining letter with the Unicode encoding scheme. We took each letter as a composite of several elements. In this regard, we set a unit of letter called the “character shape.” We expounded the process of designing the “training data” through which an AI-OCR is generated. An AI-OCR was generated through machine learning using the prepared datasets. Following is a brief overview of the outcomes obtained from the generated AI-OCR model. Outcomes of Single Character Recognition (Out of the 2,434 sample letters): a. 2,340 letters exactly recognized (Accuracy rate 96.14 %) b. 2,397 letters correctly listed (Accuracy rate 98.48 %)
Academic Significance and Societal Importance of the Research Achievements	本研究によって開発されたデーヴァナーガリー文字ＯＣＲは、第一の目的としてサンスクリット語文献（版本）をテキストデータ化するためのものであるが、その延長線上に開けた可能性として、インド国内外に大量に保存されているサンスクリット語写本資料をテキストデータ化への応用も視野に入れている。かつてマイクロフィルムに残されたものが、最近ではデジタル撮影・デジタルスキャンによって電子アーカイブ化が進められている。今後はこうした写本資料のテキストデータ化、さらには構造化が必要となってくるだろう。今回のＯＣＲ共同開発プロジェクトは、こうした研究の進展を見越したものである。

Report

(3 results)

2021 Annual Research Report Final Research Report ( PDF )
2020 Research-status Report

Research Products
(2 results)

All Journal Article (1 results) (of which Peer Reviewed: 1 results, Open Access: 1 results) Presentation (1 results)

[Journal Article] デーヴァナーガリー文字OCRの開発2021
- Author(s)
  加藤隆宏、友成有紀、谷口力光、大澤留次郎、藤巻聡、岡田崇、橋本江美
- Journal Title
  
  研究報告人文科学とコンピュータ
  
  Volume: 2021-CH-127 Pages: 1-4
- Related Report
  2021 Annual Research Report
- Peer Reviewed / Open Access
[Presentation] デーヴァナーガリー文字OCRの開発2021
- Author(s)
  加藤隆宏、友成有紀、谷口力光、大澤留次郎、藤巻聡、岡田崇、橋本江美
- Organizer
  第127回人文科学とコンピュータ研究会
- Related Report
  2021 Annual Research Report