Improvement of Modern Document Textualization System with Integrated Use of Letter Shape Information and Language Model
Project/Area Number |
26730161
|
Research Category |
Grant-in-Aid for Young Scientists (B)
|
Allocation Type | Multi-year Fund |
Research Field |
Library and information science/Humanistic social informatics
|
Research Institution | The University of Tokyo |
Principal Investigator |
Masuda Katsuya 東京大学, 大学総合教育研究センター, 特任助教 (20512114)
|
Project Period (FY) |
2014-04-01 – 2018-03-31
|
Project Status |
Completed (Fiscal Year 2017)
|
Budget Amount *help |
¥3,380,000 (Direct Cost: ¥2,600,000、Indirect Cost: ¥780,000)
Fiscal Year 2016: ¥780,000 (Direct Cost: ¥600,000、Indirect Cost: ¥180,000)
Fiscal Year 2015: ¥1,170,000 (Direct Cost: ¥900,000、Indirect Cost: ¥270,000)
Fiscal Year 2014: ¥1,430,000 (Direct Cost: ¥1,100,000、Indirect Cost: ¥330,000)
|
Keywords | OCR / デジタルテキスト化 / 誤り訂正 / 自然言語処理 / デジタルアーカイブ / 近代書籍 / デジタルヒューマニティーズ |
Outline of Final Research Achievements |
In this research, we have developed an OCR error correction system with the aim to improve the accuracy of digitization of modern documents. We have constructed language resources of modern documents for evaluation of our system and construction of language model for modern documents. We have constructed an error correction system consist of three part, OCR error detection, candidate character generation and selection of a character from candidates. In each part, we use both letter shape information and language model to detect error or to generate candidates. We confirmed that feedback of OCR error correction to the OCR system leads to an improvement of accuracy of the OCR system.
|
Report
(5 results)
Research Products
(3 results)