Improvement of Modern Document Textualization System with Integrated Use of Letter Shape Information and Language Model

Research Project

Project/Area Number	26730161
Research Category	Grant-in-Aid for Young Scientists (B)
Allocation Type	Multi-year Fund
Research Field	Library and information science/Humanistic social informatics
Research Institution	The University of Tokyo
Principal Investigator	Masuda Katsuya 東京大学, 大学総合教育研究センター, 特任助教 (20512114)
Project Period (FY)	2014-04-01 – 2018-03-31
Project Status	Completed (Fiscal Year 2017)
Budget Amount *help	¥3,380,000 (Direct Cost: ¥2,600,000、Indirect Cost: ¥780,000) Fiscal Year 2016: ¥780,000 (Direct Cost: ¥600,000、Indirect Cost: ¥180,000) Fiscal Year 2015: ¥1,170,000 (Direct Cost: ¥900,000、Indirect Cost: ¥270,000) Fiscal Year 2014: ¥1,430,000 (Direct Cost: ¥1,100,000、Indirect Cost: ¥330,000)
Keywords	OCR / デジタルテキスト化 / 誤り訂正 / 自然言語処理 / デジタルアーカイブ / 近代書籍 / デジタルヒューマニティーズ
Outline of Final Research Achievements	In this research, we have developed an OCR error correction system with the aim to improve the accuracy of digitization of modern documents. We have constructed language resources of modern documents for evaluation of our system and construction of language model for modern documents. We have constructed an error correction system consist of three part, OCR error detection, candidate character generation and selection of a character from candidates. In each part, we use both letter shape information and language model to detect error or to generate candidates. We confirmed that feedback of OCR error correction to the OCR system leads to an improvement of accuracy of the OCR system.

Report

(5 results)

2017 Annual Research Report Final Research Report ( PDF )
2016 Research-status Report
2015 Research-status Report
2014 Research-status Report

Research Products
(3 results)

All 2016 2015

All Journal Article (1 results) (of which Peer Reviewed: 1 results, Open Access: 1 results) Presentation (2 results)

[Journal Article] Revealing the Modern History of Japanese Philosophy Using Digitization, Natural Language Processing, and Visualization2015
- Author(s)
  Katsuya Masuda, Makoto Tanji, Hideki Mima
- Journal Title
  
  Journal of the Japanese Association for Digital Humanities
  
  Volume: 1 Issue: 1 Pages: 37-43
- DOI
  10.17928/jjadh.1.1_37
- NAID
  130005096576
- ISSN
  2188-7276
- Related Report
  2015 Research-status Report
- Peer Reviewed / Open Access
[Presentation] 言語情報と字形情報を用いた近代書籍に対するOCR誤り訂正2016
- Author(s)
  増田勝也
- Organizer
  人文科学とコンピュータ(じんもんこん)シンポジウム2016
- Place of Presentation
  国立国語研究所（東京都立川市）
- Year and Date
  2016-12-10
- Related Report
  2016 Research-status Report
[Presentation] 大域的情報を用いたOCR文字誤り訂正2015
- Author(s)
  増田勝也
- Organizer
  言語処理学会第21回年次大会
- Place of Presentation
  京都大学(京都府京都市)
- Year and Date
  2015-03-17
- Related Report
  2014 Research-status Report

Improvement of Modern Document Textualization System with Integrated Use of Letter Shape Information and Language Model

Principal Investigator

Masuda Katsuya 東京大学, 大学総合教育研究センター, 特任助教 (20512114)

¥3,380,000 (Direct Cost: ¥2,600,000、Indirect Cost: ¥780,000)

Report

Research Products

[Journal Article] Revealing the Modern History of Japanese Philosophy Using Digitization, Natural Language Processing, and Visualization2015

Author(s)

Journal Title

DOI

NAID

ISSN

Related Report

[Presentation] 言語情報と字形情報を用いた近代書籍に対するOCR誤り訂正2016

Author(s)

Organizer

Place of Presentation

Year and Date

Related Report

[Presentation] 大域的情報を用いたOCR文字誤り訂正2015

Author(s)

Organizer

Place of Presentation

Year and Date

Related Report