2019 Fiscal Year Annual Research Report

Integration of Crowdsourcing and Machine Learning for Large-scale Transcription of Pre-modern Historical Manuscripts

Research Project

Project/Area Number	18K18338
Research Institution	National Museum of Japanese History
Principal Investigator	橋本雄太国立歴史民俗博物館, 大学共同利用機関等の部局等, 助教 (10802712)
Project Period (FY)	2018-04-01 – 2020-03-31
Keywords	文字認識 / OCR / クラウドソーシング / 翻刻
Outline of Annual Research Achievements	本研究の目的は、機械学習による文字認識技術とクラウドソーシングによる人海戦術を統合することで、膨大な点数が残されている日本語の歴史文献資料を効率的にテキスト化する手法を開発することであった。当初は本研究で文字認識技術の開発に取り組む計画であったが、「くずし字」の自動認識研究が申請者の予想を超えるペースで発展を遂げたことから、方針を転換し文字認識技術の研究者とのコラボレーションを通じてクラウドソーシング翻刻の効率化に取り組むこととした。この方針のもと、2019年7月にAIによる文字認識を導入した「みんなで翻刻」の新しいバージョンを公開した。このバージョンには人文学オープンデータ共同利用センター（CODH）が開発した文字認識モデルと、凸版印刷株式会社が開発した文字認識モデルの両方が搭載されている。新バージョンの公開後、翻刻作業は順調なペースで進行している。公開から279日が経過した本日時点で、参加者の数は687人、翻刻された文字数は192.2万字である。毎日6,800字のペースで翻刻が進んでいることになる。次の課題として、①翻刻の正確性についてのサンプリング調査を実施すること、②AIによる文字認識の利用傾向と翻刻作業への寄与度合いについてについて調査を実施する必要がある。また、現在「みんなで翻刻」に搭載されている文字認識モデルは一文字単位での認識に対応したモデルだが、複数文字の認識やレイアウト解析に対応したより高度なモデルを導入する予定である。また、文字認識技術研究者とのコラボレーションの延長として、「みんなで翻刻」で入力された翻刻を文字認識AIの教師データとして利用する研究を開始した。これが実現すれば、旧バージョンの「みんなで翻刻」も含めてこれまで翻刻された800万字の翻刻を教師データとして利用できることになり、文字認識AIの更なる精度向上に貢献することが見込まれる。

Research Products
(4 results)

All 2019 Other

All Presentation (2 results) (of which Int'l Joint Research: 2 results, Invited: 1 results) Book (1 results) Remarks (1 results)

[Presentation] Digital Humanities Research in National Museum of Japanese History2019
- Author(s)
  Yuta Hashimoto
- Organizer
  The International Conference for Museums of Language & Writing 2019
- Int'l Joint Research / Invited
[Presentation] Honkoku2: Towards a Large-scale Transcription of Pre-modern Japanese Manuscripts2019
- Author(s)
  Yuta Hashimoto
- Organizer
  The 9th Conference of Japanese Association for Digital Humanities (JADH2019)
- Int'l Joint Research
[Book] デジタルアーカイブ・ベーシックス22019
- Author(s)
  今村文彦　監修／鈴木親彦　責任編集
- Total Pages
  208
- Publisher
  勉誠出版
- ISBN
  978-4-585-20282-0
[Remarks] みんなで翻刻
- URL
  https://honkoku.org/

2019 Fiscal Year Annual Research Report

Integration of Crowdsourcing and Machine Learning for Large-scale Transcription of Pre-modern Historical Manuscripts

Principal Investigator

橋本 雄太 国立歴史民俗博物館, 大学共同利用機関等の部局等, 助教 (10802712)

Research Products

[Presentation] Digital Humanities Research in National Museum of Japanese History2019

Author(s)

Organizer

[Presentation] Honkoku2: Towards a Large-scale Transcription of Pre-modern Japanese Manuscripts2019

Author(s)

Organizer

[Book] デジタルアーカイブ・ベーシックス22019

Author(s)

Total Pages

Publisher

ISBN

[Remarks] みんなで翻刻

URL

橋本雄太国立歴史民俗博物館, 大学共同利用機関等の部局等, 助教 (10802712)