Reconstructing Knowledge from Early-Modern Books

Research Project

Project/Area Number	20H04483
Research Category	Grant-in-Aid for Scientific Research (B)
Allocation Type	Single-year Grants
Section	一般
Review Section	Basic Section 90020:Library and information science, humanistic and social informatics-related
Research Institution	Nara Women's University
Principal Investigator	Jo Kazuki 奈良女子大学, 生活環境科学系, 教授 (90283928)
Co-Investigator(Kenkyū-buntansha)	高田雅美奈良女子大学, 生活環境科学系, 講師 (20397574) 石川由羽滋賀大学, データサイエンス学系, 助教 (20814370)
Project Period (FY)	2020-04-01 – 2023-03-31
Project Status	Completed (Fiscal Year 2022)
Budget Amount *help	¥17,680,000 (Direct Cost: ¥13,600,000、Indirect Cost: ¥4,080,000) Fiscal Year 2022: ¥5,330,000 (Direct Cost: ¥4,100,000、Indirect Cost: ¥1,230,000) Fiscal Year 2021: ¥6,110,000 (Direct Cost: ¥4,700,000、Indirect Cost: ¥1,410,000) Fiscal Year 2020: ¥6,240,000 (Direct Cost: ¥4,800,000、Indirect Cost: ¥1,440,000)
Keywords	デジタルアーカイブ / 文字認識 / レイアウト解析 / ディープラーニング / ニューロ翻訳 / 近代書籍文字認識 / 深層距離学習 / 近代文語体自動翻訳 / CRAFT / 解像度ピラミッド / 自動テキスト化 / 自動翻訳 / 機械学習 / 近代文語体現代口語体自動翻訳 / 深層学習 / 低出現頻度文字クローラ / 汎用レイアウト解析 / 相互翻訳
Outline of Research at the Start	本研究グループはこれまでに近代書籍文字認識手法と近代文語体現代口語体相互自動翻訳手法、特定の近代書籍に特化したレイアウト解析手法に関する基礎研究を行ってきた。本研究では、低出現頻度文字クローラを利用した近代書籍文字認識、ニューラル機械翻訳による近代文語体現代口語体相互自動翻訳、複数のレイアウト解析技術をハイブリッドに融合した近代書籍用レイアウト解析の三研究課題に取りくむ。さらにこれらの研究成果で実際に「近代書籍からの知の再構築」ができることを示すために、米国スタンフォード大学フーバー研究所で整備が進められている邦字新聞デジタルコレクションに本研究成果を適用する。
Outline of Final Research Achievements	In early-modern printed character recognition, we proposed a method suitable for multi-column, multi-heading publications such as newspapers in layout analysis and confirmed its effectiveness. In the recognition part, we implemented a method to retrieve training data by crawling, and built an environment that can collect training data hundreds of times faster than human workers can do it manually. We also established a method to artificially create character types not found in the data of specific early-modern book publishers using GAN. Furthermore, by changing the recognition engine from CNN to deep metric learning, we confirmed a recognition rate of over 99%, thereby completing our research on early-modern printed character recognition. For neural translation from early-modern literary style to present colloquial style, we prepared 60,000 training data pairs and showed that the Transformer is capable of translating with sufficient accuracy.
Academic Significance and Societal Importance of the Research Achievements	本研究成果は画像としてアーカイブ化された近代書籍のテキスト化を自動的に行えることを示したもので、テキスト化された近代文語体の文章を現代口語体に自動翻訳することで、近代書籍の知を再構成して利用することが可能となる。現在スタンフォード大学フーバー研究所でアーカイブ化が進められている邦字新聞（明治以降の日本人移民が現地で出版した日本語の新聞の総称）に本研究成果が利用される予定である。また、本研究の知見は令和6年度に公開される国会図書館のNDLOCR2で一部利用されており、NDLOCR2では近代書籍に対応した初めてのOCRとなる。

Report

(4 results)

2022 Annual Research Report Final Research Report ( PDF )
2021 Annual Research Report
2020 Annual Research Report

Research Products
(15 results)

All 2023 2022 2021 2020

All Journal Article (7 results) (of which Peer Reviewed: 5 results) Presentation (8 results) (of which Int'l Joint Research: 1 results)

[Journal Article] 多段組多サイズ見出しで構成される近代書籍のレイアウト解析2023
- Author(s)
  飯田紗也香，竹本有紀，石川由羽，髙田雅美，城和貴
- Journal Title
  
  情報処理学会論文誌数理モデル化と応用
  
  Volume: -
- Related Report
  2022 Annual Research Report
- Peer Reviewed
[Journal Article] Application of Deep Metric Learning to Early-modern Japanese Printed Character Recognition2023
- Author(s)
  1.Norie Koiso, Yuki Takemoto, Sayaka Iida, Yu Ishikawa, Masami Takata, Kazuki Joe
- Journal Title
  
  Proceedings of The 2022 International Conference on Parallel and Distributed Processing Techniques and Applications
  
  Volume: -
- Related Report
  2022 Annual Research Report
- Peer Reviewed
[Journal Article] Translating Early-modern Written Style into Current Colloquial Style in Hoji Shinbun2023
- Author(s)
  2.Honoka Nishikawa, Yuki Takemoto, Sayaka Iida, Yu Ishikawa, Masami Takata, Kaoru Ueda, Kazuki Joe
- Journal Title
  
  Proceedings of The 2022 International Conference on Parallel and Distributed Processing Techniques and Applications
  
  Volume: -
- Related Report
  2022 Annual Research Report
- Peer Reviewed
[Journal Article] 特定の近代書籍出版者における低出現頻度文字種の獲得方法2022
- Author(s)
  竹本有紀，石川由羽，高田雅美，城和貴
- Journal Title
  
  情報処理学会論文誌数理モデル化と応用
  
  Volume: -
- Related Report
  2021 Annual Research Report
- Peer Reviewed
[Journal Article] Crawling Low Appearance Frequency Characters Images for Early-Modern Japanese Printed Character Recognition2021
- Author(s)
  Nanami Fujisaki, Yu Ishikawa, Masami Takata, Kazuki Joe
- Journal Title
  
  Proceeding of 2020 PDPTA (in press)
  
  Volume: -
- Related Report
  2020 Annual Research Report
- Peer Reviewed
[Journal Article] 近代書籍における文字切り出し手法の検討2021
- Author(s)
  飯田紗也香 , 竹本有紀 , 石川由羽 , 髙田雅美 , 城和貴
- Journal Title
  
  情報処理学科数理モデル化と問題解決研究会報告
  
  Volume: 2020-MPS-132(4) Pages: 1-6
- Related Report
  2020 Annual Research Report
[Journal Article] 邦字新聞における近代文語体と現代口語体の自動翻訳の検討2020
- Author(s)
  稲見郁乃 , 竹本有紀 , 石川由羽 , 高田雅美 , 上田薫 , 城和貴
- Journal Title
  
  情報処理学科数理モデル化と問題解決研究会報告
  
  Volume: 2020-MPS-131(12) Pages: 1-6
- Related Report
  2020 Annual Research Report
[Presentation] 近代書籍文字認識に対応した誤字検出2022
- Author(s)
  福元春奈, 竹本有紀, 石川由羽, 高田雅美, 城和貴
- Organizer
  情報処理学会数理モデル化と問題解決研究会
- Related Report
  2022 Annual Research Report
[Presentation] 近代書籍のためのCRAFTを用いたレイアウト解析手法2022
- Author(s)
  飯田紗也香
- Organizer
  情報処理学会数理モデル化と問題解決研究会
- Related Report
  2021 Annual Research Report
[Presentation] 教師なし学習を用いた近代文語体と現代口語体の相互翻訳の検討2021
- Author(s)
  藤井千香子
- Organizer
  情報処理学会数理モデル化と問題解決研究会
- Related Report
  2021 Annual Research Report
[Presentation] 近代書籍文字認識に有効なデータ拡張の一手法2021
- Author(s)
  倉田帆風
- Organizer
  情報処理学会数理モデル化と問題解決研究会
- Related Report
  2021 Annual Research Report
[Presentation] CycleGANを用いた近代書籍風文字の生成とそのデータ拡張への応用2021
- Author(s)
  角張凜
- Organizer
  情報処理学会数理モデル化と問題解決研究会
- Related Report
  2021 Annual Research Report
[Presentation] 近代書籍における文字切り出し手法の検討2021
- Author(s)
  飯田紗也香
- Organizer
  情報処理学科数理モデル化と問題解決研究会報告
- Related Report
  2020 Annual Research Report
[Presentation] Crawling Low Appearance Frequency Characters Images for Early-Modern Japanese Printed Character Recognition2020
- Author(s)
  Nanami Fujisaki
- Organizer
  PDPTA2020
- Related Report
  2020 Annual Research Report
- Int'l Joint Research
[Presentation] 邦字新聞における近代文語体と現代口語体の自動翻訳の検討2020
- Author(s)
  稲見郁乃
- Organizer
  情報処理学科数理モデル化と問題解決研究会報告
- Related Report
  2020 Annual Research Report

Reconstructing Knowledge from Early-Modern Books

Principal Investigator

Jo Kazuki 奈良女子大学, 生活環境科学系, 教授 (90283928)

¥17,680,000 (Direct Cost: ¥13,600,000、Indirect Cost: ¥4,080,000)

Report

Research Products

[Journal Article] 多段組多サイズ見出しで構成される近代書籍のレイアウト解析2023

Author(s)

Journal Title

Related Report

[Journal Article] Application of Deep Metric Learning to Early-modern Japanese Printed Character Recognition2023

Author(s)

Journal Title

Related Report

[Journal Article] Translating Early-modern Written Style into Current Colloquial Style in Hoji Shinbun2023

Author(s)

Journal Title

Related Report

[Journal Article] 特定の近代書籍出版者における低出現頻度文字種の獲得方法2022

Author(s)

Journal Title

Related Report

[Journal Article] Crawling Low Appearance Frequency Characters Images for Early-Modern Japanese Printed Character Recognition2021

Author(s)

Journal Title

Related Report

[Journal Article] 近代書籍における文字切り出し手法の検討2021

Author(s)

Journal Title

Related Report

[Journal Article] 邦字新聞における近代文語体と現代口語体の自動翻訳の検討2020

Author(s)

Journal Title

Related Report

[Presentation] 近代書籍文字認識に対応した誤字検出2022

Author(s)

Organizer

Related Report

[Presentation] 近代書籍のためのCRAFTを用いたレイアウト解析手法2022

Author(s)

Organizer

Related Report

[Presentation] 教師なし学習を用いた近代文語体と現代口語体の相互翻訳の検討2021

Author(s)

Organizer

Related Report

[Presentation] 近代書籍文字認識に有効なデータ拡張の一手法2021

Author(s)

Organizer

Related Report

[Presentation] CycleGANを用いた近代書籍風文字の生成とそのデータ拡張への応用2021

Author(s)

Organizer

Related Report

[Presentation] 近代書籍における文字切り出し手法の検討2021

Author(s)

Organizer

Related Report

[Presentation] Crawling Low Appearance Frequency Characters Images for Early-Modern Japanese Printed Character Recognition2020

Author(s)

Organizer

Related Report

[Presentation] 邦字新聞における近代文語体と現代口語体の自動翻訳の検討2020

Author(s)

Organizer

Related Report