Intelligent Information Retrieval Systems for Text Databases of Japanese and Chinese Classics

Research Project

Project/Area Number	23K25157
Project/Area Number (Other)	22H03903 (2022-2023)
Research Category	Grant-in-Aid for Scientific Research (B)
Allocation Type	Multi-year Fund (2024) Single-year Grants (2022-2023)
Section	一般
Review Section	Basic Section 90020:Library and information science, humanistic and social informatics-related
Research Institution	Osaka University
Principal Investigator	肖川大阪大学, 大学院情報科学研究科, 准教授 (10643900)
Co-Investigator(Kenkyū-buntansha)	佐々木勇和大阪大学, 大学院情報科学研究科, 助教 (40745147) 石川佳治名古屋大学, 情報学研究科, 教授 (80263440) 程永超東北大学, 東北アジア研究センター, 准教授 (80823103)
Project Period (FY)	2022-04-01 – 2026-03-31
Project Status	Granted (Fiscal Year 2024)
Budget Amount *help	¥16,900,000 (Direct Cost: ¥13,000,000、Indirect Cost: ¥3,900,000) Fiscal Year 2025: ¥4,030,000 (Direct Cost: ¥3,100,000、Indirect Cost: ¥930,000) Fiscal Year 2024: ¥4,030,000 (Direct Cost: ¥3,100,000、Indirect Cost: ¥930,000) Fiscal Year 2023: ¥4,420,000 (Direct Cost: ¥3,400,000、Indirect Cost: ¥1,020,000) Fiscal Year 2022: ¥4,420,000 (Direct Cost: ¥3,400,000、Indirect Cost: ¥1,020,000)
Keywords	情報検索 / 和漢書 / データベース / 知識ベース
Outline of Research at the Start	本研究では、情報科学と歴史・文化学の融合を目指し、テキスト化した和漢書データを対象として、和漢書テキストデータベースに適用できる知的情報検索手法とシステムの開発に関する研究を行う。このような動機のもとで、漢文固有表現の抽出と統合、和漢書テキストデータベースと密結合した知識ベースの構築、和漢書テキストに対する固有名詞の共参照解析、および情報検索フレームワークの構築・システムの実装を進める。本研究の成果によって、東アジアの歴史・文化研究を積極的に支援するのみならず、人文・社会系の他の研究分野に応用することも期待される。
Outline of Annual Research Achievements	本年度は、和漢書テキストに対する漢文固有表現の抽出、共参照解析、知識ベースの構築を行った。具体的には、トークンフリーの事前学習済みモデル(ByT5)を活用した。これまでの最も広く使用されている事前学習済み言語モデルは、単語や部分単語単位に対応するトークンのシーケンスに作用する。これに対して、トークンフリーのモデルは、生のテキスト（バイトまたは文字）に直接作用し、多くの利点を持っている。そのため、ByT5に基づく漢文の事前学習済み言語モデルを開発し、漢文の固有名詞認識のために学習済みモデルをチューニングした。チューニングされたモデルは、既存の手法を大幅に上回る性能を発揮し、いわゆるグラウンドトゥルース（C-CLUE）のエラーさえも訂正できる。また、和漢書の情報検索においては、知識ベースを活用することで、固有名詞間の関係が明確になり、検索結果の品質が向上する。そのため、和漢書テキストデータベースと密結合した知識ベースの構築を行った。特に、人物間の関係と人物・官職間の関係についての知識ベースを構築した。研究成果はDEIM 2024学会で発表された。詳細な研究成果はACL ARRに提出される予定である。さらに、データベース間での統合を促進するために、異なるデータを統合できるテーブル埋め込み手法を開発した。膨大なデータに対応するために、大規模な高次元データ検索に焦点を当て、効率的な高次元データ索引技術および類似性に基づく問合せ処理方法を活用した。
Current Status of Research Progress	Current Status of Research Progress 2: Research has progressed on the whole more than it was originally planned. Reason 当初の計画通り、トークンフリーの事前学習済みモデルを用いた。特に、任意の言語のテキストを処理することができ、ノイズに対してより堅牢であり、複雑でエラーが発生しやすいテキスト前処理パイプラインを取り除くことができた。
Strategy for Future Research Activity	漢文固有表現の抽出と共参照解析のみならず、様々な漢文タスクを解決するため、一般化能力を持つモデルの開発に取り組む。特に、Llama 3などの低価格のGPUでローカルに実行可能な大規模言語モデルを用いて、組み込みの漢文ドメイン知識を提供したモデルをカスタマイズする。

Report

(2 results)

2023 Annual Research Report
2022 Annual Research Report

Research Products
(23 results)

All 2024 2023 2022 Other

All Int'l Joint Research (3 results) Journal Article (5 results) (of which Int'l Joint Research: 2 results, Peer Reviewed: 5 results, Open Access: 4 results) Presentation (14 results) (of which Int'l Joint Research: 5 results) Remarks (1 results)

[Int'l Joint Research] フォーダム大学(米国)
- Related Report
  2023 Annual Research Report
[Int'l Joint Research] ニューサウスウェールズ大学(オーストラリア)
- Related Report
  2023 Annual Research Report
[Int'l Joint Research] アントグループ/香港大学/広州大学(中国)
- Related Report
  2023 Annual Research Report
[Journal Article] Utilization of Information Entropy in Training and Evaluation of Students’ Abstraction Performance and Algorithm Efficiency in Programming2024
- Author(s)
  Wu Zengqing、Liu Huizhong、Xiao Chuan
- Journal Title
  
  IEEE Transactions on Education
  
  Volume: 67 Issue: 2 Pages: 266-281
- DOI
  10.1109/te.2024.3354297
- Related Report
  2023 Annual Research Report
- Peer Reviewed / Int'l Joint Research
[Journal Article] Benchmark for Personalized Federated Learning2024
- Author(s)
  Matsuda Koji、Sasaki Yuya、Xiao Chuan、Onizuka Makoto
- Journal Title
  
  IEEE Open Journal of the Computer Society
  
  Volume: 5 Pages: 2-13
- DOI
  10.1109/ojcs.2023.3332351
- Related Report
  2023 Annual Research Report
- Peer Reviewed / Open Access
[Journal Article] High-Ratio Compression for Machine-Generated Data2023
- Author(s)
  Zhang Jiujing、Shen Zhitao、Yang Shiyu、Meng Lingkai、Xiao Chuan、Jia Wei、Li Yue、Sun Qinhui、Zhang Wenjie、Lin Xuemin
- Journal Title
  
  Proceedings of the ACM on Management of Data
  
  Volume: 1 Issue: 4 Pages: 1-27
- DOI
  10.1145/3626732
- Related Report
  2023 Annual Research Report
- Peer Reviewed / Open Access / Int'l Joint Research
[Journal Article] DeepJoin: Joinable Table Discovery with Pre-Trained Language Models2023
- Author(s)
  Dong Yuyang、Xiao Chuan、Nozawa Takuma、Enomoto Masafumi、Oyamada Masafumi
- Journal Title
  
  Proceedings of the VLDB Endowment
  
  Volume: 16 Issue: 10 Pages: 2458-2470
- DOI
  10.14778/3603581.3603587
- Related Report
  2023 Annual Research Report
- Peer Reviewed / Open Access
[Journal Article] MQH: Locality Sensitive Hashing on Multi-level Quantization Errors for Point-to-Hyperplane Distances2022
- Author(s)
  Kejing Lu, Yoshiharu Ishikawa, Chuan Xiao
- Journal Title
  
  Proceedings of the VLDB Endowment
  
  Volume: 16 Issue: 4 Pages: 864-876
- DOI
  10.14778/3574245.3574269
- Related Report
  2022 Annual Research Report
- Peer Reviewed / Open Access
[Presentation] An Efficient Diversity-Aware Method for the Empty-Answer Problem2024
- Author(s)
  Yuto Ikeda、Chuan Xiao、Makoto Onizuka
- Organizer
  26th International Workshop on Design, Optimization, Languages and Analytical Processing of Big Data (DOLAP)
- Related Report
  2023 Annual Research Report
- Int'l Joint Research
[Presentation] A Token-Free Approach to Entity-Based Keyword Search in Classical Chinese2024
- Author(s)
  蔣 中慶、呉増青、肖川、佐々木勇和、鬼塚真
- Organizer
  第16回データ工学と情報マネジメントに関するフォーラム(DEIM)
- Related Report
  2023 Annual Research Report
[Presentation] 自律分散型データ統合技術Dejimaの性能分析2024
- Author(s)
  吉田凌河、肖川、鬼塚真
- Organizer
  第16回データ工学と情報マネジメントに関するフォーラム(DEIM)
- Related Report
  2023 Annual Research Report
[Presentation] empty-answer問題に対する多様性を考慮した効率的な探索手法2024
- Author(s)
  池田悠人、肖川、鬼塚真
- Organizer
  第16回データ工学と情報マネジメントに関するフォーラム(DEIM)
- Related Report
  2023 Annual Research Report
[Presentation] Jellyfish: データ前処理のための大規模言語モデル2024
- Author(s)
  張皓辰、董于洋、肖川、小山田昌史
- Organizer
  第16回データ工学と情報マネジメントに関するフォーラム(DEIM)
- Related Report
  2023 Annual Research Report
[Presentation] SABM：大規模言語モデルに基づくエージェントベース実世界シミュレーション2024
- Author(s)
  呉増青、彭潤、韓勗、鄭舒元、肖川
- Organizer
  第16回データ工学と情報マネジメントに関するフォーラム(DEIM)
- Related Report
  2023 Annual Research Report
[Presentation] Smart Agent-Based Modeling: On the Use of Large Language Models in Computer Simulations2023
- Author(s)
  Chuan Xiao
- Organizer
  5th joint Korea-Japan Workshop on Management of Data (KJDM)
- Related Report
  2023 Annual Research Report
- Int'l Joint Research
[Presentation] CAGAIN: Column Attention Generative Adversarial Imputation Networks2023
- Author(s)
  Kawagoshi Jun、Dong Yuyang、Nozawa Takuma、Xiao Chuan
- Organizer
  34th International Conference on Database and Expert Systems Applications (DEXA)
- Related Report
  2023 Annual Research Report
- Int'l Joint Research
[Presentation] A Method of Image Dehazing Based on Atmospheric Veil Prediction by ResNet2023
- Author(s)
  Zhang Jie、Li Fan、Kang Mengfei、Luo Xiongbiao、Zhao JIng、Xiao Chuan、Du Haipeng、Wang Huaijun
- Organizer
  2nd Workshop on User-Centric Narrative Summarization of Long Videos (NarSUM)
- Related Report
  2023 Annual Research Report
- Int'l Joint Research
[Presentation] "Guinea Pig Trials" Utilizing GPT: A Novel Smart Agent-Based Modeling Approach for Studying Firm Competition and Collusion2023
- Author(s)
  Xu Han、Zengqing Wu、Chuan Xiao
- Organizer
  Conference on Information Systems and Technology (CIST)
- Related Report
  2023 Annual Research Report
- Int'l Joint Research
[Presentation] 創薬のための分子グラフ推薦システム2023
- Author(s)
  Sheng Hu, Ichigaku Takigawa, Chuan Xiao
- Organizer
  第15回データ工学と情報マネジメントに関するフォーラム(DEIM)
- Related Report
  2022 Annual Research Report
[Presentation] Token-Free Cross-Lingual Named Entity Recognition for Classical Chinese2023
- Author(s)
  Zhongqing Jiang, Zengqing Wu, Chuan Xiao
- Organizer
  第15回データ工学と情報マネジメントに関するフォーラム(DEIM)
- Related Report
  2022 Annual Research Report
[Presentation] 大域的一貫性を保証する自律分散型データ統合技術の性能分析2023
- Author(s)
  吉田凌河, 伊藤竜一, 肖川, 鬼塚真
- Organizer
  第15回データ工学と情報マネジメントに関するフォーラム(DEIM)
- Related Report
  2022 Annual Research Report
[Presentation] 経路を用いた高速なサブグラフ編集距離問合せ2023
- Author(s)
  堀内美聡, 佐々木勇和, 肖川, 鬼塚真
- Organizer
  第15回データ工学と情報マネジメントに関するフォーラム(DEIM)
- Related Report
  2022 Annual Research Report
[Remarks] 研究者ホームページ
- URL
  https://sites.google.com/site/chuanxiao1983/
- Related Report
  2023 Annual Research Report

Intelligent Information Retrieval Systems for Text Databases of Japanese and Chinese Classics

Principal Investigator

肖 川 大阪大学, 大学院情報科学研究科, 准教授 (10643900)

¥16,900,000 (Direct Cost: ¥13,000,000、Indirect Cost: ¥3,900,000)

Current Status of Research Progress

Reason

Report

Research Products

[Int'l Joint Research] フォーダム大学(米国)

Related Report

[Int'l Joint Research] ニューサウスウェールズ大学(オーストラリア)

Related Report

[Int'l Joint Research] アントグループ/香港大学/広州大学(中国)

Related Report

[Journal Article] Utilization of Information Entropy in Training and Evaluation of Students’ Abstraction Performance and Algorithm Efficiency in Programming2024

Author(s)

Journal Title

DOI

Related Report

[Journal Article] Benchmark for Personalized Federated Learning2024

Author(s)

Journal Title

DOI

Related Report

[Journal Article] High-Ratio Compression for Machine-Generated Data2023

Author(s)

Journal Title

DOI

Related Report

[Journal Article] DeepJoin: Joinable Table Discovery with Pre-Trained Language Models2023

Author(s)

Journal Title

DOI

Related Report

[Journal Article] MQH: Locality Sensitive Hashing on Multi-level Quantization Errors for Point-to-Hyperplane Distances2022

Author(s)

Journal Title

DOI

Related Report

[Presentation] An Efficient Diversity-Aware Method for the Empty-Answer Problem2024

Author(s)

Organizer

Related Report

[Presentation] A Token-Free Approach to Entity-Based Keyword Search in Classical Chinese2024

Author(s)

Organizer

Related Report

[Presentation] 自律分散型データ統合技術Dejimaの性能分析2024

Author(s)

Organizer

Related Report

[Presentation] empty-answer問題に対する多様性を考慮した効率的な探索手法2024

Author(s)

Organizer

Related Report

[Presentation] Jellyfish: データ前処理のための大規模言語モデル2024

Author(s)

Organizer

Related Report

[Presentation] SABM：大規模言語モデルに基づくエージェントベース実世界シミュレーション2024

Author(s)

Organizer

Related Report

[Presentation] Smart Agent-Based Modeling: On the Use of Large Language Models in Computer Simulations2023

Author(s)

Organizer

Related Report

[Presentation] CAGAIN: Column Attention Generative Adversarial Imputation Networks2023

Author(s)

Organizer

Related Report

[Presentation] A Method of Image Dehazing Based on Atmospheric Veil Prediction by ResNet2023

Author(s)

Organizer

Related Report

[Presentation] "Guinea Pig Trials" Utilizing GPT: A Novel Smart Agent-Based Modeling Approach for Studying Firm Competition and Collusion2023

Author(s)

Organizer

Related Report

[Presentation] 創薬のための分子グラフ推薦システム2023

肖川大阪大学, 大学院情報科学研究科, 准教授 (10643900)