2023 Fiscal Year Annual Research Report

Intelligent Information Retrieval Systems for Text Databases of Japanese and Chinese Classics

Research Project

Project/Area Number	22H03903
Allocation Type	Single-year Grants
Research Institution	Osaka University
Principal Investigator	肖川大阪大学, 大学院情報科学研究科, 准教授 (10643900)
Co-Investigator(Kenkyū-buntansha)	佐々木勇和大阪大学, 大学院情報科学研究科, 助教 (40745147) 石川佳治名古屋大学, 情報学研究科, 教授 (80263440) 程永超東北大学, 東北アジア研究センター, 准教授 (80823103)
Project Period (FY)	2022-04-01 – 2026-03-31
Keywords	情報検索 / 和漢書 / 知識ベース / データベース
Outline of Annual Research Achievements	本年度は、和漢書テキストに対する漢文固有表現の抽出、共参照解析、知識ベースの構築を行った。具体的には、トークンフリーの事前学習済みモデル(ByT5)を活用した。これまでの最も広く使用されている事前学習済み言語モデルは、単語や部分単語単位に対応するトークンのシーケンスに作用する。これに対して、トークンフリーのモデルは、生のテキスト（バイトまたは文字）に直接作用し、多くの利点を持っている。そのため、ByT5に基づく漢文の事前学習済み言語モデルを開発し、漢文の固有名詞認識のために学習済みモデルをチューニングした。チューニングされたモデルは、既存の手法を大幅に上回る性能を発揮し、いわゆるグラウンドトゥルース（C-CLUE）のエラーさえも訂正できる。また、和漢書の情報検索においては、知識ベースを活用することで、固有名詞間の関係が明確になり、検索結果の品質が向上する。そのため、和漢書テキストデータベースと密結合した知識ベースの構築を行った。特に、人物間の関係と人物・官職間の関係についての知識ベースを構築した。研究成果はDEIM 2024学会で発表された。詳細な研究成果はACL ARRに提出される予定である。さらに、データベース間での統合を促進するために、異なるデータを統合できるテーブル埋め込み手法を開発した。膨大なデータに対応するために、大規模な高次元データ検索に焦点を当て、効率的な高次元データ索引技術および類似性に基づく問合せ処理方法を活用した。
Current Status of Research Progress	Current Status of Research Progress 2: Research has progressed on the whole more than it was originally planned. Reason 当初の計画通り、トークンフリーの事前学習済みモデルを用いた。特に、任意の言語のテキストを処理することができ、ノイズに対してより堅牢であり、複雑でエラーが発生しやすいテキスト前処理パイプラインを取り除くことができた。
Strategy for Future Research Activity	漢文固有表現の抽出と共参照解析のみならず、様々な漢文タスクを解決するため、一般化能力を持つモデルの開発に取り組む。特に、Llama 3などの低価格のGPUでローカルに実行可能な大規模言語モデルを用いて、組み込みの漢文ドメイン知識を提供したモデルをカスタマイズする。

Research Products
(18 results)

All 2024 2023 Other

All Int'l Joint Research (3 results) Journal Article (4 results) (of which Int'l Joint Research: 2 results, Peer Reviewed: 4 results, Open Access: 3 results) Presentation (10 results) (of which Int'l Joint Research: 5 results) Remarks (1 results)

[Int'l Joint Research] フォーダム大学(米国)
- Country Name
  U.S.A.
- Counterpart Institution
  フォーダム大学
[Int'l Joint Research] ニューサウスウェールズ大学(オーストラリア)
- Country Name
  AUSTRALIA
- Counterpart Institution
  ニューサウスウェールズ大学
[Int'l Joint Research] アントグループ/香港大学/広州大学(中国)
- Country Name
  CHINA
- Counterpart Institution
  アントグループ/香港大学/広州大学
- # of Other Institutions
  5
[Journal Article] Utilization of Information Entropy in Training and Evaluation of Students’ Abstraction Performance and Algorithm Efficiency in Programming2024
- Author(s)
  Wu Zengqing、Liu Huizhong、Xiao Chuan
- Journal Title
  
  IEEE Transactions on Education
  
  Volume: 67 Pages: 266～281
- DOI
  10.1109/TE.2024.3354297
- Peer Reviewed / Int'l Joint Research
[Journal Article] Benchmark for Personalized Federated Learning2024
- Author(s)
  Matsuda Koji、Sasaki Yuya、Xiao Chuan、Onizuka Makoto
- Journal Title
  
  IEEE Open Journal of the Computer Society
  
  Volume: 5 Pages: 2～13
- DOI
  10.1109/OJCS.2023.3332351
- Peer Reviewed / Open Access
[Journal Article] High-Ratio Compression for Machine-Generated Data2023
- Author(s)
  Zhang Jiujing、Shen Zhitao、Yang Shiyu、Meng Lingkai、Xiao Chuan、Jia Wei、Li Yue、Sun Qinhui、Zhang Wenjie、Lin Xuemin
- Journal Title
  
  Proceedings of the ACM on Management of Data
  
  Volume: 1 Pages: 1～27
- DOI
  10.1145/3626732
- Peer Reviewed / Open Access / Int'l Joint Research
[Journal Article] DeepJoin: Joinable Table Discovery with Pre-Trained Language Models2023
- Author(s)
  Dong Yuyang、Xiao Chuan、Nozawa Takuma、Enomoto Masafumi、Oyamada Masafumi
- Journal Title
  
  Proceedings of the VLDB Endowment
  
  Volume: 16 Pages: 2458～2470
- DOI
  10.14778/3603581.3603587
- Peer Reviewed / Open Access
[Presentation] An Efficient Diversity-Aware Method for the Empty-Answer Problem2024
- Author(s)
  Yuto Ikeda、Chuan Xiao、Makoto Onizuka
- Organizer
  26th International Workshop on Design, Optimization, Languages and Analytical Processing of Big Data (DOLAP)
- Int'l Joint Research
[Presentation] A Token-Free Approach to Entity-Based Keyword Search in Classical Chinese2024
- Author(s)
  蔣 中慶、呉増青、肖川、佐々木勇和、鬼塚真
- Organizer
  第16回データ工学と情報マネジメントに関するフォーラム(DEIM)
[Presentation] 自律分散型データ統合技術Dejimaの性能分析2024
- Author(s)
  吉田凌河、肖川、鬼塚真
- Organizer
  第16回データ工学と情報マネジメントに関するフォーラム(DEIM)
[Presentation] empty-answer問題に対する多様性を考慮した効率的な探索手法2024
- Author(s)
  池田悠人、肖川、鬼塚真
- Organizer
  第16回データ工学と情報マネジメントに関するフォーラム(DEIM)
[Presentation] Jellyfish: データ前処理のための大規模言語モデル2024
- Author(s)
  張皓辰、董于洋、肖川、小山田昌史
- Organizer
  第16回データ工学と情報マネジメントに関するフォーラム(DEIM)
[Presentation] SABM：大規模言語モデルに基づくエージェントベース実世界シミュレーション2024
- Author(s)
  呉増青、彭潤、韓勗、鄭舒元、肖川
- Organizer
  第16回データ工学と情報マネジメントに関するフォーラム(DEIM)
[Presentation] Smart Agent-Based Modeling: On the Use of Large Language Models in Computer Simulations2023
- Author(s)
  Chuan Xiao
- Organizer
  5th joint Korea-Japan Workshop on Management of Data (KJDM)
- Int'l Joint Research
[Presentation] CAGAIN: Column Attention Generative Adversarial Imputation Networks2023
- Author(s)
  Kawagoshi Jun、Dong Yuyang、Nozawa Takuma、Xiao Chuan
- Organizer
  34th International Conference on Database and Expert Systems Applications (DEXA)
- Int'l Joint Research
[Presentation] A Method of Image Dehazing Based on Atmospheric Veil Prediction by ResNet2023
- Author(s)
  Zhang Jie、Li Fan、Kang Mengfei、Luo Xiongbiao、Zhao JIng、Xiao Chuan、Du Haipeng、Wang Huaijun
- Organizer
  2nd Workshop on User-Centric Narrative Summarization of Long Videos (NarSUM)
- Int'l Joint Research
[Presentation] "Guinea Pig Trials" Utilizing GPT: A Novel Smart Agent-Based Modeling Approach for Studying Firm Competition and Collusion2023
- Author(s)
  Xu Han、Zengqing Wu、Chuan Xiao
- Organizer
  Conference on Information Systems and Technology (CIST)
- Int'l Joint Research
[Remarks] 研究者ホームページ
- URL
  https://sites.google.com/site/chuanxiao1983/

2023 Fiscal Year Annual Research Report

Intelligent Information Retrieval Systems for Text Databases of Japanese and Chinese Classics

Principal Investigator

肖 川 大阪大学, 大学院情報科学研究科, 准教授 (10643900)

Current Status of Research Progress

Reason

Research Products

[Int'l Joint Research] フォーダム大学(米国)

Country Name

Counterpart Institution

[Int'l Joint Research] ニューサウスウェールズ大学(オーストラリア)

Country Name

Counterpart Institution

[Int'l Joint Research] アントグループ/香港大学/広州大学(中国)

Country Name

Counterpart Institution

# of Other Institutions

[Journal Article] Utilization of Information Entropy in Training and Evaluation of Students’ Abstraction Performance and Algorithm Efficiency in Programming2024

Author(s)

Journal Title

DOI

[Journal Article] Benchmark for Personalized Federated Learning2024

Author(s)

Journal Title

DOI

[Journal Article] High-Ratio Compression for Machine-Generated Data2023

Author(s)

Journal Title

DOI

[Journal Article] DeepJoin: Joinable Table Discovery with Pre-Trained Language Models2023

Author(s)

Journal Title

DOI

[Presentation] An Efficient Diversity-Aware Method for the Empty-Answer Problem2024

Author(s)

Organizer

[Presentation] A Token-Free Approach to Entity-Based Keyword Search in Classical Chinese2024

Author(s)

Organizer

[Presentation] 自律分散型データ統合技術Dejimaの性能分析2024

Author(s)

Organizer

[Presentation] empty-answer問題に対する多様性を考慮した効率的な探索手法2024

Author(s)

Organizer

[Presentation] Jellyfish: データ前処理のための大規模言語モデル2024

Author(s)

Organizer

[Presentation] SABM：大規模言語モデルに基づくエージェントベース実世界シミュレーション2024

Author(s)

Organizer

[Presentation] Smart Agent-Based Modeling: On the Use of Large Language Models in Computer Simulations2023

Author(s)

Organizer

[Presentation] CAGAIN: Column Attention Generative Adversarial Imputation Networks2023

Author(s)

Organizer

[Presentation] A Method of Image Dehazing Based on Atmospheric Veil Prediction by ResNet2023

Author(s)

Organizer

[Presentation] "Guinea Pig Trials" Utilizing GPT: A Novel Smart Agent-Based Modeling Approach for Studying Firm Competition and Collusion2023

Author(s)

Organizer

[Remarks] 研究者ホームページ

URL

肖川大阪大学, 大学院情報科学研究科, 准教授 (10643900)