Integrated Ensemble Learning with Embedded Vectors in Authorship Attribution

Research Project

Project/Area Number	22K12726
Research Category	Grant-in-Aid for Scientific Research (C)
Allocation Type	Multi-year Fund
Section	一般
Review Section	Basic Section 90020:Library and information science, humanistic and social informatics-related
Research Institution	Doshisha University
Principal Investigator	金明哲同志社大学, 研究開発推進機構, 嘱託研究員 (60275469)
Project Period (FY)	2022-04-01 – 2025-03-31
Project Status	Granted (Fiscal Year 2023)
Budget Amount *help	¥4,160,000 (Direct Cost: ¥3,200,000、Indirect Cost: ¥960,000) Fiscal Year 2024: ¥390,000 (Direct Cost: ¥300,000、Indirect Cost: ¥90,000) Fiscal Year 2023: ¥1,300,000 (Direct Cost: ¥1,000,000、Indirect Cost: ¥300,000) Fiscal Year 2022: ¥2,470,000 (Direct Cost: ¥1,900,000、Indirect Cost: ¥570,000)
Keywords	著者推定 / BERT / 文体特徴量 / 統合的アンサンブル学習 / 言語生成モデルChatGPT / 人工知能(AI) / 埋め込みベクトル / アンサンブル学習 / 深層学習 / 事前学習済みモデル
Outline of Research at the Start	BERTは大量の学習データを用いた事前学習済みのモデルに、個別分野のタスクを適応させる汎用モデルである。網羅的かつ大量の学習データを作成し、事前学習させることが困難であるため、日本ではWikipedia、Ｗeb記事などに基づいて学習させたBERTが続々公開されている。本研究では、著者推定タスクに焦点を当て、公開されている複数のBERTについて、学習に用いたデータがタスクに与える影響を分析したうえで、これらのBERTを用いたアンサンブル学習、および複数の文体特徴量と複数のBERTを併用した統合的アンサンブル学習で著者推定の精度を向上させる方法について研究する。
Outline of Annual Research Achievements	前年度では複数のBERTを比較し、事前学習データがタスクに影響を与えること、異なる事前学習データで学習したBERTをアンサンブル学習することにより著者推定精度を向上させることが可能であることを明らかにしました。本年度は、まず前年度の実験結果を再確認し、まとめて論文投稿を行いました。次に研究計画通り、複数BERTのアンサンブル学習結果と複数の文体特徴量4種類(文字のbigram，タグのbigram，タグ付き形態素，文節のパターン)の特徴量を統合的にアンサンブル学習することに関する実験研究を行いました。その結果、単一の文体特徴量及びBERTのみのアンサンブル学習結果より高い正解率で著者を推定する可能性があることが分かりました。また、BERTを用いてニュース記事を学習に基づいた株価の推定に関する研究を進め、その結果をまとめ人工知能専門誌に投稿し、採択されました。なお、生成型AIの一つChatGPTが社会で大きな反響を起こしている状況を踏まえて、ChatGPTが生成した文章の文体について研究を進めました。ChatGPTとBERTの共通点はともにTransformrsによる埋め込みベクトルアクテクチャを用い、大規模なデータセットを事前学習しています。そこで本年度はChatGPTが生成された文章と人間が作成した文章との識別に関する研究に焦点を与え、実証研究を行った。その結果を国際学術誌に２編の論文を投稿し、採択されました。論文については日本の複数の新聞、アメリカのCommunications of the ACM（2024年3月25日）に取り上げています。関連の論文はresearchmapにアップしました。
Current Status of Research Progress	Current Status of Research Progress 2: Research has progressed on the whole more than it was originally planned. Reason 前年度では、まず研究の基盤として、使用するBERTモデルの選択と必要となるコーパス作成を行いました。次に、BERTモデルによる学習データが個別のタスクに与える影響について研究を行い、事前学習データが個別のタスクを解く際のモデルの性能に影響を与えること、さらには異なるコーパスで学習したBERTモデルをアンサンブル学習することにより精度を向上させることが可能であることを明らかにしました。本年度では、前年度の結果を再確認し、論文にまとめて研究雑誌に投稿しました。さらに、複数の文体特徴量と複数のBERTモデルを統合的に用いたアンサンブル学習の実験を行い、初歩的な結果を得ました。また、BERTモデルを用いてニュース記事を学習し、株価の推定に関する研究を進め、その結果をまとめた論文を人工知能専門紙に掲載しました。さらに、Transformersによる埋め込みベクトルアーキテクチャを用いた生成言語モデルChatGPTが生成する文章の文体に関する研究を進め、その結果を国際学術誌に2編の論文として投稿し、採択されました。研究は計画通りに順調に進めています。
Strategy for Future Research Activity	2024年度では、すでに得た研究成果を丁寧に点検し、論文化して国際学術誌に投稿するする。また、大規模言語モデルと文体との関連の最新動向と新しい課題を模索する。

Report

(2 results)

2023 Research-status Report
2022 Research-status Report

Research Products
(12 results)

All 2024 2023 2022

All Journal Article (8 results) (of which Int'l Joint Research: 4 results, Peer Reviewed: 8 results, Open Access: 6 results) Presentation (3 results) (of which Invited: 1 results) Book (1 results)

[Journal Article] Can we spot fake public comments generated by ChatGPT(-3.5, -4)?: Japanese stylometric analysis expose emulation created by one-shot learning2024
- Author(s)
  Zaitsu Wataru、Jin Mingzhe、Ishihara Shunichi、Tsuge Satoru、Inaba Mitsuyuki
- Journal Title
  
  PLOS ONE
  
  Volume: 19 Issue: 3 Pages: 1-10
- DOI
  10.1371/journal.pone.0299031
- Related Report
  2023 Research-status Report
- Peer Reviewed / Open Access / Int'l Joint Research
[Journal Article] Distinguishing ChatGPT(-3.5, -4)-generated and human-written papers through Japanese stylometric analysis2023
- Author(s)
  Zaitsu Wataru、Jin Mingzhe
- Journal Title
  
  PLOS ONE
  
  Volume: 18 Issue: 8 Pages: 1-10
- DOI
  10.1371/journal.pone.0288453
- Related Report
  2023 Research-status Report
- Peer Reviewed
[Journal Article] Analysis of Stock Market Movement Prediction withPre-trained Language model2023
- Author(s)
  李金陽、Doshisha University、金明哲、宿久洋、Doshisha University & Kyoto University、Doshisha University
- Journal Title
  
  人工智能研究
  
  Volume: 1 Issue: 2 Pages: 26-39
- DOI
  10.55375/aif.2023.2.3
- Related Report
  2023 Research-status Report
- Peer Reviewed
[Journal Article] Is word length inaccurate for authorship attribution?2022
- Author(s)
  Zheng Wanwan、Jin Mingzhe
- Journal Title
  
  Digital Scholarship in the Humanities
  
  Volume: 38 Issue: 2 Pages: 875-890
- DOI
  10.1093/llc/fqac067
- Related Report
  2022 Research-status Report
- Peer Reviewed / Open Access / Int'l Joint Research
[Journal Article] A review on authorship attribution in text mining2022
- Author(s)
  Zheng Wanwan、Jin Mingzhe
- Journal Title
  
  WIREs Computational Statistics
  
  Volume: 15 Issue: 2
- DOI
  10.1002/wics.1584
- Related Report
  2022 Research-status Report
- Peer Reviewed / Open Access / Int'l Joint Research
[Journal Article] 異ジャンル文章が混在した場合における著者識別分析2022
- Author(s)
  柳燁佳, 金明哲
- Journal Title
  
  データ分析の理論と応用
  
  Volume: 11 Pages: 1-14
- Related Report
  2022 Research-status Report
- Peer Reviewed / Open Access
[Journal Article] Improving the Performance of Feature Selection Methods with Low-Sample-Size Data2022
- Author(s)
  Zheng Wanwan、Jin Mingzhe
- Journal Title
  
  The Computer Journal
  
  Volume: 00 Issue: 7 Pages: 00-00
- DOI
  10.1093/comjnl/bxac033
- Related Report
  2022 Research-status Report
- Peer Reviewed / Open Access / Int'l Joint Research
[Journal Article] 現代小説の文末表現における通時変化の統計モデリングと分析2022
- Author(s)
  李広微, 金明哲
- Journal Title
  
  計量国語学
  
  Volume: 33(5) Pages: 309-324
- Related Report
  2022 Research-status Report
- Peer Reviewed / Open Access
[Presentation] 著者推定における事前学習済みBERTを用いたアンサンブル学習法の提案2022
- Author(s)
  神田泰誠，柳燁佳，金明哲
- Organizer
  信学技報(電子情報通信学会)
- Related Report
  2022 Research-status Report
[Presentation] 著者推定における異なる事前学習データを持つ日本語版BERTの性能比較分析2022
- Author(s)
  神田泰誠，柳燁佳，金明哲
- Organizer
  日本行動計量学会
- Related Report
  2022 Research-status Report
[Presentation] Stylometryから連想する計量的表現研究2022
- Author(s)
  金　明哲
- Organizer
  表現学会
- Related Report
  2022 Research-status Report
- Invited
[Book] テキストデータマネジメント2022
- Author(s)
  波多野賢治、天笠俊之、鈴木優、宮崎純、楠和馬
- Total Pages
  242
- Publisher
  岩波書店
- ISBN
  4000298992
- Related Report
  2022 Research-status Report

Integrated Ensemble Learning with Embedded Vectors in Authorship Attribution

Principal Investigator

金 明哲 同志社大学, 研究開発推進機構, 嘱託研究員 (60275469)

¥4,160,000 (Direct Cost: ¥3,200,000、Indirect Cost: ¥960,000)

Current Status of Research Progress

Reason

Report

Research Products

[Journal Article] Can we spot fake public comments generated by ChatGPT(-3.5, -4)?: Japanese stylometric analysis expose emulation created by one-shot learning2024

Author(s)

Journal Title

DOI

Related Report

[Journal Article] Distinguishing ChatGPT(-3.5, -4)-generated and human-written papers through Japanese stylometric analysis2023

Author(s)

Journal Title

DOI

Related Report

[Journal Article] Analysis of Stock Market Movement Prediction withPre-trained Language model2023

Author(s)

Journal Title

DOI

Related Report

[Journal Article] Is word length inaccurate for authorship attribution?2022

Author(s)

Journal Title

DOI

Related Report

[Journal Article] A review on authorship attribution in text mining2022

Author(s)

Journal Title

DOI

Related Report

[Journal Article] 異ジャンル文章が混在した場合における著者識別分析2022

Author(s)

Journal Title

Related Report

[Journal Article] Improving the Performance of Feature Selection Methods with Low-Sample-Size Data2022

Author(s)

Journal Title

DOI

Related Report

[Journal Article] 現代小説の文末表現における通時変化の統計モデリングと分析2022

Author(s)

Journal Title

Related Report

[Presentation] 著者推定における事前学習済みBERTを用いたアンサンブル学習法の提案2022

Author(s)

Organizer

Related Report

[Presentation] 著者推定における異なる事前学習データを持つ日本語版BERTの性能比較分析2022

Author(s)

Organizer

Related Report

[Presentation] Stylometryから連想する計量的表現研究2022

Author(s)

Organizer

Related Report

[Book] テキストデータマネジメント2022

Author(s)

Total Pages

Publisher

ISBN

Related Report

金明哲同志社大学, 研究開発推進機構, 嘱託研究員 (60275469)