Topic models bridging between documents as members composing a corpus and documents as sequences composed by words

Research Project

Project/Area Number	21K12017
Research Category	Grant-in-Aid for Scientific Research (C)
Allocation Type	Multi-year Fund
Section	一般
Review Section	Basic Section 61030:Intelligent informatics-related
Research Institution	Rikkyo University
Principal Investigator	Masada Tomonari 立教大学, 人工知能科学研究科, 教授 (60413928)
Project Period (FY)	2021-04-01 – 2024-03-31
Project Status	Completed (Fiscal Year 2023)
Budget Amount *help	¥4,030,000 (Direct Cost: ¥3,100,000、Indirect Cost: ¥930,000) Fiscal Year 2023: ¥390,000 (Direct Cost: ¥300,000、Indirect Cost: ¥90,000) Fiscal Year 2022: ¥1,820,000 (Direct Cost: ¥1,400,000、Indirect Cost: ¥420,000) Fiscal Year 2021: ¥1,820,000 (Direct Cost: ¥1,400,000、Indirect Cost: ¥420,000)
Keywords	機械学習 / テキストマイニング / 自然言語処理 / トピックモデル / 言語モデル / 深層学習 / 埋め込み / 自動採点 / ベイズ統計
Outline of Research at the Start	トピックモデルは、特定の意図の下に収集されたコーパスのなかで各文書が持つ位置付けを明らかにする手法として優れており、また、コーパスに潜む多様な話題に対応する複数の単語リストを抽出することで多様なコンテンツの鳥瞰図を与える。一方、近年、文書を単語列として精緻にモデル化するBERT等のモデルが、深層学習分野で急発展している。BERTによるモデリングは、英語なら英語の単語列に一般的に見られる言語的特徴を反映できるため、汎用性を持つ。本研究は、トピックモデルとBERTを組み合わせ、個別のコーパス内での話題特定能力が強く、かつ、英語なら英語の一般的な言語的特徴も反映したトピック抽出の実現を目指す。
Outline of Final Research Achievements	The aim of our study was to combine a topic model as a domain-specific encoder with a deep learning-based language model as a general-purpose encoder, in order to improve the quality of topic analysis. However, after starting this research, language models have shown a remarkable progress in effectiveness and efficiency. By using them for text embeddings and analyzing various corpora, we have found that a domain-specific encoder can be realized only by fine-tuning a language model. Not only topic modeling but also any text mining based solely on word frequencies no longer have any technical importance. Here is the conclusion of our study: We should now focus on how to utilize the text embeddings provided by language models in order to improve the quality of topic analysis.
Academic Significance and Societal Importance of the Research Achievements	本研究の学術的意義は、従来ならミニバッチ式の変分推論で実践していたトピックモデリングを、事前学習済みの言語モデルを使ったテキスト埋め込みの利用により置き換える、定型的な手順を見つけた点にある。社会的意義は、変分推論の面倒を見なくてよい分、変分推論を十分に収束する前に止めてしまっている、ハイパーパラメータをチューニングしていない、等のミスが生じず、初心者でも失敗の可能性が低いトピック抽出を実現できる点にある。抽出されるトピックの質を上げるために言語モデルをファインチューニングする場合であっても、関連する技術情報がトピックモデルよりも豊富で見つけやすいため、初心者にも接近しやすい手順となっている。

Report

(4 results)

2023 Annual Research Report Final Research Report ( PDF )
2022 Research-status Report
2021 Research-status Report

Research Products
(5 results)

All 2024 2022 2021

All Journal Article (2 results) (of which Peer Reviewed: 2 results) Presentation (3 results) (of which Int'l Joint Research: 2 results)

[Journal Article] Sentence-BERT Distinguishes Good and Bad Essays in Cross-prompt Automated Essay Scoring2022
- Author(s)
  Sasaki Toru、Masada Tomonari
- Journal Title
  
  Proceedings of 2022 IEEE International Conference on Data Mining Workshops (ICDMW)
  
  Volume: 1 Pages: 274-281
- DOI
  10.1109/icdmw58026.2022.00045
- Related Report
  2022 Research-status Report
- Peer Reviewed
[Journal Article] AmLDA: A Non-VAE Neural Topic Model2022
- Author(s)
  Tomonari MASADA
- Journal Title
  
  Springer Communications in Computer and Information Science
  
  Volume: 1577 Pages: 281-295
- DOI
  10.1007/978-3-031-04447-2_19
- ISBN
  9783031044465, 9783031044472
- Related Report
  2021 Research-status Report
- Peer Reviewed
[Presentation] 言語モデルを使用した日本文学の感情展開と分類2024
- Author(s)
  冨名腰哲，正田備也
- Organizer
  情報処理学会第86回全国大会
- Related Report
  2023 Annual Research Report
[Presentation] Sentence-BERT Distinguishes Good and Bad Essays in Cross-prompt Automated Essay Scoring2022
- Author(s)
  Toru Sasaki
- Organizer
  The 1st Workshop on Data Mining in Learning Science (at the 22nd IEEE International Conference on Data Mining, ICDM2022)
- Related Report
  2022 Research-status Report
- Int'l Joint Research
[Presentation] AmLDA: A Non-VAE Neural Topic Model2021
- Author(s)
  正田備也
- Organizer
  8th International Conference on Information Management and Big Data (SIMBig 2021)
- Related Report
  2021 Research-status Report
- Int'l Joint Research

Topic models bridging between documents as members composing a corpus and documents as sequences composed by words

Principal Investigator

Masada Tomonari 立教大学, 人工知能科学研究科, 教授 (60413928)

¥4,030,000 (Direct Cost: ¥3,100,000、Indirect Cost: ¥930,000)

Report

Research Products

[Journal Article] Sentence-BERT Distinguishes Good and Bad Essays in Cross-prompt Automated Essay Scoring2022

Author(s)

Journal Title

DOI

Related Report

[Journal Article] AmLDA: A Non-VAE Neural Topic Model2022

Author(s)

Journal Title

DOI

ISBN

Related Report

[Presentation] 言語モデルを使用した日本文学の感情展開と分類2024

Author(s)

Organizer

Related Report

[Presentation] Sentence-BERT Distinguishes Good and Bad Essays in Cross-prompt Automated Essay Scoring2022

Author(s)

Organizer

Related Report

[Presentation] AmLDA: A Non-VAE Neural Topic Model2021

Author(s)

Organizer

Related Report