Development of a platform for automatic data science by using text information in metadata

Research Project

Project/Area Number	22K21288
Research Category	Grant-in-Aid for Research Activity Start-up
Allocation Type	Multi-year Fund
Review Section	1001:Information science, computer engineering, and related fields
Research Institution	Tokyo City University
Principal Investigator	Masuda Satoshi 東京都市大学, メディア情報学部, 教授 (60947927)
Project Period (FY)	2022-08-31 – 2024-03-31
Project Status	Completed (Fiscal Year 2023)
Budget Amount *help	¥2,860,000 (Direct Cost: ¥2,200,000、Indirect Cost: ¥660,000) Fiscal Year 2023: ¥1,430,000 (Direct Cost: ¥1,100,000、Indirect Cost: ¥330,000) Fiscal Year 2022: ¥1,430,000 (Direct Cost: ¥1,100,000、Indirect Cost: ¥330,000)
Keywords	データサイエンス / 自動特徴量エンジニアリング / 自然言語処理 / 時系列特徴量 / 時系列データ抽出の自動化 / datetime API / 特徴量エンジニアリング / テキスト分析 / ソフトウェア工学
Outline of Research at the Start	膨大なデータから新たな知見を得る分析はデータサイエンスと呼ばれ、現在その普及は社会的に重要となっている。本研究では、データサイエンスの作業の中でより経験が必要とされ自動化の効果が大きいデータの特徴量抽出に着目し、従来の数値情報からではなくデータ項目名やデータ記述など、いわゆるメタデータのテキスト情報から特徴量抽出を自動化する新たなアプローチを取る。具体的には、既存のデータサイエンスにおけるデータ記述およびソースコードに対して、自然言語処理やソースコード分析技術を利用し、特徴量の抽出が可能な形で再利用する技術を開発する。
Outline of Final Research Achievements	In this research, we took a new approach to automate feature extraction from textual information of data item names. Specifically, we created a knowledge database focusing on time series (datetime) features by using natural language processing and source code analysis techniques for data item names and source codes in existing data science. Furthermore, we developed a system that recommends datetime features from newly provided text information using the knowledge database. For the feature recommendation mechanism, we improved the accuracy of word vectorization by using one-hot vector and word embedding methods. In experiments, we confirmed the classification accuracy of the knowledge database and applied it to actual forecasting tasks, such as house price forecasting, to confirm the improvement in forecasting accuracy.
Academic Significance and Societal Importance of the Research Achievements	膨大なデータから新たな知見を得る分析はデータサイエンスと呼ばれ，その普及が推進されている．データの特徴量を抽出する作業は，特徴量エンジニアリングと呼ばれ，データサイエンスの作業ステップの一つである．現在，特徴量エンジニアリングの作業は，エキスパートの経験に頼っているため，その作業の自動化の研究が行われている．本研究は，テキスト情報からdatetime特徴量を推薦する方法を提案し，システムを開発し，有効性を確認した．これにより，自動特徴量エンジニアリングの学術的領域に貢献した．

Report

(3 results)

2023 Annual Research Report Final Research Report ( PDF )
2022 Research-status Report

Research Products
(2 results)

All 2024 2023

All Presentation (2 results) (of which Int'l Joint Research: 1 results)

[Presentation] データ項目名を利用した時系列特徴量の推薦システム2024
- Author(s)
  増田聡, 武田友宏
- Organizer
  電子情報通信学会知能ソフトウェア工学研究会(KBSE)
- Related Report
  2023 Annual Research Report
[Presentation] Datetime Feature Recommendation Using Textual Information2023
- Author(s)
  Satoshi Masuda, Takaaki Tateishi, Toshihiro Takahashi
- Organizer
  27th International Conference on Knowledge-Based and Intelligent Information & Engineering Systems (KES 2023)
- Related Report
  2023 Annual Research Report
- Int'l Joint Research

Development of a platform for automatic data science by using text information in metadata

Principal Investigator

Masuda Satoshi 東京都市大学, メディア情報学部, 教授 (60947927)

¥2,860,000 (Direct Cost: ¥2,200,000、Indirect Cost: ¥660,000)

Report

Research Products

[Presentation] データ項目名を利用した時系列特徴量の推薦システム2024

Author(s)

Organizer

Related Report

[Presentation] Datetime Feature Recommendation Using Textual Information2023

Author(s)

Organizer

Related Report