Machine learning driven system level heterogeneous memory management for high-performance computing

研究課題

研究課題/領域番号	19K11993
研究種目	基盤研究(C)
配分区分	基金
応募区分	一般
審査区分	小区分60090:高性能計算関連
研究機関	国立研究開発法人理化学研究所
研究代表者	GEROFI BALAZS 国立研究開発法人理化学研究所, 計算科学研究センター, 上級研究員 (70633501)
研究期間 (年度)	2019-04-01 – 2023-03-31
研究課題ステータス	中途終了 (2022年度)
配分額 *注記	4,290千円 (直接経費: 3,300千円、間接経費: 990千円) 2021年度: 1,300千円 (直接経費: 1,000千円、間接経費: 300千円) 2020年度: 2,080千円 (直接経費: 1,600千円、間接経費: 480千円) 2019年度: 910千円 (直接経費: 700千円、間接経費: 210千円)
キーワード	Memory access tracing / Runtime approximation / Distributed ML / Neural network training / I/O of deep learning / Distributed learning / Memory access tracking / heterogeneous memory / gem5 / architectural simulator / non-uniform memory / machine learning / reinforcement learning / long-short term memory / transformer attention / Memory management / Machine learning / HPC
研究開始時の研究の概要	This research studies the combination of system software level mechanisms with machine learning driven policies for heterogeneous memory management in high-performance computing. It involves automatic discovery and characterization of memory devices, online application profiling based on hardware performance counters, machine learning driven decision processes for data management, and transparent, operating system level data movement.
研究実績の概要	Results have been achieved in two parallel efforts of the project. We found that system-software-level heterogeneous memory management solutions utilizing machine learning, in particular nonsupervised learning- based methods such as reinforcement learning, require rapid estimation of execution runtime as a function of the data layout across memory devices for exploring different data placement strategies, which renders architecture-level simulators impractical for this purpose. We proposed a differential tracing-based approach using memory access traces obtained by high-frequency sampling-based methods (e.g., Intel's PEBS) on real hardware using of different memory devices. We developed a runtime estimator based on such traces that provides an execution time estimate orders of magnitude faster than full-system simulators. On a number of HPC mini applications we showed that the estimator predicts runtime with an average error of 4.4% compared to measurements on real hardware. For the deep learning data shuffling subtopic, we investigated the viability of partitioning the dataset among DL workers and performing only a partial distributed exchange of samples in each training epoch. Through extensive experiments on up to 2048 GPUs of ABCI and 4096 compute nodes of Fugaku, we demonstrated that in practice validation accuracy of global shuffling can be maintained when carefully tuning the partial distributed exchange. We provided an implementation in PyTorch that enables users to control the proposed data exchange scheme.

報告書

(4件)

研究成果
(8件)

すべて 2022 2021 2020 2019 その他

すべて国際共同研究 (2件) 雑誌論文 (1件) (うち国際共著 1件、査読あり 1件) 学会発表 (5件) (うち国際学会 5件、招待講演 2件)

[国際共同研究] Telecom Sudparis(フランス)
- 関連する報告書
  2022 実績報告書
[国際共同研究] Argonne National Laboratory(米国)
- 関連する報告書
  2022 実績報告書
[雑誌論文] Why Globally Re-shuffle? An I/O Perspective on Data Shuffling in Large Scale Deep Learning2021
- 著者名/発表者名
  TruongThao Nguyen, Balazs Gerofi, Liao Jianwei, Francois Trahay, Mohamed Wahib
- 雑誌名
  
  International Conference for High Performance Computing, Networking, Storage, and Analysis (SC) [submitted]
  
  巻: 1 ページ: 10-10
- 関連する報告書
  2020 実施状況報告書
- 査読あり / 国際共著
[学会発表] Why Globally Re-shuffle? Revisiting Data Shuffling in Large Scale Deep Learning2022
- 著者名/発表者名
  Truong Thao Nguyen, Francois Trahay, Jens Domke, Aleksandr Drozd, Emil Vatai, Jianwei Liao, Mohamed Wahib, Balazs Gerofi
- 学会等名
  36th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2022)
- 関連する報告書
  2022 実績報告書 2021 実施状況報告書
- 国際学会
[学会発表] Rapid Execution Time Estimation for Heterogeneous Memory Systems through Differential Tracing2022
- 著者名/発表者名
  Nicolas Denoyelle, Swann Perarnau, Kamil Iskra, Balazs Gerofi
- 学会等名
  International Conference on High Performance Computing (ISC)
- 関連する報告書
  2022 実績報告書 2021 実施状況報告書
- 国際学会
[学会発表] Directions for Operating Systems Research2021
- 著者名/発表者名
  Balazs Gerofi
- 学会等名
  DOE ASCR OS Research Roundtable'21
- 関連する報告書
  2020 実施状況報告書
- 国際学会 / 招待講演
[学会発表] 2020 SIAM Conference on Parallel Processing for Scientific Computing2020
- 著者名/発表者名
  Balazs Gerofi
- 学会等名
  Operating System Support for Intelligent Management of Heterogeneous Memor
- 関連する報告書
  2019 実施状況報告書
- 国際学会
[学会発表] Towards Intelligent Management of Heterogeneous Memory: A Reinforcement Learning Approach2019
- 著者名/発表者名
  Balazs Gerofi
- 学会等名
  Fifth Workshop on Programming Abstractions for Data Locality (PADAL'19)
- 関連する報告書
  2019 実施状況報告書
- 国際学会 / 招待講演

Machine learning driven system level heterogeneous memory management for high-performance computing

研究代表者

GEROFI BALAZS 国立研究開発法人理化学研究所, 計算科学研究センター, 上級研究員 (70633501)

4,290千円 (直接経費: 3,300千円、間接経費: 990千円)

報告書

研究成果

[国際共同研究] Telecom Sudparis(フランス)

関連する報告書

[国際共同研究] Argonne National Laboratory(米国)

関連する報告書

[雑誌論文] Why Globally Re-shuffle? An I/O Perspective on Data Shuffling in Large Scale Deep Learning2021

著者名/発表者名

雑誌名

関連する報告書

[学会発表] Why Globally Re-shuffle? Revisiting Data Shuffling in Large Scale Deep Learning2022

著者名/発表者名

学会等名

関連する報告書

[学会発表] Rapid Execution Time Estimation for Heterogeneous Memory Systems through Differential Tracing2022

著者名/発表者名

学会等名

関連する報告書

[学会発表] Directions for Operating Systems Research2021

著者名/発表者名

学会等名

関連する報告書

[学会発表] 2020 SIAM Conference on Parallel Processing for Scientific Computing2020

著者名/発表者名

学会等名

関連する報告書

[学会発表] Towards Intelligent Management of Heterogeneous Memory: A Reinforcement Learning Approach2019

著者名/発表者名

学会等名

関連する報告書