2021 年度実施状況報告書

Machine learning driven system level heterogeneous memory management for high-performance computing

研究課題

研究課題/領域番号	19K11993
研究機関	国立研究開発法人理化学研究所
研究代表者	GEROFI BALAZS 国立研究開発法人理化学研究所, 計算科学研究センター, 上級研究員 (70633501)
研究期間 (年度)	2019-04-01 – 2023-03-31
キーワード	Memory access tracing / Runtime approximation / Distributed learning / Neural network training / I/O of deep learning
研究実績の概要	Results have been achieved in two parallel efforts of the project. We found that system-software-level heterogeneous memory management solutions utilizing machine learning, in particular nonsupervised learning-based methods such as reinforcement learning, require rapid estimation of execution runtime as a function of the data layout across memory devices for exploring different data placement strategies, which renders architecture-level simulators impractical for this purpose. We proposed a differential tracing-based approach using memory access traces obtained by high-frequency sampling-based methods (e.g., Intel's PEBS) on real hardware using of different memory devices. We developed a runtime estimator based on such traces that provides an execution time estimate orders of magnitude faster than full-system simulators. On a number of HPC mini applications we showed that the estimator predicts runtime with an average error of 4.4% compared to measurements on real hardware. For the deep learning data shuffling subtopic, we investigated the viability of partitioning the dataset among DL workers and performing only a partial distributed exchange of samples in each training epoch. Through extensive experiments on up to 2048 GPUs of ABCI and 4096 compute nodes of Fugaku, we demonstrated that in practice validation accuracy of global shuffling can be maintained when carefully tuning the partial distributed exchange. We provided an implementation in PyTorch that enables users to control the proposed data exchange scheme.
現在までの達成度 (区分)	現在までの達成度 (区分) 1: 当初の計画以上に進展している理由 We have made significant progress on two fronts of the project that was mainly achievable because of the successful collaboration with Argonne National Labs in the US, AIST in Japan and Telecom Sudparis in France. We foresee an additional two publications as the likely outcome of the overall effort.
今後の研究の推進方策	For the two respective sub topics we plan to perform the following steps. With respect to the reinforcement learning based memory management topic, we are working on integrating out diferencial tracing based runtime estimator into the OpenAI "gym" environment framework which we are coupling with the PFRL reinforcement learning framework developed by Preferred Network in Japan. On the deep learning data shuffling and I/O optimization topic we are investigating the feasibility of importance sampling based input sample shuffling and its integration into the distributed learning scheme. In particular, early experiments show that importance sampling based data set decay, i.e., actively discarding input samples that are less important can lead to significant runtime improvements.
次年度使用額が生じた理由	Not applicable.

研究成果
(2件)

すべて 2022

すべて学会発表 (2件) (うち国際学会 2件)

[学会発表] Why Globally Re-shuffle? Revisiting Data Shuffling in Large Scale Deep Learning2022
- 著者名/発表者名
  Truong Thao Nguyen, Francois Trahay, Jens Domke, Aleksandr Drozd, Emil Vatai, Jianwei Liao, Mohamed Wahib, Balazs Gerofi
- 学会等名
  36th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2022)
- 国際学会
[学会発表] Rapid Execution Time Estimation for Heterogeneous Memory Systems through Differential Tracing2022
- 著者名/発表者名
  Nicolas Denoyelle, Swann Perarnau, Kamil Iskra, Balazs Gerofi
- 学会等名
  International Conference on High Performance Computing (ISC)
- 国際学会

2021 年度 実施状況報告書

Machine learning driven system level heterogeneous memory management for high-performance computing

研究代表者

GEROFI BALAZS 国立研究開発法人理化学研究所, 計算科学研究センター, 上級研究員 (70633501)

現在までの達成度 (区分)

理由

研究成果

[学会発表] Why Globally Re-shuffle? Revisiting Data Shuffling in Large Scale Deep Learning2022

著者名/発表者名

学会等名

[学会発表] Rapid Execution Time Estimation for Heterogeneous Memory Systems through Differential Tracing2022

著者名/発表者名

学会等名

2021 年度実施状況報告書