2022 Fiscal Year Annual Research Report
Machine learning driven system level heterogeneous memory management for high-performance computing
Project/Area Number |
19K11993
|
Research Institution | Institute of Physical and Chemical Research |
Principal Investigator |
GEROFI BALAZS 国立研究開発法人理化学研究所, 計算科学研究センター, 上級研究員 (70633501)
|
Project Period (FY) |
2019-04-01 – 2023-03-31
|
Keywords | Memory access tracing / Runtime approximation / Distributed ML / Neural network training / I/O of deep learning |
Outline of Annual Research Achievements |
Results have been achieved in two parallel efforts of the project. We found that system-software-level heterogeneous memory management solutions utilizing machine learning, in particular nonsupervised learning- based methods such as reinforcement learning, require rapid estimation of execution runtime as a function of the data layout across memory devices for exploring different data placement strategies, which renders architecture-level simulators impractical for this purpose. We proposed a differential tracing-based approach using memory access traces obtained by high-frequency sampling-based methods (e.g., Intel's PEBS) on real hardware using of different memory devices. We developed a runtime estimator based on such traces that provides an execution time estimate orders of magnitude faster than full-system simulators. On a number of HPC mini applications we showed that the estimator predicts runtime with an average error of 4.4% compared to measurements on real hardware. For the deep learning data shuffling subtopic, we investigated the viability of partitioning the dataset among DL workers and performing only a partial distributed exchange of samples in each training epoch. Through extensive experiments on up to 2048 GPUs of ABCI and 4096 compute nodes of Fugaku, we demonstrated that in practice validation accuracy of global shuffling can be maintained when carefully tuning the partial distributed exchange. We provided an implementation in PyTorch that enables users to control the proposed data exchange scheme.
|
Research Products
(4 results)