2020 Fiscal Year Research-status Report
Machine learning driven system level heterogeneous memory management for high-performance computing
Project/Area Number |
19K11993
|
Research Institution | Institute of Physical and Chemical Research |
Principal Investigator |
GEROFI BALAZS 国立研究開発法人理化学研究所, 計算科学研究センター, 上級研究員 (70633501)
|
Project Period (FY) |
2019-04-01 – 2022-03-31
|
Keywords | Memory access tracking / Neural network training / I/O of deep learning |
Outline of Annual Research Achievements |
We have completed the extension to the gem5 simulator for supporting heterogeneous memory systems by adding capabilities to define an arbitrary number of different memory devices with specific performance characteristics. We completed the python interface for real-time memory access communication between gem5 and PyTorch and developed test codes to run simple analysis on the captured data. Due to the high runtime overhead of gem5 we also started working on a simplified simulator based on leading-loads model using gem5 results, this runtime estimator will be more suitable for plugging it into a reinforcement learning framework. As a side topic, we tarted exploring I/O implications of large scale training that is necessary for distributed training of large neural networks in supercomputing environments.
|
Current Status of Research Progress |
Current Status of Research Progress
4: Progress in research has been delayed.
Reason
Our PostDoc student who was scheduled to work on this project couldn't come to Japan due to COVID-19 and resigned from his RIKEN position. We are lacking man-power at the moment for the agenda to progress as originally planned.
|
Strategy for Future Research Activity |
Continue implementation of leading-load based runtime estimator. Continue exploration of memory sensitive applications. Start investigating an alternative runtime estimator based on precise-event based sampling and heterogeneous memory platforms (Intel Optane+DRAM or DRAM+MCDRAM configurations as primary targets). Continue development of I/O improvements for large-scale training.
|
Causes of Carryover |
Most of the fund will be used for renting compute capacity in order to run experiments. Depending on the COVID situation, some of the funds may be used for international travel.
|
Research Products
(2 results)