2019 Fiscal Year Research-status Report
Machine learning driven system level heterogeneous memory management for high-performance computing
Project/Area Number |
19K11993
|
Research Institution | Institute of Physical and Chemical Research |
Principal Investigator |
GEROFI BALAZS 国立研究開発法人理化学研究所, 計算科学研究センター, 上級研究員 (70633501)
|
Project Period (FY) |
2019-04-01 – 2022-03-31
|
Keywords | heterogeneous memory / gem5 / architectural simulator / non-uniform memory / machine learning / reinforcement learning / long-short term memory / transformer attention |
Outline of Annual Research Achievements |
Modified the gem5 architecture simulator to support heterogeneous memory device technologies, where different devices are exposed as separate NUMA nodes. The simulator has been also extended to support NUMA specific system calls that allows access control to the underlying memory device. Implemented a distributed reinforcement learning framework and explored various learning algorithms, i.e., vanilla policy gradients, proximal policy optimization. Explored various neural network architectures, such as basic recurrent networks, Long-short term memory (LSTM) networks and attention based (Transformer) networks. Investigated pytorch memory management issues on many-core CPUs and identified a number of operating system level improvements that can accelerate the training process on CPUs. Presented the idea and project status at the following venues: 1.) Fifth Workshop on Programming Abstractions for Data Locality (PADAL'19, 2.) Data-Centric Operating Systems and Runtimes mini-symposium at the 2020 SIAM Conference on Parallel Processing for Scientific Computing (SIAM PP'20). Part of this work has been performed by Aleix Roca Nonell from Barcelona Supercomputing Center who visited us as a summer internship student between 2018/Nov and 2019/Feb.
|
Current Status of Research Progress |
Current Status of Research Progress
2: Research has progressed on the whole more than it was originally planned.
Reason
As opposed to the original proposal of executing all experiments on actual hardware we took a different approach of simulating the hardware in the gem5 simulator. The main advantage of this method is that it enables us to explore a wider range of possible hardware configurations. On the negative side, experiments take more time due to the speed of the simulator.
With respect to the machine learning, we have already explored various learning algorithms and are in process of integrating the simulator into a distributed machine learning framework that will in turn drive the simulation.
|
Strategy for Future Research Activity |
In this year we will continue working on extensions to the gem5 simulator. We are switching to RIKEN's ARM based branch that has been released to the public recently and that also supports OpenMP in system call emulation mode. The first immediate step is to move our previous changes to this new baseline. This is performed in collaboration with Swann Perarnau and Nicolas Denoyelle from Argonne National Laboratory in the US.
The step after for gem5 will be an extension so that machine learning frameworks can interact with the simulation. We envision this through an inter-process communication channel which will allow Pytorch or Tensorflow to obtain state information of the simulation, e.g., memory access patterns and performance counters. At the same time, it will also enable the ML algorithm to control memory address space layout changes in the gem5 simulation so that data can be placed in and moved across different memory devices.
As for machine learning, a new PostDoc student, Marco Capucinni, has joined the system software team of RIKEN as of Apr 2020. He will be working on exploring further learning algorithms and to implement a reinforcement learning Gym environment that will drive the gem5 simulation. In addition, he will be working on the overall integration of distributed training that drives the gem5 simulator.
|
Causes of Carryover |
In this fiscal year the main source of expenditure will be the cost for rental of compute resources. The combination of architectural simulation and reinforcement learning based exploration requires a significant amount computation. We are considering to use a large scale CPU based cluster, e.g., the OBCX machine at the University of Tokyo or the upcoming Fugaku machine at RIKEN.
|
Research Products
(2 results)