2019 年度実施状況報告書

Machine learning driven system level heterogeneous memory management for high-performance computing

研究課題

研究課題/領域番号	19K11993
研究機関	国立研究開発法人理化学研究所
研究代表者	GEROFI BALAZS 国立研究開発法人理化学研究所, 計算科学研究センター, 上級研究員 (70633501)
研究期間 (年度)	2019-04-01 – 2022-03-31
キーワード	heterogeneous memory / gem5 / architectural simulator / non-uniform memory / machine learning / reinforcement learning / long-short term memory / transformer attention
研究実績の概要	Modified the gem5 architecture simulator to support heterogeneous memory device technologies, where different devices are exposed as separate NUMA nodes. The simulator has been also extended to support NUMA specific system calls that allows access control to the underlying memory device. Implemented a distributed reinforcement learning framework and explored various learning algorithms, i.e., vanilla policy gradients, proximal policy optimization. Explored various neural network architectures, such as basic recurrent networks, Long-short term memory (LSTM) networks and attention based (Transformer) networks. Investigated pytorch memory management issues on many-core CPUs and identified a number of operating system level improvements that can accelerate the training process on CPUs. Presented the idea and project status at the following venues: 1.) Fifth Workshop on Programming Abstractions for Data Locality (PADAL'19, 2.) Data-Centric Operating Systems and Runtimes mini-symposium at the 2020 SIAM Conference on Parallel Processing for Scientific Computing (SIAM PP'20). Part of this work has been performed by Aleix Roca Nonell from Barcelona Supercomputing Center who visited us as a summer internship student between 2018/Nov and 2019/Feb.
現在までの達成度 (区分)	現在までの達成度 (区分) 2: おおむね順調に進展している理由 As opposed to the original proposal of executing all experiments on actual hardware we took a different approach of simulating the hardware in the gem5 simulator. The main advantage of this method is that it enables us to explore a wider range of possible hardware configurations. On the negative side, experiments take more time due to the speed of the simulator. With respect to the machine learning, we have already explored various learning algorithms and are in process of integrating the simulator into a distributed machine learning framework that will in turn drive the simulation.
今後の研究の推進方策	In this year we will continue working on extensions to the gem5 simulator. We are switching to RIKEN's ARM based branch that has been released to the public recently and that also supports OpenMP in system call emulation mode. The first immediate step is to move our previous changes to this new baseline. This is performed in collaboration with Swann Perarnau and Nicolas Denoyelle from Argonne National Laboratory in the US. The step after for gem5 will be an extension so that machine learning frameworks can interact with the simulation. We envision this through an inter-process communication channel which will allow Pytorch or Tensorflow to obtain state information of the simulation, e.g., memory access patterns and performance counters. At the same time, it will also enable the ML algorithm to control memory address space layout changes in the gem5 simulation so that data can be placed in and moved across different memory devices. As for machine learning, a new PostDoc student, Marco Capucinni, has joined the system software team of RIKEN as of Apr 2020. He will be working on exploring further learning algorithms and to implement a reinforcement learning Gym environment that will drive the gem5 simulation. In addition, he will be working on the overall integration of distributed training that drives the gem5 simulator.
次年度使用額が生じた理由	In this fiscal year the main source of expenditure will be the cost for rental of compute resources. The combination of architectural simulation and reinforcement learning based exploration requires a significant amount computation. We are considering to use a large scale CPU based cluster, e.g., the OBCX machine at the University of Tokyo or the upcoming Fugaku machine at RIKEN.

研究成果
(2件)

すべて 2020 2019

すべて学会発表 (2件) (うち国際学会 2件、招待講演 1件)

[学会発表] 2020 SIAM Conference on Parallel Processing for Scientific Computing2020
- 著者名/発表者名
  Balazs Gerofi
- 学会等名
  Operating System Support for Intelligent Management of Heterogeneous Memor
- 国際学会
[学会発表] Towards Intelligent Management of Heterogeneous Memory: A Reinforcement Learning Approach2019
- 著者名/発表者名
  Balazs Gerofi
- 学会等名
  Fifth Workshop on Programming Abstractions for Data Locality (PADAL'19)
- 国際学会 / 招待講演