• Search Research Projects
  • Search Researchers
  • How to Use
  1. Back to project page

2020 Fiscal Year Research-status Report

Machine learning driven system level heterogeneous memory management for high-performance computing

Research Project

Project/Area Number 19K11993
Research InstitutionInstitute of Physical and Chemical Research

Principal Investigator

GEROFI BALAZS  国立研究開発法人理化学研究所, 計算科学研究センター, 上級研究員 (70633501)

Project Period (FY) 2019-04-01 – 2022-03-31
KeywordsMemory access tracking / Neural network training / I/O of deep learning
Outline of Annual Research Achievements

We have completed the extension to the gem5 simulator for supporting heterogeneous memory systems by adding capabilities to define an arbitrary number of different memory devices with specific performance characteristics.
We completed the python interface for real-time memory access communication between gem5 and PyTorch and developed test codes to run simple analysis on the captured data.
Due to the high runtime overhead of gem5 we also started working on a simplified simulator based on leading-loads model using gem5 results, this runtime estimator will be more suitable for plugging it into a reinforcement learning framework.
As a side topic, we tarted exploring I/O implications of large scale training that is necessary for distributed training of large neural networks in supercomputing environments.

Current Status of Research Progress
Current Status of Research Progress

4: Progress in research has been delayed.

Reason

Our PostDoc student who was scheduled to work on this project couldn't come to Japan due to COVID-19 and resigned from his RIKEN position. We are lacking man-power at the moment for the agenda to progress as originally planned.

Strategy for Future Research Activity

Continue implementation of leading-load based runtime estimator.
Continue exploration of memory sensitive applications.
Start investigating an alternative runtime estimator based on precise-event based sampling and heterogeneous memory platforms (Intel Optane+DRAM or DRAM+MCDRAM configurations as primary targets).
Continue development of I/O improvements for large-scale training.

Causes of Carryover

Most of the fund will be used for renting compute capacity in order to run experiments.
Depending on the COVID situation, some of the funds may be used for international travel.

  • Research Products

    (2 results)

All 2021

All Journal Article (1 results) (of which Int'l Joint Research: 1 results,  Peer Reviewed: 1 results) Presentation (1 results) (of which Int'l Joint Research: 1 results,  Invited: 1 results)

  • [Journal Article] Why Globally Re-shuffle? An I/O Perspective on Data Shuffling in Large Scale Deep Learning2021

    • Author(s)
      TruongThao Nguyen, Balazs Gerofi, Liao Jianwei, Francois Trahay, Mohamed Wahib
    • Journal Title

      International Conference for High Performance Computing, Networking, Storage, and Analysis (SC) [submitted]

      Volume: 1 Pages: 10

    • Peer Reviewed / Int'l Joint Research
  • [Presentation] Directions for Operating Systems Research2021

    • Author(s)
      Balazs Gerofi
    • Organizer
      DOE ASCR OS Research Roundtable'21
    • Int'l Joint Research / Invited

URL: 

Published: 2021-12-27  

Information User Guide FAQ News Terms of Use Attribution of KAKENHI

Powered by NII kakenhi