2021 Fiscal Year Final Research Report

Acceleration of large-scale deep learning by optimizing parallel I/O

Research Project

PDF

Project/Area Number	20K19811
Research Category	Grant-in-Aid for Early-Career Scientists
Allocation Type	Multi-year Fund
Review Section	Basic Section 60090:High performance computing-related
Research Institution	Institute of Physical and Chemical Research
Principal Investigator	Sato Kento 国立研究開発法人理化学研究所, 計算科学研究センター, チームリーダー (50739696)
Project Period (FY)	2020-04-01 – 2022-03-31
Keywords	高性能計算 / 大規模計算 / 深層学習 / 機械学習 / I/O / ストレージ
Outline of Final Research Achievements	Applications that read large amounts of training data, such as large-scale distributed deep learning, have insufficient system I/O performance, thereby, I/O performance is becoming increasingly important to support such applications. To optimize I/O performance, we investigated I/O performance on the supercomputer Fugaku and accelerated I/O by data compression. In particular, our finding from our project partly contributed to the development of software for deep learning frameworks and the benchmark evaluation of MLPerf HPC. As a result, we achieved the world's fastest performance on CosmoFlow, one of the MLPerf HPC benchmarks by using about the half number of Fugaku nodes.
Free Research Field	高性能計算
Academic Significance and Societal Importance of the Research Achievements	近年、深層学習に代表される人工知能の研究が盛んに行われており、産業界でも人工知能は様々な形で実用化レベルまで到達している。この深層学習における計算処理には、学習モデルを構築する「学習フェーズ」と、構築された学習モデルを使って、実際に画像認識などの予測・認識を行う「推論フェーズ」に分かれている。深層学習では、より正確な予測・認識を可能にする学習モデルを高速に構築することが重要な要素となっている。本研究は、スーパーコンピュータなどの大規模システムにおける学習フェーズの高速化を達成する研究課題であり、学術的・社会的意義は高いと考る。