Acceleration of large-scale deep learning by optimizing parallel I/O

Research Project

Project/Area Number	20K19811
Research Category	Grant-in-Aid for Early-Career Scientists
Allocation Type	Multi-year Fund
Review Section	Basic Section 60090:High performance computing-related
Research Institution	Institute of Physical and Chemical Research
Principal Investigator	Sato Kento 国立研究開発法人理化学研究所, 計算科学研究センター, チームリーダー (50739696)
Project Period (FY)	2020-04-01 – 2022-03-31
Project Status	Completed (Fiscal Year 2021)
Budget Amount *help	¥4,160,000 (Direct Cost: ¥3,200,000、Indirect Cost: ¥960,000) Fiscal Year 2021: ¥1,690,000 (Direct Cost: ¥1,300,000、Indirect Cost: ¥390,000) Fiscal Year 2020: ¥2,470,000 (Direct Cost: ¥1,900,000、Indirect Cost: ¥570,000)
Keywords	高性能計算 / 大規模計算 / 深層学習 / 機械学習 / I/O / ストレージ / 富岳 / Arm / チューニング / 並列I/O
Outline of Research at the Start	近年、深層学習が盛んに行われているが、よりサイズの大きな学習モデルを用いてより複雑な問題を学習するために、大規模計算機の利用が不可欠となっている。しかし、複数のユーザーが利用する大規模共用計算機における大規模深層学習では、共有グローバルファイルシステム(GFS)の性能が計算性能に比べ低いため、いくら学習に使用する計算資源(CPUやGPUの数)を増やしても学習の速度はそれ以上向上しない「大規模深層学習のスケール化の問題」が起こる。本研究計画調書では、大規模共用計算環境において10倍以上の並列I/O性能の向上により大規模深層学習の高速化を実現する。
Outline of Final Research Achievements	Applications that read large amounts of training data, such as large-scale distributed deep learning, have insufficient system I/O performance, thereby, I/O performance is becoming increasingly important to support such applications. To optimize I/O performance, we investigated I/O performance on the supercomputer Fugaku and accelerated I/O by data compression. In particular, our finding from our project partly contributed to the development of software for deep learning frameworks and the benchmark evaluation of MLPerf HPC. As a result, we achieved the world's fastest performance on CosmoFlow, one of the MLPerf HPC benchmarks by using about the half number of Fugaku nodes.
Academic Significance and Societal Importance of the Research Achievements	近年、深層学習に代表される人工知能の研究が盛んに行われており、産業界でも人工知能は様々な形で実用化レベルまで到達している。この深層学習における計算処理には、学習モデルを構築する「学習フェーズ」と、構築された学習モデルを使って、実際に画像認識などの予測・認識を行う「推論フェーズ」に分かれている。深層学習では、より正確な予測・認識を可能にする学習モデルを高速に構築することが重要な要素となっている。本研究は、スーパーコンピュータなどの大規模システムにおける学習フェーズの高速化を達成する研究課題であり、学術的・社会的意義は高いと考る。

Report

(3 results)

2021 Annual Research Report Final Research Report ( PDF )
2020 Research-status Report

Research Products
(14 results)

All 2022 2021 Other

All Int'l Joint Research (3 results) Journal Article (5 results) (of which Int'l Joint Research: 4 results, Peer Reviewed: 5 results) Presentation (2 results) (of which Int'l Joint Research: 1 results) Remarks (4 results)

[Int'l Joint Research] Sun Yat-Sen University/Xi'an Univ. of Finance and Economics(中国)
- Related Report
  2021 Annual Research Report
[Int'l Joint Research] Lawrence Berkeley National Laboratory/Argonne National Laboratory/Oak Ridge National Laboratory(米国)
- Related Report
  2021 Annual Research Report
[Int'l Joint Research] Florida States University(米国)
- Related Report
  2020 Research-status Report
[Journal Article] Social Media Driven Big Data Analysis for Disaster Situation Awareness: A Tutorial2022
- Author(s)
  Pal Amitangshu、Wang Junbo、Wu Yilang、Kant Krishna、Liu Zhi、Sato Kento
- Journal Title
  
  IEEE Transactions on Big Data
  
  Volume: - Issue: 1 Pages: 1-1
- DOI
  10.1109/tbdata.2022.3158431
- Related Report
  2021 Annual Research Report
- Peer Reviewed / Int'l Joint Research
[Journal Article] Semi-Synchronous Federated Learning Protocol with Dynamic Aggregation in Internet of Vehicles2022
- Author(s)
  Liang Feiyuan、Yang Qinglin、Liu Ruiqi、Wang Junbo、Sato Kento、Guo Jian
- Journal Title
  
  IEEE Transactions on Vehicular Technology
  
  Volume: - Issue: 5 Pages: 1-1
- DOI
  10.1109/tvt.2022.3148872
- Related Report
  2021 Annual Research Report
- Peer Reviewed / Int'l Joint Research
[Journal Article] The 16,384-node Parallelism of 3D-CNN Training on An Arm CPU based Supercomputer2021
- Author(s)
  Tabuchi Akihiro、Shirahata Koichi、Yamazaki Masafumi、Kasagi Akihiko、Honda Takumi、Kurihara Kouji、Kawakami Kentaro、Tabaru Tsuguchika、Fukumoto Naoto、Kuroda Akiyoshi、Fukai Takaaki、Sato Kento
- Journal Title
  
  2021 IEEE 28th International Conference on High Performance Computing, Data, and Analytics (HiPC)
  
  Volume: - Pages: 152-161
- DOI
  10.1109/hipc53243.2021.00029
- Related Report
  2021 Annual Research Report
- Peer Reviewed
[Journal Article] MLPerf HPC: A Holistic Benchmark Suite for Scientific Machine Learning on HPC Systems2021
- Author(s)
  Farrell Steven、Emani Murali、Balma Jacob、Drescher Lukas、Drozd Aleksandr、Fink Andreas、Fox Geoffrey、Kanter David、Kurth Thorsten、Mattson Peter、Mu Dawei、Ruhela Amit、Sato Kento、Shirahata Koichi、Tabaru Tsuguchika、et al.
- Journal Title
  
  2021 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC)
  
  Volume: - Pages: 33-45
- DOI
  10.1109/mlhpc54614.2021.00009
- Related Report
  2021 Annual Research Report
- Peer Reviewed / Int'l Joint Research
[Journal Article] Compression of Time Evolutionary Image Data through Predictive Deep Neural Networks2021
- Author(s)
  Rupak Roy, Kento Sato, Subhadeep Bhattacharya, Xingang Fang, Yasumasa Joti, Takaki Hatsui, Toshiyuki Hiraki, Jian Guo and Weikuan Yu
- Journal Title
  
  21th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID)
  
  Volume: -
- Related Report
  2020 Research-status Report
- Peer Reviewed / Int'l Joint Research
[Presentation] Measurement of I/O Performance on a Hierarchical File System for Distributed Deep Neural Network2022
- Author(s)
  Takaki Fukai, Kento Sato
- Organizer
  The 4th R-CCS International Symposium (RCCS-IS4)
- Related Report
  2021 Annual Research Report
- Int'l Joint Research
[Presentation] Measurement of I/O performance for distributed deep neural networks on Fugaku2021
- Author(s)
  Takaaki Fukai, Kento Sato
- Organizer
  The 3rd R-CCS International Symposium
- Related Report
  2020 Research-status Report
[Remarks] High Performance Big Data Research Team
- URL
  https://www.hpbd.r-ccs.riken.jp
- Related Report
  2021 Annual Research Report
[Remarks] Compression of Time Evolutionary Image Data ... 略
- URL
  https://www.hpbd.r-ccs.riken.jp/hpbd/en/research/
- Related Report
  2020 Research-status Report
[Remarks] HPC and AI Initiatives for Supercomputer Fugaku略
- URL
  https://www.fujitsu.com/global/documents/about/resources/publications/technicalreview/2020-03/article09.pdf
- Related Report
  2020 Research-status Report
[Remarks] 富岳における深層学習フレームワーク構築・最適化とMLPerf HPC ベンチマーク
- URL
  https://www.riken.jp/pr/news/2020/20201119_1/index.html
- Related Report
  2020 Research-status Report

Acceleration of large-scale deep learning by optimizing parallel I/O

Principal Investigator

Sato Kento 国立研究開発法人理化学研究所, 計算科学研究センター, チームリーダー (50739696)

¥4,160,000 (Direct Cost: ¥3,200,000、Indirect Cost: ¥960,000)

Report

Research Products

[Int'l Joint Research] Sun Yat-Sen University/Xi'an Univ. of Finance and Economics(中国)

Related Report

[Int'l Joint Research] Lawrence Berkeley National Laboratory/Argonne National Laboratory/Oak Ridge National Laboratory(米国)

Related Report

[Int'l Joint Research] Florida States University(米国)

Related Report

[Journal Article] Social Media Driven Big Data Analysis for Disaster Situation Awareness: A Tutorial2022

Author(s)

Journal Title

DOI

Related Report

[Journal Article] Semi-Synchronous Federated Learning Protocol with Dynamic Aggregation in Internet of Vehicles2022

Author(s)

Journal Title

DOI

Related Report

[Journal Article] The 16,384-node Parallelism of 3D-CNN Training on An Arm CPU based Supercomputer2021

Author(s)

Journal Title

DOI

Related Report

[Journal Article] MLPerf HPC: A Holistic Benchmark Suite for Scientific Machine Learning on HPC Systems2021

Author(s)

Journal Title

DOI

Related Report

[Journal Article] Compression of Time Evolutionary Image Data through Predictive Deep Neural Networks2021

Author(s)

Journal Title

Related Report

[Presentation] Measurement of I/O Performance on a Hierarchical File System for Distributed Deep Neural Network2022

Author(s)

Organizer

Related Report

[Presentation] Measurement of I/O performance for distributed deep neural networks on Fugaku2021

Author(s)

Organizer

Related Report

[Remarks] High Performance Big Data Research Team

URL

Related Report

[Remarks] Compression of Time Evolutionary Image Data ... 略

URL

Related Report

[Remarks] HPC and AI Initiatives for Supercomputer Fugaku略

URL

Related Report

[Remarks] 富岳における深層学習フレームワーク構築・最適化とMLPerf HPC ベンチマーク

URL

Related Report