2021 Fiscal Year Research-status Report

Scalable Hybrid-parallelism Design for Mega-Size Deep Learning Model

Research Project

Project/Area Number	21K17751
Research Institution	National Institute of Advanced Industrial Science and Technology
Principal Investigator	Nguyen Truong 国立研究開発法人産業技術総合研究所, 情報・人間工学領域, 研究員 (60835346)
Project Period (FY)	2021-04-01 – 2024-03-31
Keywords	Deep Learning / Large-scale / Distributed computing
Outline of Annual Research Achievements	In this fiscal year, we study the method to reduce the I/O time for large-scale training with a extremaly large dataset. We revisit data shuffling in DL workloads to investigate the viability of partitioning the dataset among workers and performing only a partial distributed exchange of samples in each training epoch. Through extensive experiments on up to 2,048 GPUs of ABCI and 4,096 compute nodes of Fugaku, we demonstrate that in practice validation accuracy of global shuffling can be maintained when carefully tuning the partial distributed exchange. We provide a solution implemented in PyTorch that enables users to control the proposed data exchange scheme. Our submission to IPDPS2022 is accepted We also study the method to reduce the communication time for training Deep Learning by a co-design of network architecture and collective communication algorithm. We propose to use the Kautz network for inter-memory network using switchless OPTWEB FPGA and multi-port collective communications to mitigate the influence of the startup latency on the execution time. Based on our experimental results with OPTWEB of custom Stratix10 FPGA cards, SimGrid simulation results show that our collective communication is 7x faster than that of Dragonfly with 272 FPGAsOur submission to IPDPS2022 is accepted.
Current Status of Research Progress	Current Status of Research Progress 1: Research has progressed more than it was originally planned. Reason Our proposed partial local shuffling enable to train a model on very large dataset which could not be done before. For example, we can increase the accuracy of training with DEEPCAM dataset about 2%. For common dataset, we could reduce the total training time up to 70%
Strategy for Future Research Activity	we are going to develop extend the work on I/O to reduce the overhead of partial local shuffling, e.g., the exchange phase, at scale. In the next fiscal year, we will also develop new methods to reduce the computing time by eliminating non-important samples during training process. We also study the method to reduce the communication time by applying the overlapping of communication and computation.
Causes of Carryover	In the next fiscal year, we will conduct a wide range of large-scale experiment on supercomputer system. We will pay for using ABCI supercomputer.

Research Products
(2 results)

All 2022

All Journal Article (2 results) (of which Int'l Joint Research: 2 results)

[Journal Article] Why Globally Re-shuffle? Revisiting Data Shuffling in Large Scale Deep Learning2022
- Author(s)
  Truong Thao Nguyen, Francois Trahay, Jens Domke, Aleksandr Drozd, Emil Vatai, Jianwei Liao, Mohamed Wahib, Balazs Gerofi
- Journal Title
  
  36th IEEE International Parallel & Distributed Processing Symposium
  
  Volume: 0 Pages: 1-12
- Int'l Joint Research
[Journal Article] Scalable Low-Latency Inter-FPGA Networks2022
- Author(s)
  Kien Trung Pham, Truong Thao Nguyen, Hiroshi Yamaguchi, Yutaka Urino, Michihiro Koibuchi
- Journal Title
  
  36th IEEE International Parallel & Distributed Processing Symposium
  
  Volume: 0 Pages: 1-12
- Int'l Joint Research

2021 Fiscal Year Research-status Report

Scalable Hybrid-parallelism Design for Mega-Size Deep Learning Model

Principal Investigator

Nguyen Truong 国立研究開発法人産業技術総合研究所, 情報・人間工学領域, 研究員 (60835346)

Current Status of Research Progress

Reason

Research Products

[Journal Article] Why Globally Re-shuffle? Revisiting Data Shuffling in Large Scale Deep Learning2022

Author(s)

Journal Title

[Journal Article] Scalable Low-Latency Inter-FPGA Networks2022

Author(s)

Journal Title