2021 Fiscal Year Research-status Report
Scalable Hybrid-parallelism Design for Mega-Size Deep Learning Model
Project/Area Number |
21K17751
|
Research Institution | National Institute of Advanced Industrial Science and Technology |
Principal Investigator |
Nguyen Truong 国立研究開発法人産業技術総合研究所, 情報・人間工学領域, 研究員 (60835346)
|
Project Period (FY) |
2021-04-01 – 2024-03-31
|
Keywords | Deep Learning / Large-scale / Distributed computing |
Outline of Annual Research Achievements |
In this fiscal year, we study the method to reduce the I/O time for large-scale training with a extremaly large dataset. We revisit data shuffling in DL workloads to investigate the viability of partitioning the dataset among workers and performing only a partial distributed exchange of samples in each training epoch. Through extensive experiments on up to 2,048 GPUs of ABCI and 4,096 compute nodes of Fugaku, we demonstrate that in practice validation accuracy of global shuffling can be maintained when carefully tuning the partial distributed exchange. We provide a solution implemented in PyTorch that enables users to control the proposed data exchange scheme. Our submission to IPDPS2022 is accepted
We also study the method to reduce the communication time for training Deep Learning by a co-design of network architecture and collective communication algorithm. We propose to use the Kautz network for inter-memory network using switchless OPTWEB FPGA and multi-port collective communications to mitigate the influence of the startup latency on the execution time. Based on our experimental results with OPTWEB of custom Stratix10 FPGA cards, SimGrid simulation results show that our collective communication is 7x faster than that of Dragonfly with 272 FPGAsOur submission to IPDPS2022 is accepted.
|
Current Status of Research Progress |
Current Status of Research Progress
1: Research has progressed more than it was originally planned.
Reason
Our proposed partial local shuffling enable to train a model on very large dataset which could not be done before. For example, we can increase the accuracy of training with DEEPCAM dataset about 2%. For common dataset, we could reduce the total training time up to 70%
|
Strategy for Future Research Activity |
we are going to develop extend the work on I/O to reduce the overhead of partial local shuffling, e.g., the exchange phase, at scale.
In the next fiscal year, we will also develop new methods to reduce the computing time by eliminating non-important samples during training process. We also study the method to reduce the communication time by applying the overlapping of communication and computation.
|
Causes of Carryover |
In the next fiscal year, we will conduct a wide range of large-scale experiment on supercomputer system. We will pay for using ABCI supercomputer.
|
Research Products
(2 results)