2023 Fiscal Year Final Research Report

Scalable Hybrid-parallelism Design for Mega-Size Deep Learning Model

Research Project

PDF

Project/Area Number	21K17751
Research Category	Grant-in-Aid for Early-Career Scientists
Allocation Type	Multi-year Fund
Review Section	Basic Section 60090:High performance computing-related
Research Institution	National Institute of Advanced Industrial Science and Technology
Principal Investigator	Nguyen TRUONG 国立研究開発法人産業技術総合研究所, 情報・人間工学領域, 研究員 (60835346)
Project Period (FY)	2021-04-01 – 2024-03-31
Keywords	Distributed Training / Large Model / Large dataset / Large scale system
Outline of Final Research Achievements	We deal with memory capacity limitation when training a large model by separating the model into multiple smaller parts (published a Q1 journal-TNSM23). We also found that 3D parallelism (data+pipeline+tensor) becomes standard in training large-scale Deep Learning with large datasets. We proposed the methods to speed up this training process. To reduce the I/O time, we use local shuffling along with a pair-wise data exchanging and a model exchanging to maintain the accuracy of the model. We published 3 papers (IPDPS22a, CCGRID23, CANDAR23), a poster (HPCAsia24), and achieved 2 best paper awards. To reduce the computing time, we eliminate to process the non-important samples during the training (published at a A* conference - Neurips23). We reduce the communication time by co-design network architecture and collective communication. We published 2 rank A paper (IPDPS22b, CCGRID24), a Q1 journal (JPDC23) and a poster (HPCAsia23).
Free Research Field	High performance computing
Academic Significance and Societal Importance of the Research Achievements	Our research helps to support the research and development of big models. It brings a groundbreaking new solution with the requirements of the urgent AI, e.g.,ChatGPT. It can be ultimately contributing to the advancement of AI models, particularly foundational models, in the context of social 5.0.