2023 Fiscal Year Final Research Report
Scalable Hybrid-parallelism Design for Mega-Size Deep Learning Model
Project/Area Number |
21K17751
|
Research Category |
Grant-in-Aid for Early-Career Scientists
|
Allocation Type | Multi-year Fund |
Review Section |
Basic Section 60090:High performance computing-related
|
Research Institution | National Institute of Advanced Industrial Science and Technology |
Principal Investigator |
Nguyen TRUONG 国立研究開発法人産業技術総合研究所, 情報・人間工学領域, 研究員 (60835346)
|
Project Period (FY) |
2021-04-01 – 2024-03-31
|
Keywords | Distributed Training / Large Model / Large dataset / Large scale system |
Outline of Final Research Achievements |
We deal with memory capacity limitation when training a large model by separating the model into multiple smaller parts (published a Q1 journal-TNSM23). We also found that 3D parallelism (data+pipeline+tensor) becomes standard in training large-scale Deep Learning with large datasets. We proposed the methods to speed up this training process. To reduce the I/O time, we use local shuffling along with a pair-wise data exchanging and a model exchanging to maintain the accuracy of the model. We published 3 papers (IPDPS22a, CCGRID23, CANDAR23), a poster (HPCAsia24), and achieved 2 best paper awards. To reduce the computing time, we eliminate to process the non-important samples during the training (published at a A* conference - Neurips23). We reduce the communication time by co-design network architecture and collective communication. We published 2 rank A paper (IPDPS22b, CCGRID24), a Q1 journal (JPDC23) and a poster (HPCAsia23).
|
Free Research Field |
High performance computing
|
Academic Significance and Societal Importance of the Research Achievements |
Our research helps to support the research and development of big models. It brings a groundbreaking new solution with the requirements of the urgent AI, e.g.,ChatGPT. It can be ultimately contributing to the advancement of AI models, particularly foundational models, in the context of social 5.0.
|