2023 Fiscal Year Annual Research Report
Scalable Hybrid-parallelism Design for Mega-Size Deep Learning Model
Project/Area Number |
21K17751
|
Research Institution | National Institute of Advanced Industrial Science and Technology |
Principal Investigator |
Nguyen Truong 国立研究開発法人産業技術総合研究所, 情報・人間工学領域, 研究員 (60835346)
|
Project Period (FY) |
2021-04-01 – 2024-03-31
|
Keywords | Distributed Training / Large Model / Large dataset / Large scale system |
Outline of Annual Research Achievements |
We found that 3D parallelism (data+pipeline+model) becomes standard in training large-scale Deep Learning with large datasets. We proposed the methods to speed up this training process: + To reduce the I/O time, we use local shuffling (IPDPS22a paper) along with pair-wise data exchanging (CCGRID23-Best Paper Candidate, HPCAsia24) and model exchanging (CANDAR23-Best Paper ward) to maintain the accuracy of the model. + To reduce the computing time, we eliminate non-important samples during the training (Neurips23). + We reduce the communication time by co-design network architecture and collective communication (IPDPS22b, HPCAsia23, JPDC23, CCGRID24).
We also deal with memory capacity limitation by separating the big model into multiple smaller parts and only assembling it at the end (TNSM23).
|