Project/Area Number |
21K17751
|
Research Category |
Grant-in-Aid for Early-Career Scientists
|
Allocation Type | Multi-year Fund |
Review Section |
Basic Section 60090:High performance computing-related
|
Research Institution | National Institute of Advanced Industrial Science and Technology |
Principal Investigator |
Nguyen TRUONG 国立研究開発法人産業技術総合研究所, 情報・人間工学領域, 研究員 (60835346)
|
Project Period (FY) |
2021-04-01 – 2024-03-31
|
Project Status |
Completed (Fiscal Year 2023)
|
Budget Amount *help |
¥4,680,000 (Direct Cost: ¥3,600,000、Indirect Cost: ¥1,080,000)
Fiscal Year 2023: ¥1,430,000 (Direct Cost: ¥1,100,000、Indirect Cost: ¥330,000)
Fiscal Year 2022: ¥1,820,000 (Direct Cost: ¥1,400,000、Indirect Cost: ¥420,000)
Fiscal Year 2021: ¥1,430,000 (Direct Cost: ¥1,100,000、Indirect Cost: ¥330,000)
|
Keywords | Distributed Training / Large Model / Large dataset / Large scale system / Deep Learning / Large Scale / Distributed Computing / Non-IID / Large-scale / Distributed computing / Hybrid parallelism |
Outline of Research at the Start |
This proposal try to find techniques that help to speed-up the training/inference process of Distributed Deep Learning. The proposed research project includes several research topics: (1) Hybrid-parallelism design:(1.1) Study the limitation of different parallelism strategies and (1.2) find novel fine-grained hybrid parallelism strategies for each type of specific applications (2) Method to reduce communication time via (2.1) optimizing the communication mechanism for each type of network architecture in supercomputers and (2.2)study the method to reduce network contention.
|
Outline of Final Research Achievements |
We deal with memory capacity limitation when training a large model by separating the model into multiple smaller parts (published a Q1 journal-TNSM23). We also found that 3D parallelism (data+pipeline+tensor) becomes standard in training large-scale Deep Learning with large datasets. We proposed the methods to speed up this training process. To reduce the I/O time, we use local shuffling along with a pair-wise data exchanging and a model exchanging to maintain the accuracy of the model. We published 3 papers (IPDPS22a, CCGRID23, CANDAR23), a poster (HPCAsia24), and achieved 2 best paper awards. To reduce the computing time, we eliminate to process the non-important samples during the training (published at a A* conference - Neurips23). We reduce the communication time by co-design network architecture and collective communication. We published 2 rank A paper (IPDPS22b, CCGRID24), a Q1 journal (JPDC23) and a poster (HPCAsia23).
|
Academic Significance and Societal Importance of the Research Achievements |
Our research helps to support the research and development of big models. It brings a groundbreaking new solution with the requirements of the urgent AI, e.g.,ChatGPT. It can be ultimately contributing to the advancement of AI models, particularly foundational models, in the context of social 5.0.
|