研究課題/領域番号 |
21K17751
|
研究種目 |
若手研究
|
配分区分 | 基金 |
審査区分 |
小区分60090:高性能計算関連
|
研究機関 | 国立研究開発法人産業技術総合研究所 |
研究代表者 |
Nguyen Truong 国立研究開発法人産業技術総合研究所, 情報・人間工学領域, 研究員 (60835346)
|
研究期間 (年度) |
2021-04-01 – 2024-03-31
|
研究課題ステータス |
交付 (2022年度)
|
配分額 *注記 |
4,680千円 (直接経費: 3,600千円、間接経費: 1,080千円)
2023年度: 1,430千円 (直接経費: 1,100千円、間接経費: 330千円)
2022年度: 1,820千円 (直接経費: 1,400千円、間接経費: 420千円)
2021年度: 1,430千円 (直接経費: 1,100千円、間接経費: 330千円)
|
キーワード | Deep Learning / Large Scale / Distributed Computing / Non-IID / Large-scale / Distributed computing / Hybrid parallelism |
研究開始時の研究の概要 |
This proposal try to find techniques that help to speed-up the training/inference process of Distributed Deep Learning. The proposed research project includes several research topics: (1) Hybrid-parallelism design:(1.1) Study the limitation of different parallelism strategies and (1.2) find novel fine-grained hybrid parallelism strategies for each type of specific applications (2) Method to reduce communication time via (2.1) optimizing the communication mechanism for each type of network architecture in supercomputers and (2.2)study the method to reduce network contention.
|
研究実績の概要 |
This year, we develop new methods to reduce the computing time by eliminating non-important samples during the training process (submitted to ICML2023). Through our previous work (IPDPS2022), we found that local shuffling could not achieve good accuracy in large-scale training due to non-iid data and overfitting issues. We deal with non-iid by assigning the impact factor for the models from different workers dynamically and use knowledge distillation for dealing with overfitting. The work is the Best Paper Award Finalist in CCGRID2023. We study the method to reduce the communication time by a co-design of collective communication algorithm and the intra-node network architecture (a Q1-journal JPDC is accepted) and the inter-node network architecture (poster at HPCA-Asia2023).
|
現在までの達成度 (区分) |
現在までの達成度 (区分)
1: 当初の計画以上に進展している
理由
We enlarge our international collaborative research with Telecom SudParis France, Hanoi University of Science and Technology (HUST) Vietnam, and VinUni-Illinois Smart Health Center VinUniversity Vietnam. The CCGRID2023 paper (PI is the corresponding author) is selected as one of the best paper award finalist papers (top 4 over 58 accepted papers, over 275 submitted papers). In the ICML2023 paper, empirical results on various large-scale datasets and models used directly in image classification and segmentation show that while the with-replacement importance sampling algorithm performs poorly on large datasets, our method can reduce total training time by up to 22% impacting accuracy only by 0.4% compared to the baseline.
|
今後の研究の推進方策 |
We continue to investigate (1) the extension of work on I/O to reduce the overhead of partial local shuffling at scale, and (2) the extension of the methods to reduce the computing time by eliminating non-important samples during the training process. We also study (3) the method to reduce communication time by applying the overlapping of communication and computation.
|