2023 Fiscal Year Annual Research Report

Scalable Hybrid-parallelism Design for Mega-Size Deep Learning Model

Research Project

Project/Area Number	21K17751
Research Institution	National Institute of Advanced Industrial Science and Technology
Principal Investigator	Nguyen Truong 国立研究開発法人産業技術総合研究所, 情報・人間工学領域, 研究員 (60835346)
Project Period (FY)	2021-04-01 – 2024-03-31
Keywords	Distributed Training / Large Model / Large dataset / Large scale system
Outline of Annual Research Achievements	We found that 3D parallelism (data+pipeline+model) becomes standard in training large-scale Deep Learning with large datasets. We proposed the methods to speed up this training process: + To reduce the I/O time, we use local shuffling (IPDPS22a paper) along with pair-wise data exchanging (CCGRID23-Best Paper Candidate, HPCAsia24) and model exchanging (CANDAR23-Best Paper ward) to maintain the accuracy of the model. + To reduce the computing time, we eliminate non-important samples during the training (Neurips23). + We reduce the communication time by co-design network architecture and collective communication (IPDPS22b, HPCAsia23, JPDC23, CCGRID24). We also deal with memory capacity limitation by separating the big model into multiple smaller parts and only assembling it at the end (TNSM23).

Research Products
(5 results)

All 2024 2023

All Journal Article (4 results) (of which Int'l Joint Research: 4 results, Peer Reviewed: 4 results, Open Access: 2 results) Presentation (1 results) (of which Int'l Joint Research: 1 results)

[Journal Article] KAKURENBO: Adaptively Hiding Samples in Deep Neural Network Training2024
- Author(s)
  Truong Thao Nguyen, Balazs Gerofi, Edgar Josafat Martinez-Noriega, Francois Trahay, and Mohamed Wahib
- Journal Title
  
  37th Conference on Neural Information Processing Systems (NeurIPS 2023)
  
  Volume: - Pages: 1-23
- Peer Reviewed / Open Access / Int'l Joint Research
[Journal Article] FedDCT: Federated Learning of Large Convolutional Neural Networks on Resource-Constrained Devices Using Divide and Collaborative Training2024
- Author(s)
  Nguyen Quan、Pham Hieu H.、Wong Kok-Seng、Le Nguyen Phi、Nguyen Truong Thao、Do Minh N.
- Journal Title
  
  IEEE Transactions on Network and Service Management
  
  Volume: 21 Pages: 418～436
- DOI
  10.1109/TNSM.2023.3314066
- Peer Reviewed / Open Access / Int'l Joint Research
[Journal Article] A Bandwidth-Optimal All-to-All Communication in Two-Dimensional Fully Connected Network2024
- Author(s)
  Kien Trung Pham, Thao Nguyen Truong and Michihiro Koibuchi
- Journal Title
  
  24th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing
  
  Volume: - Pages: 1-7
- DOI
  10.1109/CCGrid59990.2024.00010
- Peer Reviewed / Int'l Joint Research
[Journal Article] SEM: A Simple Yet Efficient Model-agnostic Local Training Mechanism to Tackle Data Sparsity and Scarcity in Federated Learning2023
- Author(s)
  Pham Quang Ha、Nguyen Nang Hung、Nguyen Thanh Hung、Pham Huy Hieu、Nguyen Phi Le、Nguyen Truong Thao
- Journal Title
  
  Eleventh International Symposium on Computing and Networking (CANDAR)
  
  Volume: - Pages: 120-126
- DOI
  10.1109/CANDAR60563.2023.00023
- Peer Reviewed / Int'l Joint Research
[Presentation] Efficient Sample Exchanging for Large-Scale Training Distributed Deep Learning with Local Sampling2024
- Author(s)
  Truong Thao Nguyen, Yusuke Tanimura
- Organizer
  International Conference on High Performance Computing in Asia-Pacific Region 2024
- Int'l Joint Research

2023 Fiscal Year Annual Research Report

Scalable Hybrid-parallelism Design for Mega-Size Deep Learning Model

Principal Investigator

Nguyen Truong 国立研究開発法人産業技術総合研究所, 情報・人間工学領域, 研究員 (60835346)

Research Products

[Journal Article] KAKURENBO: Adaptively Hiding Samples in Deep Neural Network Training2024

Author(s)

Journal Title

[Journal Article] FedDCT: Federated Learning of Large Convolutional Neural Networks on Resource-Constrained Devices Using Divide and Collaborative Training2024

Author(s)

Journal Title

DOI

[Journal Article] A Bandwidth-Optimal All-to-All Communication in Two-Dimensional Fully Connected Network2024

Author(s)

Journal Title

DOI

[Journal Article] SEM: A Simple Yet Efficient Model-agnostic Local Training Mechanism to Tackle Data Sparsity and Scarcity in Federated Learning2023

Author(s)

Journal Title

DOI

[Presentation] Efficient Sample Exchanging for Large-Scale Training Distributed Deep Learning with Local Sampling2024

Author(s)

Organizer