• Search Research Projects
  • Search Researchers
  • How to Use
  1. Back to project page

2023 Fiscal Year Annual Research Report

Scalable Hybrid-parallelism Design for Mega-Size Deep Learning Model

Research Project

Project/Area Number 21K17751
Research InstitutionNational Institute of Advanced Industrial Science and Technology

Principal Investigator

Nguyen Truong  国立研究開発法人産業技術総合研究所, 情報・人間工学領域, 研究員 (60835346)

Project Period (FY) 2021-04-01 – 2024-03-31
KeywordsDistributed Training / Large Model / Large dataset / Large scale system
Outline of Annual Research Achievements

We found that 3D parallelism (data+pipeline+model) becomes standard in training large-scale Deep Learning with large datasets. We proposed the methods to speed up this training process:
+ To reduce the I/O time, we use local shuffling (IPDPS22a paper) along with pair-wise data exchanging (CCGRID23-Best Paper Candidate, HPCAsia24) and model exchanging (CANDAR23-Best Paper ward) to maintain the accuracy of the model.
+ To reduce the computing time, we eliminate non-important samples during the training (Neurips23).
+ We reduce the communication time by co-design network architecture and collective communication (IPDPS22b, HPCAsia23, JPDC23, CCGRID24).

We also deal with memory capacity limitation by separating the big model into multiple smaller parts and only assembling it at the end (TNSM23).

  • Research Products

    (5 results)

All 2024 2023

All Journal Article (4 results) (of which Int'l Joint Research: 4 results,  Peer Reviewed: 4 results,  Open Access: 2 results) Presentation (1 results) (of which Int'l Joint Research: 1 results)

  • [Journal Article] KAKURENBO: Adaptively Hiding Samples in Deep Neural Network Training2024

    • Author(s)
      Truong Thao Nguyen, Balazs Gerofi, Edgar Josafat Martinez-Noriega, Francois Trahay, and Mohamed Wahib
    • Journal Title

      37th Conference on Neural Information Processing Systems (NeurIPS 2023)

      Volume: - Pages: 1-23

    • Peer Reviewed / Open Access / Int'l Joint Research
  • [Journal Article] FedDCT: Federated Learning of Large Convolutional Neural Networks on Resource-Constrained Devices Using Divide and Collaborative Training2024

    • Author(s)
      Nguyen Quan、Pham Hieu H.、Wong Kok-Seng、Le Nguyen Phi、Nguyen Truong Thao、Do Minh N.
    • Journal Title

      IEEE Transactions on Network and Service Management

      Volume: 21 Pages: 418~436

    • DOI

      10.1109/TNSM.2023.3314066

    • Peer Reviewed / Open Access / Int'l Joint Research
  • [Journal Article] A Bandwidth-Optimal All-to-All Communication in Two-Dimensional Fully Connected Network2024

    • Author(s)
      Kien Trung Pham, Thao Nguyen Truong and Michihiro Koibuchi
    • Journal Title

      24th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing

      Volume: - Pages: 1-7

    • DOI

      10.1109/CCGrid59990.2024.00010

    • Peer Reviewed / Int'l Joint Research
  • [Journal Article] SEM: A Simple Yet Efficient Model-agnostic Local Training Mechanism to Tackle Data Sparsity and Scarcity in Federated Learning2023

    • Author(s)
      Pham Quang Ha、Nguyen Nang Hung、Nguyen Thanh Hung、Pham Huy Hieu、Nguyen Phi Le、Nguyen Truong Thao
    • Journal Title

      Eleventh International Symposium on Computing and Networking (CANDAR)

      Volume: - Pages: 120-126

    • DOI

      10.1109/CANDAR60563.2023.00023

    • Peer Reviewed / Int'l Joint Research
  • [Presentation] Efficient Sample Exchanging for Large-Scale Training Distributed Deep Learning with Local Sampling2024

    • Author(s)
      Truong Thao Nguyen, Yusuke Tanimura
    • Organizer
      International Conference on High Performance Computing in Asia-Pacific Region 2024
    • Int'l Joint Research

URL: 

Published: 2024-12-25  

Information User Guide FAQ News Terms of Use Attribution of KAKENHI

Powered by NII kakenhi