• Search Research Projects
  • Search Researchers
  • How to Use
  1. Back to project page

2021 Fiscal Year Research-status Report

Scalable Hybrid-parallelism Design for Mega-Size Deep Learning Model

Research Project

Project/Area Number 21K17751
Research InstitutionNational Institute of Advanced Industrial Science and Technology

Principal Investigator

Nguyen Truong  国立研究開発法人産業技術総合研究所, 情報・人間工学領域, 研究員 (60835346)

Project Period (FY) 2021-04-01 – 2024-03-31
KeywordsDeep Learning / Large-scale / Distributed computing
Outline of Annual Research Achievements

In this fiscal year, we study the method to reduce the I/O time for large-scale training with a extremaly large dataset. We revisit data shuffling in DL workloads to investigate the viability of partitioning the dataset among workers and performing only a partial distributed exchange of samples in each training epoch. Through extensive experiments on up to 2,048 GPUs of ABCI and 4,096 compute nodes of Fugaku, we demonstrate that in practice validation accuracy of global shuffling can be maintained when carefully tuning the partial distributed exchange. We provide a solution implemented in PyTorch that enables users to control the proposed data exchange scheme. Our submission to IPDPS2022 is accepted

We also study the method to reduce the communication time for training Deep Learning by a co-design of network architecture and collective communication algorithm. We propose to use the Kautz network for inter-memory network using switchless OPTWEB FPGA and multi-port collective communications to
mitigate the influence of the startup latency on the execution time. Based on our experimental
results with OPTWEB of custom Stratix10 FPGA cards, SimGrid simulation results show that our collective communication is 7x faster than that of Dragonfly with 272 FPGAsOur submission to IPDPS2022 is accepted.

Current Status of Research Progress
Current Status of Research Progress

1: Research has progressed more than it was originally planned.

Reason

Our proposed partial local shuffling enable to train a model on very large dataset which could not be done before. For example, we can increase the accuracy of training with DEEPCAM dataset about 2%. For common dataset, we could reduce the total training time up to 70%

Strategy for Future Research Activity

we are going to develop extend the work on I/O to reduce the overhead of partial local shuffling, e.g., the exchange phase, at scale.

In the next fiscal year, we will also develop new methods to reduce the computing time by eliminating non-important samples during training process. We also study the method to reduce the communication time by applying the overlapping of communication and computation.

Causes of Carryover

In the next fiscal year, we will conduct a wide range of large-scale experiment on supercomputer system. We will pay for using ABCI supercomputer.

  • Research Products

    (2 results)

All 2022

All Journal Article (2 results) (of which Int'l Joint Research: 2 results)

  • [Journal Article] Why Globally Re-shuffle? Revisiting Data Shuffling in Large Scale Deep Learning2022

    • Author(s)
      Truong Thao Nguyen, Francois Trahay, Jens Domke, Aleksandr Drozd, Emil Vatai, Jianwei Liao, Mohamed Wahib, Balazs Gerofi
    • Journal Title

      36th IEEE International Parallel & Distributed Processing Symposium

      Volume: 0 Pages: 1-12

    • Int'l Joint Research
  • [Journal Article] Scalable Low-Latency Inter-FPGA Networks2022

    • Author(s)
      Kien Trung Pham, Truong Thao Nguyen, Hiroshi Yamaguchi, Yutaka Urino, Michihiro Koibuchi
    • Journal Title

      36th IEEE International Parallel & Distributed Processing Symposium

      Volume: 0 Pages: 1-12

    • Int'l Joint Research

URL: 

Published: 2022-12-28  

Information User Guide FAQ News Terms of Use Attribution of KAKENHI

Powered by NII kakenhi