2021 Fiscal Year Annual Research Report
ExaPath: Hierarchical Routing for Next-Gen Supercomputers and Beyond
Project/Area Number |
19H04119
|
Research Institution | Institute of Physical and Chemical Research |
Principal Investigator |
ドンケ イェンス 国立研究開発法人理化学研究所, 計算科学研究センター, チームリーダー (70815480)
|
Co-Investigator(Kenkyū-buntansha) |
遠藤 敏夫 東京工業大学, 学術国際情報センター, 教授 (80396788)
|
Project Period (FY) |
2019-04-01 – 2024-03-31
|
Keywords | HPC interconnects |
Outline of Annual Research Achievements |
In FY2021, the third year of the ExaPath project, we conducted two distinct studies for routing in HPC interconnects. The first published paper of this FY is a benchmarking effort, which analyzes how modern HPC codes are accelerated by parameters like compilers and on-node and off-node communication libraries. Understanding such behavior and sensitivity to hardware and software paramemters will help to improve the system design. The paper with the title "A64FX - Your Compiler You Must Decide!" was published in the Proceedings of the 2021 IEEE International Conference on Cluster Computing (CLUSTER). The second published work, a peer-reviewed poster, is based on our intern's work at TokyoTech which was presented at the 4rd R-CCS International Symposium (RCCS-IS4). Based on our previous work from FY2020, where we demonstrated a prototype of running MPI-parallelized CUDA applications on Fugaku, the intern continued to work on the MocCUDA prototype. His poster with the title "Automatic translation of CUDA code into high performance CPU code using LLVM IR transformations" demonstrates our early attempts to automaticlly port MPI-parallelized CUDA without human intervention. The approach to scale Nvidia NCCL communication library, and Horovod, alongside the node-local CUDA emulation will be subject to future studies. Lastly, we disseminated our research findings via a talk at the SIAM Conference on Parallel Processing for Scientific Computing and discussed our work and related routing and network topics with colleagues at various online meetings and conference.
|
Current Status of Research Progress |
Current Status of Research Progress
3: Progress in research has been slightly delayed.
Reason
The original plan is slightly delayed, because COVID continues to caused major disturbances in the research community as well as international and domestic conference schedules. Hence, opportunities to seek new collaborators and chances to discuss and disseminate our research findings were fewer than expected.
|
Strategy for Future Research Activity |
The future direction of the research will primarily match the initially outlined plan in the project proposal. We will try to establish more international and domestic collaborations to develop a suitable HPC routing library which hopefully can be interfaced with the OpenFabrics Management Framework (OFMF) and other interconnection management frameworks. And we plan to develop new, and assist in the development (through collaborations) of new, routing algorithms for current and future HPC installations. We will also work with international partners to benchmark existing installations and prototypes, such as Slingshot and Rockport networks, to understand those novel architectures more deeply. This knowledge will aid our R&D efforts.
|
Research Products
(3 results)