研究課題/領域番号 |
19H04119
|
研究種目 |
基盤研究(B)
|
配分区分 | 補助金 |
応募区分 | 一般 |
審査区分 |
小区分60090:高性能計算関連
|
研究機関 | 国立研究開発法人理化学研究所 |
研究代表者 |
ドンケ イェンス 国立研究開発法人理化学研究所, 計算科学研究センター, チームリーダー (70815480)
|
研究分担者 |
遠藤 敏夫 東京工業大学, 学術国際情報センター, 教授 (80396788)
|
研究期間 (年度) |
2019-04-01 – 2024-03-31
|
研究課題ステータス |
交付 (2023年度)
|
配分額 *注記 |
17,160千円 (直接経費: 13,200千円、間接経費: 3,960千円)
2023年度: 3,250千円 (直接経費: 2,500千円、間接経費: 750千円)
2022年度: 3,250千円 (直接経費: 2,500千円、間接経費: 750千円)
2021年度: 3,510千円 (直接経費: 2,700千円、間接経費: 810千円)
2020年度: 3,250千円 (直接経費: 2,500千円、間接経費: 750千円)
2019年度: 3,900千円 (直接経費: 3,000千円、間接経費: 900千円)
|
キーワード | routing / HPC interconnects / hierarchical / supercomputing |
研究開始時の研究の概要 |
The research objective is the invention and development of a novel type of algorithms, which calculate the communication paths within supercomputer networks. These novel algorithms will be hierarchical to overcome scalability challenges of existing algorithms, which are insufficient for future system.
|
研究実績の概要 |
In FY2022, the forth year of the ExaPath project, we worked predominantly on the enhancements of the MocCUDA approach to aid and speedup the large-scale execution of deep learning frameworks on Fugaku, which are bottlenecked by the network as well as shortcomings in the code portability from CUDA to A64FX. Thanks to our previous publications, we were able to establish new international collaborations with reseachers from MIT, Google, and Argonne national lab. The outcome of this productive collaboration was published in "High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel Constructs" in the proceedings of the 28th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '23, as well as disseminated in multiple peer-reviewed posters. Furthermore, the internship student, who assisted this research, was able to successfully defend his Master's thesis and move on to a PhD program. We were also able to establish a connection to the team of Rockport networks to be able to evaluate their novel interconnection network technology and these research outcomes will contribute towards our project goal. The third collaboration with ETH Zurich around routing for their Slimfly proof-of-concept is still ongoing and will likely yield a peer-reviewed publication in FY2023. Lastly, we disseminated our research findings via talks at the JLESC workshop and Benchmarking in the Data Center: Expanding to the Cloud workshop and discussed our work and related routing and network topics with colleagues at various online meetings and conference.
|
現在までの達成度 (区分) |
現在までの達成度 (区分)
2: おおむね順調に進展している
理由
Most of the COVID-related backlog and slowdowns of the R&D were resolved over time and things are getting back to "normal", and therefore the status can be considered as on-track.
|
今後の研究の推進方策 |
In the fifth fiscal year, the PI will continue the research into a novel hierarchical, adaptive routing for near‐term, large‐scale interconnect deployments, which use emerging technologies, such as Rockport network, RoCE, CXL, BXI, or Slingshot. This research will be performed with the assistance of a co‐investigator and two internship students and a JRA. The PI, with assistance of the two internship students, will study the SparCML approach to accelerate Large Language Models (LLMs) on Supercomputer Fugaku. The interns will collaborate with RIKEN's AI teams to investigate the current LLM communication patters, and implement a lossy and/or lossless variant of MPIAllreduce into the MPI library under consideration of the topology placement and novel routing approaches. Network-related research results and development tasks will be summarized and the PI will disseminate these documents among the broader network research community.
|