• Search Research Projects
  • Search Researchers
  • How to Use
  1. Back to previous page

Development of system reliability improvement technology based on medium- to long-term failure prediction

Research Project

Project/Area Number 21H03449
Research Category

Grant-in-Aid for Scientific Research (B)

Allocation TypeSingle-year Grants
Section一般
Review Section Basic Section 60090:High performance computing-related
Research InstitutionTokyo Denki University

Principal Investigator

Egawa Ryusuke  東京電機大学, 工学部, 教授 (80374990)

Co-Investigator(Kenkyū-buntansha) 滝沢 寛之  東北大学, サイバーサイエンスセンター, 教授 (70323996)
谷村 勇輔  国立研究開発法人産業技術総合研究所, 情報・人間工学領域, 主任研究員 (80415710)
滝澤 真一朗  国立研究開発法人産業技術総合研究所, 情報・人間工学領域, 主任研究員 (80550483)
Project Period (FY) 2021-04-01 – 2024-03-31
Project Status Completed (Fiscal Year 2023)
Budget Amount *help
¥16,770,000 (Direct Cost: ¥12,900,000、Indirect Cost: ¥3,870,000)
Fiscal Year 2023: ¥4,940,000 (Direct Cost: ¥3,800,000、Indirect Cost: ¥1,140,000)
Fiscal Year 2022: ¥5,200,000 (Direct Cost: ¥4,000,000、Indirect Cost: ¥1,200,000)
Fiscal Year 2021: ¥6,630,000 (Direct Cost: ¥5,100,000、Indirect Cost: ¥1,530,000)
Keywords高性能計算 / ジョブスケジューリング / 障害 / 予測 / 計算システム / 障害発生予測 / 信頼性 / 障害発生 / 高性能計算システム
Outline of Research at the Start

将来の高性能計算システムは,システムの大規模化・複雑化が進み,平均故障間隔は数分から数十分と大幅に短縮することが予想されている,このため,長時間に及ぶアプリケーション実行を担保するためには,高性能計算システムの信頼性,耐障害性の堅持は重要な課題となっている.本課題では,システムのヘルスモニタリング情報を解析することで,将来起こりえる障害の中・長期予測を行い,障害を回避しながら安定したシステム運用が可能な技術の開発に取り組む.

Outline of Final Research Achievements

We have conducted research on elemental technologies to improve and maintain the reliability of high-performance computing systems, which are becoming increasingly large and complex. We developed technologies for collecting and aggregating system log messages and health monitoring information, and created a mechanism to predict failures using these data. Besides, to enhance the efficient use of HPC systems, we developed a job scheduling simulator capable of replicating system behavior and designed low-power job scheduling algorithms as well as algorithms for urgent jobs, demonstrating their effectiveness. These technologies have the potential to enhance the reliability and throughput of future HPC systems.

Academic Significance and Societal Importance of the Research Achievements

高性能計算システムは,科学技術やものづくりのための計算基盤としてばかりでなく,近年,気象予想技術,津波浸水被害予測技術など社会基盤として重要な役割を担っている.このため,システムを安定的,かつ効率的に利用することが強く求められている一方で,システムの高性能化に伴い,システムは大規模化複雑化の一途を辿り,システムの信頼性の向上,堅持が強く求めらている.信頼性だけでなく,システムの効率的利用にも取り組んだ本研究は,将来の高性能計算基盤の運用の効率化に向けた基礎的研究と言えるものであり,社会的にも意味がある.

Report

(4 results)
  • 2023 Annual Research Report   Final Research Report ( PDF )
  • 2022 Annual Research Report
  • 2021 Annual Research Report
  • Research Products

    (16 results)

All 2024 2023 2022 2021

All Journal Article (10 results) (of which Int'l Joint Research: 2 results,  Peer Reviewed: 9 results,  Open Access: 6 results) Presentation (6 results) (of which Invited: 1 results)

  • [Journal Article] AOBA: The Most Powerful Vector Supercomputer in the World2024

    • Author(s)
      -Hiroyuki Takizawa, Keichi Takahashi, Yoichi Shimomura, Ryusuke Egawa, Kenji Oizumi, Satoshi Ono, Takeshi Yamashita, Atsuko Saito
    • Journal Title

      Sustained Simulation Performance 2022

      Volume: - Pages: 71-81

    • DOI

      10.1007/978-3-031-41073-4_6

    • ISBN
      9783031410727, 9783031410734
    • Related Report
      2023 Annual Research Report
    • Peer Reviewed
  • [Journal Article] Balancing exploitation and exploration in parallel Bayesian optimization under computing resource constraint2023

    • Author(s)
      Moto Satake, Keichi Takahashi, Yoichi Shimomura, and Hiroyuki Takizawa
    • Journal Title

      2023 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

      Volume: - Pages: 706-713

    • DOI

      10.1109/ipdpsw59300.2023.00122

    • Related Report
      2023 Annual Research Report
    • Peer Reviewed / Open Access
  • [Journal Article] Performance Evaluation of a Next-Generation SX-Aurora TSUBASA Vector Supercomputer2023

    • Author(s)
      Keichi Takahashi, Soya Fujimoto, Satoru Nagase, Yoko Isobe, Yoichi Shimomura, Ryusuke Egawa, and Hiroyuki Takizawa
    • Journal Title

      Lecture Notes in Computer Science

      Volume: 13948 Pages: 359-378

    • DOI

      10.1007/978-3-031-32041-5_19

    • ISBN
      9783031320408, 9783031320415
    • Related Report
      2023 Annual Research Report
    • Peer Reviewed / Open Access
  • [Journal Article] oward Building a Digital Twin of Job Scheduling and Power Management on an HPC System2023

    • Author(s)
      Tatsuyoshi Ohmura, Yoichi Shimomura, Ryusuke Egawa and Hiroyuki Takizawa
    • Journal Title

      Job Scheduling Strategies for Parallel Processing (JSSPP 2022)

      Volume: - Pages: 47-67

    • DOI

      10.1007/978-3-031-22698-4_3

    • ISBN
      9783031226977, 9783031226984
    • Related Report
      2022 Annual Research Report
    • Peer Reviewed
  • [Journal Article] A Task-Parallel Runtime for Heterogeneous Multi-node Vector Systems2023

    • Author(s)
      Kazuki Ide, Keichi Takahashi, Yoichi Shimomura, and Hiroyuki Takizawa
    • Journal Title

      Lecture Notes in Computer Science

      Volume: 13798 Pages: 331-343

    • DOI

      10.1007/978-3-031-29927-8_26

    • ISBN
      9783031299261, 9783031299278
    • Related Report
      2022 Annual Research Report
    • Peer Reviewed / Open Access
  • [Journal Article] Equivalence Checking of Code Transformation by Numerical and Symbolic Approaches2023

    • Author(s)
      Shunpei Sugawara, Keichi Takahashi, Yoichi Shimomura, Ryusuke Egawa, and Hiroyuki Takizawa
    • Journal Title

      Lecture Notes in Computer Science

      Volume: 13798 Pages: 373-386

    • DOI

      10.1007/978-3-031-29927-8_29

    • ISBN
      9783031299261, 9783031299278
    • Related Report
      2022 Annual Research Report
    • Peer Reviewed / Open Access
  • [Journal Article] Xevolver for Performance Tuning of C Programs2023

    • Author(s)
      -Hiroyuki Takizawa, Shunpei Sugawara, Yoichi Shimomura, Keichi Takahashi, Ryusuke Egawa
    • Journal Title

      Sustained Simulation Performance 2021

      Volume: - Pages: 85-93

    • DOI

      10.1007/978-3-031-18046-0_6

    • ISBN
      9783031180453, 9783031180460
    • Related Report
      2022 Annual Research Report
  • [Journal Article] A Real-time Flood Inundation Prediction on SX-Aurora TSUBASA2022

    • Author(s)
      Yoichi Shimomura, Akihiro Musa, Yoshihiko Sato, Atsuhiko Konja, Guoqing Cui, Rei Aoyagi, Keichi Takahashi, and Hiroyuki Takizawa
    • Journal Title

      IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)

      Volume: - Pages: 192-197

    • DOI

      10.1109/hipc56025.2022.00035

    • Related Report
      2022 Annual Research Report
    • Peer Reviewed
  • [Journal Article] Evaluating the Performance and Conformance of a SYCL Implementation for SX-Aurora TSUBASA2021

    • Author(s)
      Jiahao Li, Mulya Agung, and Hiroyuki Takizawa
    • Journal Title

      Lecture Notes in Computer Science

      Volume: 13148 Pages: 36-47

    • DOI

      10.1007/978-3-030-96772-7_4

    • ISBN
      9783030967710, 9783030967727
    • Related Report
      2022 Annual Research Report
    • Peer Reviewed / Open Access / Int'l Joint Research
  • [Journal Article] Towards Conflict-Aware Workload Co-execution on SX-Aurora TSUBASA2021

    • Author(s)
      Riku Nunokawa, Yoichi Shimomura, Mulya Agung, Ryusuke Egawa, and Hiroyuki Takizawa
    • Journal Title

      Lecture Notes in Computer Science

      Volume: 13148 Pages: 163-174

    • DOI

      10.1007/978-3-030-96772-7_16

    • ISBN
      9783030967710, 9783030967727
    • Related Report
      2021 Annual Research Report
    • Peer Reviewed / Open Access / Int'l Joint Research
  • [Presentation] 探索と活用の調整による並列ベイズ最適化の効率化2023

    • Author(s)
      佐竹望都, 高橋慧智, 下村陽一, 滝沢寛之
    • Organizer
      第188回HPC研究発表会
    • Related Report
      2023 Annual Research Report
  • [Presentation] ベクトルプロセッサを用いた統計的機械学習に関する研究2023

    • Author(s)
      幸田 涼詩, 高橋 慧智, 下村 陽一, 滝沢 寛之
    • Organizer
      xSIG 2023
    • Related Report
      2023 Annual Research Report
  • [Presentation] 使える高性能計算機システムの実現にむけて2023

    • Author(s)
      江川隆輔
    • Organizer
      学術情報メディアセンターセミナー「時代に合ったHPCの活用」
    • Related Report
      2022 Annual Research Report
    • Invited
  • [Presentation] 計算特性に着目した実行時間予測に基づくリアルタイム洪水シミュレーションの動的資源割当2022

    • Author(s)
      青柳嶺, 高橋慧智, 下村陽一, 滝沢寛之
    • Organizer
      第185回HPC研究発表会
    • Related Report
      2022 Annual Research Report
  • [Presentation] 機械学習に基づくジョブスケジューリングのためのGANによるデータ拡張2022

    • Author(s)
      石井翔, 高橋慧智, 下村陽一, 滝沢寛之
    • Organizer
      第185回HPC研究発表会
    • Related Report
      2022 Annual Research Report
  • [Presentation] 緊急ジョブの優先実行を考慮したジョブスケジューリングに関する一検討2022

    • Author(s)
      中井大貴, 大村竜義, 高橋慧智, 下村陽一, 滝沢寛之
    • Organizer
      第187回HPC研究発表会
    • Related Report
      2022 Annual Research Report

URL: 

Published: 2021-04-28   Modified: 2025-01-30  

Information User Guide FAQ News Terms of Use Attribution of KAKENHI

Powered by NII kakenhi