• Search Research Projects
  • Search Researchers
  • How to Use
  1. Back to project page

2011 Fiscal Year Final Research Report

Scalable, Fast Checkpointing for Heterogeneous Supercomputers

Research Project

  • PDF
Project/Area Number 22700047
Research Category

Grant-in-Aid for Young Scientists (B)

Allocation TypeSingle-year Grants
Research Field Computer system/Network
Research InstitutionTokyo Institute of Technology

Principal Investigator

MARUYAMA Naoya  東京工業大学, 学術国際情報センター, 助教 (60532801)

Project Period (FY) 2010 – 2011
Keywords計算機システム / 高性能計算 / 耐故障性 / GPU計算
Research Abstract

We developed a fault-tolerant software technique that allows for reliable executions of long-running applications on large-scale supercomputers, where component failures are not exceptions but norm. Such failures can be a huge problem when obtaining scientific simulation results that can be only obtainable with days of weeks of executions. Our algorithm for fault tolerant application executions in the presence of component failures allows for very fast saving of application runtime states so that they can be restarted from the saved states upon failures. We have developed a prototype implementation of the proposed algorithm and demonstrated its highly scalable performance on a large-scale heterogeneous supercomputer.

  • Research Products

    (14 results)

All 2011 2010

All Journal Article (1 results) (of which Peer Reviewed: 1 results) Presentation (12 results) Book (1 results)

  • [Journal Article] Model-based Fault Localization : Finding Behavioral Outliers in Large-scale Computing Systems2010

    • Author(s)
      Naoya Maruyama and Satoshi Matsuoka
    • Journal Title

      New Generation Computing

      Volume: Vol28, No.3 Pages: 237-255

    • Peer Reviewed
  • [Presentation] Towards an Asynchronous Check pointing System2011

    • Author(s)
      Kento Satou, Adam Moody, Kathryn Mohror, Todd Gamblin, Bronis R. De Supinski, Naoya Maruyama, Satoshi Matsuoka
    • Organizer
      IPSJ SIG Technical Reports 2011-ARC-197 2011-HPC-132(HOKKE-19)
    • Place of Presentation
      Sapporo
    • Year and Date
      2011-11-28
  • [Presentation] FTI : High performance Fault Tolerance Interface for hybrid systems2011

    • Author(s)
      Leonardo Bautista, Naoya Maruyama, Dimitri Komatitsch, Tsuboi Seiji, Franck Cappello, Satoshi Matsuoka, and Nakamura Takeshi
    • Organizer
      ACM/IEEE Supercomputing(SC' 11)
    • Place of Presentation
      Seattle, USA
    • Year and Date
      2011-11-16
  • [Presentation] Physis : An Implicitly Parallel Programming Model for Stencil Computations on Large-Scale GPU-Accelerated Supercomputers2011

    • Author(s)
      Naoya Maruyama, Tatsuo Nomura, Kento Sato, and Satoshi Matsuoka
    • Organizer
      ACM/IEEE Supercomputing(SC' 11)
    • Place of Presentation
      Seattle, USA
    • Year and Date
      2011-11-15
  • [Presentation] Accelerating the TSUBAME Supercomputer with Graphics Processing Units and its Implications for Systems Research2011

    • Author(s)
      Naoya Maruyama
    • Organizer
      Workshop on Large-Scale Parallel Processing(LSPP' 11) in conjunction with IEEE International Parallel and Distributed Processing Symposium(IPDPS' 11)
    • Place of Presentation
      Anchorage, USA
    • Year and Date
      2011-05-20
  • [Presentation] A Sequential Programming Framework for Large-Scale GPU-Accelerated Structured Grids2011

    • Author(s)
      Tatsuo Nomura, Naoya Maruyama, Toshio Endo, Satoshi Matsuoka
    • Organizer
      SIAM Conference on Computational Science and Enginnering
    • Place of Presentation
      Reno, USA
    • Year and Date
      2011-03-03
  • [Presentation] Low-overhead checkpoint for large-scale GPU-accelerated systems2010

    • Author(s)
      Leonardo Bautista, Akira Nukada, Naoya Maruyama, Franck Cappello, Satoshi Matsuoka
    • Organizer
      High Performance Computing Conference(HiPC)
    • Place of Presentation
      Goa, India
    • Year and Date
      2010-12-20
  • [Presentation] An 80-Fold Speedup, 15. 0 TFlops, Full GPU Acceleration of Non-Hydrostatic Weather Model ASUCA Production Code2010

    • Author(s)
      Takashi Shimokawabe, Takayuki Aoki, Chiashi Muroi, Junichi Ishida, Kohei Kawano, Toshio Endo, Akira Nukada, Naoya Maruyama, Satoshi Matsuoka
    • Organizer
      ACM/IEEE Supercomputing(SC' 10)
    • Place of Presentation
      New Orleans, USA
    • Year and Date
      2010-11-16
  • [Presentation] MPI-CUDA Applications Check pointing2010

    • Author(s)
      Toan Nguyen, Hideyuki Jitsumoto, Naoya Maruyama, Tatsuo Nomura, Toshio Endo, Satoshi Matsuoka
    • Organizer
      並列/分散/協調処理に関するサマー・ワークショップ
    • Place of Presentation
      金沢
    • Year and Date
      2010-08-04
  • [Presentation] GPUクラスタを対象にした並列ステンシル計算の自動コード生成フレームワーク2010

    • Author(s)
      野村達雄,丸山直也,遠藤敏夫,松岡聡
    • Organizer
      並列/分散/協調処理に関するサマー・ワークショップ
    • Place of Presentation
      金沢
    • Year and Date
      2010-08-03
  • [Presentation] Distributed Diskless Checkpoint for Large Scale Systems2010

    • Author(s)
      Leonardo Bautista, Naoya Maruyama, Franck Cappello, Satoshi Matsuoka
    • Organizer
      IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing(CCGrid' 10)
    • Place of Presentation
      Melbourne, Australia
    • Year and Date
      2010-05-18
  • [Presentation] Linpack Evaluation on a Supercomputer with Heterogeneous Accelerators2010

    • Author(s)
      Toshio Endo, Akira Nukada, Satoshi Matsuoka, and Naoya Maruyama
    • Organizer
      IEEE International Parallel & Distributed Processing Symposium(IPDPS2010)
    • Place of Presentation
      Atlanta, USA
    • Year and Date
      2010-04-21
  • [Presentation] A High-Performance Fault-Tolerant Software Framework for Memory on Commodity GPUs2010

    • Author(s)
      Naoya Maruyama, Akira Nukada, and Satoshi Matsuoka
    • Organizer
      IEEE International Parallel & Distributed Processing Symposium(IPDPS2010)
    • Place of Presentation
      Atlanta, USA
    • Year and Date
      2010-04-20
  • [Book] アスキー・メディアワークス2010

    • Author(s)
      松岡聡,青木尊之,遠藤敏夫,丸山直也,佐藤仁,滝澤真一朗,實本英之
    • Total Pages
      48
    • Publisher
      TSUBAMEの造り方から探るPCクラスターと『スパコン』のあいだ

URL: 

Published: 2013-07-31  

Information User Guide FAQ News Terms of Use Attribution of KAKENHI

Powered by NII kakenhi