Scalable, Fast Checkpointing for Heterogeneous Supercomputers
Project/Area Number |
22700047
|
Research Category |
Grant-in-Aid for Young Scientists (B)
|
Allocation Type | Single-year Grants |
Research Field |
Computer system/Network
|
Research Institution | Tokyo Institute of Technology |
Principal Investigator |
MARUYAMA Naoya 東京工業大学, 学術国際情報センター, 助教 (60532801)
|
Project Period (FY) |
2010 – 2011
|
Project Status |
Completed (Fiscal Year 2011)
|
Budget Amount *help |
¥3,510,000 (Direct Cost: ¥2,700,000、Indirect Cost: ¥810,000)
Fiscal Year 2011: ¥1,430,000 (Direct Cost: ¥1,100,000、Indirect Cost: ¥330,000)
Fiscal Year 2010: ¥2,080,000 (Direct Cost: ¥1,600,000、Indirect Cost: ¥480,000)
|
Keywords | 計算機システム / 高性能計算 / 耐故障性 / GPU計算 / スーパーコンピュータ / GPGPU |
Research Abstract |
We developed a fault-tolerant software technique that allows for reliable executions of long-running applications on large-scale supercomputers, where component failures are not exceptions but norm. Such failures can be a huge problem when obtaining scientific simulation results that can be only obtainable with days of weeks of executions. Our algorithm for fault tolerant application executions in the presence of component failures allows for very fast saving of application runtime states so that they can be restarted from the saved states upon failures. We have developed a prototype implementation of the proposed algorithm and demonstrated its highly scalable performance on a large-scale heterogeneous supercomputer.
|
Report
(3 results)
Research Products
(35 results)
-
-
-
-
[Presentation] Towards an Asynchronous Check pointing System2011
Author(s)
Kento Satou, Adam Moody, Kathryn Mohror, Todd Gamblin, Bronis R. De Supinski, Naoya Maruyama, Satoshi Matsuoka
Organizer
IPSJ SIG Technical Reports 2011-ARC-197 2011-HPC-132(HOKKE-19)
Place of Presentation
Sapporo
Year and Date
2011-11-28
Related Report
-
[Presentation] Towards an Asynchronous Checkpointing System2011
Author(s)
Kento Satou, Adam Moody, Kathryn Mohror, Todd Gamblin, Bronis R. De Supinski, Naoya Maruyama, Satoshi Matsuoka
Organizer
IPSJ SIG Technical Reports 2011-ARC-197 2011-HPC-132 (HOKKE-19)
Place of Presentation
札幌
Year and Date
2011-11-28
Related Report
-
-
-
-
-
-
-
-
-
-
-
-
-
-
[Presentation] An 80-Fold Speedup, 15.0 TFlops, Full GPU Acceleration of Non-Hydrostatic Weather Model ASUCA Production Code2010
Author(s)
Takashi Shimokawabe, Takayuki Aoki, Chiashi Muroi, Junichi Ishida, Kohei Kawano, Toshio Endo, Akira Nukada, Naoya Maruyama, Satoshi Matsuoka
Organizer
International Conference for High Performance Computing, Networking, Storage and Analysis (SC10)
Place of Presentation
New Orleans
Year and Date
2010-11-17
Related Report
-
[Presentation] An 80-Fold Speedup, 15. 0 TFlops, Full GPU Acceleration of Non-Hydrostatic Weather Model ASUCA Production Code2010
Author(s)
Takashi Shimokawabe, Takayuki Aoki, Chiashi Muroi, Junichi Ishida, Kohei Kawano, Toshio Endo, Akira Nukada, Naoya Maruyama, Satoshi Matsuoka
Organizer
ACM/IEEE Supercomputing(SC' 10)
Place of Presentation
New Orleans, USA
Year and Date
2010-11-16
Related Report
-
[Presentation] MPI-CUDA Application Checkpointing2010
Author(s)
Toan Nguyen, Tatsuo Nomura, Hideyuki Jitsumoto, Naoya Maruyama, Toshio Endo, Satoshi Matsuoka
Organizer
GPU Technology Conference 2010
Place of Presentation
San Jose, CA
Year and Date
2010-09-20
Related Report
-
-
-
[Presentation] MPI-CUDA Applications Checkpointing2010
Author(s)
Toan Nguyen, Hideyuki Jitsumoto, Naoya Maruyama, Tatsuo Nomura, Toshio Endo, Satoshi Matsuoka
Organizer
Summer United Workshops on Parallel, Distributed and Cooperative Processing (SWoPP 2010)
Place of Presentation
金沢
Year and Date
2010-08-04
Related Report
-
-
-
-
-
-
-
-
-
-
[Book] アスキー・メディアワークス2010
Author(s)
松岡聡,青木尊之,遠藤敏夫,丸山直也,佐藤仁,滝澤真一朗,實本英之
Total Pages
48
Publisher
TSUBAMEの造り方から探るPCクラスターと『スパコン』のあいだ
Related Report
-