2011 Fiscal Year Final Research Report
Scalable, Fast Checkpointing for Heterogeneous Supercomputers
Project/Area Number |
22700047
|
Research Category |
Grant-in-Aid for Young Scientists (B)
|
Allocation Type | Single-year Grants |
Research Field |
Computer system/Network
|
Research Institution | Tokyo Institute of Technology |
Principal Investigator |
MARUYAMA Naoya 東京工業大学, 学術国際情報センター, 助教 (60532801)
|
Project Period (FY) |
2010 – 2011
|
Keywords | 計算機システム / 高性能計算 / 耐故障性 / GPU計算 |
Research Abstract |
We developed a fault-tolerant software technique that allows for reliable executions of long-running applications on large-scale supercomputers, where component failures are not exceptions but norm. Such failures can be a huge problem when obtaining scientific simulation results that can be only obtainable with days of weeks of executions. Our algorithm for fault tolerant application executions in the presence of component failures allows for very fast saving of application runtime states so that they can be restarted from the saved states upon failures. We have developed a prototype implementation of the proposed algorithm and demonstrated its highly scalable performance on a large-scale heterogeneous supercomputer.
|
-
-
-
-
-
-
-
-
[Presentation] An 80-Fold Speedup, 15. 0 TFlops, Full GPU Acceleration of Non-Hydrostatic Weather Model ASUCA Production Code2010
Author(s)
Takashi Shimokawabe, Takayuki Aoki, Chiashi Muroi, Junichi Ishida, Kohei Kawano, Toshio Endo, Akira Nukada, Naoya Maruyama, Satoshi Matsuoka
Organizer
ACM/IEEE Supercomputing(SC' 10)
Place of Presentation
New Orleans, USA
Year and Date
2010-11-16
-
-
-
-
-
-
[Book] アスキー・メディアワークス2010
Author(s)
松岡聡,青木尊之,遠藤敏夫,丸山直也,佐藤仁,滝澤真一朗,實本英之
Total Pages
48
Publisher
TSUBAMEの造り方から探るPCクラスターと『スパコン』のあいだ