Research Project
Grant-in-Aid for Young Scientists (B)
We developed a fault-tolerant software technique that allows for reliable executions of long-running applications on large-scale supercomputers, where component failures are not exceptions but norm. Such failures can be a huge problem when obtaining scientific simulation results that can be only obtainable with days of weeks of executions. Our algorithm for fault tolerant application executions in the presence of component failures allows for very fast saving of application runtime states so that they can be restarted from the saved states upon failures. We have developed a prototype implementation of the proposed algorithm and demonstrated its highly scalable performance on a large-scale heterogeneous supercomputer.
All 2012 2011 2010
All Journal Article (2 results) (of which Peer Reviewed: 2 results) Presentation (31 results) Book (2 results)
New Generation Computing
Volume: Vol28, No.3 Pages: 237-255
Volume: Vol.28,No.3 Pages: 237-255