Co-Investigator(Kenkyū-buntansha) |
ISHIKAWA Yutaka National Institute of Advanced Industrial Science and Technology, Researcher, 情報アーキテクチャ部, 主任研究官
OGAWA Hirotaka Graduate School of Science, Dept. of Mathematical and Computer Sciences, Tokyo Institute of Technology, Assistant, 大学院・情報理工学研究科, 助手 (90302968)
AIDA Kento Inter Disciplinary Graduate School of Science and Engineering, Tokyo Institute of Technology, Lecturer, 大学院・総合理工学研究科, 講師 (80247212)
TAKAGI Hiromitsu National Institute of Advanced Industrial Science and Technology, Researcher, 情報アーキテクチャ部, 主任研究官
|
Research Abstract |
The objective of the project is to push the technological envelop of fault tolerance and reconfigurability in large-scale clustering such that the clusters become almost self-sustaining, and reconfiguring is a matter of "Plug&Play". Some of the salient results are as follows : 1) Construction of the "Plug&Play" clustering testbed (20 nodes of DELL Inspiron , Mobile Celeron 600 MHz, 128 MB Memory, 20 GB HDD, 3COM Plug&Play PCMCIA 100Base-T Network Card). This served as a flexible testbed for middleware development. It was also very compact (a small rack) and low power (less than 400 watts/20 nodes) 2) Development of the Parakeet Fault Tolerant, High-Performance Cluster MPI which allows various checkpointing algorithms to be selected from a set of available algorithm by the user according to his application characteristics. Parakeet is an entirely user-level implementation, is portable and efficient, and frees the users from checkpointing concerns within his code. We have implemented vario
… More
us checkpointing strategies to achieve the best efficiency, and conducted detailed performance analysis comparing with full restart. 3) Self-organizing cluster middleware, the Lucie prototype. As a basic technology, plug&play clustering requires hot swapping of nodes, reconfiguration of software organization within a node, and dynamic partition management. Lucie builds on existing Linux tools to implement full cluster configuration capabilities in an automated fashion. Lucie allows fully automated (re)installation and configuration of every node in a cluster in a very rapid fashion. 4) Prototyping scalable, secure and self-organizing cluster communication. We have identified that scalable, reliable, secure, and self-organizing communication within the cluster node is the essential foundation for reliable, plug&play clustering. We have prototyped some of the ideas in the Gfarm (cluster middleware for Petascale Datagrid processing) job manager : there, the self-organizing process ring structure governs all the nodes, and jobs can be started up rapidly in parallel, in a safe secure manner. Less
|