Project/Area Number |
09680332
|
Research Category |
Grant-in-Aid for Scientific Research (C)
|
Allocation Type | Single-year Grants |
Section | 一般 |
Research Field |
計算機科学
|
Research Institution | TOKYO INSTITUTE OF TECHNOLOGY (1998) Japan Advanced Institute of Science and Technology (1997) |
Principal Investigator |
YOKOTA Haruo Tokyo Institute of Technology, Graduate School of Information Science and Engineering, Department of Computer Science, Associate Professor, 大学院・情報理工学研究科, 助教授 (10242570)
|
Co-Investigator(Kenkyū-buntansha) |
SUGINO Eiji Iwate Prefecture University, Lecturer, ソフトウェア情報学部, 講師 (10293391)
|
Project Period (FY) |
1997 – 1998
|
Project Status |
Completed (Fiscal Year 1998)
|
Budget Amount *help |
¥3,700,000 (Direct Cost: ¥3,700,000)
Fiscal Year 1998: ¥400,000 (Direct Cost: ¥400,000)
Fiscal Year 1997: ¥3,300,000 (Direct Cost: ¥3,300,000)
|
Keywords | Massively Parallel System / Fault Tolerant Software / Program Translation / Primary-backup Method / State-machine Method / Replication / リブリケーション / プライマリ / バックアップ / レプリカ |
Research Abstract |
Corresponding to practicality of massively parallel systems, requirements for fault tolerance of the parallel systems becomes very large. We proposed a method which masks a fault of a component processor in a massively parallel system by parallel software without assuming dedicated hardware or operating systems. In the case of a component processor fault, the fault-tolerant parallel software detects the fault, and continue the job without the faulted processor by combining the primary-backup method and state-machine method. Because to write parallel programs with concerning fault tolerance requires heavy burden for programmers, we provide a mechanism which automatically converts an original parallel program into fault-tolerant parallel software masking a fault of a component processor by using parallel logic programming language. Since components of a fault-tolerant parallel system are generally used as redundancies for implementing the fault tolerance, system performance would be decreased by enhancement of fault tolerance. Moreover, overhead of software fault tolerance also decreases its performance. It is not enough to show the improvement of reliability, but it is required to show the balance between the reliability and performance. Therefore, we introduce a criterion to evaluate both the system reliability and performance, and consider execution environment in which the fault-tolerant parallel software becomes effective.
|