Redundancy: Software N-version programming is most commonly used FTM, where, some N modules are made to perform same operation and their outputs are compared and if they differ, the voting mechanism is invoked to choose fault free result. But, this approach consumes more systems resources for redundant operations and the system cost increases 5.
Migration: In this approach, when any part of application is running, using failure prediction algorithms, the location, time and type of failure that will occurs at any node is identified first and, that portion of application is migrated to the safe node where application can continue running without failure. Reliability, Availability and Serviceability (RAS) log files are commonly used to develop the prediction algorithm 4. But, these failure predictive algorithms in case of petascale and exascale systems where more number of processors and nodes are involved, fail to accurately record the failure logs, that is logs the failures which may never occur or does not logs the failure which occur. Therefore, unidentified faults are not addressed. Hence, migration FTMs when used with other FTMs like checkpointing, again the challenge is when and where to place checkpoints or at what intervals checkpoints are placed.
Failure Semantics: Semantics helps to understand the way in which failure happen. Semantics are based on the fact that, any error in results indicates fault in operation. This erroneous state leads to subsequent failures 6. The various situations of faults are listed in semantics and the recovery action is invoked upon detection of faults. Some of the failure semantics are omission failure semantics, performance failure semantics and crash failure semantics. This technique is based on the ability of the fault detector and any undetected faults, their semantics and their recovery action is not addressed 7.
Failure Masking: The operations are grouped with some redundant processing members. In case of failure these redundant members take over the operation and provide required results without disrupting the end result. Hierarchical group masking and flat group masking are most efficient masking techniques used. In these techniques, voting process or coordinator is used to select one of the redundant members to take over the identified faulty operation and perform once again without fault. During this process the faulty is masked and later replaced by the operational staff. In this technique, extra hardware and software required for voting process / coordinator operation, redundant processing units. This technique also incorporates delays and overheads in its performance because of voting process and decision making in selecting redundant processor. Also, masking technique create new errors, failures and hazards when operational staff fails to replace failed components, since such systems should be regularly inspected and therefore overall costs increases 8 9.
Based on literature survey, the above mentioned FTMs are reliable and efficient in distributed system but lag addressing undetectable faults. Faults in these FTMs left undetectable in case of high performance computations.
Checkpointing is most suitable way to address undetectable faults where the recent fault free operation/state, consistent is saved and the operation is restarted from that checkpoint 6. Shaded blocks are out of scope in this research.