The performance of present computing systems has increased at the cost of considerably enlarged power consumption. The increased power consumption either reduces the operation time for battery powered systems, such as hand-held mobile systems, or generates extreme amount of heat and requires expensive sophisticated packaging and cooling technologies, especially for complex systems that consist of several processing units. The generated heat, if not efficiently removed, can also reduce system reliability, since hardware failure rate increases with higher temperature [1][2]. In multiprocessor systems, such as space-based control systems or life maintenance systems, where a failure may cause catastrophic results, reliability …show more content…
During the execution of an application, a fault may take place due to various reasons, such as hardware failures, software errors and electro-magnetic effects. Therefore, fault-tolerance is an inherent requirement of systems when accurate results are needed even in the occurrence of faults. In the fault-tolerance area, redundancy is employed to mask or otherwise work around these faults, in this manner preserving a certain desired level of functionality. Generally, redundancy is defined as the deployment of spare resources (spatial) for the application. Permanent faults are generally tolerated by hardware redundancy, which is also known as modular redundancy (MR), where cloned tasks are running concurrently on multiple processing units. Broadly, three different techniques are used for implementing temporal redundancy based fault-tolerance in task scheduling: checkpointing, recovery block and recovery through …show more content…
In a checkpoint, the state of a system is checked and correct states are saved to a stable storage. When faults are noticed, the execution is rolled back to the most recent correct checkpoint and re-computes the faulty section by exploring the temporal redundancy. With the huge number of checkpoints, the time overhead caused by this method may be unaffordable.
2) The recovery block approach is another method providing a task with one or more backups. Once the original copy of the program fails, the system switches to the executions of its backup [15][12][13]. The execution times of the original task and its backups may be different.
3) Recovery through re-execution technique is used to tolerate transient faults, by re-executing the original task if a fault occurs. As soon as faults are detected, the system restores the system state to a previous safe state and the recovery task is send out, in the form of re-execution