iserc.ie - Software Engineering Research Information

Wednesday, October 18th, 2017

Fault Tolerance


"Not the power to remember, but its very opposite, the power to forget, is a necessary condition for our existence." - Sholem Asch


Fault Tolerance is the ability of a system to handle an unexpected failure of hard- or software. This can start with the ability to continue or resume operation in case of a power loss. Often fault tolerance are mirroring all resources, what means that every operation is conducted on two or even more identical subsystems - in case one dropes out, one of the others will take over operation.

A system must be be available, relaible, safe and secure. Available means that it is usable when it is neede, reliable means that for each two calculations with the same input, the same output will come out, safe means that it will not have any hazardous effects on the outside, and secure means that confidential data will stay confidential. Fault Tolerance has to ensure that this four effects are ensured.

There are several techiques of avoiding faults, removing them, evade or tolerate them in detecting, diagnosing, confining, masking, compensating and recovering from faults.