Fault tolerant System design Course

M. Keshtgary
Shiraz university of Technology
1
Fall 1392
Shiraz University of Technology
FAULT TOLERANT SYSTEMS
DESIGN COURSE
REFERENCES
E. Dubrova, Fault-Tolerant Design, Springer, 2013
 I. Koren and C. M. Krishna, Fault Tolerant Systems,
Morgan-Kaufman 2007.
 Mostafa Abd-El-Barr, Design and Analysis of Reliable
and Fault-Tolerant Computer Systems, Department of
Information Science Kuwait University, Kuwait,
Published by Imperial College Press, 2007

Shiraz University of Technology
2
OBJECTIVES
understanding fault tolerance
 – faults and their effects (errors, failures)
 – redundancy techniques
 – evaluation of fault-tolerant systems
 – concepts and applications

Shiraz University of Technology
3
OVERVIEW
Introduction
–
definition of fault tolerance, applications
 Fundamentals of dependability
–
dependability attributes: reliability,
availability, safety
– dependability impairments: faults, errors,
failures
– dependability means
 Dependability evaluation techniques

Shiraz University of Technology
common measures: failure rate, MTTF, MTTR
 reliability block diagrams
 Markov processes

4
OVERVIEW

Shiraz University of Technology
Redundancy techniques
– space redundancy
• hardware redundancy: NMR
• information redundancy: Parity check
• software redundancy: Consistency Check
– time redundancy: Re-computation
5
OVERVIEW
Shiraz University of Technology
6
FAULT TOLERANCE
Fault-tolerance is the ability of a system to
continue performing its function in spite of
faults
 broken connection
hardware
 bug in program
software

Shiraz University of Technology
7
GOALS OF FAULT TOLERANCE

The main goal of fault tolerance is to
increase the dependability of a system
Shiraz University of Technology
8
DEPENDABILITY
Shiraz University of Technology
9
EXAMPLES OF SPECIFICATIONS OF
PROPER SERVICE
Shiraz University of Technology
10
DEPENDABILITY TREE
Shiraz University of Technology
11
AVAILABILITY
A(t) is the probability that a system is
functioning correctly at the instant of time t
 Depends on

Shiraz University of Technology
How frequently the system becomes non-operational
 How quickly it can be repaired

12
AVAILABILITY
STEADY-STATE
Shiraz University of Technology
13
HIGH AVAILABILITY EXAMPLES
Shiraz University of Technology
14
RELIABILITY
a measure of the continuous delivery of
service
 R(t) is the probability that a system operates
without failure in the interval [0,t], given that it
worked at time 0
 We need high reliability when:

Shiraz University of Technology
even momentary periods of incorrect performance are
unacceptable (Ex: aircraft)
 no repair possible (Ex: satellite, spacecraft)

15
RELIABILITY VERSUS AVAILABILITY
Shiraz University of Technology
16
RELIABILITY VERSUS FAULT TOLERANCE
Shiraz University of Technology
17
RELIABILITY VERSUS FAULT TOLERANCE
Shiraz University of Technology
18
HOW FAULT TOLERANCE HELPS
Shiraz University of Technology
19
SAFETY
Safety is the probability that a system will either
perform its function correctly or will discontinue
its operation in a safe way.
 System is safe

Shiraz University of Technology
if it functions correctly, or
 if it fails, it remains in a safe state

20
HIGH SAFETY EXAMPLES

Nuclear energy

Banking

don’t give the money if in doubt
Shiraz University of Technology

stop reactor if a problem occur
21
RELIABILITY VERSUS SAFETY
Reliability is the probability that a system will
perform its functions correctly
 Safety is the probability that a system will either
work correctly or will stop in a manner that
causes no harm

Shiraz University of Technology
22
HOW FAULT TOLERANCE HELPS
Fault tolerance techniques can improve safety by
turning a system off if a failure of a certain sort
is detected
 In a nuclear power plant the reaction process
should be stopped if some discrepancy is detected

Shiraz University of Technology
23
CONFIDENTIALITY

absence of unauthorized disclosure of information
Shiraz University of Technology
24
INTEGRITY

absence of improper system state alterations or
information
Shiraz University of Technology
25
MAINTAINABILITY

ability to undergo repairs and modifications
Shiraz University of Technology
26
SECURITY

Shiraz University of Technology
is the concurrent existence of a) availability for
authorized users only, b) confidentiality, and
c) integrity with ‘improper’ meaning
‘unauthorized’.
27