Test Stories - references split out



System Agreement Failures

In the history of fault tolerance, there have been many examples of
system agreement failures that have caused serious impairments or
complete loss of fielded systems. Many of these failures are due to
Byzantine faults [1]. On this page, several
examples of faults that led to system disagreement are given. Particular
attention should be given to the proximate cause of the disagreement
(typically a poorly designed fault detection, isolation, and recovery
mechanism) rather than the phenomenological (physics based) event that
starts the chain of events leading to the disagreement.

Space Transport System

NASA's space shuttle has experienced several examples of agreement
failures due to incorrect handling of Byzantine faults between its
MDM units and its
GPC. These faults fall within the class
that the shuttle developers called "non-universal I/O error". The
MDMs act as remote I/O concentrators
for the GPCs. Data from the
MDMs are transferred to the
GPCs over data busses that are similar
to MIL STD 1553. The GPCs execute
redundancy-management algorithms that include
FDIR functions having specific
handling for the "non-universal I/O error" class of failure. However,
these FDIR algorithms were not
correctly designed to handle Byzantine faults. Given that there were
four GPCs, the shuttle had sufficient
redundancy to tolerate a Byzantine fault, if these FDIR algorithms had
been designed correctly.

In one of the earliest examples (some 25 years ago), this failure was
triggered by a technician putting incorrect terminating resistors on the
end of a data bus. Because of the impedance mismatch between the
characteristic impedance of the data bus and resistance of the
terminating resistors, signals on the data bus were reflected off of the
resistors. These reflections caused a standing wave on the data bus. Two
of the four GPCs happened to be
connected to the data bus at nodes of the standing wave and the other
two GPCs were connected to the data bus
at anti-nodes of the standing wave (see figure 1).
Because of this, two of the GPCs
disagreed with the other two GPCs. It
was lucky that this irreconcilable 2:2 disagreement occurred in the lab.

Standing wave caused by incorrect terminating resistors

Figure 1: Standing wave caused by incorrect terminating resistors

A more recent example of this problem came closer to causing a disaster.
At 12:12 GMT 13 May 2008, a space shuttle was loading its hypergolic
fuel for mission STS-124 when a 3-1
disagreement occurred among its GPCs
(GPC 4 disagreed with the other
GPCs). Three seconds later, the split
became 2-1-1 (GPC 2 now disagreed with
GPC 4 and the other two
GPCs). This required that the launch
countdown be stopped. During the subsequent troubleshooting, the
remaining two GPCs disagreed (1-1-1-1
split). See the reports given
in [2] and [3]. This
was a complete system disagreement. However, none of the
GPCs were faulty. The fault was in the
FA 2 MDM. This fault was a crack in a
diode. The photomicrographs in figure 2 show two
views of this diode, rotated 90 degrees. The dark wavy line pointed to
by the red arrows is the crack. The current flow through diode is
normally left to right through the material shown in these pictures.
This means that the crack was perpendicular to the normal current flow
and completely through the current path. As a crack opened up, it
changed the diode into another type of component ... a capacitor. This
transformation is illustrated in figure 3.

The diode crack \
Figure 2: The diode crack

Figure with three elements: the left element shows a diode symbol, the
center element shows a diode symbol with a gap through the center of it,
the ride element shows capacitor. \
Figure 3: The transformation of a diode into a capacitor

Normal bus signals \
Figure 4: Normal bus signals

Bad bus signals caused by diode failure \
Figure 5: Bad bus signals caused by diode failure

The normal signals that should appear on the data bus between the
MDM and the
GPC are shown in
figure 4. The signals that were produced due to
the diode failure are shown in figure 5. Because
some of the bits in the signal are smaller than they should have been,
some of the GPC receivers could not see
these bits. The ability to see these bits depends on the sensitivity of
the receiver, which is a function of manufacturing variances,
temperature, and its power supply voltage. From the symptoms, it is
apparent that the receiver in GPC 4 was
the least sensitive and saw the errors before the other three
GPC. This
causedGPC 4 to disagree with the other
three. Then, as the crack in the diode widened, the bits became shorter
to the point where GPC 2 could no
longer see these bits; which caused it to disagree with the other
GPC. At this point, the set of messages
that was received correctly by GPC 4
was different from the set of messages that was correctly received by
GPC 2 which was different again from
the set of messages that was correctly received by
GPC 1 and
GPC 3. This process continued until
GPC 1 and
GPC 3 also disagreed with all the other
GPC.

TTP/C

A databus known as TTP/C was developed
for the needs of the emerging automotive "by-wire"
industry [4]. Its goals were to provide
replica determinism while living within the cost constraints of the
automotive marketplace. Because of its low cost, it has also found
applications within the aerospace market.
TTP/C is a
TDMA-based serial communications
protocol that provides synchronization in deterministic message
communication over dual redundant physical media.
TTP/C also provides a membership
service at the protocol level. The function of the membership service is
to provide global consensus on message distribution and system state.
Addressing the consensus problem at the protocol level can greatly
reduce system software complexity. However, placing a requirement for
protocol level consensus leaves the protocol itself vulnerable to a
Byzantine failure. The FIT project
confirmed the possibility of this vulnerability by observing actual
occurrences of such failures [].

As part of the FIT project, a first generation time-triggered
communication controller (TTP-C1) was radiated with heavy
ions. [6]. The errors caused by this
experiment were not controlled; they were the result of random
radioactive decay. The reported fault manifestations were bit-flips in
register and RAM locations within the TTP-C1
IC. ICs
with improved design are now available for
TTP/C.

During the many thousands of fault injection runs, several system
failures due to Byzantine faults were
recorded [6]. The dominant Byzantine
failure mode observed was due to marginal transmission timing.
Corruptions in the time-base registers, within the integrated
IC that had been irradiated, led it to
transmit messages at periods that were slightly-off-specification (SOS),
i.e. slightly too early or too late relative to the globally agreed upon
time base. A message transmitted slightly too early was accepted only by
the ICs of the system having slightly
fast clocks; ICs with slightly slower
clocks rejected the message. Even though such a timing failure would
have been tolerated by TTP/C's
Byzantine-tolerant clock synchronization
algorithm [7], the dependency of this
service on TTP/C's membership service
prevented it from succeeding. After a Byzantine erroneous transmission,
the membership consensus logic of
TTP/C prevented
ICs that had different perceptions of
this transmission's validity from communicating with each other.
Therefore, following such a faulty transmission, the system is
partitioned into two sets or cliques --- one clique containing the
ICs that accepted the erroneous
transmission, the other clique containing the
ICs that rejected the transmission.

TTP/C incorporates a mechanism to deal
with these unexpected faults - as long as the errors are transient. The
clique avoidance algorithm is executed on every
IC prior to its next scheduled message.
ICs that find themselves in a minority
clique (i.e. unable to receive messages from the majority of active
ICs) are expected to cease operation
before transmitting. However, if the faulty
IC is in the majority clique or is
programmed to re-integrate after a failure, then a permanent SOS fault
can cause repeated failures. This behavior was observed during the FIT
fault injections. In several fault injection tests, the faulty
IC did not cease transmission and the
SOS fault persisted. The persistence of this fault prevented the clique
avoidance mechanism from successfully recovering. In several instances,
the faulty IC continued to divide the
membership of the remaining cliques, which resulted in eventual system
failure. In later analysis of the faulty behavior, these effects were
repeated with software simulated fault injection. The original faults
were traced to upsets in either the C1 controller time-base registers or
the micro-code instruction RAM [8].
Subsequent generations of the TTP/C
controller have incorporated parity and other mechanisms to reduce the
influence of random upsets (e.g. ROM based microcode execution). SOS
faults in TTP/C can be mitigated with
a central guardian. This guardian assumes fail-detectable behavior and
does not violate end-to-end CRC arguments, which has to be shown by
exhaustive testing.

Mid-Value Select

Mid-Value select is a well-known method for masking the propagation of
failures. It has properties that are similar to an M-out-of-N
voter [9], but does not require any of its
inputs to be bit-for-bit identical. Many mid-value select
implementations in actual fielded systems are merged with other fault
tolerance mechanisms such as reasonableness checks or other fault
detection mechanisms that are then used to block some inputs to the
mid-value selection if they are known to be bad via these other fault
detection mechanisms. This blocking of inputs makes these types of
mid-value selection mechanisms a form of hybrid
nMR. Hybrid
nMR systems can change the "M" and/or
"N" in the M-out-of-N calculations by using reconfiguration. The
objective is to be able to tolerate more faults than can be tolerated by
using nMR alone (details described in
example below). To make things even more complex, these additional fault
detection mechanisms sometimes use a previous output of the mid-value
select in comparison with its inputs to determine if they are faulty
and/or use a previous output of the mid-value select as the replacement
value for a faulty input.

One of the most common ways of blocking inputs to a mid-value selector
is to override the value of a faulty input with the value that is the
midpoint of the reasonable range of input values. This is illustrated on
the left side of figure 6, which was taken from a paper
by Stephen Osder [10] and modified
slightly. The "voter" in his figure is actually a mid-value selector,
which due to its similarity with a bit-for-bit M-out-of-N voter can be
called a voter. While Osder uses this example to show how this widely
used mid-value selection design can fail to meet its design purpose for
a non-replicated mid-value selector, we can use the same design to show
how disagreement can arrive in replicated mid-value selectors.

Mid-value select failure example \
Figure 6: Mid-value select failure example

In this example, zero input volts (ground) is assumed to be the middle
of the valid range. The switches in this figure are driven by the fault
detection logic and force faulty inputs to ground. The fault detection
mechanism used in this example is one that is commonly used. It compares
the previous output of the mid-value selection with each of the inputs.
If an input value is more than a certain epsilon away from this last
value, it gets switched to zero. The rationale for this design is that
the mid-value selector could only tolerate one failure (worst case) if
the switches were not included. With the switches (and assuming good
enough failure detection circuitry that controls the switches), this
design hopes to tolerate an additional failure using the following
argument - When the first failure is detected, it is clamped to zero. If
a second failure occurs that is further away from zero than the good
value, the good value is selected by the mid-value selector. If the bad
value is between the good value and ground, the worst value that the
mid-value selection can output is ground, which is the midpoint of the
reasonable input range. Sometimes this is sufficient. When it is not, a
common variation of this idea is to use the previous output of the
mid-value selection instead of the midpoint of the reasonable input
range (ground in this example).

Mid-value select is used most advantageously where bit-for-bit
M-out-of-N voters cannot be used due to the system's inability to ensure
that the inputs are bit-for-bit identical. Most often this is due to the
inputs being asynchronous with respect to each other. However, it is
this very asynchrony coupled with these hybrid
nMR additions to mid-value select that
can still lead to system disagreement.

Going back to the Osder example, but using replicated mid-value selectors, the
following scenario, as depicted in the simple plot on the right side of
figure 6, is possible - Two of the three signals are on either
side of the signal that is currently being selected as the mid-value and are
nearly epsilon away from this mid-value. Given that this middle input value is
sampled asynchronously and is varying to some degree, one of the mid-value
selectors could sample it when it was closer to the more positive of the other
two inputs (i.e., X2 is sampled near point A) and another mid-value
selector could sample it when it was closer to the more negative of the two
other signals (i.e., X2 is sampled near point B). The former
mid-value selector will then block the more negative of the other two inputs
and the latter mid-value select will block the more positive of the other two
inputs. We now have a disagreement between these two mid-value selectors. Even
more perversely, a third mid-value selector could sample the middle input when
it's exactly between the two other imports and not see either one of them as
being an epsilon away. Thus, we get a three-way split: one blocking the
positive input, one blocking the negative input, and one not blocking any
inputs. The conditions for sustaining a three-way split are highly
unlikely. However, a two-way split would be persistent. In this persistent
state, the replicated mid-value selectors may initially select the same input
(e.g. X2 from the right side of figure 6) but
eventually select different inputs. For example, if X2 continues on
its downward slope and eventually becomes lower than X3, any
replicated mid-value selector that has blocked X~1~ will select X3
as the mid-value; while, any replicated mid-value selector that has blocked
X3 will select (the grounded) X2 as the mid-value. Thus,
this condition creates a system disagreement.