As described in this section, the reliability of
a system is a function of the reliability of the various components that comprise it. The
more components in a system, the less reliable a system will be. Furthermore, in terms of
reliability, the chain is truly only as strong as its weakest link. When dealing with a
PC, there are a number of critical components without which the system will not function;
if one of these hardware pieces fails then your array will go down, regardless of the
number of disks you have or how well they are manufactured. This is an important point
that too few people consider carefully enough when setting up a RAID box.
One unreliable component can severely drag down the overall reliability of a system,
because the MTBF of a system will always be lower than the MTBF of the least
reliable component. Recall the formula for reliability of a system:
System MTBF = 1 / ( 1/MTBF1 + 1/MTBF2 + ... + 1/MTBFN
)
Also recall that if the MTBF values of all the components are equal (i.e., MTBF1
= MTBF2 = ... = MTBFN) then this boils down to:
System MTBF = Component MTBF / N
This means that if we have four components with an MTBF of 1,000,000 hours each, the
MTBF of the system is 250,000 hours. But if we have four components, of which three have
an MTBF of 1,000,000 hours and the fourth has an MTBF of 100,000 hours? In this case, the
MTBF of the system drops to only about 77,000 hours, one-third of the previous value.
What this all means is that you can have the greatest hard disks in the world, and use
multiple RAID levels to protect against drive failure, but if you put it all in a system
with lousy support components, you're not going to have a reliable, high-availability
system. It's as simple as that, but in fact, it's actually worse than that. While
RAID reduces reliability, it improves fault tolerance; however, most of the other
components in the system have no fault tolerance. This means that the failure of any one
of them will bring down the PC. Of particular concern are components that affect all the
drives in a system, and which generally have a reputation for problems or relatively low
reliability.
To increase the reliability of the PC as a whole, systems using RAID are usually
designed to use high-quality components. Many systems go beyond this, however, by
introducing fault tolerance into other key components in the system that often fail. Since
many of the most common problems with PCs are related to power, and since without the power supply nothing in the PC will
operate, many high-end RAID-capable systems come equipped with redundant power supplies. These
supplies are essentially a pair of supplies in one, either of which can operate the PC. If
one fails then the other can handle the entire load of the PC. Most also allow hot
swapping, just like hard disk hot swapping in a
RAID array. See this section for more.
Another critical issue regarding support hardware relates to power protection--your PC is completely
dependent on the supply of electricity to the power supply unless you use a UPS. In my
opinion, any application important enough to warrant the implementation of a
fault-tolerant RAID system is also important enough to justify the cost of a UPS, which you can think of as "fault
tolerance for the electric grid". In addition to allowing the system to weather short
interruptions in utility power, the UPS also protects the system against being shut down
abruptly while in the middle of an operation. This is especially important for arrays
using complex techniques such as striping with parity; having the PC go down in the middle
of writing striped information to the array can cause a real mess. Even if the battery
doesn't last through a power failure, a UPS lets you shut the PC down gracefully in the
event of a prolonged outage.
How about components like motherboards, CPUs, system memory and the like? They
certainly are critical to the operation of the system: a multiple-CPU system can handle
the loss of a processor (though that's a pretty rare event in any case) but no PC around
will function if it has a motherboard failure. Most systems running RAID do not provide
any protection against failure of these sorts of components. The usual reason is that
there is no practical way to protect against the failure of something like a motherboard
without going to (very expensive) specialty, proprietary designs. If you require
protection against the failure of any of the components in a system, then you
need to look beyond fault-tolerance within the PC and configure a system of redundant
machines.
Next: The Continued Importance of Backups