If you ask an IT professional to list the reasons why he or she set up a RAID array,
one of the answers likely to be mentioned is "increased reliability". They
probably don't really mean it though. ;^) As I have implied in many other areas of the
site's coverage of RAID, "reliability" is a vague word when it comes to
redundant disk arrays. The answer of increased reliability is both true and not true at
the same time.
The reliability of an individual component refers to how likely the component is to
remain working with a failure being encountered, typically measured over some period of
time. The reliability of a component is a combination of factors: general factors related
to the design and manufacture of the particular make and model, and specific factors
relevant to the way that particular component was built, shipped, installed and
maintained.
The reliability of a system is a function of the reliability of its
components. The more components you put into a system, the worse the reliability is of the
system as a whole. That's the reason why compex machines typically break down more
frequently than simple ones. While oversimplified, the number used most often to express
the reliability of many components, including hard disks, is mean time between failures (MTBF). If the MTBF values
of the components in a system are designated as MTBF1, MTBF2, and so
on up MTBFN, the reliability of the system can be calculated as follows:
System MTBF = 1 / ( 1/MTBF1 + 1/MTBF2 + ... + 1/MTBFN
)
If the MTBF values of all the components are equal (i.e., MTBF1 = MTBF2
= ... = MTBFN) then the formula simplifies to:
System MTBF = Component MTBF / N
The implications of this are clear. If you create a RAID array with four drives, each
of which has an MTBF figure of 500,000 hours, the MTBF of the array is only
125,000 hours! In fact, it's usually worse than that, because if you are using hardware
RAID, you must also include the MTBF of the controller, which without the RAID
functionality, wouldn't be needed. For sake of illustration, let's say the MTBF of the
controller card is 300,000 hours. The MTBF of the storage subsystem then would be:
System MTBF = 1 / ( 1/MTBF1 + 1/MTBF2 + ... + 1/MTBFN
)
= 1 / ( 1/500000 + 1/500000 + 1/500000 + 1/500000 + 1/300000)
= 88,235
So in creating our array, our "reliability" has actually decreased 82%. Is
that right? Why then do people bother with RAID at all? Well, that's the other side of the
reliability coin. While the reliability of the array hardware goes down, when you include
redundancy information through mirroring or parity, you provide fault tolerance, the ability to
withstand and recover from a failure. This allows the decreased reliability of the array
to allow failures to occur without the array or its data being disrupted, and that's how
RAID provides data protection. Fault tolerance is
discussed here. The reason that most people say RAID improves reliability is that when
they are using the term "reliability" they are including in that the fault
tolerance of RAID; they are not really talking about the reliability of the hardware.
What happens if you don't include redundancy? Well, then you have a ticking time-bomb:
and that's exactly what striping without parity, RAID
0, is. A striped array without redundancy has substantially lower reliability
than a single drive and no fault tolerance. That's why I do not recommend its use unless
its performance is absolutely required, and it is supplemented with very thorough backup
procedures.
Next: Fault Tolerance