In the "good old days" of RAID, fault tolerance was provided through
redundancy, but there was a problem when it came to availability: what do you do if a
drive fails in a system that runs 24 hours a day, 7 days a week? Or even in a system that
runs 12 hours a day but has a drive go bad first thing in the morning? The redundancy
would let the array continue to function, but in a degraded
state. The hard disks were installed deep inside the server case, and this required
the case to be opened to access the failed drive and replace it. Furthermore, the other
drives in the array that continued to run despite the failure, would have to be powered
off, interrupting all users of the system anyway. Surely there had to be a better way, and
of course, there is.
An important feature that allows availability to remain high when hardware fails and
must be replaced is drive swapping. Now strictly speaking, the term "drive
swapping" simply refers to changing one drive for another, and of course that can be
done on any system (unless nobody can find a screwdriver!
) What is usually meant by
this term though is hot swapping, which means changing a hard disk in a system
without having to turn off the power and open up the system case. In a system that
supports hot swap, you can easily remove a failed drive, replace it with a new one and
have the system rebuild the replaced drive immediately. The users of the system don't even
know that the change has occurred.
Unfortunately, "hot swap" is another one of those terms that is used in a
non-standard way by many, frequently leading to confusion. In fact, there are a hierarchy
of different swap "temperatures" that properly describe the state of the system
at the time a drive is swapped:
- Hot Swap: A true hot swap is defined as one where the drive can be
replaced while the rest of the system remains completely uninterrupted. This means the
system carries on functioning, the bus keeps transferring data, and the hardware change is
completely transparent.
- Warn Swap: In a so-called "warm swap", the power remains on
to the hardware and the operating system continues to function, but all activity must be
stopped on the bus to which the device is connected. This is worse than a hot swap,
obviously, but clearly better than a cold one.
- Cold Swap: The system must be powered off before making the swap.
It is common for a system to be described as capable of hot swapping when it really is
only doing warm swaps. True hot swapping requires support from all of the components in
the system: the RAID controller, the bus (usually SCSI), the enclosure (which must have
open bays for the drives so they can be accessed from the front of the case), and the
interface. It requires special connectors on the drives that are designed to ensure that
the ground connections between the drive and the bus are maintained at any time that the
device has power. This means that when removing a device, the power connection has to be
broken before the ground connection, and when re-inserting a device, the ground connection
has to be made before the power connection is re-established. This is typically done by
designing the connectors so that the ground connector pins are a bit longer than the other
pins. This design is in fact used by SCSI SCA, the most common interface used by
hot-swappable RAID arrays. See this discussion of
SCA for more, as well as this discussion of drive
enclosures.
As mentioned above, the SCA method on SCSI is most commonly used for hot-swappable
arrays. In the IDE/ATA world, the best you can usually do is warm swapping using drive trays, which "convert" regular
IDE/ATA drives to a form similar in concept to how SCA works, though not quite the same.
This is still pretty good, but not really hot swapping. The system usually needs to be
halted before you remove the drives.
A system that cannot do hot swapping, or even warm swapping, will benefit from the use
of hot spares. If your system can only cold
swap, you will at some point have to take it down to change failed hardware. But if you
have hot spares, you can restore the array to full functionality immediately, and thus
delay shutting the system down to a more convenient time, like 3:00 am (heh, I meant more
convenient for the users, not you, the lucky administrator.
) In fact, hot
sparing is a useful feature even if you have hot swap capability; read more about it here.
Next: Hot Spares