In an effort to help users avoid data loss, drive manufacturers are now incorporating
logic into their drives that acts as an "early warning system" for pending drive
problems. This system is called Self-Monitoring Analysis and Reporting Technology
or SMART. The hard disk's integrated controller works with various sensors to
monitor various aspects of the drive's performance, determines from this information if
the drive is behaving normally or not, and makes available status information to software
that probes the drive and look at it.
The fundamental principle behind SMART is that many problems with hard disks don't
occur suddenly. They result from a slow degradation of various mechanical or electronic
components. SMART evolved from a technology developed by IBM called Predictive Failure
Analysis or PFA. PFA divides failures into two categories: those that can be
predicted and those that cannot. Predictable failures occur slowly over time, and often
provide clues to their gradual failing that can be detected. An example of such a
predictable failure is spindle motor bearing burnout: this will often occur over a long
time, and can be detected by paying attention to how long the drive takes to spin up or
down, by monitoring the temperature of the bearings, or by keeping track of how much
current the spindle motor uses. An example of an unpredictable failure would be the
burnout of a chip on the hard disk's logic board: often, this will "just happen"
one day. Clearly, these sorts of unpredictable failures cannot be planned for.

|
The main principle behind failure prediction is that
some failures cause gradual changes in
various indicators that can be tracked to detect trends that may indicate overall drive
failure. |
Image © Quantum
Corporation
Image used with permission. |
The drive manufacturer's reliability engineers analyze failed drives and various
mechanical and electronic characteristics of the drive to determine various correlations:
relationships between predictable failures, and values and trends in various
characteristics of the drive that suggest the possibility of slow degradation of the
drive. The exact characteristics monitored depend on the particular manufacturer and
model. Here are some that are commonly used:
- Head Flying Height: A downward trend in flying height will often presage a head crash.
- Number of Remapped Sectors: If the drive is remapping many sectors due to internally-detected
errors, this can mean the drive is starting to go.
- ECC Use and Error Counts: The number of errors
encountered by the drive, even if corrected internally, often signal problems developing
with the drive. The trend is in some cases more important than the actual count.
- Spin-Up Time: Changes in spin-up time can reflect problems with the
spindle motor.
- Temperature: Increases in drive temperature often signal spindle motor
problems.
- Data Throughput: Reduction in the transfer rate of the drive can signal
various internal problems.
(Some of the quality and reliability features I am describing in this part of the site
are in fact used to feed data into the SMART software.)
Using statistical analysis, the "acceptable" values of the various
characteristics are programmed into the drive. If the measurements for the various
attributes being monitored fall out of the acceptable range, or if the trend in a
characteristic is showing an unacceptable decline, an alert condition is written into the
drive's SMART status register to warn that a problem with the drive may be occurring.
SMART requires a hard disk that supports the feature and some sort of software to check
the status of the drive. All major drive manufacturers now incorporate the SMART feature
into their drives, and most newer PC systems and motherboards have BIOS routines that will
check the SMART status of the drive. So do operating systems such as Windows 98. If your
PC doesn't have built-in SMART support, some utility software (like Norton Utilities and
similar packages) can be set up to check the SMART status of drives. This is an important
point to remember: the hard disk doesn't generate SMART alerts, it just makes available
status information. That status data must be checked regularly for this feature to be of
any value.
Clearly, SMART is a useful tool but not one that is foolproof: it can detect some sorts
of problems, but others it has no clue about. A good analogy for this feature would be to
consider it like the warning lights on the dashboard of your car: something to pay
attention to, but not to rely upon. You should not assume that because SMART generated an
alert, there is definitely a drive problem, or conversely, that the lack of an alarm means
the drive cannot possibly be having a problem. It certainly is no replacement for proper
hard disk care and maintenance, or routine and current backups.
If you experience a SMART alert using your drive, you should immediately stop using it
and contact your drive manufacturer's technical support department for instructions. Some
companies consider a SMART alert sufficient evidence that the drive is bad, and will
immediately issue an RMA for its replacement; others require other steps to be performed,
such as running diagnostic software on the drive. In no
event should you ignore the alert. Sometimes I see people asking others "how
they can turn off those annoying SMART messages" on their PCs. Doing that is, well,
like putting electrical tape over your car's oil pressure light so it won't bother you
while you're driving! 
Next: Idle Time Error Checking