The basis of all error detection and correction in hard disks is the inclusion of
redundant information and special hardware or software to use it. Each sector of data on
the hard disk contains 512 bytes, or 4,096 bits, of user data. In addition to these bits,
an additional number of bits are added to each sector for the implementation of error
correcting code or ECC (sometimes also called error correction code or error
correcting circuits). These bits do not contain data; rather, they contain information
about the data that can be used to correct any problems encountered trying to access the
real data bits.
There are several different types of error correcting codes that have been invented
over the years, but the type commonly used on PCs is the Reed-Solomon algorithm,
named for researchers Irving Reed and Gustave Solomon, who first discovered the general
technique that the algorithm employs. Reed-Solomon codes are widely used for error
detection and correction in various computing and communications media, including magnetic
storage, optical storage, high-speed modems, and data transmission channels. They have
been chosen because they are easier to decode than most other similar codes, can detect
(and correct) large numbers of missing bits of data, and require the least number of extra
ECC bits for a given number of data bits.
When a sector is written to the hard disk, the appropriate ECC codes are generated and
stored in the bits reserved for them. When the sector is read back, the user data read,
combined with the ECC bits, can tell the controller if any errors occurred during the
read. Errors that can be corrected using the redundant information are corrected before
passing the data to the rest of the system. The system can also tell when there is too
much damage to the data to correct, and will issue an error notification in that event. The sophisticated firmware present in all
modern drives uses ECC as part of its overall error management protocols. This is all done
"on the fly" with no intervention from the user required, and no slowdown in
performance even when errors are encountered and must be corrected.
The capability of a Reed Solomon ECC implementation is based on the number of
additional ECC bits it includes. The more bits that are included for a given amount of
data, the more errors that can be tolerated. There are multiple tradeoffs involved in
deciding how many bits of ECC information to use. Including more bits per sector of data
allows for more robust error detection and correction, but means fewer sectors can be put
on each track, since more of the linear distance of the track is used up with non-data
bits. On the other hand, if you make the system more capable of detecting and correcting
errors, you make it possible to increase areal density or make other performance
improvements, which could pay back the "investment" of extra ECC bits, and then
some. Another complicating factor is that the more ECC bits included, the more processing
power the controller must possess to process the Reed Solomon algorithm. The engineers who
design hard disks take these various factors into account in deciding how many ECC bits to
include for each sector.
If you are interested, take
this link to read more about the theory underlying ECC. Note that some of this
information is complicated to read.
Next: Read Error Severities and Error Management Logic