Disk compression takes advantage of (at least) two characteristics of most files. First
is the fact that most files have a large amount of redundant information, with patterns
that tend to repeat. By using "placeholders" that are smaller than the pattern
they represent, the size of the file can be reduced.
For example, let's take the sentence "In fact, there are many theories to explain
the origin of man.". If you look closely, you will see that the string "
the" (space plus "the") appears in this sentence three times. Compression
software can replace this string with a token, for example "#", and store the
phrase as "In fact,#re are many#ories to explain# origin of man.". Then, they
reverse-translate the "#" back to " the" when the file is read back.
Further, they can replace the string " man" with "$" and reduce the
sentence to "In fact,#re are$y#ories to explain# origin of$.". Just replacing
those two patterns reduces the size of the sentence by 24%, and this is just a simple
example of what full compression algorithms can do, working with a large number of
patterns and rules.
The other characteristic of many files that disk compression makes use of is the fact
that while each character in a file takes up one byte, most characters don't require the
full byte to store them. Each byte can hold one of 256 different values, but if you have a
text file, there will be very long sequences containing only letters, numbers, and
punctuation. Compression agents use special formulas to pack information like text so that
it makes full use of the 256 values that each byte can hold.
The combination of these two effects results in text files often being compressed by a
factor of 2 to 1 or even 3 to 1. Data files can often be compressed even more: take a look
at some spreadsheet or database files and you will find long sequences of blanks and
zeros, sometimes hundreds or thousands in a row. These files can often be compressed 5 to
1, 10 to 1 or even more.
Finally, compression is also useful in battling slack. If
you have 1,000 files on a hard disk that uses 16,384 byte clusters, and each of these
files is 500 bytes in size, you are using 16 MB of disk space to store less than 500 KB of
data. The reason is that each file must be allocated a full cluster, and only 500 of the
16,384 bytes actually has any data--the rest is slack (97%!) If you put all of those files
into a compressed file like a ZIP file, not only will they probably be reduced in size
greatly, but the ZIP file will have a maximum of 16,383 bytes of slack by itself,
resulting in a large amount of saved disk space. The advanced features of DriveSpace 3
volume compression will in fact reduce slack even if file compression isn't enabled.
Next: Compression Types