Zipping your way to free space: Part 3
In the last two columns (Part 1, Part 2), we contrasted a number of compression tools that you might use to free up disk space on your systems. In comparing compression utilities, the most important issues to consider are speed, the compression ratio (how much space you can expect to save), portability and reliability. Which of these factors ranks highest in your list depends on your application, but the compression tool that you use routinely should work well on most of the files that you need to compress, giving you reasonably good file size reduction and acceptable performance.
What we didn't consider in the earlier columns were decompression time (we looked only at compression time) and the possibility that some of the compression utilities may not work at all given extremely large files. We also didn't get into the issue of patent restrictions -- whether there are hidden strings attached to your use of particular compression tools.
Decompression Timing
With respect to timing, decompression time can be considerably more important than compression time. Why? Because the files that you compress may be delivered to any number of individual users or customers. In other words, every file that is compressed may be uncompressed tens or hundreds of times and is likely to be uncompressed at considerably more time-critical moments.
In glossing over the issue of decompression time, we might have assumed that compression and compression times would be roughly equivalent for any particular compression tool. This assumption, however, proves not to be the case, particularly with very large files. For some compression tools, the compression operation takes MUCH longer than the corresponding decompression. For others, the compression may be slightly faster. For every tool, however, the contents of a file, not just its size, determines how quickly it will be compressed and decompressed by the particular tool. The script presented last week has, therefore, been modified and presented again below -- this time measuring decompression time along with compression time with some surprising results.
Working with Very Large Files
If the files that you need to compress are particularly large, you may need to verify that the tool you want to use can compress and decompress them. Some compression utilities may break down when asked to process extremely large files. For example, the pack command may issue the following error message when asked to compress a 2 Gbyte file -- a reference to limitations of the particular compression algorithm (Huffman encoding) that it uses:
Huffman tree has too many levels - file unchanged
Working with Very Small Files
If the files that you need to work with are particularly small, you may see little or no space saving from compressing them. In fact, some files will even grow in size if you force a compression (some tools will simply not operate on files that are extremely small) -- as shown here.
First, we create an empty file:
> touch zerobytes
Then, we attempt to "compress" it, but the compress command refuses:
> compress zerobytes
zerobytes: -- file unchanged
Next, we gzip the file and look at the size of the rssultant file:
> gzip zerobytes
> ls -l zero*
-rw-r--r-- 1 shs staff 30 Feb 8 16:19 zerobytes.gz
After we gunzip the file, we compress it again, this time with bzip2:
> bzip2 zerobytes
> ls -l zero*
-rw-r--r-- 1 shs staff 14 Feb 8 16:19 zerobytes.bz2
After we bunzip2 the file, we compress it with zip:
> zip zerobytes.zip zerobytes
adding: zerobytes (stored 0%)
> ls -l zer*
-rw-r--r-- 1 shs staff 0 Feb 8 16:19 zerobytes
-rw-r--r-- 1 shs staff 150 Feb 8 16:22 zerobytes.zip
Clearly, compression of an empty file yields a non-empty file. The small amount of content, reflecting some overhead associated with the particular compression algorithm, is negligible but points to the fact that compression isn't always a good thing.
When It's Not Worth It
While file compression can be a sysadmin's friend, not every large file is worth compressing. I've run into numerous cases in which a tar file compresses down to 90-95% of its original size. In cases such as this, the savings is hardly worth the time and trouble of compressing and decompressing. Unless there is something to be gained by keeping your files in the same format (e.g., compressed tar files), you might as well not bother.
The New Script
Adding decompression commands to the original script was relatively easy, though this step obviated the need to refresh the original file for each compress operation. To make the output a little easier to read, the script now uses separator lines between the output from each of the compression commands. In addition, we display only the real (clock) time, ignoring the breakdown into user and sys time. This makes for cleaner, easier-to-parse output -- such as this.
===== compress =====
compress 2.4
uncompress 1.2
compress 70% reduction
Here's the new compressTest script:
#!/bin/sh
if (test $# = 0); then
echo "usage: ziptest"
exit
else
file=$1 fi orig_sz=`ls -l $file | awk '{print $5}'` for tool in compress pack gzip zip bzip2 do echo "=====" $tool "=====" case $tool in compress) time compress $file 2>&1 | grep real | sed "s/real/ compress/" comp_sz=`ls -l $file.Z | awk '{print $5}'` time uncompress $file 2>&1 | grep real | sed "s/real/uncompress/" ;; pack) time pack $file 2>&1 | grep real | sed "s/real/ pack/" comp_sz=`ls -l $file.z | awk '{print $5}'` time unpack $file 2>&1 | grep real | sed "s/real/ unpack/" ;; gzip) time gzip $file 2>&1 | grep real | sed "s/real/ gzip/" comp_sz=`ls -l $file.gz | awk '{print $5}'` time gunzip $file 2>&1 | grep real | sed "s/real/ gunzip/" ;; zip) time zip $file.zip $file 2>&1 | grep real | sed "s/real/ zip/" rm $file comp_sz=`ls -l $file.zip | awk '{print $5}'` time unzip $file.zip 2>&1 | grep real | sed "s/real/ unzip/" rm $file.zip ;; bzip2) time bzip2 $file 2>&1 | grep real | sed "s/real/ bzip2/" comp_sz=`ls -l $file.bz2 | awk '{print $5}'` time bunzip2 $file.bz2 2>&1 | grep real | sed "s/real/ bunzip2/" ;; esac percent=`expr $comp_sz \* 100 / $orig_sz` reduction=`expr 100 - $percent` echo $tool ${reduction}% reduction echo done
While running the script will provide compression and decompression times for one file, the table below displays compress and decompress times for three separate files:
o a relatively small (8 Kbytes) syslog (text) file o a medium-sized (14 MBytes) wtmp (data) file o a very large (2 Gbytes) tar file
syslog file wtmp file 2GB file
===== compress ===== ===== compress ===== ===== compress =====
compress 0.0 compress 2.4 compress 6:13.8
uncompress 0.0 uncompress 1.6 uncompress 4:18.3
compress 73% reduction compress 70% reduction compress 82% reduction
===== pack ===== ===== pack ===== ===== pack =====
pack 0.0 pack 2.1 pack N/A
unpack 0.0 unpack 3.2 unpack N/A
pack 33% reduction pack 46% reduction pack failed
===== gzip ===== ===== gzip ===== ===== gzip =====
gzip 0.0 gzip 8.7 gzip 11:58.6
gunzip 0.0 gunzip 0.9 gunzip 4:18.4
gzip 85% reduction gzip 81% reduction gzip 88% reduction
===== zip ===== ===== zip ===== ===== zip =====
zip 0.0 zip 8.8 zip 12:20.3
unzip 0.0 unzip 1.0 unzip 4:15.4
zip 85% reduction zip 81% reduction zip 88% reduction
===== bzip2 ===== ===== bzip2 =====
bzip2 0.1 bzip2 26.7 bzip2 2:09:55.2
bunzip2 0.0 bunzip2 5.4 bunzip2 12:23:4
bzip2 87% reduction bzip2 82% reduction bzip2 91% reduction
For the small file, time differences between the compression algorithms were insignificant. The range of compression ratios is fairly wide, but the more popular commands (gzip, zip and bzip2) are very close.
For the medium-sized file, we notice that compression times are significantly greater for almost every compression command. This generally works in our favor, since most of us compress files under more leisurely circumstances than we uncompress them. All of the times shown fall within acceptable ranges, however, so the differences aren't likely to swing votes. Compression ratios are similar to those for small files.
It's when we look at extremely large files that the differences are worth notice. The pack command fails to compress our 2 Gbyte file. The compress command, on the other hand, ranks number one for compression time and is on par with gzip and zip for decompression. The bzip2 command, always the lead for compression ratio takes an inordinate amount of time -- over two hours in this case -- to compress our 2GB file. But it reduces it to 9% or so of its original size and offers a perfectly acceptable decompression time, even if its three times as long as the other commands.
What about Patents?
An interesting, though troublesome issue arose some years ago over the issue of compression algorithms and patents. Some compression tools (such as the Unix compress command) use an algorithm that became the subject of considerable controversy when Unisys, holder of the Lempel-ZIV-WELCH (LZW) patent, decided in 1995 that it should charge a licensing fee. Most of the controversy surrounded, not the compression utilities themselves, but the use of the LZW algorithm in the production of GIF and TIFF image files. While a rush on the part of webmasters to remove these image files from the Web was averted (Unisys moved its focus to developers and away from end users), many of the alternative compression tools owe their birth to a backlash against the LZW patent.
Tools like gzip and bzip2, for example, were specifically created to sidestep the kind of legal entanglements that could ensue from the use of patented algorithms. Both these tools use the Burrows-Wheeler Transform algorithm -- an algorithm which takes each block of data and rearranges it to create a compressed file.
Fortunately, the patents on the original LZW compression algorithm all expired between June 2003 and July 2004. While Unisys has patents pending on improvements to the original algorithms, developers now can choose from a number of patent-free algorithms and your choice of one tool or another can be made without concern for patents (at least until new tools based on newly patented algorithms emerge).
Wrapping Up
There is no reason that you can't use all of the compression utilities at your disposal, selecting the best tool for each and every task. Even so, you're likely to use one or two commands routinely and others only when there is an overriding reason to do so. The script included in this column is intended to help you determine when each command is the best for the job.