Unix Tip: Comparing Files with Checksums

By Sandra Henry-Stocker, ITworld.com |  Small Business Add a new comment

Send in your Unix questions today! |
See additional Unix tips and tricks


Unix systems provide numerous ways to compare files. The most common way to verify that you have received or downloaded the proper file is to compute a checksum and compare it
against one computed by a reliable source. MD5 is frequently used to compute checksums
because it is computationally unlikely that two different files will ever have the same
checksum. Similar commands, such as sum and cksum, also compute checksums but not with
as much reliability. Let's look at several checksums and see why.



One of the first things you'll notice if you compare the output of the sum, time and md5
commands is the length of each calculated value. The sum command prints two numbers.
The first (31339 in our example) is a 16-bit checksum. This means that you will get any
of 65,536 distinct responses (from 0 to 65,535) for any file. The chance of getting the
same checksum for two files which are different is very small. If you have 65,000 files
to compare, however, the chance that two of them have the same checksum, though different,
is quite high. In fact, you'll probably have a number of false matches.

# sum /export/home/jdoe/bigfile.gz
31339 165523 home/jdoe/bigfile.gz

One characteristic of the sum command is that the length of the checksum has some
relationship to the length of the file. If one file contains "abc" and another contains
"abd", the checksums are only different by 1. This command is clearly using a very
simple calculation, better for verifying the integrity of a file than for heavy duty or
high security file checking.

# sum /tmp/ab*
304 1 /tmp/abc
305 1 /tmp/abd

The second number that sum prints is the number of 512-byte blocks that are in the file.
This helps considerably to insure that dissimilar files are clearly dissimilar. Unless
the files you are comparing are also roughly the same size, the fact that the checksums
are the same can be discounted.



The cksum command works similarly. The first number that it prints is a cyclical
redundancy check (CRC) for the file. As you can see from the sample output below, the CRC
is a fairly large number. This decreases the chance that two files will be taken as
being identical when they are not. Notice the difference in the checksum of our two
three-byte files.

# cksum /tmp/ab*
1112837078      4       /tmp/abc
1197460547      4       /tmp/abd

Using cksum against the lartge file we saw earlier, we see a similar checksum even though
the size of the file is dramatically larger.

# cksum /export/home/jdoe/bigfile.gz
3574185895      84747520        home/tcs/bigfile.gz

The second number in the cksum output is the number of octets (bytes) in the file. This
is a similar concept to the number of blocks, but is considerably finer grained. Two
files occupying the same number of blocks are still likely to include a different number
of octets.



The md5 command is the most reliable of the three commands and the only one recommended
for serious file checking. If you are sending a gzipped file to a customer and want the
customer to be confident that the file you have sent is both intact and the file you
intended to send, providing him with an md5 checksum is a very good idea. Notice the
length of the checksum below.

# md5 /export/home/jdoe/bigfile.gz
MD5 (/export/home/jdoe/bigfile.gz) = e1e0aec5c73eeb3bcf4cff4d5a44b067

This thirty-two hexadecimal number can take on any of 2 ** 128 possible values. This is a
bigger number than most of us can think about. It's billions times billions big. I am
told, it is exactly:

340,282,366,920,938,463,463,374,607,431,768,211,456

Probably so. I don't even want to think about calculating so large a number.


The chance of two files having the same md5 checksum is infinitesimally small. Looking at
the two small files, we see that the md5 checksums seem to have no similarity whatsoever.

# md5 /tmp/ab*
MD5 (/tmp/abc) = 0bee89b07a248e27c83fc3d5951213c1
MD5 (/tmp/abd) = 8f0abafc5f8e6686a882c78cac4bcb9f

Of course, to be valuable, checksums have to compute identically on different systems.
Fortunately for us, this should always be the case.

 

    Add a comment

    Post a comment using one of these accounts
    Or join now
    At least 6 characters

    Note: Comment will appear soon after you have activated your account.
    Obscene/spam comments will be removed and accounts suspended.
    The information you submit is subject to our Privacy Policy and Terms of Service.

    ITworld LIVE

    Small BusinessWhite Papers & Webcasts

    White Paper

    Microsoft Volume Licensing Comparison - Small/Med. Business

    This quick-reference document lets small and medium organizations (i.e. those with five or more devices) to easily compare the available Microsoft Volume Licensing programs to create a simple, cost-effective and flexible way to benefit from volume licensing.

    White Paper

    ESG: Oracle Database Appliance: A Simple, Economical Option for SMBs and Independent Software Vendors

    Read this technology overview of a DBMS built for SMBs that provides a rapidly-deployable, highly-available platform at an affordable cost

    See more White Papers | Webcasts

    Ask a question

    Ask a Question