Unix: Sorting information that isn't quite numeric
Sorting data numerically and alphanumerically isn't generally much of a challenge on Unix systems, but sometimes .5 is smaller than .11.
Sorting numbers is extremely simple on Unix systems; just use the -n option with your sort commands. But when the numeric information that we want to sort is broken up into chunks of uneven size -- as it generally is with IP addresses and section numbers within documents, we need something more than just -n. An address like 10.3.45.7, for example, will show up earlier in our sorted output than 10.3.7.11 just as Section 3.5 in a document precedes 3.12. So, what do we do?
To begin, we need to prepare ourselves for using a more detailed sort command. One of the key options we will need is sort's -t. It works like the -F option of awk, allowing us to specify that our field specifier should be a dot and, thus, allowing us to address each field in our addresses separately. Starting our sort command with sort -t . will allow each IP address (IPv4 anyway) to be sorted with respect to each of its four octets (i.e., each byte in the address).
Before we get into the command for sorting IP addresses, however, let's first look at some simpler examples of sort commands. In this first display, we contrast an alphanumeric sort with a numeric sort. The results are quite different, of course, but the numeric sort works just like we'd expect. The resultant list is clearly in numeric order.
$ cat nums $ sort nums $ sort -n nums 98.4 1 1 98.15 100.9 7 98.9 11 11 100.9 21 21 67 7 67 21 67 98.15 11 98.15 98.4 7 98.4 98.9 1 98.9 100.9
The problem with IP addresses is that .15 isn't smaller than .9 and so a numeric sorting of fields (unless they're padded with zeroes) isn't going to work. Another option that sort provides, however, is a -k option that allows us to sort on a particular portion of our numeric data. If we want, for example, to sort a series of phone numbers on just the last four digits, we could use a command like this:
$ sort -t - -k3 phones 410-290-1225 949-987-1234 301-945-1264 410-290-6543 640-465-9681
This command tells sort to sort the data on the third field using a hyphen as the field separator. So, the numbers get sorted on just the last 4 digits, often referred to as "the extension". Phone numbers aren't much of a challenge because they all have the same number of digits (ignoring the possibility of there being both local and international numbers in the list). Sorting numerically on the same-length numeric fields does just what we'd expect.
$ sort -n phones 301-945-1264 410-290-1225 410-290-6543 640-465-9681 949-987-1234
IP addresses, on the other hand, can have anywhere from one to three digits in any field so, if we want to see 3 in the resulting list before we see 11 in the same octet, we have to work a little harder. Just specifying that our field separator is a dot doesn't quite cut it. Notice here how 10.3.7.11 follows 10.3.45.7 in the list.
$ sort -t . IPs 10.1.12.98 10.2.99.21 10.3.45.67 10.3.45.7 10.3.7.11 192.168.0.1
Throwing -n into the mix doesn't help either. 10.3.7.11 falls after 10.3.45.7 because .7 is larger than .4.
$ sort -n -t . IPs 10.1.12.98 10.2.99.21 10.3.45.67 10.3.45.7 10.3.7.11 192.168.0.1
If we want to sort just on the rightmost field in a set of IP addresses, we could do this by instructing sort to use just the 4th field:
$ sort -n -t . -k 4 IPs 192.168.0.1 10.3.45.7 10.3.7.11 10.2.99.21 10.3.45.67 10.1.12.98
Here, we see that the sort command is sorting on the 4th octet properly with 7 showing up in the list before 11 and so on. This demonstrates what we need to do on the addresses as a whole -- looking at the contents of each field separately and ignoring the dots that separate them. To sort IP addresses on all four fields, each of the four fields needs to be specified in the sort command. Either of these commands should do the trick:
$ sort -n -t . -k 1,1 -k 2,2 -k 3,3 -k 4,4 IPs $ sort -t . -k 1,1n -k 2,2n -k 3,3n -k 4,4n IPs
The -n is either applied to the entire command (version 1) or to each field individually (version 2).
The "1,1", "2,2" etc. partS of these commands specifY the order in which the fields are sorted. As shown, the first field is
sorted first, the second next, etc. And, of course, we could use this kind of sort command in a pipe as well:
$ cat IPs | sort -t . -k 1,1n -k 2,2n -k 3,3n -k 4,4n 10.1.12.98 10.2.99.21 10.3.7.11 10.3.45.7 10.3.45.67 192.168.0.1
The sort command has some additional options as well. Some that I have found quite useful are.
-r reverse the order of the sort
-b ignore leading blanks (as far as the sort is concerned, it will not remove them)
-c check and report on whether the input is in sort order
-M sort in month order where months are Jan, Feb, etc.
-m merge files before sorting their joined content
-u remove duplicates
Here are some examples of commands using these options:
Sort dates in month order.
$ sort -M dates Jan 4 Jan 8 Feb 2 Mar 18 Apr 26 May 1 Jun 26 Jul 26 Aug 6 Sep 10 Sep 14 Sep 23 Sep 25 Sep 4 Oct 19
Sort dates in reverse month order.
$ sort -M -r dates Oct 19 Sep 4 Sep 25 Sep 23 Sep 14 Sep 10 Aug 6 Jul 26 Jun 26 May 1 Apr 26 Mar 18 Feb 2 Jan 8 Jan 4
Merge two files of dates and display in month order.
$ sort -m dates dates2 | sort -M Jan 1 Jan 4 Jan 8 Feb 2 Mar 18 Mar 18 Apr 26 May 1 Jun 26 Jul 26 Aug 6 Sep 10 Sep 14 Sep 23 Sep 25 Sep 4 Oct 19 Nov 28 Dec 25
Do the same thing, but remove duplicate dates.
$ sort -m -u dates dates2 | sort -M Jan 1 Jan 4 Jan 8 Feb 2 Mar 18 Apr 26 May 1 Jun 26 Jul 26 Aug 6 Sep 10 Sep 14 Sep 23 Sep 25 Sep 4 Oct 19 Nov 28 Dec 25
I have run into enough situations where sorting data by IP address has saved me a lot of time and effort that I have turned my sort-by-IP command into an alias that I can now use anytime I need it.
$ alias byIP='sort -n -t . -k 1,1 -k 2,2 -k 3,3 -k 4,4' $ getNodes | byIP 10.1.12.98 10.2.99.21 10.3.7.11 10.3.45.7 10.3.45.67 192.168.0.1
Read more of Sandra Henry-Stocker's Unix as a Second Language blog and follow the latest IT news at ITworld, Twitter and Facebook.
flickr /Key Foster