Unix: Chopping up CSV files

CSV (comma-separated values) files represent a common standard for transmitting data between applications. Like the more generic "flat files", each line in a CSV file represents a record. Exporting or importing data from and to applications such as Excel in CSV format is standard practice for a lot of applications. Hand a CSV file to a tool such as awk, however, and you may run into some problems. Let's look at the problems and at a handy Perl module that takes the stress out of working with CSV files on Unix systems.

To begin, the format of a CSV file is not quite as easy as when,where,what. If it were, awk would have no trouble selecting any column you might want to see with a command such as this:

$ awk -F, '{print $3}'

Instead, CSV files can contain fields that contain blanks and fields that contain commas. When a field in a CSV file contains commas, for example, the entire field needs to be enclosed in quotes so that applications such as Excel can determine the extent of the field.

1997,Infonet,"El Segundo, CA"
1998,InCap,"Corte Madera, CA"
1998,WPI,"San Francisco, CA"
1999,E*Trade,"San Francisco, CA"
2001,TCS,"Annapolis, MD"

Other fields may be quoted, but don't need to be. Were we to try to pull out the third field in these records using only the command as a separator, we would get this:

"el
"Corte
"San
"Annapolis

Clearly this isn't what we'd want. Similarly, were we to use double quotes, we'd have a different sort of problem. The location field would be $2, but the first field would be 1997,Infonet,. And if we were lucky enough to find that each field was quoted, we'd still have to refer to the fields of interest in an odd manner, calling them $2, $4 and $6. $3, for example would be only a comma for each record. And, note, this only works if your CSV quotes every field religiously. Can you count on that?

"1997","Infonet","El Segundo, CA"
"1998","InCap","Corte Madera, CA"
"1998","WPI","San Francisco, CA"
"1999","E*Trade","San Francisco, CA"
"2001","TCS","Annapolis, MD"

So, what do you do? You can specify " as the separator and live with the $2, $4, $6 way of referring to your data fields or you can use the Text::CSV - comma-separated values manipulation routines from CPAN to separate the fields in your CSV file properly. Here's a very simple script that shows how it works.

#!/usr/bin/perl

use Text::CSV;

my $file = 'jobs.csv';
my $csv = Text::CSV->new();

open (CSV, "<", $file) or die $!;

while (<CSV>) {
    if ($csv->parse($_)) {
        my @columns = $csv->fields();
        while (defined($item = shift @columns)){
            print $item, "\n";
        }
    } else {
        my $err = $csv->error_input;
        print "Failed to parse line: $err";
    }
    print "--\n";
}
close CSV;

What is this script doing? First, it employs the Text::CSV module with the use Text::CSV line. This module isn't likely to be installed on your system by default, so you will likely have to download the Text-CSV-0.5.tar.gz file and then run through these steps to set it up:

perl Makefile.PL
make
make test
make install

Once this is done, you can start using the module. In the script above, we read a file named jobs.csv. In a while loop, we then break out each column using the my @columns = $csv->fields() line. Each field in the input line then becomes an element in the @columns array. We then list each field in the array on a separate line to demonstrate that we've broken each field out from the file properly. If we ran into an error, the else clause will display it to us. At the end of each line, we then print -- to show that we've completed the parsing of each line. You, of course, have the option to do whatever you want within the while loop. In the script below, we're doing something similar, except we're converting the CSV file into a colon-separated file. This requires, of course, that the colon (or any other field that you choose to be your delimiter) doesn't appear within any of the fields. We are also taking special precautions to be sure that we don't append the separator to the end of each line. We don't want 1997:Infonet:El Segundo, CA to show up as 1997:Infonet:El Segundo, CA:. To do this, we're counting the fields by looking at the size of the array and using a separate print statement for the last field. Another way to do this would be to go ahead and add the : each time to the end of a string that you build up within each loop and then remove it before you display it.

#!/usr/bin/perl

use Text::CSV;

my $file = 'jobs.csv';
my $csv = Text::CSV->new();

open (CSV, "<", $file) or die $!;

while (<CSV>) {
    if ($csv->parse($_)) {
        my @columns = $csv->fields();
        my $arraySize=@columns;
        $ix=0; $x=$arraySize - 1;
        while ($ix lt $x) {
            print "$columns[$ix]:";
            $ix++;
        }
        print "$columns[$ix]\n";
    } else {
        my $err = $csv->error_input;
        print "Failed to parse line: $err";
    }
}
close CSV;

As written this script will display the output on your screen. To instead, write it to a file, you would just do this instead:

#!/usr/bin/perl

use Text::CSV;

my $file = 'jobs.csv';
my $newfile = 'newjobs.csv';			# create this file
my $csv = Text::CSV->new();

open (CSV, "<", $file) or die $!;
open (NEW, ">", $newfile) or die $!;

while (<CSV>) {
    if ($csv->parse($_)) {
        my @columns = $csv->fields();
        my $arraySize=@columns;
        $ix=0; $x=$arraySize - 1;
        while ($ix lt $x) {
            print NEW "$columns[$ix]:";		# write to new file
            $ix++;
        }
        print NEW "$columns[$ix]\n";		# write to new file
    } else {
        my $err = $csv->error_input;
        print "Failed to parse line: $err";
    }
}
close CSV;

The new file would then look like this:

1997:Infonet:El Segundo, CA
1998:InCap:Corte Madera, CA
1998:WPI:San Francisco, CA
1999:E*Trade:San Francisco, CA
2001:TCS:Annapolis, MD

You could manipulate this sort of file easily with awk and also feed it to Excel by specifying the delimiter is a colon. And, note, that you can use another character -- maybe a | or a # if colons don't work for you. Just replace the : characters in the script to the character of your choice. Text::CSV takes the pain out of parsing complex CSV files.

Read more of Sandra Henry-Stocker's Unix as a Second Language blog and follow the latest IT news at ITworld, Twitter and Facebook.

Top 10 Hot Internet of Things Startups
Join the discussion
Be the first to comment on this article. Our Commenting Policies