Unix: Chopping up CSV files

CSV (comma-separated values) files represent a common standard for transmitting data between applications. Like the more generic "flat files", each line in a CSV file represents a record. Exporting or importing data from and to applications such as Excel in CSV format is standard practice for a lot of applications. Hand a CSV file to a tool such as awk, however, and you may run into some problems. Let's look at the problems and at a handy Perl module that takes the stress out of working with CSV files on Unix systems.

By  

To begin, the format of a CSV file is not quite as easy as when,where,what. If it were, awk would have no trouble selecting any column you might want to see with a command such as this:

$ awk -F, '{print $3}'

Instead, CSV files can contain fields that contain blanks and fields that contain commas. When a field in a CSV file contains commas, for example, the entire field needs to be enclosed in quotes so that applications such as Excel can determine the extent of the field.

1997,Infonet,"El Segundo, CA"
1998,InCap,"Corte Madera, CA"
1998,WPI,"San Francisco, CA"
1999,E*Trade,"San Francisco, CA"
2001,TCS,"Annapolis, MD"

Other fields may be quoted, but don't need to be.

Were we to try to pull out the third field in these records using only the command as a separator, we would get this:

"el
"Corte
"San
"Annapolis

Clearly this isn't what we'd want.

Similarly, were we to use double quotes, we'd have a different sort of problem. The location field would be $2, but the first field would be 1997,Infonet,. And if we were lucky enough to find that each field was quoted, we'd still have to refer to the fields of interest in an odd manner, calling them $2, $4 and $6. $3, for example would be only a comma for each record. And, note, this only works if your CSV quotes every field religiously. Can you count on that?

"1997","Infonet","El Segundo, CA"
"1998","InCap","Corte Madera, CA"
"1998","WPI","San Francisco, CA"
"1999","E*Trade","San Francisco, CA"
"2001","TCS","Annapolis, MD"

So, what do you do? You can specify " as the separator and live with the $2, $4, $6 way of referring to your data fields or you can use the Text::CSV - comma-separated values manipulation routines from CPAN to separate the fields in your CSV file properly. Here's a very simple script that shows how it works.

#!/usr/bin/perl

use Text::CSV;

my $file = 'jobs.csv';
my $csv = Text::CSV->new();

open (CSV, "<", $file) or die $!;

while () {
    if ($csv->parse($_)) {
        my @columns = $csv->fields();
        while (defined($item = shift @columns)){
            print $item, "\n";
        }
    } else {
        my $err = $csv->error_input;
        print "Failed to parse line: $err";
    }
    print "--\n";
}
close CSV;

What is this script doing? First, it employs the Text::CSV module with the use Text::CSV line. This module isn't likely to be installed on your system by default, so you will likely have to download the Text-CSV-0.5.tar.gz file and then run through these steps to set it up:

perl Makefile.PL
make
make test
make install

Once this is done, you can start using the module.

In the script above, we read a file named jobs.csv. In a while loop, we then break out each column using the my @columns = $csv->fields() line.

Photo Credit: 

flickr / TripleScoop

Join us:
Facebook

Twitter

Pinterest

Tumblr

LinkedIn

Google+

Answers - Powered by ITworld

Join us:
Facebook

Twitter

Pinterest

Tumblr

LinkedIn

Google+

Ask a Question