8 cool tools for data analysis, visualization and presentation
Reporters wrangle all sorts of data, from analyzing property tax valuations to mapping fatal accidents -- and, here at Computerworld, for stories about IT salaries and H-1B visas. In fact, tools used by data-crunching journalists are generally useful for a wide range of other, non-journalistic tasks -- and that includes software that's been specifically designed for newsroom use. And, given the generally thrifty culture of your average newsroom, these tools often have the added appeal of little or no cost.
I came back from last year's National Institute for Computer-Assisted Reporting (NICAR) conference with 22 free tools for data visualization and analysis -- most of which are still popular and worth a look. At this year's conference, I learned about other free (or at least inexpensive) tools for data analysis and presentation.
Want to see all the tools from last year and 2012?
For quick reference, check out our chart listing all 30 free data visualization and analysis tools.
Like that previous group of 22 tools, these range from easy enough for a beginner (i.e., anyone who can do rudimentary spreadsheet data entry) to expert (requiring hands-on coding). Here are eight of the best:
What it does: This utility suite from GitHub has a host of Unix-like command-line tools for importing, analyzing and reformatting comma-separated data files.
What's cool: Sure, you could pull your file into Excel to examine it, but CSVKit makes it quick and easy to preview, slice and summarize.
For example, you can see all your column headers in a list -- which is handy for super-wide, many-column files -- and then just pull data from a few of those columns. In addition to inputting CSV files, it can import several fixed-width file formats -- for example, there are libraries available for the specific fixed-width formats used by the Census Bureau and Federal Elections Commission.
Two simple commands will generate a data structure that can, in turn, be used by several SQL database formats (Mr. Data Converter handles only MySQL). The SQL code will create a table, inferring the proper data type for each field as well as the insert commands for adding data to the table.
The Unix-like interface will be familiar to anyone who has worked on a *nix system, and makes it easy to save multiple frequently used commands in a batch file.
Drawbacks: Working on a command line means learning new text commands (not to mention the likely risk of typing errors), which might not be worthwhile unless you work with CSV files fairly often. Also, be advised that this tool suite is written in Python, so Windows users will need that installed on their system as well.
Skill level: Expert
Runs on: Any Windows, Mac or Linux system with Python installed.
Related tools: Google Refine is a desktop application that can do some rudimentary file analysis as well as its core task of data cleaning; and The R Project for Statistical Computing can do more powerful statistical analysis on CSV and other files.
What's cool: In addition to sortable tables, results can be searched in real time (results are narrowed further with each search-entry keystroke).
Drawbacks: Search capability is fairly basic and cannot be narrowed by column or by using wildcard or Boolean searches.
Skill level: Expert
Learn more: Numerous examples on the DataTables site show many ways to use this plug-in.
What it does: This alpha project from the Knight Digital Media Center at UC Berkeley turns a Google Docs spreadsheet into an interactive, sortable database that can be posted on the Web.
What's cool: In addition to text searching, you can include numerical range-based sliders. Usage is free. End users can easily create their own databases from spreadsheets without writing code.
Drawbacks: My test application ran into some intermittent problems; for example, it wouldn't display my data list when using the "show all records" button. This is an alpha project, and should be treated as such.
In addition, the current iteration limits spreadsheets to 10 columns and a single sheet. One column must have numbers, so this won't work for text-only information. The search widget is currently limited to a few specific choices of fields to search, although this might increase as the project matures. (A paid service like Caspio would offer more customization.) The nine-step wizard might get cumbersome after frequent use.
Skill level: Advanced beginner.
Runs on: Current Web browsers
What's cool: With Highcharts, users can mouse over items for more details; they can also click on items in the chart legend to turn them on and off. There are many different chart types available, from basic line, bar, column and area charts to zoomable time series; each comes with six stylesheet options. Little customization is needed to get a sleek-looking chart -- and charts will display on iOS and Android devices as well as on desktop browsers.
Drawbacks: Highcharts, like Google Maps, does have a distinctive look, so you may want to customize the Highcharts stylesheets so your visualizations don't look like numerous other Highcharts on the Web. While charts displayed fine for me on an Android phone, they weren't interactive (they were on an iPad).
Skill level: Intermediate to Expert.
Runs on: Web browsers
What it does: How often do you have data in one format -- while your application needs it in another? New York Times interactive graphics editor Shan Carter ran into this situation often enough that he coded a tool that converts comma- or tab-delimited data into nine different formats. It's available as either a service on the Web or an open source tool.
What's cool: Mr. Data Converter can generate XML, JSON, ASP/VBScript or basic HTML table formatting as well as arrays in PHP, Python (as a dictionary) and Ruby. It will even generate MySQL code to create a table (guessing at field formats based on the data) and insert your data. If your data is in an Excel spreadsheet, you don't need to save it as a CSV or TSV; you can just copy and paste it into the tool.
Drawbacks: Only CSV or TSV formats can be input, as well as copying and pasting in data from Excel.
Skill level: Beginner
Learn more: You can follow Mr. Data Converter on Twitter at @mrdataconverter.
Related tools: Data Wrangler is a Web-based tool that reformats data to your specifications.
What it does: Panda is less about analyzing or presenting data than finding it amidst the pile of standalone spreadsheets scattered around an organization. It was specifically designed for newsrooms, but could be used by any organization where individuals collect information on their desktops that would be worth sharing. Billed as a "newsroom appliance," users can upload CSV or Excel files to Panda and then search across all available data sets or a within a single file.
What's cool: Panda makes it simple to give others access to information that's been sitting on individuals' hard drives in different stand-alone spreadsheets. Even non-technical users can easily upload and search data. Search is extremely fast, using ApacheSolr.
Drawbacks: Queries are basic -- you can't specify a particular column/field to search, so a search for "Washington" would bring back items containing both the place and a person's name. The required hosting platform is quite specific, requiring Ubuntu 11.1. (Panda's developers have created an Amazon Community Image with the required server setup for hosting on Amazon Web Services EC2.)
Skill level: Beginner (Advanced Beginner for administration)
Runs on: Must be hosted on Amazon EC2 or a server running Ubuntu 11.10. Clients can use any Web browser.
Learn more: Panda documentation, still in the works, gives basics on setup, configuration and use. Nieman Journalism Lab has some background on the project, which was funded by a $150,000 Knight News Challenge grant.
What it does: This free plugin from Microsoft allows Excel 2010 to handle massively large data sets much more efficiently than the basic version of Excel does. It also lets Excel act like a relational database by adding the capacity to truly join columns in different tables instead of relying on Excel's somewhat cumbersome VLOOKUP command. PowerPivot includes its own formula language, Data Analysis Expressions (DAX), which has a similar syntax to Excel's conventional formulas.
What's cool: PowerPivot can handle millions of records -- data sets that would usually grind PowerPivot-less Excel to a halt. And by joining tables, you can make more "intelligent" pivot tables and charts to explore and visualize large data sets with Excel's point-and-click interface.
Drawbacks: This is limited to Excel 2010 on Windows systems. Also, SQL jocks might prefer using a true relational database for multi-table data in order to build complex data queries.
Skill level: Intermediate
Runs on: Excel 2010 on Windows only.
Related tools: Zoho Reports can take data from various file formats and turn it into charts, tables and pivot tables.
What it does: This general-purpose visualization platform allows creation of interactive dashboards with multiple, related visualizations -- for example, a bar chart, scatter plot and map. The open-source project was created by the University of Massachusetts at Lowell in partnership with a consortium of government agencies and is still in beta.
What's cool: The visualizations are slick and highly interactive; clicking an area in one visualization also affects others in the dashboard. The platform includes powerful statistical analysis capabilities. Users can create their own visualizations on a Weave-based Web system, or save and alter the tools and appearances of visualizations that have been publicly shared by others.
Drawbacks: Requires Flash for end-user viewing. It's currently somewhat difficult to install, although a one-click install is scheduled for this summer. And because it's so powerful, some users say that implementations must consider how to winnow down functionality so as not to overwhelm end users.
Skill level: Intermediate for those just creating visualizations; Expert for those implementing a Weave system.
Learn more: The Weave site includes demos, videos and a user guide. For more examples of visualizations that can be built using a Weave platform, see one planner's MetroBoston DataCommon gallery. In addition, I wrote more detailed Computerworld coverage of Weave following a presentation at Northeastern University.
Related tools: Tableau Public is a robust general-purpose visualization platform.
Also see: 22 free tools for data visualization and analysis (April 20, 2011) and Chart and image gallery: 30 free tools for data visualization and analysis.