July 15, 2013, 3:30 PM —
Image credit: flickr/David Goehring
R is hot. Whether measured by more than 4,400 add-on packages, the 18,000+ members of LinkedIn's R group or the close to 80 R Meetup groups currently in existence, there can be little doubt that interest in the R statistics language, especially for data analysis, is soaring.
Why R? It's free, open source, powerful and highly extensible. "You have a lot of prepackaged stuff that's already available, so you're standing on the shoulders of giants," Google's chief economist told The New York Times back in 2009.
Because it's a programmable environment that uses command-line scripting, you can store a series of complex data-analysis steps in R. That lets you re-use your analysis work on similar data more easily than if you were using a point-and-click interface, notes Hadley Wickham, author of several popular R packages and chief scientist with RStudio.
That also makes it easier for others to validate research results and check your work for errors -- an issue that cropped up in the news recently after an Excel coding error was among several flaws found in an influential economics analysis report known as Reinhart/Rogoff.
The error itself wasn't a surprise, blogs Christopher Gandrud, who earned a doctorate in quantitative research methodology from the London School of Economics. "Despite our best efforts we always will" make errors, he notes. "The problem is that we often use tools and practices that make it difficult to find and correct our mistakes."
Sure, you can easily examine complex formulas on a spreadsheet. But it's not nearly as easy to run multiple data sets through spreadsheet formulas to check results as it is to put several data sets through a script, he explains.
Indeed, the mantra of "Make sure your work is reproducible!" is a common theme among R enthusiasts.
Why not R? Well, R can appear daunting at first. That's often because R syntax is different from that of many other languages, not necessarily because it's any more difficult than others.
"I have written software professionally in perhaps a dozen programming languages, and the hardest language for me to learn has been R," writes consultant John D. Cook in a Web post about R programming for those coming from other languages. "The language is actually fairly simple, but it is unconventional."
And so, this guide. Our aim here isn't R mastery, but giving you a path to start using R for basic data work: Extracting key statistics out of a data set, exploring a data set with basic graphics and reshaping data to make it easier to analyze.
Your first step
Installing R is actually all you need to get started. However, I'd suggest also installing the free R integrated development environment (IDE) RStudio. It's got useful features you'd expect from a coding platform, such as syntax highlighting and tab for suggested code auto-completion. I also like its four-pane workspace, which better manages multiple R windows for typing commands, storing scripts, viewing command histories, viewing visualizations and more.
Although you don't need the free RStudio IDE to get started, it makes working with R much easier.
The top left window is where you'll probably do most of your work. That's the R code editor allowing you to create a file with multiple lines of R code -- or open an existing file -- and then run the entire file or portions of it.
Bottom left is the interactive console where you can type in R statements one line at a time. Any lines of code that are run from the editor window also appear in the console.
The top right window shows your workspace, which includes a list of objects currently in memory. There's also a history tab with a list of your prior commands; what's handy there is that you can select one, some or all of those lines of code and one-click to send them either to the console or to whatever file is active in your code editor.
The window at bottom right shows a plot if you've created a data visualization with your R code. There's a history of previous plots and an option to export a plot to an image file or PDF. This window also shows external packages (R extensions) that are available on your system, files in your working directory and help files when called from the console.
Learning the shortcuts
Wickham, the RStudio chief scientist, says these are the three most important keyboard shortcuts in RStudio:
Tab is a generic auto-complete function. If you start typing in the console or editor and hit the tab key, RStudio will suggest functions or file names; simply select the one you want and hit either tab or enter to accept it.
Control + the up arrow (command + up arrow on a Mac) is a similar auto-complete tool. Start typing and hit that key combination, and it shows you a list of every command you've typed starting with those keys. Select the one you want and hit return. This works only in the interactive console, not in the code editor window.
Control + enter (command + enter on a Mac) takes the current line of code in the editor, sends it to the console and executes it. If you select multiple lines of code in the editor and then hit ctrl/cmd + enter, all of them will run.
For more about RStudio features, including a full list of keyboard shortcuts, head to the online documentation.
Setting your working directory
Change your working directory with the setwd() function, such as:
Note that the slashes always have to be forward slashes, even if you're on a Windows system. For Windows, the command might look something like:
If you are using RStudio, you can also use the menu to change your working directory under Session > Set Working Directory.
Installing and using packages
Chances are if you're going to be doing, well, pretty much anything in R, you're going to want to take advantage of some of the thousands of add-on packages available for R at CRAN, the Comprehensive R Archive Network. The command for installing a package is:
If you don't want to type the command, in RStudio there's a Packages tab in the lower right window; click that and you'll see a button to "Install Packages." (There's also a menu command; the location varies depending on your operating system.)
To see which packages are already installed on your system, type:
Or, in RStudio, go to the Packages tab in the lower right window.
To use a package in your work once it's installed, load it with:
If you'd like to make sure your packages stay up to date, you can run:
and get the latest versions for all your installed packages.
If you no longer need or want a package on your system, use the function:
If you want to find out more about a function, you can type a question mark followed by the function name -- one of the rare times parentheses are not required in R, like so:
This is a shortcut to the help function, which does use parentheses:
Although I'm not sure why you'd want to use this as opposed to the shorter ?functionName command.
If you already know what a function does and just want to see formats for using it properly, you can type:
and you'll get a list with examples of the function being used, if there's one available. The arguments (args) function:
just displays a list of a function's arguments.
If you want to search through R's help documentation for a specific term, you can use:
help.search("your search term")
That also has a shortcut:
??("my search term")
No parentheses are needed if the search term is a single word without spaces.