One of the big lessons that we've learned in the last ten years is that even little pieces of once disparate data, gathered in a single location, can yield startling truths. With Facebook, that lesson crystallized with the introduction of its Graph Search feature in 2013, when the online world realized – to its horror – that the feature could be used to uncover embarrassing details that were hiding in plain sight, or exploit their social graph for laughs ("Mothers of Jews who like Bacon" was an oft-cited Facebook Graph search). Graph Search was dismissed as "humorless, creepy and doomed to disappoint." But, in one form or another, it's with us still.
These days, its not just our personal lives that are laid open to the ravages of search spiders and Google dorking. With the growing popularity of shared code repositories like GitHub, all the world's source code has suddenly become fodder for the clever and curious. Call it "GitHub dorking."
Consider the recent article in these pages noting the prevalence of the string "ugly hack" in GitHub repositories. The article, by ITworld's Phil Johnson, notes that references to the term "ugly hack" were far more common in source code repositories written in the C programming language than any other programming language – and by a large margin. The message: C code is the king when it comes to "down and dirty code fixes."
An unspoken corollary is that GitHub beats a handy and easily followed path to examples of messy and possibly vulnerable code in a wide range of applications – many of them obscure and without value, some of them not.
As with the Facebook Graph Search example, a couple things had to happen before we arrived at the point of stories being written about funky and revealing GitHub searches. First: GitHub, a hosted source code repository, had to get huge. In the seven years since it launched, it has certainly done that, growing from 6,000 users and 2,500 repositories in 2008 to 9.4 million people collaborating right now across 22.5 million repositories today.
The other key component was the addition of GitHub's internal search feature in 2013, which made it simple to run queries across public and private GitHub repositories that a given user had access to. Almost immediately, astute observers noted that the feature could be used to reveal private encryption keys and login credentials buried in code checked in to GitHub.
Despite those warnings, there's ample evidence that the practice continues. In March, for example, ride sharing firm Uber was found to have accidentally uploaded database credentials to a GitHub repository. As noted by the publication Ars Technica, searches of GitHub repositories for credentials used for secure FTP reveal thousands of usernames and passwords that could be used to compromise public facing assets.
The ease with which developers can share and re-use code on GitHub is part of the problem, said Bill Ledingham, chief technology officer at Black Duck Software. The company monitors some 300,000 open source software projects that use GitHub, downloading the source code and analyzing it for vulnerabilities. Ledingham said leaked user credentials are inadvertent errors caused by developers too accustomed to the ease with which code can be borrowed, modified and resubmitted to GitHub.
"Developers in some cases are just taking the easiest path forward," he said. "They're checking in code or re-using it and not looking at some of these issues related to security."
There are other security concerns that companies need to be aware of, as well. Leaks of intellectual property may be a concern in organizations in which developers are mixing shared code from GitHub with proprietary code. "Developers are putting their code out on these locations and they may include some of their company's (intellectual property)," Ledingham said.
And, of course, GitHub is just a web based application and, thus, vulnerable to the same kinds of problems as other, similar online applications: cross site scripting and information leak vulnerabilities in GitHub.com, the GitHub API or associated tools and services like Gist.
None of these problems are limited to GitHub. Use of hosted repositories like Sourceforge, Launchpad and BitBucket carries many of the same risks. But GitHub's popularity puts it in a unique position, and puts enterprises in a tough spot.
"In my opinion there are no exclusive issues to [GitHub], as there are other similar services like GitLab or BitBucket," wrote the security researcher known as Joernchen of Phenolit, a top contributor to GitHub's bug bounty program. "It's just the case that GitHub is the most popular service, and therefore from the attackers perspective gives the best results."
Still, that popularity has put many enterprises and software development shops in a tough spot. Many didn't so much opt for GitHub as acquiesce to a groundswell of adoption and enthusiasm by rank and file developers, Ledingham said. Confronted with clandestine or unsanctioned use by rank and file developers, these organizations have become reluctant converts: recognizing the power of the platform and the productivity gains that come from its use.
Those firms don't fully appreciate the nuances of GitHub and the ways in which public repositories managed by employees might work to undermine corporate security. GitRob is one example. The command line tool developed by Michael Henriksen was introduced in January and can be used to analyze all the public GitHub repositories associated with a particular organization. GitRob works by compiling the public repositories belonging to known employees of that firm. GitRob can flag filenames in each repository that match patterns of known sensitive files. Henricksen, an employee of the firm SoundCloud, developed it to help his firm spot sensitive files that might have accidentally been uploaded to public repositories.
The high level advice for most developers and development organizations is not to do stupid things, said Joernchen. "If someone checks the whole $HOME directory in a Git (repository), that's fine. But if that person is smart enough to publish this (repository) on GitHub, I've got not much compassion," he wrote.
Still, Joernchen acknowledges that many kinds of information leaks are subtle and easily overlooked. He noted research from 2012 on Ruby on Rails session cookie secrets that were being checked in to GitHub by developers unaware that the 64 byte session key was confidential and could be used to hijack Ruby on Rails applications.
The fix for some of these problems is straightforward enough. Joernchen and others recommend using GitHub's search features as well as tools like GitRob to interrogate your company's own code for potential data leaks.
Data leak prevention products can identify and block the movement of proprietary code. Concerted education for developers about best practices and proper security hygiene when downloading and uploading code to shared and searchable source repositories can help prevent head slapping mistakes like the leak of database administrator credentials and private keys.
Companies that are very concerned about such leaks should simply opt for a self-hosted repository like GitHub Enterprise or GitLab, rather than relying on the publicly hosted platform, Joernchen advised.
At the end of the day, the draw of GitHub and its tens of millions of open source and proprietary projects will continue to be a powerful draw, "ugly hacks" or no. But with the growth comes a commensurate need to understand and manage the risk, he said.
"First it's about gaining visibility and then its about what policies to put in place to manage behavior," Ledingham said.