If you’ve been programming for any length of time, no doubt you’ve developed your own coding style. Every developer has preferences not only for things like spacing (e.g,, spaces vs tabs), naming styles (e.g., CamelCase vs. snake_case) and commenting, but also how he or she implements certain types of functionality. New research now shows that a developer’s coding style is a type of fingerprint, which can be used to identify who wrote an anonymous piece of code with a high degree of accuracy.
Researchers from Drexel University, the University of Maryland,the University of Goettingen, and Princeton have developed a “code stylometry,” which uses natural language processing and machine learning to determine the authors of source code based on coding style. Their findings, which were recently published in the paper “De-anonymizing Programmers via Code Stylometry,” could be applicable to a wide of range of situations where determining the true author of a piece of code is important. For example, it could be used to help identify the author of malicious source code and to help resolve plagiarism and copyright disputes.
The authors based their code stylometry on traditional style features, such as layout (e.g., whitespace) and lexical attributes (e.g., counts of various types of tokens). Their real innovation, though, was in developing what they call “abstract syntax trees” which are similar to parse tree for sentences, and are derived from language-specific syntax and keywords. These trees capture a syntactic feature set which, the authors wrote, “was created to capture properties of coding style that are completely independent from writing style.” The upshot is that even if variable names, comments or spacing are changed, say in an effort to obfuscate, but the functionality is unaltered, the syntactic feature set won’t change.
To test how well their code stylometry can identify the author of a piece of code, the researchers gathered publicly available data from Google’s Code Jam, an annual programming competition which attracts a wide range of programmers, from students to professionals to hobbyists. They looked at C++ source code from the 2008 to 2014 competitions written by more than 100,000 contestants. Their basic approach was to take solutions to a number of identical problems for a group of users as a training dataset, in order to learning the style of each coder. They then looked blindly at solutions the same coders wrote to another problem and tried to identify the author of each.
Here were some of their key findings:
- Their code stylometry achieved 95% accuracy in identifying the author of anonymous code. That was based on data from 250 coders over multiple years, averaging 630 lines of code per author. Using a dataset with fewer programmers (30) but more lines of code per person (1,900), the identification accuracy rate was even higher, 97%.
- Accuracy rates weren’t statistically different when using an off-the-shelf C++ code obfuscators. Since these tools generally work by refactoring names and removing spaces and comments, the syntactic feature set wasn’t changed so author identification at similar rates was still possible.
- Coding style is more well defined through solving harder problems. The identification accuracy rate improved when the training dataset was based on more difficult programming problems. The authors weren’t sure which way the causality runs, however, writing:
“This might indicate that as programmers become more advanced, they build a stronger coding style compared to newbies. There is another possibility that maybe better programmers start out with a more unique coding style. It is hard to say if good programmers are born or made.”
The key to this system being used is, of course, first obtaining the code stylometries for a wide range of developers. The authors didn't address how, say, a database of programmers’ styles would be compiled. Also, to identify the author of a piece code would require access to the source code, and not just executables, though the authors mention there is some evidence that style is preserved in binaries.
In any case, though, be aware that your fingerprints are all over your code, for better or for worse.