If you’re a software developer and somebody asked you what percentage of the code you write represents the actual functionality versus how much is filler, fluff or code just required by the language to actually run? 95%? 75%? 50%? No matter what you guessed, you’re probably way off because new research has discovered that only about 5% of written code captures the core functionality it actually provides.
In a new paper titled A Study of “Wheat” and “Chaff” in Source Code, researchers from the University of California, Davis, Southeast University in China, and University College London theorized that, just as with natural languages, some - and probably, most - written code isn’t necessary to convey the point of what it does. Comparing programming methods to stalks of wheat, the authors argued that some parts of code are the “wheat,” which represents the semantic core of a function, while everything else is the “chaff.”
The authors claimed that the wheat of a function could be encapsulated by small sets of keywords, which they called the “minimum distinguishing subset” or MINSET. The MINSET can be derived by breaking a method down into lexemes (i.e., code delimited by space or punctuation), discarding what’s not important to the behavior of the function and mapping the remaining ones to keywords. Those keywords then make up the MINSET.
To test their theory that MINSETs of functions are, in fact, a small percentage of the code written, in the summer of 2012 the researchers downloaded 1,000 of the most popular Java projects from Apache, Eclipse, GitHub, and SourceForge. From that they got 100 million lines of Java code and tossed out simple methods (those with less than 50 tokens). That left just under 1.9 million distinct methods, from which they then randomly sampled 10,000 and determined their MINSETs. The code and data used in the study are available for download from Bitbucket.
Here are their main findings:
- MINSETS are surprisingly small. The mean MINSET size of a method was 1.55 keywords and the largest consisted of 6.
- MINSET size didn’t increase with method size. When looking only at the 1,000 largest methods, the average and max MINSET size actually decreased to 1.12 and 4, respectively. This indicates, the authors wrote, that “minsets are small and potentially effective indices of unique information even for abnormally large methods.”
- Most code is almost all chaff. On average, only 4.6% of the unique lexemes in a method make up the MINSET. That is, over 95% of the code is chaff.
A couple of important points to keep in mind here. First, the MINSET itself is not executable; it’s merely the smallest subset of the code which characterizes the core functionality. Some of the other 95% of the code (the chaff) is required to make it run, so it’s not useless. Secondly, while this study only looked at Java code, the authors expect these finding would hold true for other languages, particularly C and C++, due to the similarities of the languages.
What are the implications of this work? The researchers mention a number of potential applications of MINSETs:
- Improved code search - MINSETs could be used to rank code search results based on similarity to a query.
- Smarter IDEs - IDE’s that have an indexed database of MINSETs could propose similar code, support auto-code completion and speed up debugging.
- Alternative forms of programming - MINSETs could be used to support keyword-based programming, i.e., creating usable code from a small set of keywords.
Who knew programmers and farmers had so much in common?