GitHub, you probably know, has quickly become the main (or, at least, the most well-known) repository for open source software. It currently hosts millions of code repositories for hundreds of programming languages supported by several million users. The sheer volume of activity on GitHub reflects the growing popularity of open source software and the commitment of many people to working together to improve code across many different programming languages.
But I wondered recently if some programming languages tended to generate more open source contributions than others. Thanks to Google BigQuery, I was able to poke around raw GitHub Archive data myself to look into this further. Specifically, to try and quantify this, I looked at the average number of pull requests opened per GitHub repository by programming language. I thought that would be a good (but certainly not perfect) proxy for measuring the number of contributions (or attempted contributions) to a code base by someone other than the repository owner.
First, let me present the results.
Now, here's my methodology
- First, I queried the number of (non-forked) repositories per programming language using the following:
SELECT repository_language, count(distinct repository_url) as cntFROM [githubarchive:github.timeline]
WHERE repository_fork == "false"group by repository_language;
This gave me results for 150 programming languages covering over 4.1 million repositories (I ignored repositories with no programming language specified), for an average of 27,473 repositories per language.
- I then queried the number of pull requests opened per programming language using the following:
SELECT repository_language, count(*) as cntFROM [githubarchive:github.timeline]
WHERE repository_fork == "false"AND type="PullRequestEvent" and payload_action="opened"
group by repository_language;
Again ignoring repositories without a programming language specified, this gave me a total of just under 2.8 million pull requests across the 150 programming languages, for an average of 20,567 pull requests per language.
Overall, for the 4.1 million repositories with a programming language specified, the average number of pull requests opened per repository was .67.
From the results, then, we see that Python repositories are generating the most pull requests, on average, with .94 per repository. It's interesting that, using this measure, the Python community is the most giving, though the language currently only ranks as the eighth most popular programming language based on the latest TIOBE rankings. So, the Python community may be smaller than, say the Java community (#2 on the TIOBE list), but this suggests it's a tighter group.
Does this really mean that the Python community is more helpful than other language communities? Not necessarily, of course. Using the number of GitHub pull requests as a proxy for measuring outside (non-repository owner) contributions is far from perfect. Pull requests can come from outsiders forking and updating a repository, or from other project owners working on the same project but using a shared repository model. Maybe Python developers are more likely to use shared repositories and pull requests as development methodology.
Still, I think the results are interesting and suggest that, if you want to choose a programming language for a project that has a large, active and helpful community of developers behind it, you could do a lot worse than Python.
Read more of Phil Johnson's #Tech blog and follow the latest IT news at ITworld. Follow Phil on Twitter at @itwphiljohnson. For the latest IT news, analysis and how-tos, follow ITworld on Twitter and Facebook.