Tuesday, September 8, 2020

Open Source Saturation

In Supporting Open Source Software I discussed the critical need for better support for contributors to open source projects. Now, Quo Vadis, Open Source? The Limits of Open Source Growth by Michael Dorner, Maximilian Capraro and Ann Barcomb presents statistical evidence suggesting that this problem is affecting the vitality of the open source environment. Follow me below the fold for the details.

Their abstract reads (my emphasis):
Open source software plays a significant role in the software industry. Prior work described open source to be growing polynomially or even exponentially. However, such growth cannot be sustained infinitely given finite resources. In this study, we present the results of four accumulated measurements on size and growth of open source considering over 224,000 open source projects for the last 25 years. For each of those projects, we measured lines of code, commits, contributors and lifecycle state over time, which reproduces and replicates the measurements of three well-cited studies. We found the number of active open source projects has been shrinking since 2016 and the number of contributors and commits has decreased from a peak in 2013. Open source -- although initially growing at exponential rate -- is not growing anymore. We believe it has reached saturation.
Commit rate
The authors observe (Fig 4) that the monthly commit rate grew exponentially until 2010, peaked in 2013 and then declined until, in 2019 it matched the rate from 2007.

I would suggest that this is an application of W. Brian Arthur's model of the tech economy. When a new market niche is discovered, many efforts to address it will be started. Over time increasing returns to scale (network effects) will propel a few, perhaps only one, of them to dominate the niche.

The winners will probably grow to dominate the niche by adding features (commits) rapidly. The losers will be unable to keep up, so their commit rates will drop and they will become inactive, and then abandoned. As competition for the winners decreases, they will add fewer features. Their commits will increasingly be bug fixes, or addressing vulnerabilities, so thei rate will decrease too, but it wil stabilize at a lower but constant rate. Note that this model does not apply to a few large, aggregate projects such as the Linux kernel.

Project lifecycle
Perhaps the most interesting graph is Fig. 6, showing the population of open source projects at different stages of their lifecycle through time. The authors observe that:
we were able to confirm an exponential growth until 2013 over all available projects, while most of the projects are inactive – they do not receiving [sic] a commit within a given month. ... most inactive projects never receive a contribution again – they are abandoned.
And also that:
The portion of actively developed open source projects which receive at least one contribution in a given month is small and approximately constant over time.
If I'm right, the active group of projects will comprise those large projects, typically infrastructure, and a group of smaller projects still competing to dominate their niche. It would be very interesting to stratify Dorner et al's active project data by both age and size. One would expect two groups:
  • Larger, older, actively maintained projects.
  • Smaller, younger projects that, over time, become inactive.
Note that a smaller project that is inactive may be widely used. It has dominated its niche, acquired all important features, and shaken out all the easy-to-find bugs. A small proportion of smaller projects may remain in this mature state until obsoleted.

Active contributor rate
The authors observe (Fig 7) the behavior of the number of active contributors per month to be similar to the number of commits per month. It grew exponentially until 2010, peaked in 2013, and then declined until, in 2019, it matched the rate from 2007.

Of course, if all contributors were equally productive, this is what one would expect. They aren't; it is well-known that programmer productivity is a long-tailed distribution. But presumably the distribution doesn't change much with time, so the difference between the mean and median productivity is relatively stable, leading to the same result.

The authors suggest some possible explanations for the saturation of open source that they observe:
  • A decrease in developers willing to volunteer, and no corresponding increase in paid development work
  • The shift from volunteer to paid contributions reducing the effective time for contributing for each participant, due to company resource management
  • An increase in episodic participation [3], with more people preferring to volunteer less
  • A generational shift (the mean age of contributors in 2005 was 31, and in 2017 it was 30 [17,18]) from collective to reflexive volunteering [21], perhaps in response to the growing role of open source participation in career development
  • Increasing code complexity requiring skills fewer developers possess, and discouraging newcomers [45]
  • Increasing formalization of software projects, requiring significant effort on the part of developers to adhere to submission or foundation guidelines
  • A decreased quality of contributions and, therefore, a lower acceptance rate of contributions and an overload for reviewers and committers
These all seem plausible, but I would add one more, again based on W. Brian Arthur's model. Over time, the likelihood that a newly opened market niche is adjacent to an existing winner increases. So demand for new features is increasingly likely to be satisfied by one of its regular contributors adding a small number of commits to an existing project, rather than several new contributors making a larger number of commits to start a new project. This is analogous to the way the absence of anti-trust enforcement allows the tech oligopolists to suppress competition from new startups.

Tip of the hat to Glyn Moody, who concludes:
The new research might be an indication that the open source community, which has selflessly given so much for decades, is showing signs of altruism fatigue. Now would be a good time for companies to start giving back by supporting open source projects to a much greater degree than they have so far.


Stefane Fermigier said...

As indicated in the paper: "The most serious threat to our the validity of our study is the unknown precision and accuracy of Open Hub as measurement system."

I went to my OpenHub profile (https://www.openhub.net/accounts/sfermigier) and, according to it, I haven't committed since 2014 (which is not true at all).

I'm pretty sure the OpenHub data are massively bogus, hence the paper too (Garbage in, Garbage out).

Etienne Juliot said...


These statistics counts the number of commits. But since Git is now mainstream for most of Open Source projects, it dramatically changes the number of commits you need to push a new features. And 2010 is close to the switch of lots of Open Source projects to Git.
First, I remember when we were working with SVN, we were doing lots of small commit to avoid conflicts and because it was the way to go. Now, with Git AND an efficient CI, we commit when the feature is coded, the tests are OK, the documentation is here, and the build is without regression. It changes the game of this number of commits.
As you can commit locally then squash your commit to make them coarse-grained, it can influence this number too.
It should be interested too to know if this number is only about the master, but also all branches and even the review branches. At my company (Obeo), we are working on several Open Source of the Eclipse Foundation, and our workflow is to commit all on going work to gerrit, to use this gerrit branch to test and validate. Then, only when it is finish, gerrit push to the master (or an official branch) the result, with one commit. I checked at https://www.openhub.net/p/eclipse_sirius and I highly suspect that commits inside Gerrit, which represent the large majority of our work, isn't counted.