The half-life of code & the ship of Theseus - Sauf · The half-life of code & the ship of Theseus...

1
Erik Bernhardsson The half-life of code & the ship of Theseus 2016-12-05 As a project evolves, does the new code just add on top of the old code? Or does it replace the old code slowly over time? In order to understand this, I built a little thing to analyze Git projects, with help from the formidable GitPython project. The idea is to go back in history historical and run a git blame (making this somewhat fast was a bit nontrivial, as it turns out, but I’ll spare you the details, which involve some opportunistic caching of files, pick historical points spread out in time, use git diff to invalidate changed files, etc). In moment of clarity, I named “Git of Theseus” as a terrible pun on ship of Theseus. I’m a dad now, so I can make terrible puns. It refers to a philosophical paradox, where the pieces of a ship are replaced for hundreds of years. If all pieces are replaced, is it still the same ship? The ship wherein Theseus and the youth of Athens returned from Crete had thirty oars, and was preserved by the Athenians down even to the time of Demetrius Phalereus, for they took away the old planks as they decayed, putting in new and stronger timber in their places, in so much that this ship became a standing example among the philosophers, for the logical question of things that grow; one side holding that the ship remained the same, and the other contending that it was not the same. It turns out that code doesn’t exactly evolve the way I expected. There is a “ship of Theseus” effect, but there’s also a compounding effect where codebases keep growing over time (maybe I should call it “Second Avenue Subway” effect, after the construction project in NYC that’s been going on since 1919). Let’s start by analyzing Git itself. Git became self-hosting early on, and it’s one of the most popular and oldest Git projects: This plots the aggregate number of lines of code over time, broken down into cohorts by the year added. I would have expected more of a decay here, and I’m surprised to see that so much code written back in 2006 is still alive in the code base — interesting! We can compute the decay for individual commits too. If we align all commits at x=0, we can look at the aggregate decay for code in a certain repo. This analysis is somewhat harder to implement than it sounds like because of various stuff (mostly because newer commits have had less time, so the right end of the curve represents an aggregate of fewer commits). For Git, this plot looks like this: Even after 10 years, 40% of lines of code is still present! Let’s look at a broader range of (somewhat randomly selected) open source projects: It looks like Git is somewhat of an outlier here. Fitting an exponential decay to Git and solving for the half-life gives approx ~6 years. Hmm… not convinced this is necessarily a perfect fit, but as the famous quote goes: All models are wrong, some models are useful. I like the explanatory power of an exponential decay — code has an expected life time and a constant risk of being replaced. I suspect a slightly better model would be to fit a sum of exponentials. This would work for a repo with some code that changes fast and some code that changes slowly. But before going down a rabbit hole of curve fitting, I reminded myself of von Neumann’s quote: With four parameters I can fit an elephant, and with five I can make him wiggle his trunk. There’s probably some way to make it work, but I’ll revisit some other time. Let’s look at a lot of projects in aggregate (also sampled somewhat arbitrarily): In aggregate, the half-life is roughly ~3.33 years. I like that, it’s an easy number to remember. But the spread is big between different projects. The aggregate model doesn’t necessarily have super strong predictive power — it’s hard to point to a arbitrary open source project and expect half of it to be gone 3.33 years later. Moar repos Apache (aka HTTPD) is another repo that goes way back: Rails: Beautiful exponential fit! Node Wanna run it for your own repo? Again, code is available here. The monster repo of them all Note that most of these repos took at most a few minutes to analyze, using my script. As a final test I decided to run it over the Linux kernel which is HUGE 635,229 commits as of today. This is 16 times larger than the second biggest repo I looked at (rails) and took multiple days to analyze on my shitty computer. To make it faster I ended up computing the full git blame only for commits spread out at least 3 weeks and also limited it to .c files: The squiggly lines are probably from the sampling mechanism. But look at this beauty — a whopping 16M lines! The code contribution from each year’s cohort is extremely smooth at this scale. Individual commits have absolutely no meaning at this scale — they cumulative sum of them is very predictible. It’s like going from Newton’s laws to thermodynamics. Linux also clearly exhibits more of a linear growth pattern. I’m speculating that this has to do with its high modularity. The drivers directory has by far the most number of files (22,091) followed by arch (17,967) which contains support for various architectures. This is exactly the kind of things you would expect to scale very well with complexity, since they have a well defined interface. Somewhat off topic, but I like the notion of how well a projects scales with complexity. A linear scalability is the ultimate goal, where each one marginal feature takes roughly the same amount of code. Bad projects scale superlinearly, and every marginal feature takes more and more code. It’s interesting to go back and contrast Linux to something like Angular, which basically exhibits the opposite behavior: The half-life of a randomly selected line in Angular is about 0.32 years. Does this reflect on Angular? Is the architecture basically not as “linear” and consistent? You might say the comparison is unfair, because Angular is new. That’s a fair point. But I wouldn’t be surprised if it does reflect on some questionable design. Don’t mean to be shitting on Angular here, but it’s an interesting contrast. Half-life by repository A somewhat arbitrary sample of projects and their half-lifes: project half-life (years) first commit angular 0.32 2014 bluebird 0.56 2013 kubernetes 0.59 2014 keras 0.69 2015 tensorflow 1.08 2015 express 1.23 2009 scikit-learn 1.29 2011 luigi 1.30 2012 backbone 1.48 2010 ansible 1.52 2012 react 1.66 2013 node 1.76 2009 underscore 1.97 2009 requests 2.10 2011 rails 2.43 2004 django 3.38 2005 theano 3.71 2008 numpy 4.15 2006 moment 4.54 2015 scipy 4.62 2007 tornado 4.80 2009 redis 5.20 2010 flask 5.22 2010 httpd 5.38 1999 git 6.04 2005 chef 6.18 2008 linux 6.60 2005 It’s interesting that moment has such high half-life, but the reason is that so much of the code is locale-specific. This creates a more linear scalability with a stable core of code and linear additions over time. express is an outlier in the other direction. It’s 7 years old but code changes extremely quickly. I’m guessing this is partly because (a) lack of linear scalability in code (b) it’s probably one of the first major Javascript open source projects to hit mainstream/popularity, surfing on the Node.js wave. Possibly the code base also sucks, but I have no idea Has coding changed? I can think of three reasons why there’s such a strong relationship between the year the project was initiated, and the half-life 1. Code churns more early on in projects, and becomes more stable a while in 2. Coding has changed from 2006 to 2016, and modern projects evolve faster 3. There’s some kind of selection bias where the only projects that survive are the scalable stables ones Interestingly, I don’t find any clear evidence of #1 in the data. The half-life for code written earlier in old projects are as high as late code. I’m skeptical about #3 as well because I don’t see why there would be a relation between survival and code structure (but maybe there is). My conclusion is that writing code has fundamentally changed in the last 10 years. Code really seems to change at a much faster rate in modern projects. By the way, see discussion on Hacker News and on Reddit! Related posts NYC subway math 2016-04-04 Nearest neighbor methods and vector models – part 1 2015-09-23 Recurrent Neural Networks for Collaborative Filtering 2014-06-28 How to build up a data team (everything I ever learned about recruiting) 2014- 06-08 Interview with a Data Scientist: Erik Bernhardsson 2015-10-27 Pareto efficency 2016-10-25 Analyzing 50k fonts using deep neural networks 2016-01-20 © 2016. All rights reserved. Loading [Contrib]/a11y/accessibility-menu.js

Transcript of The half-life of code & the ship of Theseus - Sauf · The half-life of code & the ship of Theseus...

ErikBernhardsson

Thehalf-lifeofcode&theshipofTheseus2016-12-05

Asaprojectevolves,doesthenewcodejustaddontopoftheoldcode?Ordoes

itreplacetheoldcodeslowlyovertime?Inordertounderstandthis,Ibuilta

littlethingtoanalyzeGitprojects,withhelpfromtheformidableGitPython

project.Theideaistogobackinhistoryhistoricalandruna gitblame(makingthissomewhatfastwasabitnontrivial,asitturnsout,butI’llspare

youthedetails,whichinvolvesomeopportunisticcachingoffiles,pick

historicalpointsspreadoutintime,use gitdiff toinvalidatechangedfiles,

etc).

Inmomentofclarity,Inamed“GitofTheseus”asaterriblepunonshipof

Theseus.I’madadnow,soIcanmaketerriblepuns.Itreferstoaphilosophical

paradox,wherethepiecesofashiparereplacedforhundredsofyears.Ifall

piecesarereplaced,isitstillthesameship?

TheshipwhereinTheseusandtheyouthofAthensreturnedfromCretehad

thirtyoars,andwaspreservedbytheAtheniansdowneventothetimeof

DemetriusPhalereus,fortheytookawaytheoldplanksastheydecayed,

puttinginnewandstrongertimberintheirplaces,insomuchthatthisship

becameastandingexampleamongthephilosophers,forthelogicalquestion

ofthingsthatgrow;onesideholdingthattheshipremainedthesame,andthe

othercontendingthatitwasnotthesame.

Itturnsoutthatcodedoesn’texactlyevolvethewayIexpected.Thereisa“ship

ofTheseus”effect,butthere’salsoacompoundingeffectwherecodebaseskeep

growingovertime(maybeIshouldcallit“SecondAvenueSubway”effect,after

theconstructionprojectinNYCthat’sbeengoingonsince1919).

Let’sstartbyanalyzingGititself.Gitbecameself-hostingearlyon,andit’sone

ofthemostpopularandoldestGitprojects:

Thisplotstheaggregatenumberoflinesofcodeovertime,brokendowninto

cohortsbytheyearadded.Iwouldhaveexpectedmoreofadecayhere,andI’m

surprisedtoseethatsomuchcodewrittenbackin2006isstillaliveinthecode

base—interesting!

Wecancomputethedecayforindividualcommitstoo.Ifwealignallcommits

atx=0,wecanlookattheaggregatedecayforcodeinacertainrepo.This

analysisissomewhathardertoimplementthanitsoundslikebecauseof

variousstuff(mostlybecausenewercommitshavehadlesstime,sotheright

endofthecurverepresentsanaggregateoffewercommits).

ForGit,thisplotlookslikethis:

Evenafter10years,40%oflinesofcodeisstillpresent!Let’slookatabroader

rangeof(somewhatrandomlyselected)opensourceprojects:

ItlookslikeGitissomewhatofanoutlierhere.Fittinganexponentialdecayto

Gitandsolvingforthehalf-lifegivesapprox~6years.

Hmm…notconvincedthisisnecessarilyaperfectfit,butasthefamousquote

goes:Allmodelsarewrong,somemodelsareuseful.Iliketheexplanatory

powerofanexponentialdecay—codehasanexpectedlifetimeandaconstant

riskofbeingreplaced.

Isuspectaslightlybettermodelwouldbetofitasumofexponentials.This

wouldworkforarepowithsomecodethatchangesfastandsomecodethat

changesslowly.Butbeforegoingdownarabbitholeofcurvefitting,Ireminded

myselfofvonNeumann’squote:WithfourparametersIcanfitanelephant,

andwithfiveIcanmakehimwigglehistrunk.There’sprobablysomewayto

makeitwork,butI’llrevisitsomeothertime.

Let’slookatalotofprojectsinaggregate(alsosampledsomewhatarbitrarily):

Inaggregate,thehalf-lifeisroughly~3.33years.Ilikethat,it’saneasynumber

toremember.Butthespreadisbigbetweendifferentprojects.Theaggregate

modeldoesn’tnecessarilyhavesuperstrongpredictivepower—it’shardto

pointtoaarbitraryopensourceprojectandexpecthalfofittobegone3.33

yearslater.

MoarreposApache(akaHTTPD)isanotherrepothatgoeswayback:

Rails:

Beautifulexponentialfit!

Node

Wannarunitforyourownrepo?Again,codeisavailablehere.

ThemonsterrepoofthemallNotethatmostoftheserepostookatmostafewminutestoanalyze,usingmy

script.AsafinaltestIdecidedtorunitovertheLinuxkernelwhichisHUGE—

635,229commitsasoftoday.Thisis16timeslargerthanthesecondbiggest

repoIlookedat(rails)andtookmultipledaystoanalyzeonmyshitty

computer.TomakeitfasterIendedupcomputingthefull gitblame onlyfor

commitsspreadoutatleast3weeksandalsolimiteditto .c files:

Thesquigglylinesareprobablyfromthesamplingmechanism.Butlookatthis

beauty—awhopping16Mlines!Thecodecontributionfromeachyear’scohort

isextremelysmoothatthisscale.Individualcommitshaveabsolutelyno

meaningatthisscale—theycumulativesumofthemisverypredictible.It’s

likegoingfromNewton’slawstothermodynamics.

Linuxalsoclearlyexhibitsmoreofalineargrowthpattern.I’mspeculatingthat

thishastodowithitshighmodularity.The drivers directoryhasbyfarthe

mostnumberoffiles(22,091)followedby arch (17,967)whichcontains

supportforvariousarchitectures.Thisisexactlythekindofthingsyouwould

expecttoscaleverywellwithcomplexity,sincetheyhaveawelldefined

interface.

Somewhatofftopic,butIlikethenotionofhowwellaprojectsscaleswith

complexity.Alinearscalabilityistheultimategoal,whereeachonemarginal

featuretakesroughlythesameamountofcode.Badprojectsscale

superlinearly,andeverymarginalfeaturetakesmoreandmorecode.

It’sinterestingtogobackandcontrastLinuxtosomethinglikeAngular,which

basicallyexhibitstheoppositebehavior:

Thehalf-lifeofarandomlyselectedlineinAngularisabout0.32years.Does

thisreflectonAngular?Isthearchitecturebasicallynotas“linear”and

consistent?Youmightsaythecomparisonisunfair,becauseAngularisnew.

That’safairpoint.ButIwouldn’tbesurprisedifitdoesreflectonsome

questionabledesign.Don’tmeantobeshittingonAngularhere,butit’san

interestingcontrast.

Half-lifebyrepositoryAsomewhatarbitrarysampleofprojectsandtheirhalf-lifes:

project half-life(years) firstcommit

angular 0.32 2014

bluebird 0.56 2013

kubernetes 0.59 2014

keras 0.69 2015

tensorflow 1.08 2015

express 1.23 2009

scikit-learn 1.29 2011

luigi 1.30 2012

backbone 1.48 2010

ansible 1.52 2012

react 1.66 2013

node 1.76 2009

underscore 1.97 2009

requests 2.10 2011

rails 2.43 2004

django 3.38 2005

theano 3.71 2008

numpy 4.15 2006

moment 4.54 2015

scipy 4.62 2007

tornado 4.80 2009

redis 5.20 2010

flask 5.22 2010

httpd 5.38 1999

git 6.04 2005

chef 6.18 2008

linux 6.60 2005

It’sinterestingthatmomenthassuchhighhalf-life,butthereasonisthatso

muchofthecodeislocale-specific.Thiscreatesamorelinearscalabilitywitha

stablecoreofcodeandlinearadditionsovertime.expressisanoutlierinthe

otherdirection.It’s7yearsoldbutcodechangesextremelyquickly.I’m

guessingthisispartlybecause(a)lackoflinearscalabilityincode(b)it’s

probablyoneofthefirstmajorJavascriptopensourceprojectstohit

mainstream/popularity,surfingontheNode.jswave.Possiblythecodebase

alsosucks,butIhavenoidea

Hascodingchanged?Icanthinkofthreereasonswhythere’ssuchastrongrelationshipbetweenthe

yeartheprojectwasinitiated,andthehalf-life

1. Codechurnsmoreearlyoninprojects,andbecomesmorestableawhilein

2. Codinghaschangedfrom2006to2016,andmodernprojectsevolvefaster

3. There’ssomekindofselectionbiaswheretheonlyprojectsthatsurviveare

thescalablestablesones

Interestingly,Idon’tfindanyclearevidenceof#1inthedata.Thehalf-lifefor

codewrittenearlierinoldprojectsareashighaslatecode.I’mskepticalabout

#3aswellbecauseIdon’tseewhytherewouldbearelationbetweensurvival

andcodestructure(butmaybethereis).Myconclusionisthatwritingcode

hasfundamentallychangedinthelast10years.Codereallyseemsto

changeatamuchfasterrateinmodernprojects.

Bytheway,seediscussiononHackerNewsandonReddit!

Relatedposts

NYCsubwaymath2016-04-04

Nearestneighbormethodsandvectormodels–part12015-09-23

RecurrentNeuralNetworksforCollaborativeFiltering2014-06-28

Howtobuildupadatateam(everythingIeverlearnedaboutrecruiting)2014-

06-08

InterviewwithaDataScientist:ErikBernhardsson2015-10-27

Paretoefficency2016-10-25

Analyzing50kfontsusingdeepneuralnetworks2016-01-20

©2016.Allrightsreserved.

Loading[Contrib]/a11y/accessibility-menu.js